LiveSketch: Query Perturbations for Guided Sketch-Based...

LiveSketch: Query Perturbations for Guided Sketch-based Visual Search

John Collomosse1 2, Tu Bui1, and Hailin Jin2

1Centre for Vision Speech and Signal Processing, University of Surrey2Creative Intelligence Lab, Adobe Research

Abstract

LiveSketch is a novel algorithm for searching large imagecollections using hand-sketched queries. LiveSketch tacklesthe inherent ambiguity of sketch search by creating visualsuggestions that augment the query as it is drawn, makingquery specification an iterative rather than one-shot processthat helps disambiguate users’ search intent. Our technicalcontributions are: a triplet convnet architecture that incor-porates an RNN based variational autoencoder to searchfor images using vector (stroke-based) queries; real-timeclustering to identify likely search intents (and so, targetswithin the search embedding); and the use of backpropaga-tion from those targets to perturb the input stroke sequence,so suggesting alterations to the query in order to guide thesearch. We show improvements in accuracy and time-to-taskover contemporary baselines using a 67M image corpus.

1. Introduction

Determining user intent from a visual search query re-mains an open challenge, particularly in sketch based imageretrieval (SBIR) over millions of images where a sketchedshape can yield plausible yet unexpected matches. For exam-ple, a user’s sketch of a dog might return a map of the UnitedStates that ostensibly resembles the shape (structure) drawn,but is not relevant. Free-hand sketches are often incompleteand ambiguous descriptions of desired image content [8].This limits the ability of sketch to communicate search in-tent, particularly over large image datasets.

This paper proposes LiveSketch; a novel interactive SBIRtechnique in which users iterate to refine their sketchedquery, selecting and integrating sketch embellishments sug-gested by the system in order to disambiguate search in-tent and so improve the relevance of results (Fig. 1). A corenovelty of our approach lies within the method by whichvisual suggestions are generated, exploiting the reversibilityof deep neural networks (DNNs) that are commonly usedto encode image features to create the search index in vi-sual search systems [26, 13, 7, 6]. By identifying clustersof likely target intents for the user’s search, we reverse theDNN encoder to explain how such clusters could be gen-

Figure 1. LiveSketch helps disambiguate SBIR for large datasets,where a shape sketched from scratch (top left) can yield resultsthat do not match the users’ search intent. LiveSketch iterativelysuggests refinements to users’ sketched queries to guide the search(Iter.1-3), based on the user indicating relevant clusters of results(right). This interaction disambiguates and quickly guides thesearch towards results that match users’ search intent (subsec. 4.3).

erated by adapting the query. We are inspired by adversar-ial perturbations (APs); that use backpropagation to gen-erate ‘adversarial’ image examples [12] that induce objectmis-classification [24, 2, 23] to a targeted category. In ourcontext of visual search, we similarly backpropagate to per-turb the sketched query from its current state toward one (ormore) targets identified in the search embedding by the user.As such, the query becomes a ‘living sketch’ on the can-vas that reacts interactively to intents expressed by the user,forming the basis for subsequent search iterations. The useof a single, live sketch to collaboratively guide the search dif-fers from prior approaches such as ShadowDraw [22] thatghost hundreds of top results on the canvas. We proposethree technical contributions:1) Vector Queries for Sketch based Image Retrieval. Welearn a joint search embedding that unifies vector graphicand raster representations of visual structure, encoded byrecurrent (RNN) and convnet (CNN) branches of a noveltriplet DNN architecture. Uniquely, this embedding enables

2879

the retrieval of raster (e.g. photo) content using sketchedqueries encoded as a sequence of strokes. This higher levelrepresentation is shown to not only enhance search accuracy(subsec. 4.1) but also enables perturbation of the query toform suggestions, without need for pixel regularization.2) Guided Discovery of Search Intent. We make use of anauxiliary (semantic) embedding to cluster search results intopools, each representing a candidate search intent. For exam-ple, a circle on a stick might return clusters correspondingto balloons, signs, mushrooms. Deriving query suggestionsfrom sketches drawn from these pools guides the user towardrelevant content, clarifying intent by supplying contextualinformation not present in the query.3) Query Perturbation. We propose an iterative strategyfor SBIR query refinement in which the users’ query sketchis perturbed to incorporate the appearance of search intent(s)indicated by the user. We cast this as a search for a queryperturbation that shifts the encoded query within the searchembedding closer toward those selected intent(s), encodingthat vector as a loss (in the spirit of APs) that is backpropa-gated through the DNN to update the sketch.

2. Related Work

Visual search is a long-standing problem within the com-puter vision and information retrieval communities, wherethe iterative presentation and refinement of results has beenstudied extensively as relevance feedback (RF) [29, 21, 20]although only sparsely for SBIR [16]. RF is driven by in-teractive markup of results at each search iteration. Userstag results as relevant or irrelevant, so tuning internal searchparameters to improve results. Our work differs in that wemodify the query itself to affect subsequent search iterations;queries may be further augmented by the user at each itera-tion. Recognizing the ambiguity present in sketched querieswe group putative results into semantic clusters and proposeedits to the search query for each object class present.

Query expansion (QE) is a automated technique to im-prove search accuracy from a single, one-off visual query[19, 39, 33, 27] by recursively submitting search results asqueries. LiveSketch contrasts with QE as it is an interactivesystem in which query refinements are suggested, and op-tionally incorporated by the user to help disambiguate searchintent; so communicating more information than present ina single, initial sketched query.

Deep learning, specifically CNNs (convnets), have beenrapidly adopted for SBIR and more broadly for visual searchoutperforming classical dictionary learning based models(e.g. bag of words) [30, 3]. Wang et al [34] were arguablythe first to explore CNNs for sketched 3D model retrievalvia a contrastive loss network mapping sketches to ren-dered 2D views. Qi et al. [25] similarly learned correspon-dence between sketches and edge maps. Fine-grained SBIRwas explored by Yu et al. [38] and Sangkloy et al. [28]who used a three-branch CNN with triplet loss for learn-ing the cross-domain embedding. Triplet loss models havebeen used more broadly for visual search e.g. using pho-tographic queries [35, 26, 13]. Bui et al. [4, 6] perform

cross-category retrieval using a triplet model and currentlylead the Flickr15k [15] benchmark for SBIR. Their systemwas combined with a learned model of visual aesthetics [36]to constrain SBIR using stylistic cues in [7]. All of theseprior techniques learn a deep encoder function that mapsan image into a point in a metric search embedding wherethe distance between an image pair correlates to its similar-ity. Such embeddings can be binarized (e.g. via PQ [18])for scalable search. In prior work search embeddings werelearned using rasterized sketches i.e. images, rather than vec-tor representations of sketched strokes. In our approach weadopt the a vector representation for sketches, building uponthe SketchRNN variational auto-encoder of Eck et al. pre-viously applied to blend [14] and match [37] sketches withsketches. Here we adapt SketchRNN in a more general formfor both our interactive search of photographs, and for gener-ating search suggestions, training with the Quickdraw50Mdataset [1].

Our work is aligned to ShadowDraw [22] in which ghosts(edge-maps derived from top search results) are averagedand overlaid onto the sketch canvas (similarly, [40] for photosearch). However our system differs both in intent and inmethod. ShadowDraw is intended to teach unskilled usersto sketch rather than as a search system in its own right [22].The technical method also differs - our system uses deep neu-ral networks (DNNs) both for search and for query guidance,and hallucinates a single manipulable sketch rather than anon-edittable cloud of averaged suggestions. This declut-ters presentation of the suggestions and does not constrainsuggestions to the space of existing images in the dataset.Our method produces query suggestions by identifying des-tination points within the search embedding and employingbackpropagation through the deep network (with networkweights fixed) in order to update the input query so that itmaps to those destination points. The manipulation of in-put imagery with the goal of affecting change in the outputembedding is common in the context of adversarial perturba-tions (APs) where image pixels are altered to change the clas-sification (softmax) output of a CNN [24, 2]. We are inspiredby FGSM [12] which directly backpropagates from classifi-cation loss to input pixels in order to induce noise that, whilstnear-imperceptible, causes mis-classification with high con-fidence. Although we also backpropagate, our goal differs inthat we aim for observable changes to the query that guidethe user in refining their input. Our reimagining of APs forinteractive query refinement in visual search is unique.

3. Methodology

LiveSketch accepts a query sketch Q in vector graphicsform (as a variable length sequence of strokes), and searchesa large (∼ 108) dataset of raster images I = I1, ..., IN.Our two-stream network architecture (Fig. 2) unifies bothvector and raster modalities via a common search embedding(S). Sketch and image content are encoded via RNN andCNN branches respectively, unified via 4 fully connected(fc) layers; final layer activations yield S ∈ ℜ256. The end-to-end RNN and CNN paths through the network describe

2880

Figure 2. Overview of the proposed SBIR framework. A querysketch (Q, vector graphics form) and images (I , raster form) areencoded into the search embedding S via RNN and CNN branches,unified via four inner product layers. Images are encoded via SI(.);the image branch of [6]. Query sketches are encoded via SQ(.);the encoder stage of Fig. 3. An auxiliary semantic embedding Zclusters results to help the user pick search target(s) T in the searchembedding. In the spirit of adversarial perturbation, the strokes Qare adjusted to minimize ||SQ(Q)−Ti||2 and so evolve the sketchtoward the selected target(s).

the pair of encoding functions SQ(Q) and SI(Ii) for encod-ing the visual structure of sketches, and of images, respec-tively; the process for learning these functions is describedin subsec. 3.1. Once learned, the image dataset is indexed(offline) by feeding forward all Ii ∈ I through SI(.). Atquery time, results for a given Q are obtained by ranking on||SQ(Q)− SI(Ii)||2 where ||.||2 is the L2 norm.

Fig. 2 provides an overview of our interactive search.Given an initial query Q, images embedded in S proximateto SQ(Q) are returned. Whilst these images share the visualstructure of Q, the inherent ambiguity of sketch typically re-sults in semantically diverse content, only a subset of whichis relevant to the user’s search intent. We therefore invitethe user to disambiguate their sketch intent via interaction.Search results are clustered within an ‘auxiliary’ semanticembedding Z . The user assigns relevance weights to a few(m = 3) dominant clusters. For each cluster C1, ..., Cm inZ , a search target T1, ..., Tm is identified in S (process de-scribed in subsec. 3.3). The targets receiving high weightingfrom the user represent visual structures that we will evolvethe existing query sketch Q toward, in order to form a querysuggestion (Q′) to guide the next search iteration.

Our query is represented in vector graphics form to en-able suggestions to be generated via direct modification ofthe stroke sequence encoded by Q, avoiding the need forcomplex pixel-domain regularization. LiveSketch updatesQ 7→ Q′ such that SQ(Q

′) is closer to targets T1, ..., Tmthan SQ(Q). Treating the weighted distances between thosetargets and SQ(Q

′) as a loss, we fix SQ(.) and propagategradients back via the RNN branch to perturb the input se-quence of strokes (subsec. 3.4) and so suggest the modified

R ∈ ℜ256 Raster embedding [6]*

V ∈ ℜ512 Vector graphics embedding*

S ∈ ℜ256 Joint search embedding (structure)

Z ∈ ℜ2048 Auxiliary embedding (semantic)

Table 1. Summary of the feature embeddings used in LiveSketch;* indicates intermediate embeddings not used in the search index.

sketch query. Q′ may be further augmented by the user, andsubmitted for a further iteration of search.

3.1. Crossmodal Search Embedding (S)

We wish to learn a cross-modal search embedding inwhich a sketched query expressed as a variable length se-quence of strokes, and an image indexed by the system (e.g.a photograph) containing similar visual structure, map tosimilar points within that embedding. We learn this repre-sentation using a triplet network (Fig. 4) comprising an RNNanchor (a) and siamese (i.e. identical, shared weights) posi-tive and negative CNN branches (p/n). The RNN and CNNbranches encode vector and raster content to intermediateembeddings V and R respectively; we describe how theseare learned in subsecs 3.1.1-3.1.2. The branches are unifiedby 4 fully-connected (fc) layers, with weight-sharing acrossall but the first layer to yield the common search embeddingS . Thus the fc layers encode two functions mapping V 7→ Sand R 7→ S respectively; we write these FV (.) and FR(.)and in subsec. 3.1.3 describe incorporation of these into thepair of end-to-end encoding functions SQ(.) and SI(.) forour network (Fig. 2).

3.1.1 Sketch Variational Autoencoder

The RNN branch is a forward-backward LSTM encodercomprising the front half of a variational autoencoder (v.a.e.)for sketch encoding-decoding, adapted from the SketchRNNnetwork of Eck et al. [14]. In the SketchRNN v.a.e., a de-terministic latent representation (z0) is learned, alongsideparameters of multi-variate Gaussian from which a non-deterministic (n.d.) representation (batchz) is sampled todrive the decoder and reconstruct the sketch through recur-rence conditioned on batchz. The representation is learnedvia a combination of reconstruction loss (and a regulariza-tion ‘KL loss’ [17] over the multi-variate parameters), but asproposed [14] can represent only up to a few object classesmaking it unsuitable for web-scale SBIR.

Figure 3. Modified SketchRNN [14] (changes, blue) used to en-code/decode stroke sequences via addition of 512-D latent repre-sentation and classification loss. Integrates with Fig. 2 (anchor).

2881

We adapt SketchRNN as follows (Fig. 3). We retrain fromscratch using 3.5M sketches from Quickdraw50M (QD-3.5M; see Sec. 4) adding a low-dimensional (512-D) bot-tleneck after z0 from which batchz is sampled. We add soft-max classification loss to that bottleneck, weighted equallywith the original reconstruction and KL loss terms. Dur-ing training we reduce below 10−2 the covariance of then.d. variate. A query sketch (Q) is coded as a sequence of3-tuples Q = [q1, q2, ..., qn] where qi = (δx, δy, l) repre-senting relative pen movements in x, y ∈ ℜ2 and whetherthe pen is lifted l = [0, 1]; an abbreviated form of the 5-tuple coding in [14]. The intermediate embedding availableat the bottleneck (V ∈ ℜ512) is capable of reconstructingsketches across diverse object classes (c.f. Sec.4.2). Theencoder forms the anchor of the proposed triplet network(Fig. 4); we denote the encoding and decoding functions asVE(Q) 7→ V and VD(V) 7→ Q.

3.1.2 Raster Structure Encoder

To encode raster content, we adopt the architecture of Bui etal. [6] for the CNN branch. Their work employs a triplet net-work with GoogLeNet Inception backbone [32] that unifiessketches (in raster form) and images within a joint searchembedding. One important property is the partial sharing ofweights between the sketch CNN branch (anchor) and thesiamese image CNN (+/-) branches of their triplet network.Once trained, these branches yield two functions: RS(.) andRI(.), that map sketched and image content to a joint searchembedding. Full details on the multi-stage training of thismodel are available in [6]; we use their pre-trained modelin our work and incorporate their joint embedding as the in-termediate embedding R ∈ ℜ256 in our work. Specifically,RS(.) is used to train our model (Fig. 4, p/n).

3.1.3 Training the Joint Search Embedding

The end-to-end triplet network (Fig. 4) is trained usingsketches only; 3.5M sketches (10K ×345 object classes)sampled from the public Quickdraw50M dataset [1] (sim-plified via RDP [9], as in [14]) and rasterized by renderingpen movements as anti-aliased lines of width 1 pixel on a256× 256px canvas. The stroke sequence and the renderingof that sequence are fed through the anchor (a) and positive(p) branches, and a randomly selected rasterized sketch ofdiffering object class to negative branch (n). Weights areoptimized via ADAM using triplet loss computed over acti-vations available from the final shared fc layer (the searchembedding S):

Ltrain(a,p,n) = [m+ ||SQ(a)− FR(RS(p))| |22 −

||SQ(a)− FR(RS(n))| |22]+ (1)

where m = 0.2 is a margin promoting convergence, and [x]+indicates the non-negative part of x. Training yields weightsfor the fully connected (fc) layers – recall these are partiallyshared across the vector (a) and raster (p/n) branches, yield-ing FV (.) and FR(.). The end-to-end functions mapping a

Figure 4. Training the LiveSketch network; an encoder that mapsraster and vector content to a common search embedding. Thesearch embedding is trained using raster and vector (stroke se-quence) content sampled from QD-3.5M. During training, the CNNbranches (p/n) are RS(.) i.e. the sketch branch of [6]. Howeverbranch RI(.) is used at inference time (Fig. 2).

sketched query Q to our common search embedding S is:

SQ(Q) = FV (VE(Q)). (2)

Fig. 5a shows the resulting embedding; raster and vectorcontent representing similar visual structures mix within Sbut distinct visual structures form discriminative clusters.

3.2. Search Implementation

Once trained, SQ(.) forms the RNN path within oursearch framework (Fig. 2, green) for encoding a vectorsketch query Q. The CNN path SI(.) (Fig. 2, blue) usedto index images for search, adopts the image branch of [6](subsec. 3.1.2):

SI(I) = FR(RI(I)). (3)

Note substitution of the sketch branch RS(.) that was usedduring training (eq.1) for the image branch RI(.). Both func-tions map to the same intermediate embedding R, howeverwe index images rather than sketches for SBIR.

3.3. Disambiguating Search Intent

Given a search query Q, a k-NN lookup within S is per-formed to identify a set of results J = [I1, ..., Ik] whereJ ⊆ I minimising ||SQ(Q) − SI(Ii)||2; in practice, ||.||2is approximated via product quantization (PQ) [18] for scal-ability and up to k = 500 results are returned. The resultsare clustered into candidate search intents, and presented tothe user for feedback. Clustering is performed within an aux-iliary embedding (Z) available from final layer activationsof a ResNet50/ImageNet pre-trained CNN. We write thisfunction Z(Ii), pre-computed ∀Ii ∈ I during indexing.

2882

(a) (b)Figure 5. (a) Visualizing the search embedding (S) (for 20/345 random classes sampled from QuickDraw50M); sketches in vector (red) andraster (blue) modalities have been encoded via SQ(.) and SI(.) respectively. The learned representation is discriminative on visual structurebut invariant to modality. (b) A k-NN search (L2, k = 500) yield search results in S local to encoded sketch query SQ(Q); results sharesimilar structure but span diverse semantics e.g. a box with a cross atop returns boats, churches, windmills. Results are clustered in auxiliary(semantic) embedding Z and presented to user for ranking.

3.3.1 Clustering

Images local to SQ(Q) within the search embedding Smay be semantically diverse; a single visual structure e.g.a cross atop a box, may return churches, boats, windmills,etc. However these results will form distinct clusters withinZ (Fig. 5b). We apply affinity propagation [11] to identifythe dominant m = 3 clusters C = [c1, ..., cm] in Z . Thealgorithm constructs an affinity graph for all image pairs in(Ia, Ib) ∈ J × J scoring these:

d(Ia, Ib) = ||Z(Ia)− Z(Ib)||2. (4)

Clustering is a greedy process that iteratively constructs C,selecting a best cluster ci = I1, ..., Ik from the graph, mini-mizing ρ(ci):

ρ(ci) =∑

(Ia,Ib)∈ci×ci

d(Ia, Ib) +W (ci, C). (5)

Where W (C) is a penalty term that encourages semanticdiversity by discouraging selection of ci containing imagessimilar to those already in C:

W (ci, C) ∝ −log

∑

(Ia,Ib)∈ci×χ(C)

d(Ia, Ib)

(6)

where χ(ci) represents the set of images already presentwithin clusters in set C.

3.3.2 Identifying Search Targets

For each cluster i = [1,m] we identify a representative im-age I∗i closest to the visual structure of the query:

I∗i = argminIj

||SQ(Q)− SI(Ij)||2; ∀Ij ∈ Ci. (7)

Leveraging the Quickdraw50M dataset (QD-3.5M) ofsketches (H), we identify the closest sketch Q∗

i to each rep-resentative image:

Q∗

i = minq∈H

||SQ(q)− SI(I∗

i )||2. (8)

The set of these sketches T = T1, ..., Tm, whereTi = Q∗

i , represent the set of search targets and the basisfor perturbing the user’s query (Q) to suggest a new sketchQ′ that guides subsequent iterations of search.

3.4. Sketch Perturbations for User Guidance

The search targets T are presented to the user, alongsidesliders that enable the relevance of each to be interactivelyexpressed as a set of weights Ω = ω1, ..., ωm. We seeka new sketch query Q′ that updates the original query Q toresemble the visual structure of those targets, in proportionto these user supplied weights.

For brevity we introduce the following notation. QV ∗

i =VE(Q

∗

i ) describes each search target within the RNN em-bedding V . Similarly QS∗

i = SQ(Q∗

i ) describes each target

within the search embedding S. We similarly use QV =VE(Q) and QS = SQ(Q) to denote the user’s sketch Qin V and S respectively. To perturb the sketch we seek Q′

(similarly written QV ′

and QS′

within those embeddings).The availability of the v.a.e. decoder (subsec. 3.1.1) en-

ables generation of new sketches (sequences of strokes) con-

ditioned on any point within V . Our approach is to seek QV ′

such that Q′ = VD(QV ′

) may be generated. The task of

updating Q 7→ Q′ is therefore one of obtaining QV ′

via in-terpolation between QV and targets QV ∗

i , as a function ofuser supplied weights and targets.

QV ′

= f(QV ; Ω, T ). (9)

We describe two strategies (instances of f ) for computing

QV ′

from query QV (evaluating these in subsec. 4.2).

3.4.1 Linear Interpolation

A naıve solution is to linearly interpolate within the RNNembedding V , i.e.:

flinear(QV ; Ω, T ) = QV +

m∑

i=1

ωj(QV ∗

i −QV ) (10)

yielding QV ′

via eq. 9, from which the sketch suggestion

Q′ is generated via RNN decoder Q′ = VD(QV ′

). However,

2883

Figure 6. Generating Q′: Linear (top) vs. non-linear (bottom) in-terpolation in RNN space (V); the latter due to backpropagation of

loss eq. 12 that updates Q 7→ Q′ s.t. QS′

tends toward the searchtargets identified local to QS by the user. See Fig. 8 for examples.

although QS and QS∗

i are local by construction, it is unlikelythat QV and QV ∗

i will be local; nor is the manifold of plau-sible sketches within V linear. This can lead to generationof implausible sketches (c.f. Sec. 4.2, Fig. 8).

3.4.2 Back-propagation

We therefore perform a non-linear interpolation in V , min-

imizing an objective that updates QV 7→ QV ′

closer tosearch target(s) QV ∗

i via backprop through the fc layersFV (.) (eq.2) to reduce the distance between QS and QS∗

i .

D(QV ′

) =1

m

m∑

j=1

ωj ||QS′

−QS∗

j ||22. (11)

This is analogous to FGSM adversarial perturbation (AP)of images in object recognition [12], where the input thenetwork is modified via backprop to influence its mappingto a classification embedding. In our context, we define aloss based on this distance in S regularized by the constraintthat the original and updated sketch should be nearby in V:

LAP(Q′) = D(QV ′

) + α||QV ′

−QV ||2. (12)

Weight α = 0.1 was emperically set. An optimal QV ′

issought by backpropagation through the fc layer FV (.):

fAP(Q′; Ω, T ) = argmin

q′LAP(VE(q

′)). (13)

Equipped with sliders to control relevance weights Ω oneach of the targets, the user in effect performs a linear in-terpolation (between SQ(Q) and T ) within S that causes a

non-linear extrapolation from QV to output point Q′V , and

ultimately the sketch suggestion via RNN decoder Q′ =

FD(Q′V ). Fig. 6 contrasts the linear and non-linear (back-

prop) approaches; visual examples in Fig. 8.

Method Class-level Instance-level

S-I

LS (Ours) 38.40 30.81LS-R 35.26 29.48

LS-R-I [6] 35.15 27.48Sketchy [28] 33.21 27.06Bui et al. [5] 12.59 8.76

S-S

V-R 34.88 18.80R-V 29.31 18.29

V-R-shuffle 35.94 15.71R-V-shuffle 29.61 18.57

Table 2. Accuracy for sketch based recall of sketches (S-S) andimages (S-I); evaluated using class and instance level mAP (%)over 345 vector queries (QD-345). Top: S-I ablations; raster query(LS-R) and raster intermediate embedding (LS-R-I) [6]. Bot. S-Sretrieval across query modalities; raster querying vector (R-V) andvector querying raster (V-R); also variants stroke order (-shuffle).

4. Experiments and Discussion

We evaluate the performance of the LiveSketch using theQuickDraw50M dataset [1] and a corpus of 67M stock photoand artwork images (Stock67M) from Adobe Stock1.

QuickDraw50M is a dataset of 50M hand-drawnsketches crowdsourced via a gamified classification exer-cise (Quick, Draw!) [1]. Quickdraw50M is well suited toour work due to its class diversity (345 object classes), vec-tor graphics (stroke sequence) format, and the casual/fast,throwaway act of the sketching encouraged in the exercisethat reflects typical SBIR user behaviour [8] (vs. smaller,less category-diverse datasets such as TUBerlin/Sketchy thatcontain higher fidelity sketches drawn with reference to atarget photograph [28, 10]). We sample 3.5M sketches ran-domly with even class distribution from the Quickdraw50Mtraining partition to create training set (QD-3.5M; detail insubsec 3.1.3). For sketch retrieval and interpolation exper-iments we sample 500 sketches per class (173K total) atrandom from the Quickdraw50M test partition to create anevaluation set QD-173K). A query set of sketches (QD-345)is sampled from QD-173K, one sketch per object class, toserve as queries for our non-interactive experiments.

Stock67M is a diverse, unannotated corpus of imagesused to evaluate large-scale SBIR retrieval performance. Thedataset was created by scraping harvesting every public,unwatermarked image thumbnails from the Adobe Stockwebsite in late 2016 yielding approximately 67M images atQVGA resolution.

4.1. Evaluating crossmodal search

We evaluate the performance of our cross-modal embed-ding (S) for sketch based retrieval of sketches and images.

Sketch2Sketch (S-S) Matching. We evaluate the ability ofour embedding trained in subsec. 3.1.3 to discriminate be-tween sketched visual structure, invariant to input modality(vector vs. raster). We train our model on QD-3.5M and re-trieve sketches from the QD-173K corpus, using QD-345 as

1Downloaded from https://stock.adobe.com in late 2016

2884

Figure 7. Performance of joint search embedding for retrieval. Top:Sketch2Image (S-I) Precision@k curve for SBIR – see Tbl. 2 formAP% and key to ablation notation. Bottom: Sketch2Sketch (S-S)cross-modal matching. Class-level (solid) and instance level (dash)mAP-recall curves for Vector-Raster (V-R), Raster-Vector (R-V),and stroke shuffling experiments (-shuffle).

queries. Both vector queries retrieving raster content (V-R)and vice versa (R-V) are explored, using category (class-level) and fine-grain (instance-level) metrics. For the for-mer we consider a retrieved record a match if it matches thesketched object class. For the latter, the exact same sketchmust be returned. To run raster variants, a rasterized versionof QD-173K is produced by rendering strokes to a 256×256pixel canvas (see method of subsec. 3.1.3). Sketches fromQD-173K are encoded to the search embedding S from theirvector and raster form respectively via functions SQ(.) andFR(RS(.)). Fig. 5 visualizes the sketches within the searchembedding; similar structures cluster together whilst the vec-tor/raster modalities mix. Tbl. 2 (bot.) and Fig. 7 (bot.) char-acterize performance; vector queries (∼ 35% mAP) out-perform raster (∼ 29% mAP) by a ∼ 5% margin. To ex-plore this gain further, we shuffled the ordering of the vectorstrokes retraining the model from scratch. We were surprisedto see comparable performance at class-level, suggesting thisgain is due to the spatial continuity inherent to the stroke rep-resentation, rather than temporal information. The fractionalincrease may be due to the shuffling acting as data augmen-tation enhancing invariance to stroke order. However whenat instance level, ordering appears more significant (∼ 3%gain; V-R vs. V-R-shuffle).

Sketch2Image (S-I) Matching. We evaluate the perfor-mance of the search embedding for SBIR over Stock67M,using all QD-345 sketch queries (without user interactionin this experiment). Annotation of 67M images for every

Figure 8. Comparison of sketch interpolation via our backprop(fAP ) approach and linear interpolation (flinear) within V , forfine-grain variations of a boat and a bag (c.f. Tbl. 3).

Table 3. Perturbation method user study (MTurk) comparing theproposed query perturbation scheme inspired by adversarial pertur-bations (fAP ) vs. linear interpolation variants (flinear, fSLERP ).

query is impractical; we instead crowd-source per-query an-notation via Mechanical Turk (MTurk) for the top-k (k=15)results and compute both mAP% and precision@k curveaveraged across all 345 queries for each experiment. Theannotation is crowd-source with 5 repetitions. Results aresummarized in Tbl. 2 (S-I) and Fig. 7 (top). We performtwo ablations to our proposed LiveSketch (LS) system: 1)querying with rasterized versions of the QD-345 queries(-R) using the proposed embedding S; 2) querying withrasterized queries in the intermediate embedding R (-R-I)which degenerates to [6]; we also baseline against two fur-ther recent SBIR techniques: the unshared triplet GoogleNet-V1 architecture proposed by Sangkloy et al. [28], and thetriplet edgemap approach of Bui et al. [5]. We computeclass- and instance- level precision for all queries result-ing in 345 × 15 × 5 =∼ 26K MTurk annotations. Ourembedding (LS) outperforms all ablations and baselines,with vector query alone contributing significant margin overraster. The addition of fc layers to create cross-modal em-bedding (-R) slightly improves (importantly, does not de-grade) the intermediate raster embedding R available via [6].The method significantly outperforms recent triplet SBIRapproaches [28, 5]. Note that the S-I and S-S figures arenon-comparable; they search different datasets.

2885

Method Ablations: seconds(missed) Baselines seconds(missed)LS (Ours) LS-NI LS-NI-R LS-NI-R-I [6] Sketchy [28] Bui et al. [5]

Class-level T-T 24.90 (1.33) 38.33 (1.33) 31.74 (0.33) 19.12 (0.00) 46.20 (1.00) 40.13 (1.33)Instance-level T-T 30.74 (2.00) 45.43 (1.67) 66.46 (3.67) 95.27 (3.67) 80.28 (2.67) 75.02 (1.33)

Mean Avg. T-T 27.67 (3.33) 41.72 (3.00) 45.92 (4.00) 42.69 (3.67) 60.90 (3.67) 54.88 (2.67)

Table 4. Time-to-task user study. Average time to retrieve 20 class- and instance-level search targets (18 participants, 3 per method).Comparing LiveSketch (LS) interactive method with ablations (-NI) non-interactive/one-shot; (-R) raster substitutes vector query; (-I)intermediate structure embedding, and with the three baselines [6, 28, 5]. Times in seconds; parentheses total the averaged missed queries.

4.2. Evaluating Search Suggestions

MTurk was used to evaluate the relative performance ofsketch interpolation techniques used to form query sugges-tions (Q′). We benchmark linear (flinear) and spherical lin-ear (SLERP, [14] fSLERP ) interpolation[14] in our RNNembedding V with the proposed approach fAP inspired byadversarial perturbations, in which the sketch is perturbedvia non-linear interpolation in V due to backpropagation.We also compare to linear and SLERP embedding withinthe embedding of the original SketchRNN network of Ecket al. [14] trained using the same data (QD-3.5M).

MTurkers were presented with a pair of sketches Q andQ′ sampled from QD-173K and shown a sequence of 10sketches produced by each of the 5 interpolation methods.MTurkers were asked to indicate which interpolation lookedmost natural / human drawn. Each experiment run sampled300 intra- and 300 inter-category pairs (Q,Q′) picked atrandom from QD-173K. The experiment was repeated 5times yielding 3k annotations from 25 unique MTurkers.

Tbl. 3 summarizes user study results; an un-paired t-test[31] was run to determine significance (p). Backpropaga-tion (fAP , proposed) outperformed direct linear interpola-tion (flinear) in V for inter- (18.0% vs. 26.7%, p < 0.002)and intra-category (18.7% vs. 25.3, p < 0.030) cases (seeFig. 8 for visual examples). Statistically significant resultswere obtained for fSLERP at p < 0.03. In both cases prefer-ence was stronger for inter-category interpolation, likely dueto non-local nature of (Q,Q′) causing linear interpolation todeviate from the manifold of plausible sketches (enforcedby fAP ). Even linear interpolation in V enabled more natu-ral interpolations in both inter- and intra-category cases vs.original SketchRNN [14]; but this was significant only forthe former.

4.3. Evaluating Iterative Retrieval

We evaluate the efficacy of LiveSketch via a time-to-taskexperiment in which 18 participants were timed searchingfor 20 targets using 6 methods. We perform 3 ablationsto our method (LS): 1) non-interactive (-NI), users are notoffered sketch suggestions; 2) sketches are rasterized (-R)rather than processed as vector queries; 3) as -R but search-ing within intermediate embedding R which degenerates to[6] (-R-I). We also baseline against [28, 5].

Fig. 1 provides a representative query, suggestions andclustered results sampled from the study. Tbl. 4 summa-rizes the results, partitioning across class- and instance- levelqueries (10 each). Class-level (category) queries prompted

the user to search for a specific object (‘cruise ship sailingon the ocean’). Instance-level (fine-grain) queries promptedthe user to search for a specific object with specific poseor visual attributes (‘church with three spires’,‘side view ofa shark swimming left’). Timing began on the first strokedrawn and ended when the user was satisfied that the tar-get had been found (self-assessed). If a user took longerthan three minutes to find a target, then the search time wascapped and noted as a miss (bracketed in Tbl. 4).

Significant decrease in time-to-task (∼ 15s) was ob-served with the interactive method (LS) over non-interactivevariants using vector (LS-NI, LS-NI-R, LS-NI-R) althoughquery modality had negligible effect on mean time to taskfor non-interactive cases. Baselines performed ∼ 10− 20sslower onexplainable via lower retrieval performance in sub-sec. 4.1. In all cases, fine-grain queries took longer to iden-tify with greater instances of missed searches – however themargin over class-level searches was only ∼ 6s vs. inter-active. Whilst category level search time was not enhancedby the proposed method, time taken to produce successfulfine-grain sketch queries was significantly reduced by upto 15s over non-interactive ablations and by a factor of 3over baselines. All 6 methods used a PQ [18] index and took30− 40ms to run each query over the Stock67M corpus.

5. Conclusion

We presented an novel architecture that, for the first time,enables search of large image collections using sketchedqueries expressed as a variable length sequence of strokes(i.e. in ’vector’ form). The vector modality enables aseocnd contribution; live perturbation of the sketched querystrokes to guide the user’s search. The search is guidedtowards likely search intents — interactively specified byuser weights attributed to clusters of results returned at eachsearch iteration. Perturbations were generated via backprop-agation through the feature encoder network, inspired by ad-versarial perturbations (FGSM [12]) typically used to attackobject classification systems, and applied for the first timehere in the context of relevance feedback for visual search.We showed that our interactive system significantly reducessearch time over a large (67M) image corpus, particularlyfor instance-level (fine-grain) SBIR, and that our search em-bedding (unifying vector/RNN and image/CNN modalities)performs competitively against three baselines [6, 28, 5]. Fu-ture work could focus on improvement of the RNN embed-ding which can still produce implausible sketches for verydetailed (high stroke count) drawings.

2886

References

[1] The Quick, Draw! Dataset. https://github.com/

googlecreativelab/quickdraw-dataset. Ac-cessed: 2018-10-11. 2, 4, 6

[2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Syn-thesizing robust adversarial examples. CoRR Abs,arXiv:1707.07397v2, 2017. 1, 2

[3] Tu Bui and John Collomosse. Scalable sketch-based imageretrieval using color gradient features. In Proc. ICCV Work-

shops, pages 1–8, 2015. 2

[4] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Generali-sation and sharing in triplet convnets for sketch based visualsearch. CoRR Abs, arXiv:1611.05301, 2016. 2

[5] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Compact de-scriptors for sketch-based image retrieval using a triplet lossconvolutional neural network. Computer Vision and Image

Understanding (CVIU), 2017. 6, 7, 8

[6] T. Bui, L. Ribeiro, M. Ponti, and J. Collomosse. Sketchingout the details: Sketch-based image retrieval using convolu-tional neural networks with multi-stage regression. Elsevier

Computers & Graphics, 2018. 1, 2, 3, 4, 6, 7, 8

[7] J. Collomosse, T. Bui, M. Wilber, C. Fang, and H. Jin. Sketch-ing with style: Visual search with sketches and aesthetic con-text. In Proc. ICCV, 2017. 1, 2

[8] J P Collomosse, G McNeill, and L Watts. Free-hand sketchgrouping for video retrieval. In Proc. ICPR, 2008. 1, 6

[9] D. H. Douglas and T. K. Peucker. Algorithms for the reduc-tion of the number of points required to represent a digitizedline or its caricature. Cartographica: The International Jour-

nal for Geographic Information and Geovisualization, pages112–122, 1973. 4

[10] Mathias Eitz, James Hays, and Marc Alexa. How do humanssketch objects? In Proc. ACM SIGGRAPH, volume 31, pages44:1–44:10, 2012. 6

[11] B. J. Frey and D. Dueck. Clustering by passing messagesbetween data points. Science, 315:972–976, 2007. 5

[12] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and har-nessing adversarial examples. CoRR Abs, arXiv:1412.6572,2015. 1, 2, 6, 8

[13] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar-lus. Deep image retrieval: Learning global representationsfor image search. In Proc. ECCV, pages 241–257, 2016. 1, 2

[14] D. Ha and D. Eck. A neural representation of sketch drawings.In Proc. ICLR. IEEE, 2018. 2, 3, 4, 8

[15] Rui Hu and John Collomosse. A performance evaluationof gradient field HOG descriptor for sketch based image re-trieval. Computer Vision and Image Understanding (CVIU),117(7):790–806, 2013. 2

[16] S. James and J. Collomosse. Interactive video asset retrievalusing sketched queries. In Proc. CVMP, 2014. 2

[17] N. Jaques, S. Gu, D. Bahdanau, J. Hernandez-Lobato, R.Turner, and D. Eck. Sequence tutor: Conservative fine-tuningof sequence generation models with kl-control. In Proc.

ICML. IEEE, 2017. 3

[18] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregatinglocal descriptors into a compact image representation. InProc. CVPR, 2010. 2, 4, 8

[19] J.Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Ob-ject retrieval with large vocabularies and fast spatial matching.In Proc. CVPR, 2007. 2

[20] A. Kovashka and K. Grauman. Attribute pivots for guidingrelevance feedback in image search. In Proc. ICCV, 2013. 2

[21] A. Kovashka, D. Pariks, and K. Grauman. Whittlesearch:Image search with relative attribute feedback. In Proc. CVPR,2012. 2

[22] Y. Lee, C. Zitnick, and M. Cohen. Shadowdraw: real-timeuser guidance for freehand drawing. In Proc. SIGGRAPH,2011. 1, 2

[23] J. Lu, H. Sibai, E. Fabry, and D. Forsyth. No need toworry about adversarial examples in object detection for au-tonomous vehicles. CoRR Abs, arXiv:1707.03501, 2017. 1

[24] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard.Universal adversarial perturbations. In Proc. CVPR, 2017. 1,2

[25] Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu.Sketch-based image retrieval via siamese convolutional neu-ral network. In Proc. ICIP, pages 2460–2464. IEEE, 2016.2

[26] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNNimage retrieval learns from BoW: Unsupervised fine-tuningwith hard examples. In Proc. ECCV, pages 3–20, 2016. 1, 2

[27] F. Radenovic, G. Tolias, and O. Chum. Deep shape matching.In Proc. ECCV, 2018. 2

[28] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and JamesHays. The sketchy database: Learning to retrieve badly drawnbunnies. In Proc. ACM SIGGRAPH, 2016. 2, 6, 7, 8

[29] L. Setia, J. Ick, H. Burkhardt, and A. I. Features. Svm-basedrelevance feedback in image retrieval using invariant featurehistograms. In Proc. ACM Multimedia, 2005. 2

[30] J. Sivic and A. Zisserman. Video google: A text retrievalapproach to object matching in videos. In Proc. ICCV, 2003.2

[31] W. Student. The probable error of a mean. Biometrika, 6:1–25, 1908. 8

[32] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Vincent Vanhoucke, and An-drew Rabinovich. Going deeper with convolutions. In Proc.

CVPR, 2015. 4

[33] G. Tolias and O. Chum. Asymmetric feature maps with ap-plication to sketch based retrieval. In Proc. CVPR, 2017. 2

[34] Fang Wang, Le Kang, and Yi Li. Sketch-based 3d shaperetrieval using convolutional neural networks. In Proc. CVPR,pages 1875–1883, 2015. 2

[35] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg,Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learn-ing fine-grained image similarity with deep ranking. In Proc.

CVPR, pages 1386–1393, 2014. 2

[36] M. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse,and S. Belongie. Bam! the behance artistic media dataset forrecognition beyond photography. In Proc. ICCV, 2017. 2

[37] P. Xu, Y. Huang, T. Yuan, K. Pang, Y-Z. Song, T. Xiang, andT. Hospedales. Sketchmate: Deep hashing for million-scalehuman sketch retrieval. In Proc. CVPR, 2018. 2

[38] Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy MHospedales, and Chen-Change Loy. Sketch me that shoe.In Proc. CVPR, pages 799–807, 2016. 2

[39] J. Yuan, S. Bhattacharjee, W. Hong, and X. Ruan. Queryadaptive instance search using object sketches. In Proc. ACM

Multimedia, 2016. 2

[40] J-Y. Zhu, Y-J. Lee, and A. Efros. Averageexplorer: Interactiveexploration and alignment of visual data collections. In Proc.

SIGGRAPH, 2014. 2

2887

LiveSketch: Query Perturbations for Guided Sketch-Based...

Documents

Transcript of LiveSketch: Query Perturbations for Guided Sketch-Based...