Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image...

11
Sketchforme: Composing Sketched Scenes from Text Descriptions for Interactive Applications Forrest Huang University of California, Berkeley Berkeley, U.S.A. [email protected] John F. Canny University of California, Berkeley Berkeley, U.S.A. [email protected] a horse under a tree. User Sketchforme Composed Sketches Language-Learning Applications Sketching Assistant Figure 1: Sketchforme synthesizes sketched scenes corresponding to users’ text descriptions to support interactive applications. ABSTRACT Sketching and natural languages are effective communica- tion media for interactive applications. We introduce Sketch- forme, the first neural-network-based system that can generate sketches based on text descriptions specified by users. Sketch- forme is capable of gaining high-level and low-level under- standing of multi-object sketched scenes without being trained on sketched scene datasets annotated with text descriptions. The sketches composed by Sketchforme are expressive and realistic: we show in our user study that these sketches convey descriptions better than human-generated sketches in multiple cases, and 36.5% of those sketches are considered to be human- generated. We develop multiple interactive applications using these generated sketches, and show that Sketchforme can sig- nificantly improve language learning applications and support intelligent language-based sketching assistants. ACM Classification Keywords I.4.9 Image Processing and Computer Vision: Applications Author Keywords interactive applications; natural language; sketching; generative models; deep learning; Transformer; interactive machine learning. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2019 Copyright held by the owner/author(s). . INTRODUCTION Sketching is a natural and effective way for people to com- municate artistic and functional ideas. Sketches are abstract drawings widely used by designers, engineers and educators as a thinking tool to materialize their vision while discarding unnecessary details. Sketching is also a popular form of artis- tic expression among amateur and professional artists. With the pervasive use of sketches across diverse fields, researchers in the HCI and graphics communities developed sketch-based interactive tools to enable intuitive and rich user experiences, such as assisted sketching systems [9, 14], design tools for prototyping [13], and animation authoring tools [3]. While using sketches as an interactive medium poses numer- ous benefits, producing meaningful and delightful sketches can be challenging for users and typically requires years of educa- tion and practice. This is partially due to the array of skills one needs to master for sketching: finding the appropriate abstrac- tion level desired for the sketch, composing such abstraction with visual representation involving individual objects, and conveying these objects via precise motor operations with drawing media. Researchers have developed computational methods to assist users in creating sketches, but these meth- ods typically rely on rigid modeling of the pixel/stroke-level visual details without semantic understandings of the sketched content. These restrict the use-cases of these approaches to improving sketching mechanics of user-created sketches. Recent advances in neural-network-based generative models drastically increased machines’ ability to generate convinc- ing graphical content, including sketches, from high-level concepts. The Sketch-RNN model [6] demonstrates recur- 1 arXiv:1904.04399v1 [cs.HC] 8 Apr 2019

Transcript of Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image...

Page 1: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

Sketchforme: Composing Sketched Scenes from TextDescriptions for Interactive Applications

Forrest HuangUniversity of California, Berkeley

Berkeley, [email protected]

John F. CannyUniversity of California, Berkeley

Berkeley, [email protected]

a horse under a tree.

UserSketchforme

Composed Sketches

Language-Learning Applications

Sketching Assistant

Figure 1: Sketchforme synthesizes sketched scenes corresponding to users’ text descriptions to support interactive applications.

ABSTRACTSketching and natural languages are effective communica-tion media for interactive applications. We introduce Sketch-forme, the first neural-network-based system that can generatesketches based on text descriptions specified by users. Sketch-forme is capable of gaining high-level and low-level under-standing of multi-object sketched scenes without being trainedon sketched scene datasets annotated with text descriptions.The sketches composed by Sketchforme are expressive andrealistic: we show in our user study that these sketches conveydescriptions better than human-generated sketches in multiplecases, and 36.5% of those sketches are considered to be human-generated. We develop multiple interactive applications usingthese generated sketches, and show that Sketchforme can sig-nificantly improve language learning applications and supportintelligent language-based sketching assistants.

ACM Classification KeywordsI.4.9 Image Processing and Computer Vision: Applications

Author Keywordsinteractive applications; natural language; sketching;generative models; deep learning; Transformer; interactivemachine learning.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

© 2019 Copyright held by the owner/author(s). .

INTRODUCTIONSketching is a natural and effective way for people to com-municate artistic and functional ideas. Sketches are abstractdrawings widely used by designers, engineers and educatorsas a thinking tool to materialize their vision while discardingunnecessary details. Sketching is also a popular form of artis-tic expression among amateur and professional artists. Withthe pervasive use of sketches across diverse fields, researchersin the HCI and graphics communities developed sketch-basedinteractive tools to enable intuitive and rich user experiences,such as assisted sketching systems [9, 14], design tools forprototyping [13], and animation authoring tools [3].

While using sketches as an interactive medium poses numer-ous benefits, producing meaningful and delightful sketches canbe challenging for users and typically requires years of educa-tion and practice. This is partially due to the array of skills oneneeds to master for sketching: finding the appropriate abstrac-tion level desired for the sketch, composing such abstractionwith visual representation involving individual objects, andconveying these objects via precise motor operations withdrawing media. Researchers have developed computationalmethods to assist users in creating sketches, but these meth-ods typically rely on rigid modeling of the pixel/stroke-levelvisual details without semantic understandings of the sketchedcontent. These restrict the use-cases of these approaches toimproving sketching mechanics of user-created sketches.

Recent advances in neural-network-based generative modelsdrastically increased machines’ ability to generate convinc-ing graphical content, including sketches, from high-levelconcepts. The Sketch-RNN model [6] demonstrates recur-

1

arX

iv:1

904.

0439

9v1

[cs

.HC

] 8

Apr

201

9

Page 2: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

rent neural networks (RNNs) trained on crowd-sourced datacan understand and generate original sketches from variousconcept classes. Using these techniques, we introduce Sketch-forme, the first system that is capable of synthesizing sketchesconditioned on natural language descriptions.

The contribution of this paper is two-fold. First, we contributeSketchforme, the system that uses a novel two-step neuralmethod for generating sketched scenes from text descriptions.Sketchforme first uses its Scene Composer, a neural networkthat learned high-level composition principles from datasetsof human-annotated natural images that contain text captions,bounding boxes of individual objects and class information,to generate scenes composition layouts. Sketchforme thenuses its Object sketcher, a neural network that learned low-level sketching mechanics to generate sketches adhering tothe objects’ aspect ratios in the composition. Sketchformecomposes these generated objects of certain aspect ratios intomeaningful sketched scenes.

Second, we contribute and evaluate several applications, in-cluding a sketch-based language learning system and an in-telligent sketching assistant, to exemplify the importance ofSketchforme in empowering novel sketch-based applicationsin Section 6. In these applications, Sketchforme creates new in-teractions and user experiences with the interplay between lan-guage and sketches. We envision the connection that Sketch-forme draws between language and sketching will enable fur-ther engaging and natural human-computer interactions, andopen up new avenues for self-expression by users.

RELATED WORK

Assisted sketching tools and tutorialsPrior works have augmented the creative process of sketch-ing with automatically-generated and crowd-sourced drawingguidance. ShadowDraw [14] and EZ-sketching [18] used edgeimages traced from natural images to suggest realistic sketchstrokes to users. The Drawing Assistant [9] extracts geometricstructure guides to help users construct accurate drawings. Por-traitSketch [22] provides sketching assistance specifically onfacial sketches by adjusting geometry and stroke parameters.Researchers also developed crowd-sourced web applicationsto provide real-time feedback for users to correct and improvesketched strokes [16].

In addition to assisted sketching tools, researchers also devel-oped sketch tutorial tools to improve users’ sketching profi-ciency. How2Sketch [7] automatically generates multi-steptutorials of sketching 3D objects. Sketch-sketch revolution [4]provides first-hand experiences created by sketch experts fornovice sketchers.

While these methods help users create refined sketches, noneof them can synthesize sketches from semantic descriptions asSketchforme’s sketch generation process.

Neural-network based sketch generation modelSketchforme builds upon the Sketch-RNN model, the firstneural-network based sketch generation model [6]. Sketch-RNN consists of a sequence encoder-decoder model that can

unconditionally generate stroke-based sketches based on ob-ject classes, and conditionally reconstruct sketches based onusers’ input sketches. Sketchforme extends Sketch-RNN’smodel for sketching individual objects to support conditionalsketch generation based on aspect ratios in the compositionlayouts.

Neural-network-based text-to-image synthesisGenerating graphical content from text description is a popularongoing research problem. Recent work on Generative Adver-sarial Networks (GANs) [17, 23] shows promising results ingenerating realistic images from text descriptions. GAN-CLS[17] augments the GAN architecture to consider text descrip-tions, and subsequently generate images based on users’ textinput. Extending on these works, [8] introduces multiple com-ponents to first synthesize composition and outlines from atext description, and subsequently generate images from thesecomposition and outlines. This is similar to Sketchforme’smulti-step approach to generate complete sketch-scenes fromnatural language except in the domain of natural images.

Neural Style TransferSeveral prior works in the Computer Vision community fo-cus on the research problem of transferring styles betweenvisual content. These prior works explore image stylization bymatching statistics of feature maps (i.e. filters) of pre-trainedmodels and with generative adversarial networks [24, 21]. Onepossible approach for sketch generation that arises from thesetechniques is to stylize synthetic images generated based ontext descriptions. However, this approach likely results inrealistic, detailed sketch-style images which contain distract-ing artifacts. Sketchforme focuses on synthesizing abstractsketched scenes from scratch that capture fundamental ideasfrom messages communicated by the scenes.

SYSTEM DESCRIPTIONTo support applications that affords sketch and natural-language based interactions, we developed Sketchforme, thesystem that provides the core capability of synthesizingsketched scenes from natural language descriptions. Sketch-forme implements a two-step approach to generate a completescene from text descriptions as illustrated in Figure 2. In thefirst step, Sketchforme uses its Scene Composer to synthesizecomposition layouts represented by bounding boxes of indi-vidual objects. These bounding boxes dictate the location, sizeand aspect ratio of the objects in the scene. Sketchforme’sObject Sketcher then uses this information at the second stepof the generation process to generate specific sketch strokesof these objects in their corresponding bounding boxes. Thesesteps are reflective of humans’ sketching process of scenes sug-gested by sketching tutorials, where the overall compositionof the scene is drafted before filling in details that characterizeeach object [2].

By taking this two-step approach, Sketchforme is able to ob-tain both high-level scene understanding and knowledge of therelation between individual objects. This enables a multitudeof applications that require such understanding. Moreover,this approach overcomes the difficulty for end-to-end sketchgeneration methods to capture global structures of sequential

2

Page 3: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

Generate sketches of individual objects and compose them in the scene

Generate scene composition layout

[0.0, 0.0, 1.0, 0.94, “tree”]

[0.06, 0.52, 0.81, 0.48, “horse”]

r = 0.59

a horse under a tree.

User

1) SceneComposer

Natural Language Description

r = 0.94

2) Object Sketcher

Figure 2: Overall system architecture of Sketchforme whichconsists of two steps in its sketch generation process.

inputs [6]. End-to-end scene sketch generation also requiresdatasets of dedicated sketch-caption pairs that is difficult forcrowd-workers to create [25] and will be prohibitively largein scale due to the combinatorial explosion of objects in thescenes.

Scene Composer: Generating composition layoutsTo generate composition layouts of scenes, we first modelcomposition layouts as a sequence of n objects, such that eachobject generated by the network is represented with 8 values:

bt = [xt ,yt ,wt ,ht , lt ,boxt ,startt ,endt ], t ∈ [1,n]

The first four values are fundamental data that describes bound-ing boxes of objects in the scene: x-position, y-position, width,height, and the class label. The last three values are booleanflags used as extra ‘tokens’ to mark the actual boxes, the be-ginning of sequences and the end of sequences.

Using this sequential encoding of scenes, we designed aTransformer-based Mixture Density Network as our SceneComposer to generate realistic composition layouts. Trans-former Networks [19] are state-of-the-art neural networks forsequence-to-sequence modeling tasks, such as machine trans-lation and question answering. We use the Transformer toperform a novel task: generating a sequence of objects from atext description c, a sequence of words. As multiple scenes cancorrespond to the same text descriptions, we feed the outputsof the Transformer Network into Gaussian Mixture Models(GMMs) to model the variation of scenes forming a MixtureDensity Network [1].

The generation process of the composition layouts involvestaking the previous bounding box bt−1 (or the start token) asan input and generating the current box bt . At each time-step,the Transformer model generates an output tt conditioned onthe text input c and previously generated boxes b1...t−1 usingself-attention and cross-attention mechanisms built into thearchitecture. This process is repeated for multiple boundingboxes until an end token is generated:

tt = Transformer([b1...t−1;c]) (1)

tt is then projected to the appropriate dimensionality toparametrize the GMM models with various projection lay-ers Wxy and Wwh to model P(xt ,yt), the distribution of thebounding boxes’ positions, and P(wt ,ht), the distribution ofthe bounding boxes’ sizes. With these distributions, Sketch-forme can generate bounding boxes [xt ,yt ,wt ,ht ] by sampling

from these distributions in Equations 2 and 3. The GMMmodels use the projected values as mean and co-variance formixtures of multi-variate Gaussian distributions M. These pa-rameters are passed through appropriate activation functions(Sigmoid, exp and tanh) to comply with the required range ofthe parameters.

P(xt ,yt) = ∑i∈M

Π1,iN (xt ,yt |µ1,Σ1),

[Π1,µ1,Σ1] = a(Wxy(tt)) (2)

P(wt ,ht) = ∑i∈M

Π2,iN (wt ,ht |µ2,Σ2),

[Π2,µ2,Σ2] = a(Wwh([xt ;yt ; tt ])) (3)

While P(xt ,yt) is modeled only from the first projection layerWxy, we consider P(wt ,ht) to be conditioned on the width andheight of the boxes with the position of the boxes similar to[8]. To introduce this condition, we concatenate the tt and thevalues of [xt ,yt ] to the second projection layer as describedin Equation 3. The probabilities of the current boxes aregenerated using a softmax-activated third projection layer Wcfrom the Transformer output:

P(boxt ,startt ,endt) = softmax(Wctt) (4)

In addition, Sketchforme separately uses an LSTM to generateadditional class label vectors lt because the class labels givencertain descriptions are assumed to not vary across examples.The full architecture of the Scene Composer is shown in Figure3.

GloVe Vectors

“a person riding a horse”(Condition)

512-unit LSTM Encoder

512-unit LSTM Decoder

(box labels) l1 l2

Transformer Encoder

Transformer Decoder

GMM

[x, y] (box coords.)

Inputs Outputs

Outputs

[w, h]

GMM

t

pstart/end/box

(start prob.)

Wxy

Wwh

Wp

Figure 3: Model architecture of the Scene Composer used inthe first step of Sketchforme’s sketch generation process.

3

Page 4: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

Object Sketcher: Generating individual sketchesAfter obtaining scene layouts from the Scene composer, we de-signed a modified version of Sketch-RNN to generate individ-ual objects in Sketchforme according to the layouts. We adoptthe decoder-only Sketch-RNN that is capable of generatingsketches of only individual objects as a sequence of individualstrokes. Sketch-RNN’s sequential generation process involvesgenerating the current stroke based on the previous strokesin the sketched object commonly used in sequence modelingtasks. Sketch-RNN also uses a GMM to model variation ofsketch strokes.

While the decoder-only Sketch-RNN generates realisticsketches of individual objects in certain concept classes,the aspect ratio of the output sketches generated by theoriginal Sketch-RNN model cannot be constrained. Hence,sketches generated by the original Sketch-RNN model gener-ated sketches may be unfit for assembling into scene sketchesguided by the layout generated by the Scene composer. Fur-ther, naive direct resizing of the sketches can produce sketchesof unsatisfactory quality for complex scenes.

We modified Sketch-RNN as the Object Sketcher that factorsin the aspect ratios of sketches when generating individualobjects. To incorporate this knowledge in the Sketch-RNNmodel, we compute the aspect ratio of the training data andconcatenate the aspect ratio r = ∆y

∆x of the sketch with theprevious input stroke in the sketch generation process of ourmodified Sketch-RNN as shown in Figure 4. The new for-mulation and output of the Sketch-RNN at stroke-sequence tis:

[ht ;ct ] = LST M([St−1;ht−1;r]),yt =Wht +bt (5)

Since each Sketch-RNN model only handles a single objectclass, we train multiple sketching models and use the appropri-ate model based on the class label in the layouts generated bythe Scene Composer for assembling the final sketched scene.

MODEL TRAINING AND DATA SOURCESSketchforme’s Scene Composer and Object Sketcher aretrained on different datasets that encapsulate visual-scene-level knowledge and sketching knowledge separately. This re-laxes the requirement for Sketchforme to be trained on naturallanguage annotated datasets of sketched scenes that provideshighly varied scenes corresponding to realistic scene-captionpairs.

We trained the Scene Composer using the Visual Genomedataset [12] which contains natural language region descrip-tions and object relations of natural images to demonstrate itsflexibility in utilizing various types of scene-layout datasets.Object relations in the dataset each contains a ‘subject’ (e.g.,‘person’), an ‘object’ (e.g., ‘on’), and a ‘predicate’ (e.g., ‘car’)represented by class labels and bounding boxes of the partici-pating objects in the image. Natural language region descrip-tions are represented by bounding boxes of the regions anddescription texts that correspond to the regions. We recon-cile these two types of information using region graphs in thedataset that pair these two types of data. With the paired data

2048 Units HyperLSTM

2048 Units

HyperLSTM

2048 Units HyperLSTM

S0 r S1 r Si r

y1

S1

GMM, Categorical Distribution

Sampling

y2

S2

GMM, Categorical Distribution

Sampling

yi+1

Si+1

GMM, Categorical Distribution

Sampling

r

h0 h1 hi

Figure 4: Model architecture of the extended Sketch-RNN inSketchforme’s Object Sketcher.

of natural language descriptions and relations, we train theScene Composer to generate composition layouts. We selectedrelations that contain subsets of the 100 most commonly usedobject classes and 70 predicates in the dataset. This datasetof selected object classes and predicates contains 101,968 in-stances. We split this dataset in the scheme of: 70% trainingset, 10% validation set and 20% testing set.

The Object Sketcher is trained with the Quick, Draw! [11]dataset that consists of 70,000 training sketches, 2,500 val-idation sketches and 2,500 testing sketches for each of the345 object categories in the dataset. As mentioned in Section3, we preprocess the data by computing the aspect ratio ofeach sketch as inputs to the Object Sketcher in addition to theoriginal stroke data.

Using these data sources, we train multiple neural networks ofvarious configurations and loss functions in Sketchforme. TheLSTM architectures in the Scene Composer for generatingcomposition layouts is stacked with 2 hidden-layers of size512. Similarly, the Transformer Network has the configuration(dmodel ,Nlayers) = (512,6).

The Scene Composer is trained by minimizing both the nega-tive log likelihood of the position and size data:

Lxy =−n

∑i=1

log(P(xi,yi)) (6)

Lwh =−n

∑i=1

log(P(wi,hi)) (7)

and cross-entropy loss for categorical outputs LP.

4

Page 5: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

Lp =−(n

∑t=1

P(boxt) log(boxt)+P(startt) log(P(startt))

+P(endt) log(P(endt))) (8)

For generating the class labels, note that in our network eachlt is represented as a 100-dimension vector, with each valueli,t corresponding to the output probability of the class. Lclassis thus computed as:

Lclass =−n

∑t=1

100

∑i=1

li,t log(li,t) (9)

We combine these multiple losses with weight hyper-parameters to obtain a general training objective LSC for theScene Composer:

LSC = λ1Lxy +λ2Lwh +λ3Lp +λ4Lclass (10)

The initial learning rate for the model is 1× 10−5. We useλ1 = 1.0,λ2 = 1.0,λ3 = 1×10−5,λ4 = 1×10−3. We use theAdam Optimizer with β1 = 0.9,β2 = 0.999 to minimize theloss function. We use 5 mixtures in each of the GMMs. Wechose these hyper-parameters based on empirical experiments.

The Object Sketcher uses an HyperLSTM cell [5] of size2048 for the modified Sketch-RNN model. The loss functionof the Sketch-RNN model is identical to the reconstructionloss LR in the original Sketch-RNN model to maximize thelog-likelihood of the generated probability distribution of thestroke data St at each step t. The model is trained with aninitial learning rate of 0.0001 and gradient clipping of 1.0.

EXPERIMENTS AND RESULTSCentral to evaluating Sketchforme’s success is assessing itseffectiveness in generating realistic and relevant sketches andlayouts from text descriptions. We evaluated the data gener-ated by Sketchforme at each step of the generation processqualitatively and quantitatively to demonstrate its effectivenessof generating sketched scenes. We further conducted a userstudy on the overall utility of the generated sketches to exploretheir potential in supporting real-world applications.

Composition layouts generationThe composition layouts generated by the Scene Composerin the first step of Sketchforme are represented as boundingboxes of individual objects in the scene. While the GMMin the Scene Composer already directly maximizes the loglikelihood of the data, we can evaluate the performance ofthe model by visualizing and comparing heat-maps created bysuper-positioning instances of real data and generated data.

Because Sketchforme considers the text input when generatingthe composition layouts, we should only compare the gener-ated bounding boxes with bounding boxes from the datasetthat is semantically similar to the text input. We obtain se-mantically similar ground-truth compositions by filtering thesubjects, objects, and predicates based on the descriptions. Forinstance, the composition layouts generated from ‘a person

riding a horse.’ are compared with all actual compositionswith a ‘person’ subject, predicate that is related to riding suchas ’on’, ’on top of’ etc. and ‘horse’ subject.

Heat-maps in Figure 5 shows the distributions of Sketch-forme’s synthetic bounding boxes and ground-truth boundingboxes from the dataset. From these heat-maps, we can obtaina holistic view on the generation performance of the model byvisually evaluating the similarity between the heat-maps. Weobserve similar distributions between the actual relations andthe generated composition layouts across all descriptions thatcorrespond to the composition layouts.

We can further approximate an overlap metric between the dis-tributions using a Monte-Carlo simulation to obtain a quantita-tive metric of the model’s performance. To estimate the over-lap between the generated data distribution and the dataset’sdistribution, we generated 100 composition layouts for eachprompt, and randomly sampled 1000 data points within eachbounding boxes in these layouts. We estimate the overlapbetween the distributions by counting the number of datapoints that lie within the intersections between any gener-ated and ground-truth bounding boxes. We compare Sketch-forme’s performance with both a heuristic-based boundingbox generator and a naive random bounding box generator.The heuristic-based bounding box generator only generatesthe second bounding boxes below the first bounding boxes forprompts with the ’above’-related predicates, and vice versa.The random bounding-box generator samples random valuesthat describe the bounding boxes from uniform distributionsserving as a naive baseline. Table 1 shows the percentage ofthe 1000 data points that lie in the intersections. The overlapbetween real data and Sketchforme-synthesized data is higherthan both of that between the heuristic-based generator andthe random generator by a large margin, which confirms ourqualitative visual inspection of the heat-maps.

Description Sketchforme Heuristics Randoma dog on a chair 89.1% 64.4% 61.6%

an elephantunder a tree

68.4% 40.3% 30.6%

a person riding ahorse

94.0% 57.7% 51.5%

a boat under abridge

31.8% 15.0% 6.85%

Table 1: Overlap metric from Monte-Carlo simulations foreach description between real data and generated/heuristics-generated/random data.

Generating individual object sketches at various aspectratiosThe main addition of Sketchforme to the original Sketch-RNNmodel is an additional input that allows the Object Sketcherto generate sketches based on target aspect ratios (r = ∆y

∆x ) ofcompleted sketches. We evaluate this approach by generatingsketches of various aspect ratios. The Object Sketcher is ableto adhere to input aspect ratios and generate individual objectsketches coherent to the ratios. As shown in Figure 6, trees

5

Page 6: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

a dog on a chair an elephant under a tree

Generated Ground-truth Generated Ground-truthEl

epha

ntTr

ee

Dog

Chai

r

Generated Ground-truth

a person riding a horse

Pers

onH

orse

Generated Ground-truth

a boat under a bridge

Boat

Brid

ge

Figure 5: Heat-maps generated by super-positioning Gener-ated/Visual Genome (ground-truth) data. Each horizontal pairof heat-maps corresponds to an object under a description.

generated with ratio r = 1.0 can be perceived as significantlyshorter than trees with r = 2.0.

= 1.0

= 2.0

Δy Δx = 0.6

= 5.0

Δy Δx

Δy Δx

Δy Δx

r =

r =

r =

r =

Figure 6: Generated sketches of trees with various aspectratios by the extended Sketch-RNN model in Sketchforme.

Complete scene sketchesCombining the composition layouts and object sketch genera-tion model of individual objects, Sketchforme generates com-plete scene sketches directly from text descriptions. Severalexamples of the sketches are shown in Figure 1 and Figure 7.In these figures, sketches that correspond to ‘a boat under abridge’ consist of small boats under bridges, whereas using ‘anapple on a tree’ as the input creates sketches with small appleson large trees that follow the actual sizes and proportions of

the objects. Moreover, Sketchforme is able to generalize tonovel concepts of ‘a cat on top of horse,’ such that the onlyrelations involving a cat and a horse in the Visual Genomedataset which the model was trained on is ‘a horse facing a cat.’The sizes of cats and horses in these sketches are in proportionto their actual sizes and the cat is adequately placed on theback of the horse.

a cat on top of a horse

an apple on a tree

a boat under a bridge

Descriptions Complete Sketches

a dog on top of a car

a bike next to a bench

Figure 7: Complete scene sketches generated by Sketchforme.Sketchforme is able to generalize to novel concepts such as ‘acat on top of the horse.’

Human perception user-studySketchforme’s high-level goal is to augment users’ commu-nication ability in sketches by generating realistic, plausibleand coherent sketches for users to interact with and take refer-ence from in their learning and communication processes. Tocomplement the quantitative and qualitative evaluation of thesketches, we conducted user studies on Amazon MechanicalTurk (AMT) to gauge human subjects’ opinions on the realismand ability of the sketches in conveying the description usedto generate them.

Study ProcedureWe recruited 51 human subjects on AMT and asked themto each review 50 sketches generated by either humans orSketchforme. These 50 sketches are generated from five de-scriptions. The human-generated sketches are obtained fromanother AMT task prior to this user study based on Quick,Draw! [11]. These human-generated sketches are shown inFigure 8. In this study, subjects are provided with the com-plete sketched scene and the descriptions that the scenes arebased on. Subjects are required to respond to the followingquestions:

1. Do you think this sketch was generated by a computer (AI)or a human?

2. On a scale of 1-5 (1 represents that description conveyedvery poorly, 5 represents that description conveyed verywell), how well did you think the message is conveyed bythe sketch?

The subjects are given 10 sketches as trial questions withanswers to (1) provided to them at the beginning of the task.After completing the trial tasks, the subjects’ answers to theremaining 40 sketches are aggregated as the actual study result.

6

Page 7: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

This study protocol is similar to perception studies commonlyused to evaluate synthetic visual content generation techniquesin the deep learning community [10]. In addition, we collectedcomments from the users (if any) and their perceived overalldifficulty of the task at the end of the task.

SketchformeHumans

an airplane in front of a mountain

an elephant under a tree

a cat on top of a horse

a clock on a building

a boat under a bridge

Descriptions

Figure 8: Samples of Sketches produced by humans andSketchforme used in the AMT user study.

ResultsThe first question probes the realism of the sketches with aTuring-test-style question asking the subjects to determinewhether the sketches are created by humans. Subjects onaverage considered 64.6% of the human-generated sketchesas generated by humans, while they considered 36.5% ofSketchforme-generated sketches as generated by humans asshown in Figure 9. Although the percentage of Sketchforme-generated sketches considered as generated by humans arestatistically significantly lower (p = 1.05× 10−10, paired t-test) than human-generated sketches, individual participantscommented in the study that it was difficult to distinguish be-tween human-generated and Sketchforme-generated sketches.P2 mentioned that they "really couldn’t tell the difference inmost images." P6 commented that they "didn’t know if it washuman or a computer." These results demonstrate the potentialfor Sketchforme in generating realistic sketched scenes.

We hypothesize one of the possible reasons for the lower per-centage of Sketchforme-generated sketches to be consideredas human-drawn is that the curves of the synthetic sketches arein general less jittery than human-drawn sketches. We suggestfuture work explore introducing stroke variation to generatemore realistic sketches.

The results for the second question reflects the ability of thesketches to communicate the underlying descriptions of thesketches. The average score for human-generated sketchesis µ = 3.46, whereas the average score for Sketchforme-generated sketches is µ = 3.21 as shown in Figure 10. Al-though the Sketchforme-generated sketches achieved lowerscores overall, Sketchforme-generated sketches achieved sta-tistically better average scores for sketches based on two of thedescriptions: ‘a boat under a bridge’ and ‘an airplane in frontof a mountain’ (p < 0.0005, paired t-test). There is also no sig-nificant difference between the scores of human/Sketchforme-generated sketches based on ‘a cat on top of a horse’. Thisshows the competitive performance of Sketchforme-generated

0% 20% 40% 60% 80%

27.9%

28.9%

54.4%

34.8%

36.3%

59.8%

71.1%

59.8%

65.7%

66.7%

Human-generated Sketchforme-generated

an airplane in front of a mountain

an elephant under a tree

% of sketches perceived as human-generated sketches

Descriptions

a cat on top of a horse

a clock on a building

a boat on a bridge

Figure 9: Percentage of sketches considered by users ashuman-generated. On average, 64.6% of human-generatedsketches are perceived as human-generated, while 36.5%of Sketchforme-generated sketches are perceived as human-generated.

sketches in communicating underlying descriptions for somescenes.

APPLICATIONSIn this section, we explore several applications that can ben-efited from Sketchforme’s ability to synthesize compellingsketches from natural language descriptions.

Sketch-assisted Language LearningSketches have been shown to improve memory [20]. As lan-guage learning is a memory-intensive task, Sketchforme couldsupport language education applications based on sketches.These sketches can potentially create engaging and effectivelearning processes and avoid rote learning.

Language Learning ApplicationTo explore the possibility of Sketchforme in supporting lan-guage learning, we built a basic language-learning applicationthat aims to educate learners with a translation task from Ger-man to English. In this application, learners are presented witha German phrase, and are asked to translate it to English in theform of multiple choices similar to the process of learning termdefinitions from flash-cards. This application also implementsthe Leitner system [15] with three bins that repeats phrases thatlearners make the most mistakes on most frequently. Usingthis system, the phrases are moved to different bins dependingon the participants’ familiarity of the translations.

We gathered 10 pairs of German-English sentences from anative German speaker and form 2 sets of 5 translations each.In addition, deceptive English sentences are added as otherchoices in the multiple-choice test to be selected by the learn-ers in the application. We deployed this application on AMTto test the improvement on learning performance by present-ing Sketchforme-generated sketches along with the phrases.

7

Page 8: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

0 1 2 3 4 5

2.75

2.72

3.2

3.68

3.68

3.61

3.78

3.55

3.11

3.27

Human-generated Sketchforme-generated

an airplane in front of a mountain

an elephant under a tree

Descriptions

a cat on top of a horse

a clock on a building

a boat under a bridge

Average score (out of 5) for conveying the sketches’ descriptions

Figure 10: Average score for the conveying of descriptions bythe sketches. Sketchforme-generated sketches perform betterthan human-generated sketches for sketches of ‘a boat under abridge.’ and ‘an airplane in front of a mountain.’

The full application with the sketches presented to the users isshown in Figure 11.

Figure 11: User interface for the language learning applicationpowered by Sketchforme. Sketches significantly reduced thetime taken to achieve similar learning outcome.

Study ProcedureThe study consists of a training phase and a testing phasefor each participant. In the training phase, participants arepresented with correct answers after answering each question.The participant can only advance to the next phase when theyanswer all questions correctly consecutively for all transla-tions according to the Leitner system. In the testing phase,participants are given one chance to provide their answer toall translations without seeing the correct answers. The par-ticipants are divided into two conditions, with the ’control’group only receiving phrases on their interface during train-ing, and the ’treatment’ group that receives both phrases andsketches generated by Sketchforme on their interface duringthe training phase. Both groups receive only the phrases ontheir interface during the testing phase. Moreover, we use our

two sets of translations for training and testing phases alter-natively, such that the participants will not get consecutivetraining and testing phases for the same set of descriptions.

The performance of the participants during the study are mon-itored with multiple analytical metrics including completiontime of each phases and scores on the test phase etc. At theend of the study, we also provide surveys for them to rate thedifficulty of the task and the usefulness of the sketches (if ap-plicable) on five-point Likert scales, and ask them to provideany additional suggestions to the interface.

ResultsWe recruited 38 participants on AMT to participate in thestudy. While we did not see significant differences (p = 0.132,unpaired t-test) in the correctness of answers in the testingphase of the phrases between the ‘control’ and ‘treatment’groups of participant, we discovered that the time taken tocomplete the learning task was significant less (p = 0.011,unpaired t-test) for the ‘treatment’ group at 246 seconds onaverage, compared to 338 seconds for the control group asshown in Figure 12 . Moreover, from the post-study survey,we also discovered that the ‘treatment’ group in general foundthe sketches to be helpful for learning (rated 4.58).

With Sketches

Average time taken to complete the study (seconds)

Without Sketches

0 76 152 228 304 380

246

338

*

Figure 12: Average time taken to complete the language learn-ing task among groups. With Sketchforme-generated sketches,participants take significantly less time on completing thestudy.

As Sketchforme is an automated system that is capable ofgenerating sketches from free-form text descriptions, and withthese promising results on sketch-assisted language learning,we envision Sketchforme to support and improve large-scalelanguage learning applications in the future.

Intelligent Sketching AssistantWe designed Sketchforme to support interactive sketchingsystems using a sequential architecture and a multi-step gen-eration process. To demonstrate Sketchforme’s capability ofsupporting interactive human-in-the-loop sketching applica-tions, we built a prototype of an intelligent sketching assistantreflective of two potential use-cases:

Auto-completion of scenesAs Sketchforme’s Scene Composer is a sequential architecturethat takes the previous object in the scene to generate the nextobject, we can complete unfinished user scenes instead of start-ing with a blank canvas by starting the generation with boththe start token b0 and an existing object in the scene createdby the user b1. Figure 13 shows examples of Sketchformecompleting users’ sketch of a horse in step a by adding theother object involved in the scene.

8

Page 9: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

User-Steerable generationSketchforme’s Scene Composer is capable of generating mul-tiple potential candidate objects at each step while composingthe scene layout of the generated sketch. As such, users canselect their preferred scene layout from multiple potential can-didates. Figure 13 shows multiple candidates proposed bySketchforme based on a text description in step b. Moreover,since Sketch-RNN is also capable of generating a variety ofsketches, the users can also select their preferred sketches ofeach individual objects in the scene.

“a horse under a tree.”1 2 3

a

b:

a a

Figure 13: Intelligent sketching assistant powered by Sketch-forme. The user can a) create partial sketch for Sketchformesuggest multiple candidates for b) users to choose to adequatesketch they prefer from the description ‘a horse under a tree.’

LIMITATIONS

Occlusions and layer orderSketchforme is trained to encode composition principles froma natural image dataset. In natural images, objects might oc-clude each other, hence affecting the sizes and positions of thebounding boxes in the composition layouts. Figure 14a showsseveral boats that was inadequately placed in front of parts ofthe bridge that should have occluded the boats. To overcomethis limitation, future systems can augment Sketchforme byincluding advanced vision models to determine the layer orderin the original natural image. The current Sketchforme systemonly considers a naive layer order determined by the sequenceof generation of the composition layouts.

Moreover, having occluded compositions lead to the genera-tion of overlapping sketches requiring additional research onrealistic methods to handle overlapping sketched objects. Forinstance, the model that generates composition layouts canenforce constraints to avoid overlaps in the sketches, or followhand-crafted rules to handle overlaps.

Aspect ratios might be weak signals for object posesSketchforme utilizes the aspect ratios of bounding boxes as theprimary signal to inform the shapes of sketches of individualobjects. These shapes may consequentially determine theposes of sketched objects. Although these shapes can besufficient to determine correct poses for some object classes,such as the ‘tree’ class, constraining the shapes might be weaksignals for other object classes. These shapes can suggestincoherent perspectives or partial sketches such as examplesshown in Figure 14b. In Figure 14b, only the faces of theelephants were sketched due to the aspect ratios provided tothe extended Sketch-RNN model, which is inappropriate for

composing sketched scenes. To mitigate this limitation, futurework should model the pose of objects in sketches and naturalimages to augment the composition knowledge of their models,such as incorporating complete masks of the objects.

(a) Occluded Objects (b) Incoherent Poses

Figure 14: Limitations of Sketchforme’s sketching process. Ina), the boat is significantly occluded by the bridge which af-fects the quality of generated sketches. In b), the elephant wasprovided with a square bounding box. This guides the systemto sketch only the face of the elephant which is inappropriatefor the scene.

FUTURE WORK

Conversational sketch suggestion and tutorial systemWith the highly interactive media of sketching and language,combined with Sketchforme’s high-level and low-level under-standing of each element of the sketched scene, we believefuture work should explore conversational interfaces for gen-erating sketches and interactive tutorial systems that guideusers to create sketches coherent to text descriptions. More-over, since the Object Sketcher in Sketchforme’s generationprocess is capable of completing partial sketches created byusers, Sketchforme can suggest possible strokes following in-complete user sketches at the object level, which can be usefulin sketch education applications.

Educational applications in other domainsThe unique interplay between natural language and sketchesembodied by Sketchforme creates possibilities of building newapplications that utilize the interactive properties of sketch-ing and language. In this paper, we explored the capabilityof Sketchforme in supporting basic language learning. Webelieve future work could explore other domains such as sci-ence and engineering education as text-annotated sketches arefrequently used in these domains.

Coloring and animationThe current sketches generated by Sketchforme are binarysketch strokes without colors or animations. Future workshould explore colored and/or animated sketches to enablericher user experiences. For instance, the natural imagedatasets that Sketchforme used to train models in the firststep of sketch-generation process can be used to determinepossible colors of the sketched objects.

CONCLUSIONSThis paper presents Sketchforme, a novel sketching systemcapable of generating abstract scene sketches that involve mul-tiple objects based on natural language descriptions. Sketch-forme adopts a novel two-step neural-network-based approach:the Scene Composer obtains high-level understanding of lay-outs of sketched scenes, and the Object Sketcher obtains low-level understanding of sketching individual objects. These

9

Page 10: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

models can be trained without text-annotated datasets ofsketched scenes. In the user study evaluating the expressive-ness and realism of sketches generated by Sketchforme, humansubjects considered Sketchforme-generated sketches more ex-pressive than human-generated sketches for two of the fiveseeded descriptions. They also considered 36.5% of thesesketches to be generated by humans. The sketches gener-ated by Sketchforme significantly improved a sketch-assistedlanguage learning system and enabled compelling intelligentfeatures of a sketching assistant. Sketchforme possesses thepotential to support interactive applications that utilize bothsketches and natural language as interaction media, and affordlarge-scale applications in sketching, language education andbeyond.

REFERENCES1. Christopher M Bishop. 1994. Mixture density networks.

Technical Report. Citeseer.

2. Jamie Combs and Brenda Hoddinott. 2011. Drawing fordummies. Wiley.

3. Richard C. Davis, Brien Colwell, and James A. Landay.2008. K-sketch: A ’Kinetic’ Sketch Pad for NoviceAnimators. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems (CHI ’08). ACM,New York, NY, USA, 413–422. DOI:http://dx.doi.org/10.1145/1357054.1357122

4. Jennifer Fernquist, Tovi Grossman, and GeorgeFitzmaurice. 2011. Sketch-sketch Revolution: AnEngaging Tutorial System for Guided Sketching andApplication Learning. In Proceedings of the 24th AnnualACM Symposium on User Interface Software andTechnology (UIST ’11). ACM, New York, NY, USA,373–382. DOI:http://dx.doi.org/10.1145/2047196.2047245

5. David Ha, Andrew M. Dai, and Quoc V. Le. 2016.HyperNetworks. CoRR abs/1609.09106 (2016).http://arxiv.org/abs/1609.09106

6. David Ha and Douglas Eck. 2017. A NeuralRepresentation of Sketch Drawings. CoRRabs/1704.03477 (2017). http://arxiv.org/abs/1704.03477

7. James W. Hennessey, Han Liu, Holger WinnemÃuller,Mira Dontcheva, and Niloy J. Mitra. 2017. How2Sketch:Generating Easy-To-Follow Tutorials for Sketching 3DObjects. Symposium on Interactive 3D Graphics andGames (2017).

8. Seunghoon Hong, Dingdong Yang, Jongwook Choi, andHonglak Lee. 2018. Inferring semantic layout forhierarchical text-to-image synthesis. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition. 7986–7994.

9. Emmanuel Iarussi, Adrien Bousseau, and TheophanisTsandilas. 2013. The Drawing Assistant: AutomatedDrawing Guidance and Feedback from Photographs. InACM Symposium on User Interface Software andTechnology (UIST). ACM, St Andrews, United Kingdom.https://hal.inria.fr/hal-00840853

10. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. 2017. Image-to-Image Translation withConditional Adversarial Networks. IEEE Conference onComputer Vision and Pattern Recognition (CVPR)(2017).

11. Jonas Jongejan, Henry Rowley, Takashi Kawashima,Jongmin Kim, and Nick Fox-Gieg. 2016. The Quick,Draw! - A.I. Experiment. (2016).https://quickdraw.withgoogle.com/

12. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, YannisKalantidis, Li-Jia Li, David A. Shamma, Michael S.Bernstein, and Li Fei-Fei. 2017. Visual Genome:Connecting Language and Vision Using CrowdsourcedDense Image Annotations. Int. J. Comput. Vision 123, 1(May 2017), 32–73. DOI:http://dx.doi.org/10.1007/s11263-016-0981-7

13. James A. Landay. 1996. SILK: Sketching Interfaces LikeKrazy. In Conference Companion on Human Factors inComputing Systems (CHI ’96). ACM, New York, NY,USA, 398–399. DOI:http://dx.doi.org/10.1145/257089.257396

14. Yong Jae Lee, C. Lawrence Zitnick, and Michael F.Cohen. 2011. ShadowDraw: Real-time User Guidance forFreehand Drawing. ACM Trans. Graph. 30, 4, Article 27(July 2011), 10 pages. DOI:http://dx.doi.org/10.1145/2010324.1964922

15. S. Leitner and R. Totter. 1972. So lernt man lernen.Herder. https://books.google.com/books?id=sD3AAQAACAAJ

16. Alex Limpaecher, Nicolas Feltman, Adrien Treuille, andMichael Cohen. 2013. Real-time Drawing AssistanceThrough Crowdsourcing. ACM Trans. Graph. 32, 4,Article 54 (July 2013), 8 pages. DOI:http://dx.doi.org/10.1145/2461912.2462016

17. Scott E. Reed, Zeynep Akata, Xinchen Yan, LajanugenLogeswaran, Bernt Schiele, and Honglak Lee. 2016.Generative Adversarial Text to Image Synthesis. CoRRabs/1605.05396 (2016). http://arxiv.org/abs/1605.05396

18. Qingkun Su, Wing Ho Andy Li, Jue Wang, and HongboFu. 2014. EZ-sketching: Three-level Optimization forError-tolerant Image Tracing. ACM Trans. Graph. 33, 4,Article 54 (July 2014), 9 pages. DOI:http://dx.doi.org/10.1145/2601097.2601202

19. Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems.5998–6008.

20. Jeffrey D. Wammes, Melissa E. Meade, and Myra A.Fernandes. 2016. The drawing effect: Evidence forreliable and robust memory benefits in free recall. TheQuarterly Journal of Experimental Psychology 69, 9(2016), 1752–1776. DOI:http://dx.doi.org/10.1080/17470218.2015.1094494

PMID: 26444654.

10

Page 11: Sketchforme: Composing Sketched Scenes from Text ... · Neural-network-based text-to-image synthesis Generating graphical content from text description is a popular ongoing research

21. Lidan Wang, Vishwanath A. Sindagi, and Vishal M. Patel.2017. High-Quality Facial Photo-Sketch Synthesis UsingMulti-Adversarial Networks. CoRR abs/1710.10182(2017). http://arxiv.org/abs/1710.10182

22. Jun Xie, Aaron Hertzmann, Wilmot Li, and HolgerWinnemöller. 2014. PortraitSketch: Face SketchingAssistance for Novices. In Proceedings of the 27thAnnual ACM Symposium on User Interface Software andTechnology (UIST ’14). ACM, New York, NY, USA,407–417. DOI:http://dx.doi.org/10.1145/2642918.2647399

23. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang,Xiaolei Huang, Xiaogang Wang, and Dimitris N. Metaxas.

2016. StackGAN: Text to Photo-realistic Image Synthesiswith Stacked Generative Adversarial Networks. CoRRabs/1612.03242 (2016). http://arxiv.org/abs/1612.03242

24. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros. 2017. Unpaired Image-to-Image Translation usingCycle-Consistent Adversarial Networks. CoRRabs/1703.10593 (2017). http://arxiv.org/abs/1703.10593

25. Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-ZheSong, Tao Xiang, Chengying Gao, Baoquan Chen, andHao Zhang. 2018. Sketchyscene: Richly-annotated scenesketches. In Proceedings of the European Conference onComputer Vision (ECCV). 421–436.

11