Predicting Moves-on-Stills for Comic Art using Viewer...

COMPUTER GRAPHICS & APPLICATIONS , VOL. ??, NO. ??, ?? 2015 1

Predicting Moves-on-Stills for Comic Artusing Viewer Gaze Data

Eakta Jain1, Yaser Sheikh2, and Jessica Hodgins2,3

1University of Florida2Carnegie Mellon University3Disney Research Pittsburgh

Abstract—Comic art consists of a sequence ofpanels, each a different shape and size, that visuallycommunicate the narrative to the reader. Whencomic books are retargeted to fixed size screens (e.g.,handheld digital displays), the technique of moves-on-stills is employed to avoid scaling, cropping orotherwise distorting the artwork. Today, moves-on-stills are created by software where a user inputs theparameters of a desired camera move. We considerthe problem of how the parameters for a goodmove-on-still may be predicted computationally. Weobserve that artists who work with visual mediadeliberately direct the flow of viewer attention acrossthe scene. We present an algorithm to predict themove-on-still parameters on a given comic panel fromthe gaze data of multiple viewers looking at thatpanel. We demonstrate our algorithm on a varietyof comic book panels, and evaluate its performanceby comparing with a professional DVD.

Index Terms—I.3.3 Picture/Image Generation,I.3.7.a Animation, L.1.0.a Attention

I. INTRODUCTION

Integrating stills into motion pictures via cam-era moves is often known as the Ken Burns ef-fect. This effect was traditionally used to engagethe viewer through motion when photographs orpaintings were shown on television or film. Withthe advent of personal handheld digital displays,presenting a still picture as a sequence of imagesseen through a viewport allowed for users to viewpictures with varied aspect ratios on a fixed sizescreen (Figure 1). For comic book art in particular,moves-on-stills offer the advantage of engaging astrength of digital displays (the presented materialcan be animated) while maintaining the defining

A portion of this work was done while Eakta Jain was agraduate student at Carnegie Mellon University.

Manuscript received April 19, 2005; revised August 26,2015.

verticalslider

artwork

animation table(a) (b) (c)

0 dwell time ratio

boun

ding

box

ratio

1

c

1

Viewport (1.33:1)

Comic book panel (1:1.37)

Viewer gaze data(two clusters)

Our result for move-on-stillbased on gaze data

Rostrum camera

Fig. 1. Left: A move-on-still moves a viewport (e.g., 1.33:1)over original artwork (e.g., 1:1.37) to best display the art.Middle: Our method uses viewer gaze data as input. Fixationpoints are clustered based on their locations (green and blue).Right: The move-on-still predicted by our algorithm starts atthe dotted rectangle and ends at the solid rectangle, revealingthe mouth of the gun. This digital effect mimics the move-on-still traditionally created by a rostrum camera setup. Originalcomic panels c�MARVEL.

characteristic of sequential art (each panel is amoment frozen in time).

Moves-on-stills were created by a camera op-erator filming the still with a rostrum camera onan animation table. Today, software such as AdobePremiere and iMovie allow anyone with a computerto create this effect digitally. While the mechanicshave been made simpler, a central question remainsunexplored: what makes for a good camera moveon a still? For example, if the picture contains aman holding an object in his hand, the cameramove should probably present the man’s face, andthe object in his hand, either moving from theface to the object, or, vice versa. Of these twomoves, which one should be selected? Will Eisnerfamously wrote that (for comic art), “. . .the artistmust. . .secure control of the reader’s attention anddictate the sequence in which the reader will followthe narrative. . .” [1]. Therefore, a good cameramove should respect this artistic intent.

However, generating a camera move that sup-ports the narrative in the original still picture,is a complex process: it involves guessing the

Eakta Jain

Eakta Jain


artist’s intent by recognizing objects such as ‘man’and ‘hand’, understanding relationships such as‘holding’, and figuring out how these relationshipsinfluence a camera move. Though computationalalgorithms are making strides towards some of thesubtasks, how to create a camera move that supportsthe narrative is still not understood. Our key insightis that comic book artists make a conscious effortto direct the viewer’s gaze, therefore, recording thegaze data of viewers should tell us the visual routethat was used to discover (and conversely, may beused to present) the narrative in the picture. Figure 1shows the viewer gaze data collected on a comicbook panel, and our result.

The first challenge posed by this insight is thatviewer eye movements are governed both by char-acteristics of the visual stimuli, and by individual-specific viewing strategies. As a result, no twoviewers will have identical eye movements on agiven visual stimulus. This inter-observer variabil-ity, and imperfect eyetracker calibration yield noisymeasurements of the regions of interest in thescene being observed. The second challenge is thateye movements do not directly map to cameramoves. The eye movements used to traverse stillimages are typically saccadic movements, i.e., rapidmovements of the eye to center the fovea at a regionof interest; fixations occur as the eye rests on theregion of interest between saccades [2]. Repeatedsaccadic sequences (scanpaths) are involved whenan observer visually processes an image. Whileeye movements occur to efficiently process visualinformation, camera moves are intended to presentinformation. In the example of the man holding theobject, if the camera were to move as rapidly as theeye does, from the face of the man to the objectin his hand, and then back and forth, the resultingvideo would be jarring to look at.

Previous work has shown that inter-observervariability is reduced in comic art [3], likely as aresult of expert direction by skilled artists. Thisfinding suggests that viewer gaze could be usedto drive moves-on-stills for comic art. We presentan algorithm to predict the parameters of a cameramove-on-still from recorded viewer gaze data. Gazedata is converted into fixations and saccades by adispersion based method using proprietary softwareprovided with the eyetracker. Our method extractsfeatures from fixation locations, saccade directions,and dwell times to predict the parameters of avirtual rostrum camera move (Figure 1). We demon-strate results on a variety of comic books, and

evaluate our algorithm by comparing the predictedparameters with the parameters of professionallycreated camera moves.

II. RELATED WORK

Viewer eye movements are driven both bybottom-up cues, and top-down influences [2]. Artis-tic wisdom is that photographers, graphic artists,and film directors are able to influence viewers tolook at certain regions versus others. Researchershave recognized this difference in the visual per-ception of artistic stimuli versus ‘natural’ stimuli.For example, Dorr and colleagues found that thecharacteristics of eye movements on static imagesare different from those on continuous videos, andthe movement patterns in Hollywood movies aresignificantly more coherent than natural movies [4].The work of Jain and colleagues [3] suggests thatthe artist is able to purposefully direct the visualattention of readers through the pictorial narrative.We are inspired by these findings to use viewergaze on comic art as an input to predict cameramove parameters.

Computer graphics researchers have previouslyrecognized that gaze is a source of high-levelinformation about the underlying scene. Re-searchers have used viewer gaze to simplify three-dimensional models [5], guide non-photorealisticrendering [6], and crop images [7]. Recently, Jainand colleagues proposed algorithms to retargetwidescreen videos based on eyetracking data [8].We further this past research by observing that,in addition to the attended locations, the order inwhich people attended to the locations, and theamount of time they spent looking is also infor-mative of the underlying pictorial composition.

Previous work in computer graphics algorithmsrelating to comic art includes methods to summa-rize interactive game play as a sequence of comicimages [9], convert a movie clip into a comic [10],and automatically generate manga layout [11]. Inour work, we show that instead of a mouse basedinterface to move a camera or annotate the image,recording the eye movements of viewers implicitlyprovides the necessary information to create moves-on-stills for comic book images.

III. PREDICTING MOVES-ON-STILLS

We present an algorithm to predict the parametersof moves-on-stills on comic book images from therecorded viewer gaze data on those images. Moves-on-stills were traditionally filmed on a rostrum

Eakta Jain


Eye tracking data from multiple viewers

Fixation points Saccades Dwell time

Unsupervised clustering

Most likely direction

Ratio of time spent per cluster

Points of interest

Pan direction Track in or out

Create framing window based on actual size of comic panel

Render result

Fig. 2. An overview of our method.

camera (Figure 1). The artwork was placed so thatthe camera framed a point of interest. Then it wasslowly moved while being filmed so that the cameraviewport tracked the point of interest, or, pannedfrom one point of interest to the next.

The degrees of freedom for a digital move are thesame as those that were afforded by the physicalsetup: one degree of vertical movement, toward oraway from the animation table, that yields a trackin or track out shot, and two degrees of movementon the animation table, North-South and East-West,that result in a pan shot. Rotation of the artworkabout the vertical axis is rarely used for comics, andthus, we focus on only the translations. In the digitalsetup, these degrees of freedom are characterizedby the position of the center of the viewport (orframing window), and the dimensions of the view-port around the point of interest. Figure 2 providesan overview of our method. The eyetracking dataused in our results was collected as part of previouswork, and details of the data collection can be foundin [3].

A. Points of interestPoints of interest are the semantically meaningful

elements of the picture, which the artist used to tellthe story, and which the viewport should frame. Inthe example image in Figure 3 (left), the points ofinterest are the two faces. It naturally follows thatthe locations of the recorded fixation points indicateregions of important semantic content. Fixations onword bubbles are not usually informative of thepoints of interest because word bubbles are placedto avoid occluding a pictorial element. Accordingly,

Fig. 3. The white dots mark the computed cluster centroid foreach color-coded cluster. The red crosses and circles indicatethe first and second fixations for each of the observers. Left:Noise in data capture leads to reading fixations that are outsidethe word bubble. Right: Reading and non-reading fixations canbe quite close to each other, as in the case of the word bubbleand the man’s head. Original comic panels c�MARVEL.

we color segment the word bubbles, identify thefixations which lie inside the segmented region, andremove them. Unsupervised clustering is performedon the locations of the remaining fixation points,based on squared Euclidean distance.

We denote the ordered fixation locations forobserver s on picture p by Xps. Then, Xps

=

[Xps1 ,Xps

2 , . . . ,Xps�ps ]

T , where Xpsl is the l-th fix-

ation location, and �ps is the total number offixations on this picture. The value �ps could bedifferent because two observers might have differ-ent reading speeds, or, because one of them fixatedlonger at each location, while the other jumpedback and forth between the locations, for example.In addition, the same observer might attend to adifferent number of regions on different pictures,or, change her speed of perusal.

We perform unsupervised k-means clustering toassociate each fixation location Xps

l 2 R2 with itscluster via the label z

psl = k, k = {1, 2, . . . ,K},

where K is the total number of clusters and ⌦k isthe set of all fixations belonging to the cluster k.Then, the point of interest µk is the centroid of allthe fixation locations belonging to a cluster ⌦k,

µk =

1

|⌦k|

SX

s=1

�psX

l=1

Xpsl �(z

psl � k), (1)

where � refers to the Kronecker delta function. Thedwell time ⌧k on a point of interest is the totalduration of all fixations in its cluster. Let tpsl be theduration of each fixation. Then,

⌧k =

SX

s=1

�psX

l=1

t

psl �(z

psl � k). (2)

Eakta Jain

Eakta Jain

Eakta Jain


In practice, eyetracking data is noisy and it isnot straightforward to identify the word bubblefixations. These fixations may lie outside the col-ored word bubbles because of calibration error. Forexample, the green colored fixation points in Figure3 (left), lie slightly outside the word bubble, and area source of error in the computed centroid (whitecircle). Sometimes though, the fixation points rightoutside the word bubbles might not be ‘readingfixations’, as in Figure 3 (right), where the fixationson the man’s face are about as far away from theword bubble as the erroneous data. Thus, simpleheuristics, such as, dilating the color segmentedword bubble, or, identifying reading fixations bythe typical saccade length between them, are notguaranteed to always work, and we opt to let theclustering method take care of such errors.

Because finding the number of meaningful clus-ters automatically is not a well-defined problem,we ascertained the number of clusters K based ondetailed observations of the professionally producedWatchmen motion comic DVD from DC Comics.Two independent coders coded the camera movesfor panels from four pages in the first chapter (31panels). On average 77% of the camera moveswere marked as camera pans between two pointsof interest. The coders marked a single-point movewhen the camera tracked in or out but did not pansignificantly. They marked a three or more pointmove when the scene was not a still picture, forexample, the objects in the panel were animated andthe camera followed them. For moves on stills, twopoints of interest are largely sufficient. Therefore,we specified K = 2 for all our examples.

B. Pan across the image planeThe panning shot is generated by moving the

viewport from one point of interest to the other.Let the direction of the pan be represented by theindicator variable ⇥ 2 {0, 1},

⇥ =

(0 if µ0 ! µ1,

1 if µ1 ! µ0,(3)

where µ0 ! µ1 denotes that the start point ofinterest is µ0 (the centroid of the first cluster) andthe end point of interest is µ1 (the centroid of thesecond cluster), and µ1 ! µ0 denotes the oppositedirection. This direction is predicted based on theonset cluster �, and the dominant direction ⇠.

The onset cluster is the region that a new ob-server will likely look at when they first look at

the given still image, that is, it is the cluster thatcontains the onset fixations of the observers. Theonset cluster � 2 {0, 1} takes the value � = 0

if the first and second fixations fall in the firstcluster, arbitrarily denoted cluster 1, and takes thevalue � = 1 if the first and second fixationsfall in the second cluster, denoted cluster 2. Thedominant direction is the direction in which anobserver is likely to peruse the image, that is,the more likely saccade direction. The directionindicator ⇠ 2 {0, 1} takes the value ⇠ = 0 if thesaccade direction is more often from cluster 1 tocluster 2 and ⇠ = 0 otherwise. Because people oftenlook back and forth between two clusters, when thenumber of saccades from cluster 1 to cluster 2 isequal to the number of saccades from cluster 2 tocluster 1, the onset cluster is the tie-breaker.

The cues are integrated into a likelihood L thatthe pan will follow the direction represented by ⇥,

L(⇥ = i) = p(� = i)p(⇠ = i). (4)

We denote the number of first and second fixationsin a cluster by n(� = i). Then, p(� = i) iscomputed as

n(� = i) =

SX

s=1

�(z

ps1 � i) +

SX

s=1

�(z

ps2 � i),(5)

p(� = i) =

n(� = i)

PKj=1 n(� = j)

. (6)

Both the first and second fixations are includedwhile computing � because the first fixation maynot be captured reliably, and may be subject tocenter bias, that is, the tendency for observers tofixate in the center of the screen when the visualstimulus switches.

The probability that the dominant direction ofeye movement is ⇠ = 0 is proportional to the num-ber of times people look from the cluster k1 = 1

to the cluster k2 = 2, and vice versa for ⇠ = 1. Letus denote this count by n(⇠ = i). Then,

n(⇠ = i) =

SX

s=1

�ps�1X

l=1

�(z

psl � k1)�(z

psl+1 � k2),(7)

p(⇠ = i) =

n(⇠ = i)

P1j=0 n(⇠ = j)

. (8)

We estimate the direction of the pan across theimage as

⇥MLE = argmax

⇥L(⇥). (9)


Onset cluster = Woman’s face, Dominant direction = Woman’s face to man’s face,Chosen trajectory = Woman’s face to man’s face.

Onset cluster = Woman’s face, Dominant direction = Tie,Chosen trajectory = Woman’s face to hand.

Onset cluster = Ironman suit, Dominant direction = Tie,Chosen trajectory = Ironman suit to Tony Stark.

Onset cluster = Man’s head, Dominant direction = Bottom to top,Chosen trajectory = Botton to top.

start

start

start

start

(a) (b)

(c) (d)

Onset cluster = Woman’s face, Dominant direction = Woman’s face to man’s face,Chosen trajectory = Woman’s face to man’s face.

Onset cluster = Woman’s face, Dominant direction = Tie,Chosen trajectory = Woman’s face to hand.

Onset cluster = Ironman suit, Dominant direction = Tie,Chosen trajectory = Ironman suit to Tony Stark.

Onset cluster = Man’s head, Dominant direction = Bottom to top,Chosen trajectory = Botton to top.

start

start

start

start

(a) (b)

(c) (d)

Fig. 4. Fixation data overlaid on example panels. The greenand blue markers are for the different clusters. The red crossesindicate the first fixation for an observer and the red circlesindicate the second fixation. The start cluster is marked with ared label. Original comic panels c�MARVEL.

The examples in Figure 4 illustrate the variouspossibilities for the two cues. In panel (a), � and⇠ both vote for cluster 2 as the starting cluster.In panels (b) and (c), there is no clear dominantdirection (the observers look back and forth be-tween the clusters) and the onset cluster � is thetie-breaker for the estimated direction of pan. Inpanel (d), the onset cluster � is the man’s head,while the dominant direction ⇠ is bottom to top.The onset cluster is the man’s head because thefirst fixations were mostly at the bottom part of thescreen, but the second fixations were on the man’shead. However, the observers all finished looking atthe image from bottom to top, leading to a strongdominant direction. The chosen direction for thepan is bottom to top for this panel.

C. Track in or out

The track in or out shot refers to the movementof the camera towards or away from a subject“...to increase or decrease its importance withinthe context of a story...” [12]. For a move-on-still,the camera should track in if the end point ismore important than the start point and track outif the start point is more important. Because peopledwell longer on regions that are more interesting orimportant to them, the dwell time ⌧i on a point ofinterest µi is a natural cue to drive the track shot.

Let µi be the start point of interest and µj bethe end point of interest (i, j 2 {0, 1}). Then, thetracking shot is a track in if ⌧i < ⌧j and trackout if ⌧i > ⌧j . This shot is generated digitally byincreasing, or, decreasing the size of the framingwindow based on the ratio of the dwell times. Fora track in, the size of the start window is the largestwindow of the desired aspect ratio that can fit insidethe panel, and the size of the end window is made

smaller by scaling with the scaling factor �j ,

�j = c+ (1� c)

⌧i

⌧jwhere ⌧i ⌧j . (10)

For a track out, the size of the end window is thelargest window of the desired aspect ratio that canfit inside the panel and the size of the start windowis smaller by the scaling factor �i,

�i = c+ (1� c)

⌧j

⌧iwhere ⌧i > ⌧j . (11)

The parameter c = 0.75 for the Watchmen panels,and c = 0.5 for all other comic panels. Thisparameter allows the user to control how sharplythe camera will track in or out. For example, ahigher value of c would mean that there is not muchdifference in the sizes of the start and end window,even if the cluster dwell times are unequal. A lowervalue of c would result in a larger difference inwindow sizes, to create a more dramatic effect,for example. In our examples, we chose the valuesof c visually, based on how dramatic a track wascomfortable to watch and the typical sizes of facesin each set. The final dimensions Di and Dj of theframing windows are computed as

Di = �iD, (12)Dj = �jD, (13)

where D is the render size, e.g., D = [640, 480]

T .The computation thus far fixes the direction the

camera will move in a plane parallel to the stillimage, and whether it will move into or away fromthe still image. The final step in rendering the move-on-still is to compute the trajectory of the center ofthe framing window.

D. Rendering the move-on-still

Because gaze data contains semantic informa-tion, we can identify the points of interest in astill picture, the order in which they should betraversed, and their relative importance for theviewer. Gaze data does not, however, provide anyinformation about how the rostrum camera operatorwould frame these points of interest within theviewport (or framing window). We use the well-known photography rule of thirds to resolve thisambiguity. According to this rule, the framing win-dow should be divided into a 3⇥3 grid with verticaland horizontal guides; the point of interest shouldlie on either a vertical or horizontal guide for avisually pleasing effect (see Figure 5).

Eakta Jain


hh/3

2h/3

h/3

2h/3h

comic panel

framing window

W

H

Vertical move (biased to be smaller)

pan direction

Unexpected move

hh/3

2h/3

h/3

2h/3h

comic panel

framing windowW

H

Vertical move (biased to be larger)

pan direction

hh/3

2h/3

h/3

2h/3h

comic panel

framing window

2w/3 w/3

2w/3w/3

w

Fig. 5. Left and middle: The center of the frame is computedso that the point of interest is framed according to the rule ofthirds, and such that the point of interest has the same relativespatial location inside the framing window as it does in thepanel. Right: We could use scripted rules to place the framingwindow, for example, the start point of interest lies on the leftintersection of the guides and the end point lies on the rightintersection of the guides. However, this type of scripted rulecan result in the effect is of a camera moving in the oppositeto desired direction because of the selection of guide points.

Our method classifies the predicted move-on-still as a North-South move if the pan directionis between 15

� and 165

� and as a South-Northmove if the pan direction is between �15

�, and�165

� (similarly for East-West, and West-East).For a North-South move (Figure 5, left), the x

position of the framing window is computed suchthat the ratio of the distances of the point of interestto the boundaries is the same in the framing windowas in the original panel. Let the start window beparametrized by the top-left corner (x1, y1) andthe bottom-right corner (x2, y2). Its size Di wascomputed in Equation 13. Recall that the positionof the point of interest is (µ1(1), µ1(2)). Then,

x1 = µ1(1)�µ1(1)

W

Di(1), (14)

x2 = µ1(1) +W � µ1(1)

W

Di(1). (15)

The y position of the framing window is computedsuch that the point of interest lies on the horizontalthird guides. We experimented with two variants:first, using the upper third guide for the startwindow and lower third guide for the end window(a smaller camera move, as in Figure 5 (left)), andsecond, using the lower third guide for the startwindow and the upper third guide for the end win-dow (a larger camera move, as in Figure 5 (right)).For the first case,

y1 = µ1(2)�1

3

Di(2), (16)

y2 = µ1(2)�2

3

Di(2). (17)

Panel 3 Panel 4

Panel 3 Panel 4

Fig. 6. Two panels from a Spiderman comic book. Originalcomic panels c�MARVEL. Top: Fixation locations (greencircles and violet squares), computed points of interest (whitedots). Bottom: In the camera move predicted by our algorithm,the dotted rectangle is the start window, the outlined rectangleis the end window.

The equations for the second case are similar. Wefound the larger camera move to be the morepleasing option as it displayed a greater amountof the underlying comic art. In Section V, wealso numerically compare the two cases with aprofessional DVD.

Figure 5 (right) shows a possible failure case ifwe rely only on scripted rules based on the directionof movement and the horizontal and vertical guides(frame the start point of interest on the upper leftguide intersection and the end point of interest onthe lower right guide intersection). The start pointof interest is to the left of the end point of interest,but placing them on the guides leads to an oppositecamera move. To avoid such cases, we use therelative spatial location of the point of interest inthe still picture to place the framing window.

IV. RESULTS

We present moves-on stills-for panels takenfrom a variety of comic books: Watchmen, Iron-man: Extremis, The Three Musketeers, Hulk:WWH,Black Panther, and The Amazing Spiderman: NewAvengers. Our results are best seen as video, andare presented as part of the accompanying movie.We show a subset of the complete set of resultsas figures in this section. The fixation locations areeither green circles, or violet squares based on theircluster. The red markers mark the first and secondfixation points, which are used to compute the onsetcluster probability for the pan shot. The large whitedot markers are the computed points of interest.There is only one user-defined parameter in our


Panel 52 Panel 55Panel 54

Fig. 7. Three panels from a World War Hulk comic book.Original comic panels c�MARVEL. Top: Fixation locations(green circles and violet squares), computed points of interest(white dots). Bottom: In the predicted camera move, the dottedrectangle is the start window, the outlined rectangle is the endwindow.

method. The parameter c = 0.75 for the Watchmenpanels, and c = 0.5 for all other comic panels.

Figure 6 shows two example panels from the Spi-derman comic book. In Panel 3, the points ofinterest are the sneaking man and the sleepingwoman, and in Panel 4, the points of interest arethe face of the man and the sleeping woman. Thesepanels demonstrate why gaze is a valuable cue—relying on a face detector or similar non-contextualcues would likely have missed the sleeping lady asa point of interest, yet we can see that the artistintended for the viewer to look at her. For Panel 3,the camera panned from right to left and in Panel4, the camera pans from top to bottom.

Figure 7 shows three panels from Hulk: WWH.Our algorithm identifies the points of interest as thekneeling man and the cloaked man, and predicts adownward pan, a track with a pan, and an upwardpan respectively. Sample frames from the result areshown in Figure 8. Figure 9 shows example panelsfrom the Ironman graphic novel. In Panel 3 andPanel 7, the points of interest are clearly the faceand hand, and the suit and face, respectively. InPanel 31, the starting point of interest is shiftedslightly to the right of the woman’s face becauseof stray fixation points. The camera tracks out forPanels 3 and 7 to reveal the paper in the woman’shand and the man looking at the suit. The cameratracks in towards the man’s face for Panel 31,correctly capturing the semantics of the scene.

Panel 52 Panel 55Panel 54

Fig. 8. Three panels from a World War Hulk comic book.Original comic panels c�MARVEL. Sample frames from theresult movie.

Panel 3 Panel 7 Panel 31

startstart

start

t start

Panel 3 Panel 7 Panel 31

Fig. 9. Three panels from Ironman. Original comic panelsc�MARVEL. Top: Fixation locations and the computed points

of interest. Bottom: Camera move predicted by our algorithm.The dotted rectangle is the start window, the outlined rectangleis the end window.

V. EVALUATION

We evaluated the camera moves predicted by ouralgorithm by comparing with the camera movescreated by a professional artist for the Watchmenmotion comic DVD by DC Comics. Two inde-pendent coders watched an excerpt from the DVDand marked the center of the framing window onthe corresponding comic panel. The coders codedfour comic book pages (31 panels). Three panelswere discarded because a coder made a mistake, orbecause the panel was not included in the DVD.

Amongst the remaining 28 panels, coder 1marked 19 panels with two-point moves and coder 2marked 24 panels with two-point moves (that is,the camera panned linearly between a start andan end point). Because our algorithm generatestwo-point moves, we numerically evaluate panelscoded with multiple-point moves by consideringonly the largest segment of camera movement. Theaverage root mean squared distance between the

Eakta Jain


30

60

90

120

150

0

Procedure 1: smaller moveProcedure 2: bias to edgeProcedure 3: aspect ratio move (average of 5 runs)Sample run from Procedure 3

180

100

500

(a) (b)

Fig. 10. Evaluation and limitations. (a) Evaluation of ourmethod based on ground truth obtained from a professionalDVD. We show two versions of our method (red squares andblue circles), along with a naive procedural camera move,shown as green triangles. Points closer to the origin and the 0�

line represent a good match with the DVD. (b) Top: becausepeople fixate on the back of the head as well as the face,the cluster center shifts towards the ear. Bottom: though thecamera move may be a two-point move, two clusters may notaccurately extract the points of interest. Original comic panelsc�MARVEL.

window centers marked by the two coders was104.8 (in pixels), which is approximately 10% ofthe largest dimension of a panel. We filtered outthe cases where the coders were not consistent bydiscarding the panels where the distance betweentheir annotations was greater than 10% of the paneldimension (105 pixels). Thus, we have reliableground truth for 20 panels.

Figure 10 (a) is a polar plot where the radialdistance is the root mean squared distance betweenthe window center predicted by our method and thewindow center for a professional DVD, averagedover both coders, and the angular value is theincluded angle between the direction predicted byour method and the professional DVD. Each markerrepresents one of the evaluated panels. The bluecircles mark the version of our algorithm which isbiased towards smaller camera moves, while the redsquares show the version of our algorithm whichis biased towards showing more of the underlyingartwork (consequently, a larger camera move). Thegreen triangle markers are a procedural cameramove: the camera moves vertically or horizontallybased on the aspect ratio of the input comic panel,the direction (whether up or down, left or right)depends on a coin toss. The plotted points wereobtained by averaging five runs. An example run isshown with black plus signs.

Points closer to the origin and the 0

� line repre-sent good performance because the predicted cam-era parameters match the DVD both in the locationof the framing windows and the direction of themove. Points on the 180

� line show the panels

where our algorithm predicted a pan direction op-posite to the DVD. Our gaze-driven method predictsthe same direction as the DVD in 12 of the 20 testpanels. In 2 panels, we predict a pan, whereas thecoders annotate the camera move to be an in-placetrack. For 6 panels, we predict a direction oppositeto the DVD. This might occur because people tendto look at faces first; if there is no clear dominantdirection, the onset cluster is responsible for thedirection of the camera move.

To better understand how well we were able toidentify the points of interest (independent of direc-tion), we computed the mean distance by ignoringthe order of the start and end centers. Procedure2 (our method, biased towards showing more ofthe comic panel) and procedure 1 (our method,smaller move) are equivalent in terms of centerdistance, averaged over the test panels (115.8 and115.6), and are closer to the DVD than the aspect-ratio-based move (average distance to DVD = 180).The advantage of our method is especially clear inthe panels where the method is able to correctlyidentify that the points of interest are not locatednear the top or the bottom of the panel.

VI. DISCUSSION

Camera moves are the product of sophisticatedcreative thought; higher order cognitive processesare involved in recognizing visual elements, infer-ring their relationship, and figuring out a cameramove that supports the narrative in the still pic-ture. Though the cognitive processes are not well-understood, a behavioral cue is available to berecorded: gaze. Our key insight is that recordedgaze data contains the information needed to pre-dict plausible moves-on-stills for comic art pictures(given that the comic book artist was successful indirecting the flow of viewer attention). We demon-strated an algorithm to extract the parameters ofa two-point camera move from the gaze data ofviewers. We evaluated our algorithm by comparingthe predicted parameters with the parameters ofprofessionally created moves-on-stills.

A limitation of our method is that althoughhuman fixations are clustered around objects ofinterest in the image, the computed centroid is apartial representation of this object. For example, inFigure 10(b)(top), people fixate on the back of thehead as well as the face, and the k-means algorithmcomputes the cluster centroid to be closer to theear than the front of the face. To fully describethe object of interest, we would need to compute a

Eakta Jain


bounding box in addition to the centroid, to ensurethat the camera frames the entire face. Our currentimplementation also assumes that there are (forthe most part) two points of interest in a comicbook panel, and thus, it is sufficient to extract twoclusters from the recorded viewer fixation points.We observed that even though the camera movemay be well specified as a two-point move, theeye movements may not cluster well into twoclusters, as in Figure 10(b)(bottom). An algorithmthat continues to pan till it has enclosed all thefixated locations might capture this scene better.Saccade amplitudes and orientations could also beused to refine the computed clusters.

We create the track in or out shot from the ratioof the total fixation durations on the associatedclusters based on the principle that people tend tofixate longer on the region that is more interestingto them; they want to look at this region moreclosely, therefore a camera should track in towardsit. Other cues to predict the size of the windowcould include metrics that describe the spread ofthe fixation locations in the associated cluster, suchas the distance between the two farthest points in acluster, or, the standard deviation of the distributionof fixation locations. However, these metrics couldbe noisy when the fixation points in a cluster are notwell modeled as a Gaussian (for example, the greencluster in Figure 4(b)). We note that the ambient-focal theory could explain how fixation duration,and spread are related. According to this theory,there are two modes of visual processing: ambientprocessing, which is characterized by larger saccadeamplitudes and shorter fixation durations (perhapsto explore the spatial arrangement of the scene), andfocal processing, which is characterized by longerfixation durations and shorter saccades (probablyto gather detailed information about a particularobject) [13], [14], [15]. Thus, we expect that longertotal fixation duration on a cluster would likelybe associated with a smaller spread in the fixationlocations.

Finally, it may happen that camera moves createdby a cinematographer, who knows where the storyis going, will contain intentional reveals. Becauseour algorithm uses information from gaze data ononly the current panel (and the viewer does notknow what is going to happen next), the predictedmoves will contain only as much surprising reve-lation as the comic book artist was able to achievethrough manipulating viewer gaze.

REFERENCES

[1] W. Eisner, Comics and Sequential Art. W.W. Norton &Company, 2008.

[2] S. E. Palmer, Vision Science, Photons to Phenomenonl-ogy. The MIT Press, 1999.

[3] E. Jain, Y. Sheikh, and J. Hodgins, “Inferring artisticintention in comic art through viewer gaze,” in ACMSymposium on Applied Perception (SAP), 2012.

[4] M. Dorr, T. Martinetz, K. Gegenfurtner, and E. Barth,“Variability of eye movements when viewing dynamicnatural scenes,” Journal of Vision, vol. 10, 2010.

[5] S. Howlett, J. Hamill, and C. O’Sullivan, “Predictingand evaluating saliency for simplified polygonal models,”ACM Transactions on Applied Perception (TAP), vol. 2,2005.

[6] D. DeCarlo and A. Santella, “Stylization and abstractionof photographs,” ACM Transactions on Graphics (TOG),vol. 21, no. 3, 2002.

[7] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, andM. Cohen, “Gaze-based interaction for semi-automaticphoto cropping,” in ACM Conference on Human Factorsin Computing Systems (CHI), 2006.

[8] E. Jain, Y. Sheikh, A. Shamir, and J. Hodgins, “Gaze-driven video re-editing,” ACM Transactions on Graphics(TOG), 2015.

[9] A. Shamir, M. Rubinstein, and T. Levinboim, “Generatingcomics from 3d interactive computer graphics,” IEEEComputer Graphics & Applications, vol. 26, no. 3, 2006.

[10] M. Wang, R. Hong, X.-T. Yuan, S. Yan, and T.-S.Chua, “Movie2comics: Towards a lively video contentpresentation,” vol. 14, no. 3, 2012.

[11] Y. Cao, R. W. Lau, and A. B. Chan, “Look over here:Attention-directing composition of Manga elements,”ACM Transactions on Graphics (TOG), vol. 33, no. 4,2014.

[12] S. D. Katz, Film directing, shot by shot, visualizing fromconcept to screen. Michael Wiese Productions, 1991.

[13] J. R. Antes, “The time course of picture viewing.” Journalof Experimental Psychology, vol. 103, no. 1, 1974.

[14] P. J. A. Unema, S. Pannasch, M. Joos, and B. M.Velichkovsky, “Time course of information processingduring scene perception: The relationship between sac-cade amplitude and fixation duration,” Visual Cognition,vol. 12, 2005.

[15] B. Follet, O. Le Meur, and T. Baccino, “New insightsinto ambient and focal visual fixations using an automaticclassification algorithm,” i-Perception, vol. 2, no. 6, 2011.

Eakta Jain

Eakta Jain

Eakta Jain

Eakta Jain

Eakta Jain

Eakta Jain

Eakta Jain

Eakta Jain


Eakta Jain is an Assistant Professor of Computer and Infor-mation Science and Engineering at the University of Florida.She received her PhD and MS degrees from Carnegie MellonUniversity, and B.Tech. from IIT Kanpur. Her research interestsare perceptually-based computer graphics algorithms to createand manipulate artistic content, including traditional handanimation, comic art, and films.

Yaser Sheikh is an Associate Professor at the RoboticsInstitute, Carnegie Mellon University. His research interestsspan computer vision, computer graphics, and robotics. Hehas won Popular Sciences Best of Whats New Award, theHonda Initiation Award (2010), and the Hillman Fellowship forExcellence in Computer Science Research (2004). He receivedhis PhD in 2006 from the University of Central Florida, andhis BS degree from the Ghulam Ishaq Institute of EngineeringScience and Technology in 2001.

Jessica Hodgins is a Professor in the Robotics Institute andComputer Science Department at Carnegie Mellon University,and Vice President of Research, Disney Research. She receivedher Ph.D. in Computer Science from Carnegie Mellon Uni-versity in 1989. Her research focuses on computer graphics,animation, and robotics with an emphasis on generating andanalyzing human motion. She has received a NSF Young Inves-tigator Award, a Packard Fellowship, and a Sloan Fellowship.In 2010, she was awarded the ACM SIGGRAPH ComputerGraphics Achievement Award.

Predicting Moves-on-Stills for Comic Art using Viewer...

Documents

Transcript of Predicting Moves-on-Stills for Comic Art using Viewer...