16.full (1)

download 16.full (1)

of 24

Transcript of 16.full (1)

  • 8/10/2019 16.full (1)

    1/24

    Attentional synchrony and the influence of viewing task ongaze behavior in static and dynamic scenes

    Tim J. Smith $Birkbeck, University of London, London, UK

    Parag K. Mital $Goldsmiths, University of London, London, UK

    Does viewing task influence gaze during dynamic sceneviewing? Research into the factors influencing gazeallocation during free viewing of dynamic scenes hasreported that the gaze of multiple viewers clusters aroundpoints of high motion (attentional synchrony), suggestingthat gaze may be primarily under exogenous control.However, the influence of viewing task on gaze behavior

    in static scenes and during real-world interaction has beenwidely demonstrated. To dissociate exogenous fromendogenous factors during dynamic scene viewing wetracked participants eye movements while they (a) freelywatched unedited videos of real-world scenes (freeviewing) or (b) quickly identified where the video wasfilmed (spot-the-location). Static scenes were alsopresented as controls for scene dynamics. Free viewing ofdynamic scenes showed greater attentional synchrony,longer fixations, and more gaze to people and areas ofhigh flicker compared with static scenes. These differenceswere minimized by the viewing task. In comparison withthe free viewing of dynamic scenes, during the spot-the-location task fixation durations were shorter, saccadeamplitudes were longer, and gaze exhibited lessattentional synchrony and was biased away from areas offlicker and people. These results suggest that the viewingtask can have a significant influence on gaze during adynamic scene but that endogenous control is slow to kickin as initial saccades default toward the screen center,areas of high motion and people before shifting to task-relevant features. This default-like viewing behaviorreturns after the viewing task is completed, confirmingthat gaze behavior is more predictable during free viewingof dynamic than static scenes but that this may be due tonatural correlation between regions of interest (e.g.,people) and motion.

    Introduction

    Real-world visual scenes change dynamically. Ob-jects, such as people, animals, vehicles, and environ-

    mental elements (e.g., leaves in the breeze), moverelative to a static background, and our viewpoint on ascene changes through our own motion. This simplefact may not seem controversial but has been neglectedby the majority of previous research into how weprocess visual scenes. Investigations into scene percep-tion using static photographs have resulted in a

    substantial understanding of how we attend to staticscenes and how this relates to subsequent perceptionand memory for scene details (for review, see Hender-son,2003; Tatler, Hayhoe, Land, & Ballard, 2011). Forexample, the magnitude of low-level visual featuressuch as edges is significantly greater at fixated thannonfixated locations during static scene free viewing(Baddeley & Tatler,2006; Foulsham & Underwood,2008; Itti & Koch, 2001; Mannan, Ruddock, &Wooding,1995; Tatler, Baddeley, & Gilchrist, 2005),but viewing task can significantly alter which semanticfeatures of a scene we fixate (Buswell, 1935; Castelha-no, Mack, & Henderson,2009; Mills, Van der Stigchel,

    Hollingworth, Hoffman, & Dodd,2011; Yarbus,1967)However, comparable research into the same questionsfor more naturalistic dynamic scenes is only in itsinfancy (Carmi & Itti, 2006a,2006b; Dorr, Martinetz,Gegenfurtner, & Barth,2010; Itti, 2005; Le Meur, LeCallet, & Barba,2007; Ross & Kowler,2013; see Smith,2013, for review). The work presented here endeavorsto extend our understanding of gaze behavior indynamic scenes by (a) examining the differencesbetween gaze behavior in dynamic and static versionsof the same scene and (b) identifying the influence ofviewing task on gaze behavior in both types of scene.

    Initial research into gaze behavior during dynamicscene viewing has revealed that the way we attend to ascene in motion may differ from how we attend to staticscenes. In static scenes, fixations from multiple viewershave exhibited prioritization of some features of a sceneover others (e.g., faces and foreground objects),although not at the same time (Mannan et al., 1997).The lack of temporal demands on when scene features

    Citation: Smith, T. J., & Mital, P. K. (2013). Attentional synchrony and the influence of viewing task on gaze behavior in static anddynamic scenes. Journal of Vision, 13(8):16, 124, http://www.journalofvision.org/content/13/8/16, doi:10.1167/13.8.16.

    Journal of Vision (2013) 13(8):16, 124 1http://www.journalofvision.org/content/13/8/16

    doi: 1 0 . 11 6 7 / 1 3 . 8 . 1 6 ISSN 1534-7362 2013 ARVOReceived November 15, 2012; published July 17, 2013

    mailto:[email protected]://pkmital.com/mailto:parag@pkmitalmailto:parag@pkmitalmailto:parag@pkmitalhttp://pkmital.com/http://pkmital.com/mailto:[email protected]:[email protected]
  • 8/10/2019 16.full (1)

    2/24

    are relevant to the viewer results in a large degree ofdisagreement in where multiple viewers attend. Thisvariability conflicts with computational models ofvisual saliency that attempt to predict the gaze locationof an ideal viewer from low-level visual features byassuming exogenous (i.e., stimulus-driven) involuntarycontrol of attention (e.g., Itti,2000; Itti & Koch,2001).

    Endogenous(e.g., internal, cognitively driven) factorssuch as viewing task (Buswell, 1935; Castelhano et al.,2009; Land & Hayhoe,2001; Yarbus,1967), knowledgeabout layout and scene semantics (Henderson, Brock-mole, Castelhano, & Mack, 2007; Henderson, Mal-colm, & Schandl,2009; Torralba, Oliva, Castelhano, &Henderson,2006), and a preference for social featuressuch as people, faces, and the subject of their eye gaze(Birmingham, Bischof, & Kingstone,2008; Castelhano,Wieth, & Henderson,2007) have a greater influence ongaze allocation in static scenes, and the inherentindividual variability in endogenous control results ingreater variability of gaze location over time.

    Such variability is eradicated with the inclusion ofmotion. When viewing dynamic scenes, the gaze ofmultiple viewers exhibits a high degree of clustering inspace and time, which we refer to as attentionalsynchrony(Smith & Henderson, 2008). For example,Goldstein, Woods, and Peli (2007) found that for morethan half of the viewing time, the distribution offixations from all viewers while watching movie clipsoccupied less than 12% of the screen area. Attentionalsynchronyhas been observed for feature films (Carmi &Itti,2006a,2006b; Goldstein et al., 2007; Mital, Smith,Hill, & Henderson,2011; Smith & Henderson, 2008;

    tHart et al., 2009; Tosi, Mecacci, & Pasquali, 1997),television (Mital et al., 2011; Sawahata et al., 2008),video clips with audio narration and subtitling (Ross &Kowler,2013), and videos of real-world scenes(Cristino & Baddeley,2009; Smith & Henderson,2008;tHart et al., 2009). Although the degree of gazeclustering is always higher in dynamic than static scenes(Smith & Henderson,2008), it varies with scene contentand compositional factors such as the inclusion ofediting (Dorr et al., 2010; Hasson et al.,2008; Mital etal., 2011; see Smith, 2013, for review). In natural real-world scenes, the absence of such compositional factorsmeans that only naturally occurring features such as

    motion, change, and sudden onsets can captureattention (Wolfe & Horowitz,2004).

    Recent computational models of visual saliency havesignificantly increased their ability to predict humangaze location in dynamic scenes by including dynamicvisual features such as motion and flicker, i.e.,difference over time (Berg, Boehnke, Marino, Munoz,& Itti,2009; Carmi & Itti,2006a,2006b; Le Meur et al.,2007; Mital et al.,2011; tHart et al.,2009; Vig, Dorr, &Barth,2009) and accounting for the large bias of gazetoward the screen center observed during dynamic

    scene free viewing (see Smith, 2013, for discussion;Tseng, Carmi, Cameron, Munoz, & Itti,2009). Motionand flicker are some of the strongest independentpredictors of gaze location in dynamic scenes, and theirindependent contributions are as high, if not higher,than the weighted combination of all features in amodel of visual salience (Carmi & Itti, 2006b; Mital et

    al., 2011). However, it would be false to conclude thatsuch correlations between low-level features and gazeare evidence of a causal relationship. In a dynamicscene filmed from a static viewpoint, areas of highmotion will coincide with areas of cognitive relevancesuch as people, animals, vehicles, and the objects actedupon by these agents (as has been shown in staticscenes; Cerf, Frady, & Koch,2009; Einhauser, Spain, &Perona,2008; Elazary & Itti, 2008). It may be thesescene semantics that are being prioritized (i.e., endog-enously) by selective attention rather than attentionbeing automatically (i.e., exogenously) captured bymotion.

    Such consistency in how multiple viewers attend to ascene has also been observed during real-worldinteractive tasks such as fixating the inside curve whiledriving (Land & Lee, 1994) and the point of a knifewhen cutting a sandwich (Hayhoe, Shrivastava,Mruczek, & Pelz,2003). Gaze during these naturalistictasks is closely coupled to the information needed forthe immediate goal (Hayhoe & Land, 1999; Land,Mennie & Rusted,1999). For example, gaze falls onlyon task-relevant objects during a task but is equallydistributed among all objects within the field of viewprior to beginning the task (Hayhoe et al., 2003). If an

    agent were subject to involuntary, bottom-up controlby visual salience while interacting with the real world,he or she would never be able to complete the task asareas of the scene with greatest relevance to thebehavior are often the least salient (e.g., the bend of aroad). Instead, the perceived utility of an object, suchas whether or not a pedestrian has collided with you inthe past, appears to be more important in decidingwhether you fixate that object rather than low-levelfeatures such as its trajectory (Jovancevic-Misic &Hayhoe,2009; Jovancevic-Misic, Sullivan, & Hayhoe,2006). This mismatch between existing theories ofvisual salience and the pragmatics of everyday vision

    has recently received a wealth of discussion (forvolumes of collected articles on the topic, see Hender-son,2005; Tatler,2009).

    However, it is false to assume that the highlyvolitional nature of gaze control during real-worldtasks indicates that a similar override of visual saliencewill be observed during noninteractive dynamic sceneviewing. Real-world behaviors differ from prerecordeddynamic scenes in that the viewer is also an agentwithin the environment and can partly control thedynamics of the scene through his or her own ego

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    3/24

    motion, viewpoint changes, and manipulation ofobjects within the environment. When viewing aprerecorded dynamic scene without a specified viewingtask, the optimal viewing strategy may be to couplegaze to points of high motion as these are likely to besources of the most information. To examine whetherthis coupling is volitional (endogenous) or due to

    involuntary capture (exogenous), the traditional meth-od is to dissociate the two factors by altering theviewing task. Most previous demonstrations of atten-tional synchrony during dynamic scene viewing haveused a free viewing or follow the main characterviewing task (see Smith,2013, for review). Suchinstructions may inadvertently bias attention towardthe dynamic features of a scene. Indirect evidence ofendogenous influences on gaze allocation has comefrom studies using prolonged or repeated presentationsof dynamic scenes. Attentional synchrony decreasesover repeated presentation of the same videos (Dorr etal.,2010) and over prolonged presentation of the same

    dynamic scene (Carmi & Itti, 2006a; Dorr et al., 2010;Mital et al.,2011), suggesting that increasing familiarityand memory for scene content leads to divergent gazebehavior. Recent evidence of optimal anticipation ofmoments of high salience in dynamic scenes suggeststhat viewers formulate and act on predictions about thefuture of visual events (Vig, Dorr, Martinetz, & Barth,2011).

    Only one study (that we are aware of) has attemptedto directly manipulate viewing task during noninter-active dynamic scene viewing. Taya, Windridge, andOsman (2012) presented participants with videos of

    tennis matches filmed from a static camera position.Participants were initially instructed to free view theclips and then watch a mixture of new and old clipswhile trying to decide who won the point. After eachclip, they also ranked a list of objects (such as ball boy/girl, net, and ball) in terms of how much they thoughtthey looked at each object. Although subjective ratingsof attended objects differed significantly across the twoviewing tasks, objective measures of intersaccadicintervals (approximately fixation durations) and gazecoherence showed no significant differences acrosstasks. Repetition of the same clips also failed to showthe decrease in attentional synchrony previously

    observed (Dorr et al., 2010). Taya and colleagues(2012) suggested that the lack of task influence on gazecould be due to the highly constrained nature of atennis match and the correspondence between points ofhigh salience (e.g., movement) and the areas of interestduring both free viewing and identifying who won thepoint (e.g., the players and the ball). They alsopresented videos with their original soundtrack intro-ducing audiovisual correspondences that may havecued attention to the same screen locations irrespectiveof viewing task.

    The present study endeavors to overcome thelimitations of Taya et al. (2012) by using less-constrained stimuli and a viewing task that directlydissociates endogenous from exogenous control. Nat-uralistic videos of indoor and outdoor scenes werepresented to two sets of participants either underinstruction to free view the clips or to identify the

    location depicted in the clips (referred to as spot-the-location). Free viewing was used as the comparison forthe more specific searchlike spot-the-location taskbecause in static scenes, gaze differences between freeview and all other specific tasks (e.g., search, memory,or pleasantness rating) have been shown to besignificantly larger than between the specific tasks(Mills et al.,2011). The majority of the clips used in thisstudy were filmed in Edinburgh where the experimentwas conducted, and participants were able to tell in afew seconds whether or not they knew the location. Tospot the location, participants had to ignore dy-namic, changeable features of the scene such as people

    and traffic and focus their attention on the static,permanent features such as buildings and landmarks.Therefore, if participants are able to employ endoge-nous control, gaze behavior during the spot-the-location task should be less predictable by exogenousfactors such as flicker (amount of change), be lessallocated to moving objects such as people, and exhibitless attentional synchrony than during free viewing asviewer gaze is distributed to more diffuse and non-temporally defined elements of the scene. To identifythe specific contribution of scene dynamics to gazelocation, scenes were presented to participants either asvideos or as static images.

    Method

    Participants

    Fifty-six members of the Edinburgh Universitycommunity participated for payment (17 men; meanage 22.74 years, minimum 18 years, maximum 38years). All participants had normal or corrected-to-normal vision. Participation was voluntary. The

    experiment was conducted according to the BritishPsychological Societys ethical guidelines.

    Apparatus

    Eye movements were monitored by an SR ResearchEyelink 1000 eye tracker, and a nine-point calibrationgrid was used. Gaze position was sampled at 1000 Hz.Viewing was binocular, but only the right eye wastracked. The images were presented on a 21-inch

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    4/24

    cathode ray tube (CRT) monitor at a viewing distanceof 90 cm with a refresh rate of 140 Hz and a resolutionof 800 600 pixels (subtending a visual angle of 25.78 19.48). A chin rest was used throughout theexperiment. The experiment was controlled with SRResearch Experiment Builder software.

    Stimuli

    Participants were presented 26 unique full-color real-world scenes on a black background aligned with thecenter of the screen. Thirteen of the scenes were staticphotographs (Windows Bitmap format; 720 576pixels 24 bit), and 13 were 20-s, silent videos (XviDMPEG-4 format; 720 576 pixels and 25 frames persecond). All scenes were presented with black bordersaround the image/video with the scene subtending avisual angle of 23.38 18.628. Scenes were originallyvideo clips, and representative static versions of these

    scenes were made by randomly choosing a frame 5 to15 s into the video clips and exporting the frame as aBitmap. Twenty-one of the scenes were filmed by theauthors around Edinburgh using the movie function ona digital still camera. These scenes depicted a variety ofcontexts including indoor and outdoor locations withvarying degrees of activity, number of people, anddepth of field (see Figure 1). Scenes were filmedaccording to the following constraints: The camera wasmounted on a tripod and not moved during the filmingperiod, the tripod was randomly located in a scene andnot framed in relation to a particular object, all scenes

    had to contain moving objects (typically people,animals, or vehicles) but not contain any text (e.g.,signs) or critical sounds (e.g., dialogue), all scenes werelit with the original lighting found in the location (e.g.,natural light [outdoors] or fluorescent lighting [in-doors]), and all scenes were focused using the camerasautofocus.

    The remaining five scenes were 20-s clips extractedfrom videos depicting individual people engaged in atask (e.g., washing a car). These scenes adhered to thefilming constraints outlined above and were originallyused to investigate the perception of events duringhuman action (Zacks et al.,2001).

    Procedure

    Participants were randomly allocated to one of twoconditions: free viewing or spot-the-location. In thefree-viewing condition, participants were told theywould be presented 26 short everyday scenes either asstill photographs or videos and were instructed tosimply look at each scene. A blocked design was usedwith either a block of 13 still images or a block of 13

    videos presented first followed by a block containing 13scenes of the opposite type. Participants saw only oneversion of each scene. Balancing the order of blocksand scene types across subjects created four subjectgroups. An experimental trial took place as follows.First, calibration was checked using a central fixationpoint presented on the CRT. If gaze position was more

    than 0.58

    away from the fixation point, a nine-pointrecalibration was performed. Otherwise, the trial beganwith a randomly positioned fixation cross. A randomfixation cross was used in an attempt to minimize thecentral bias observed in previous studies (Le Meur etal., 2007; Tatler,2007). Participants were instructed tofixate the cross. The cross was presented for 1000 msbefore being replaced with the scene. In the static block,the still photograph was presented for 20 s before beingremoved and the trial terminated. In the dynamicblock, the video began immediately and ended after 20s. Eye movements were recorded throughout theviewing time. After 13 scenes were presented, the

    second block began with a recalibration followed bythe presentation of the remaining 13 scenes of the othertype.

    In the spot-the-location condition, participants werepresented the still images and videos using the sametrial and block design as in the free-view condition butinstructed to try to identify the location depicted ineach image/video. As soon as they recognized thelocation or decided that the location was not familiar,they pressed a button on a controller (MicrosoftSidewinder joypad). They were instructed to respondas quickly as possible to encourage them to completethe location task before engaging with the scene in anyother way. The scene would then remain on the screenfor the remainder of the 20-s viewing time. After eachscene was presented, the experimenter asked partici-pants to verbally state if they recognized the locationand, if so, where it was. Verbal answers were recordedand scored as correct or incorrect.

    Data analysis

    Gaze preprocessing

    Eye movementparsing algorithms such as thatprovided with the Eyelink eye tracker (SR Research)are incapable of dealing with eye movements duringdynamic scenes as they assume a sequence of fixations,saccades, and blinks. During dynamic scenes, themovement of objects relative to the camera and theparticipants static viewpoint mean that smooth pursuitmovements may also be observed. Smooth pursuits aredistinct from fixations as they involve displacementover time while a target object is foveated. They are

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    5/24

    Figure 1. Stills taken from the 26 scenes used in this experiment. For each participant, half of the scenes were presented as static

    images and half as videos.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    6/24

    distinguished from saccades by their relatively lowacceleration and velocity compared with saccades.Thresholds of 80008/s2 and 308/s are typically used forsaccade detection (Stampe,1993), and although pursuitof objects has been shown to be accurate up to 1008/s,the majority of pursuit movements are thought to occurat considerably slower speeds (Meyer, Lasker, &

    Robinson,1985). Traditionally, studies intending tomeasure smooth pursuit eye movements use targetsmoving at known velocities and trajectories. As such,the existence of appropriate smooth pursuit movementscan be identified by checking for a match between thevelocity of the gaze and the target. The task ofidentifying smooth pursuits during spontaneous dy-namic scene viewing is more difficult as information isnot known about the target location or velocity.Instead, the velocity and continuity of motion of theeyes irrespective of the stimuli must be used. TheMarkEye algorithm provides such functionality andhas been used across both human adults and primates

    (Berg et al., 2009).The MarkEye algorithm processes raw X/Y mon-

    ocular gaze coordinates using a combination of velocitythresholds and a windowed principal componentanalysis (PCA). The algorithm progresses throughseveral stages. Initially, the raw X/Y data are firstsmoothed using a low-pass filter (63 Hz Butterworth),and gaze velocities greater than 308/s are marked aspossible saccades. A sliding window PCA is then usedto compute the ratio of explained variances (minimumover maximum) for X and Y position. A ratio near zeroindicates a straight line movement, and if the velocity is

    greater than the saccade threshold, a saccade isidentified. Ratios close to zero but less than the saccadethreshold are identified as possible smooth pursuits orother low-velocity ocular artifacts (e.g., drift, eyelidobscuring the pupil prior to a blink). The remainingdata are categorized as fixations. Saccades separated byfixations less than 60 ms in duration are assumed tobelong to the same movement and are collapsedtogether. Finally, saccades with amplitude less than 18or duration less than 15 ms are ignored and collapsedinto adjacent movements.

    The Matlab implementation of the MarkEye algo-rithm was used to parse raw gaze data generated by the

    Eyelink 1000 eye tracker. Raw monocular gaze datawere extracted for each participant and passed throughthe MarkEye algorithm. The same parameters wereused for parsing the gaze data irrespective of whetherparticipants were viewing static or dynamic stimuli.Any periods identified as tentative smooth pursuits(2.82% of all data), blinks, or classification error (dueto tracker error/noise; 14.86% of all data) wereremoved from subsequent analyses. The properties ofspontaneous smooth pursuit movements during natu-ralistic dynamic scene perception are a topic of great

    interest, but given that they was not the main focus ofthe present research, the decision was made to excludethese periods from further analyses.

    All analyses of oculomotor measures presented hereare based on only fixations with durations greater than90 ms to exclude false fixations created by the saccadichook often observed in infrared eye trackers (Nystrom

    & Holmqvist,2010). Fixations greater than 2000 ms arealso excluded as outliers, as these probably indicatelapses of concentration by participants (Castelhano etal., 2009). These criteria excluded 5.2% of fixations. Atotal of 70,875 fixations remained for analysis.

    To validate the oculomotor measures identified usingthe MarkEye algorithm, the raw gaze data were alsoparsed using the normal Eyelink algorithm (Stampe,1993). The correlation between mean fixation durationsper participant identified by the Eyelink algorithm andMarkEye was highly significant (Pearson correlations;static: r 0.853, p , 0.001; dynamic: r 0.864, p ,0.001). However, the mean fixation durations identified

    by the Eyelink algorithm were on average 35.2 mslonger than those identified by the MarkEye algorithm(static28.49 ms, dynamic41.91 ms), possibly dueto the inclusion of drift and pursuit within Eyelinkfixations. This suggests that our use of the MarkEyealgorithm has not distorted the data in any way andprovides a more conservative estimate of the oculo-motor behaviors than the Eyelink algorithm.

    Quantifying attentional synchrony

    The most striking difference in viewing behaviorbetween static and dynamic scenes previously reportedis the clustering of gaze across individuals (i.e.,attentional synchrony; Mital et al., 2011; Smith &Henderson,2008). Various methods have been pro-posed for expressing the degree of attentional syn-chrony during a dynamic scene, including entropy(Sawahata et al., 2008), Kullback-Leibler divergence(Rajashekar, Cormack, & Bovik, 2004), normalizedscan path salience (Dorr et al.,2010; Peters, Iyer, Itti, &Koch,2005; Taya et al.,2012), bivariate contour ellipsearea (BCEA; Goldstein et al., 2007; Kosnik, Fikre, &Sekuler,1986; Ross & Kowler, 2013), and Gaussian

    mixture modeling (GMM; Mital et al., 2011; Sawahataet al., 2008). Each method expresses slightly differentproperties of the gaze distribution such as assuming allgaze is best expressed as a single diagonal cluster(BCEA), multiple spherical clusters (GMM), or de-scribing the overall distribution (entropy). However,most methods have been shown to express the variationin attentional synchrony and have also been shown tocorrelate (Dorr et al., 2010). Here, our interest is inexpressing two properties of attentional synchrony: (a)the variance of gaze around a single point and (b) the

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    7/24

    number of focal points around which gaze is clusteredin a particular frame. We therefore decided to useGMM as this represents a collection of unlabeled datapoints as a mixture of Gaussians each with a separatemean, covariance, and weight parameter. However, thisapproach requires knowing how many clusters are inthe data a priori. Following Sawahata and colleagues

    (2008), we can discover the optimal number of clustersthat explain a distribution of eye movements usingmodel selection (model selection operates by minimiz-ing the Bayesian information criterion; see Bishop,2007, for further explanation of the algorithm).Alternatively, the number of clusters can be set a priori.If a single cluster is used, the algorithm will model allgaze points for a particular frame using a singleGaussian kernel approximated by a spherical covari-ance matrix. The closer the covariance of this Gaussianis to zero, the tighter the gaze clustering and thereforethe greater the attentional synchrony. For ease ofinterpretation, the cluster covariance is expressed as the

    visual angle enclosing 68% of gaze points (i.e., 1standard deviation of the full Gaussian sphericalcovariance matrix). When using a single Gaussiancluster, this measure is very similar to BCEA (Gold-stein et al., 2007), although it does not assumeindependent horizontal and vertical variances andinstead uses a spherical cluster. As such, our measurecan be considered a more conservative estimate ofattentional synchrony than BCEA. However, analysisof our data with both BCEA and single Gaussianmodeling revealed strong significant correlations be-tween the two measures (static free viewing:r 0.863,

    p,

    0.001; static spot-the-location:r 0.888, p,

    0.001; dynamic free viewing: r 0.823, p , 0.001;dynamic spot-the-location:r 0.874, p , 0.001).GMM was used to allow the second-stage analysis offitting the minimal number of clusters per frame.

    Cluster covariance was calculated for every frame ofevery scene under the two viewing instructions usingCARPE, open-source visualization and analysis soft-ware for eye-tracking experiments using dynamicscenes (Mital et al., 2011). Static scene presentationswere divided into 40-ms frames.

    Flicker

    To identify the contribution of low-level visualfeatures to gaze allocation in dynamic scenes, flickerwas identified. Flicker (change in luminance over time)has been shown to correlate with other measures ofmotion such as optic flow and be highly predictive ofearly gaze locations during dynamic scene free viewing(Carmi & Itti, 2006b; Mital et al., 2011). Flicker in avideo filmed from a static viewpoint can be indicativeof object motion, changes in environmental lighting,

    and color or video compression artifacts. For thisanalysis, flicker was calculated across luminance withina moving window of five frames to minimize theinfluence of compression artifacts and increase thelikelihood of detecting an actual change in the scene.Equation 1describes the Flicker computation at pixelx,y for time t, using the absolute difference between the

    current and previous frames luminance values (I)computed from the CIELAB color space and averagedover the last five frames (n 5).

    Fx;yt 1

    N

    XNi1

    abs

    Ix;yti Ix;yti1

    h i 1

    Raw flicker around fixation on its own is insufficient toidentify if gaze is specifically biased toward areas ofchange in the scene as the entire scene could contain thesame amount of flicker. To create a baseline estimate ofhow much flicker would be around a control sample ofgaze points, fixation locations were sampled with

    replacement from other frames of the same scene andprojected onto the current frame. The average flickerwithin a 28 region around these control locations wascalculated for each participant using Equation 1andthen averaged across all scenes and viewing tasks.Using the participants own fixations from other timepoints in the same scene controls for individual gazebiases, the central tendency, and compositional biasesin each scene (Mital et al., 2011).

    Dynamic regions of interest

    To identify the objects fixated during the dynamicscenes, dynamic regions of interest (dROI) were used.As the main objects of interest in this study werepeople, rectangular regions were created around eachperson in the scene. Most eye-tracking analysissoftware provides tools for creating static regions ofinterest. Given the dynamic nature of our objects, adedicated software tool,Gazeatron (Smith,2006), wasused. Gazeatron allows dROI to be semi-automaticallypuppetted on top of the moving people, translatingand scaling the regions as appropriate to encompasseach person. For some of our scenes, this resulted in a

    large number of small regions (e.g., Scene 12 depictingthe lobby of the National Museum of Scotland; Figure1), whereas for others, only one person was present(e.g., Scene 23 depicting a man doing laundry; Figure1). These regions were created for both static anddynamic versions of each scene. For the frame depictedin the static version, the regions were identical to thedynamic version of each scene. The coincidence of eachfixation with these dynamic regions was then identifiedresulting in either a hit (1) or a miss (0) on peoplefor each fixation. To control for the probability of

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    8/24

    randomly fixating people given the area of the screenthey occupy, a control set of dROI equal in numberand size to the regions occupied by people wasgenerated for each frame and randomly located on thescreen. The intersections of each fixation within thesecontrol regions were also identified. Across all scenes,the number of dROI per frame and over time varied

    widely, but for simplicity, the mean probability offixating the randomly located regions during staticscene viewing will be presented as the control baseline:M 0.235, SD 0.027.

    Results

    Spot-the-location results

    For participants in the free-viewing condition, noresponse was required. In the spot-the-location condi-tion, participants had to respond as quickly as possiblevia a button press when they identified the locationdepicted in the image/video or when they realized thatthey did not know the location. The mean probabilityof accurately identifying a location was 0.57 (SD 0.154) for static scenes and 0.53 (SD 0.16) fordynamic scenes. The difference was not significant (t ,1). Reaction times (RTs) for recognition/nonrecogni-tion across the two scene types exhibited a significantmain effect of whether the scene was recognized(repeated-measures analysis of variance [ANOVA]),F(1, 27) 9.58, p , 0.01, pg2 0.262, but no effect of

    stimulus type (F,

    1) or interaction between stimulustype and recognition, F(1, 27) 2.582, p 0.120, n.s.This trend toward an interaction can be attributed tomean RTs for nonrecognized scenes (5499 ms, SD 2453 ms) being significantly longer than recognizedscenes (3990 ms,SD1740 ms),t(27)3.214,p , 0.01,in the dynamic condition but not in the static condition(nonrecognized 4836 ms,SD 2175 ms; recognized4229 ms, SD 1890 ms), t(27) 1.47, p 0.153, n.s.This delay in recording a failure to recognize dynamicscenes might suggest that participants were distractedby the moving elements of the scene when encounteringhard-to-recognize scenes. This difference between static

    and dynamic scenes will be explored in more detailduring analysis of participant eye movements.

    Oculomotor measures

    Oculomotor measures (fixation durations and sac-cade amplitudes) were calculated using the MarkEyealgorithm (Berg et al., 2009) and will be presentedacross both stimulus types (static and dynamic) andviewing task (free view vs. spot-the-location). Unless

    otherwise stated, gaze data during the spot-the-locationtask are analyzed only up to the response.

    Fixation durations

    Fixation durations have been shown to vary withdifferent viewing tasks, such as reading and sceneviewing (Rayner,1998). There is also some evidencethat fixations during scene memorization are signifi-cantly longer than during search of the same scenes

    (Henderson,2007; Henderson & Smith,2009; Hender-son, Weeks, & Hollingworth,1999; Mills et al., 2011),although some studies have failed to replicate thisdifference (Castelhano et al.,2009). Fixation durationshave also been shown to increase during static scenepresentation (Antes,1974; Buswell, 1935; Mills et al.,2011; Unema, Pannasch, Joos, & Velichovsky, 2005).To look for stimulus and task influences on fixationdurations in the present study, fixations were parsedfrom the raw data and the mean (Figure 2) and changeover time of fixation durations calculated (Figure 3) foreach condition.

    As can be seen fromTable 1andFigure 2, mean

    fixation durations during dynamic scene viewing arelonger (339 ms) than static scenes (290 ms) during freeviewing and during the spot-the-location task (253 msvs. 231 ms, respectively). These differences are presenteven though smooth pursuit movements have beenseparated from fixation durations. A mixed ANOVAcomparing scene type (static vs. dynamic) to viewingtask (free view vs. spot-the-location) reveals a signifi-cant within-subjects effect of scene type, F(1, 54) 102.67,p , 0.001,pg20.655; a between-subjects effectof task, F(1, 54) 40.18, p , 0.001, pg2 0.427; and a

    Figure 2. Mean fixation durations across the two stimulus types

    (static vs. dynamic scenes;x-axis) and viewing tasks (lines): free

    view versus spot-the-location before the response. Error bars

    represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    9/24

    significant interaction,F(1, 54)13.48,p , 0.001,pg20.20. The interaction is due to a differing effect ofstimulus type within the two tasks. During free viewing,the mean difference between fixation durations indynamic scenes and static scenes is large (48.61 ms),t(27) 8.07, p , 0.001, compared with a smaller, butstill significant, difference during spot-the-location(22.75 ms), t(27) 6.22, p , 0.001. This suggests that

    spot-the-location exhibits similar eye movements topreviously described scene search tasks (Henderson,2007; Henderson et al., 1999; Henderson & Smith,2009; Mills et al., 2011) and exhibits shorter fixationsirrespective of whether the scene is static or dynamic.

    It is also worthwhile to note that because of the fixedpresentation time (20 s) during free viewing, longeraverage fixations during dynamic scenes will result inless fixations than during static scenes. The same is notnecessarily true in the spot-the-location task as theduration used for analysis is dependent on participantresponse.

    To ensure that the task influence on fixationdurations cannot be attributed to differences in whenfixations were sampled (fixations during the full 20-spresentation are used for free viewing but only up untilthe response for spot-the-location), a time-courseanalysis was performed across the full trial duration.Dividing mean fixation durations into time bins relativeto scene onset reveals a clear increase in fixation

    durations throughout the trial (Figure 3). Fixationsduring the first second after scene onset are significantlyshorter (mean across all conditions 213.63 ms, SD 28.14) than all other time periods (all ps , 0.001).These initial short fixations are probably due to initialorienting to points of high interest within the scene andaway from the random fixation cross prior to sceneonset. What these areas of interest are will beinvestigated in subsequent analysis.

    To examine the change in fixation durations overtime, a mixed ANOVA was performed. Main effects ofTime,F(19, 969)36.894,p , 0.001,pg20.413; Scene

    Figure 3. Mean fixation duration (ms) over time (1000-ms time bins) for each scene type (static green lines, dynamic blue) andviewing task (free view solid lines, spot-the-locationdashed lines). Saccades are analyzed for the full 20-s viewing duration in bothtasks and not just up to the response in spot-the-location (as inFigure 2). Error bars represent 61 SE.

    Free view Spot-the-location

    Static Dynamic tTest (p value) Static Dynamic tTest (p value)

    Fixation duration (ms) 290.68 (46.10) 339.29 (58.02) *** 231.18 (30.16) 253.93 (40.06) ***

    Percentage of trial in fixation (%) 72.89 (16.05) 73.32 (17.41) n.s. 61.82 (16.29) 65.62 (14.53) .107

    Saccade amplitude (8) 4.07 (0.63) 4.21 (0.64) * 4.91 (0.62) 5.07 (0.57) *

    Percentage of trial in saccade (%) 8.02 (2.14) 6.48 (1.92) *** 10.36 (2.93) 9.98 (2.28) n.s.

    Table 1. Mean oculomotor measures for participants across two viewing tasks (free view vs. spot-the-location) and static versusdynamic stimulus. Notes: Significance level of paired ttests between static and dynamic are reported as asterisks (*,.05, *** ,.001).

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital

  • 8/10/2019 16.full (1)

    10/24

    Type, F(1, 51) 183.06, p , 0.001, pg2 0.782; andTask, F(1, 51) 10.863, p , 0.01, pg2 0.176, wereobserved. A significant interaction between Type andTask was also observed,F(1, 51) 7.080, p , 0.01, pg2

    0.122, mirroring the interaction previously reportedfor the means. An interaction between Time and SceneType,F(19, 969)2.172,p , 0.01,pg20.041, was also

    observed. This interaction is due to participantsexhibiting a greater increase in fixation durations overtime in dynamic scenes (mean increase 126.7 ms) thanstatic scenes (mean increase 95.28 ms; seeFigure 3).The absence of a return to free-viewing durations bythe end of the trial for the spot-the-location fixations isprobably due to the fact that all trials are includedirrespective of whether participants have made alocation response and that those participants stillsearching are making shorter fixations on average.

    To examine if the increase in fixation durations overtime remains significant even when the extremely shortfixations of the first second are removed, follow-up

    ANOVAs were performed within each conditionwithout the first time bin. All main effects of Timeremained significant: static free view, F(18, 468) 4.276,p , 0.001; static spot-the-location,F(18, 468)3.992, p , 0.001; dynamic free view, F(18, 468) 3.497, p , 0.001; dynamic spot-the-location:F(18,468) 5.485, p , 0.001. This increase in fixationdurations during dynamic and static scene presentationreplicates previous observations in static scenes (Antes,1974; Buswell, 1935; Mills et al., 2011; Unema et al.,2005), extends it to dynamic scene viewing, anddemonstrates that the task and type differences in mean

    fixation durations are evident within 2000 ms of sceneonset and remain throughout scene duration.

    Saccade amplitudes

    The other standard oculomotor measure known toshow influences of viewing task during scene viewing issaccade amplitude. The average amplitude of saccadestypically varies according to the distribution of objectswithin the visual array (Rayner, 1998). Saccades aregenerally largest immediately following scene onset andgradually decrease in amplitude over time (Antes,1974;Buswell,1935; Mills et al., 2011; Unema et al., 2005)

    but have been shown to be significantly shorter duringsearch of a scene compared with memorization or freeviewing of the same scene (Mills et al., 2011). To lookfor stimulus and task influences on saccade amplitudesin the present study, saccades were parsed from the rawdata and the mean (Figure 4), and change over time ofsaccade amplitudes (Figure 5) was calculated for eachcondition.

    As can be seen from Table 1andFigure 4, meansaccade amplitudes during dynamic scene viewing areslightly larger (4.218) than static (4.078) during free

    viewing and during the spot-the-location task (5.078 vs4.918, respectively). A mixed ANOVA reveals that theseslight differences are significant, F(1, 54) 6.076, p ,0.05, pg2 0.101, as is the effect of task, F(1, 54) 31.036, p , 0.001, pg2 0.365, but there is nointeraction (F, 1). Changing the task from freeviewing to spot-the-location results in significantlylarger saccades (mean difference 0.8478), t(55) 5.264, p , 0.001, indicating that the previouslyobserved influence of searchlike tasks on saccade

    amplitudes (Mills et al., 2011) is independent ofwhether the stimulus is static or dynamic.To ensure the task influence on saccade amplitudes

    cannot be attributed to differences in when saccadeswere sampled, a time-course analysis was performedacross the full trial duration (Figure 5). Dividing meansaccade amplitudes into time bins relative to sceneonset reveals a clear decrease in saccade amplitude overthe first few seconds of scene presentation for all scenetypes and viewing tasks. The relative differencesbetween tasks is also preserved with larger-amplitudesaccades at the start of spot-the-location trials andthroughout most of the trial duration. This pattern is

    confirmed by a mixed ANOVA on mean saccadeamplitude that reveals a significant main effect of time,F(19, 950) 14.114,p , 0.001,pg2 0.22; a main effectof scene type, F(1, 50) 7.949, p , 0.01, pg2 0.137;and viewing task, F(1, 50) 8.492, p , 0.01, pg20.145, but no interactions. The lack of any interactionsindicates that the main effects observed in Figure 4arenot due to sampling different time windows in the twotasks.

    Collapsing across scene type and task, post hoccomparisons (Bonferroni corrected) between mean

    Figure 4. Mean saccade amplitudes (degrees of visual angle)

    across the two stimulus types (static vs. dynamic scenes; x-axis)

    and viewing tasks (lines): free view versus spot-the-location.

    Error bars represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    11/24

    saccade amplitudes at different time points reveal thatsaccades within the first two time bins (01000 ms and10002000 ms) are significantly larger than all others,probably due to the random fixation cross and initialorienting to points of high interest within the scene.What these points are will be explored in subsequentanalyses. By the fourth time bin (30004000 ms),amplitudes have reached asymptote, and although aslight numerical decrease continues throughout the trialfor most conditions, the differences are no longer

    significant.

    Attentional synchrony

    To identify if the degree of attentional synchronydiffers across scene type and viewing task, we fit asingle spherical Gaussian model around the gazelocations of all viewers for each frame of every sceneand viewing task (static scene presentations weredivided into 40-ms frames). The cluster covarianceexpresses the smallest circle enclosing 68% of all gazelocations for that frame (i.e., a radius of 1 standard

    deviation). Lower clustering values represent a higherdegree of attentional synchrony.

    Mean clustering across all scenes is presented inFigure 6for the entire free-viewing duration and up tothe response during spot-the-location. A mixed AN-OVA of mean cluster covariance across the twostimulus types and tasks reveals an effect of scene type,F(1, 25) 54.96, p , 0.001, pg2 0.687; an effect oftask, F(1, 25) 18.43, p , 0.001, pg2 0.424; and asignificant interaction, F(1, 25) 26.72, pg2 0.001.Examining the means (Figure 6), the interaction can be

    attributed to a decrease in the difference between static

    and dynamic scenes during the spot-the-location task

    compared with free viewing. Clusters are smaller during

    free viewing of dynamic scenes (6.408, SD 3.99) thanstatic scenes (7.838; SD 3.688), t(25) 8.802, p ,0.001, and this difference decreases during the spot-the-

    location task, dynamic (7.408, SD 4.068) comparedwith static (7.818, SD 3.738), t(25) 2.812, p , 0.01.

    Figure 5. Mean saccade amplitudes (degrees) over time (1000-ms time bins) for each scene type (static green lines, dynamicblue)and viewing task (free view solid lines, spot-the-location dashed lines). Saccades are analyzed for the full 20-s viewing duration inboth tasks and not just up to the response in spot-the-location (as inFigure 4). Error bars represent 61 SE.

    Figure 6. Mean gaze clustering (degrees of visual angle as

    described by a spherical Gaussian kernel with a radius of 1

    standard deviation) across all scenes for the two stimulus types

    (static vs. dynamic scenes;x-axis) and viewing tasks (lines): free

    view versus spot-the-location. Lower values indicate tighter

    clustering (i.e., greater attentional synchrony). Error bars

    represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    12/24

    There is no impact of task on gaze clustering duringstatic scenes (t , 1) but a highly significant increase incluster covariance for dynamic scenes, t(25) 7.762, p, 0.001. The viewing task appears to have a significantimpact on gaze clustering during dynamic scenes butdoes not completely eradicate the difference betweenstatic and dynamic scenes.

    To examine attentional synchrony over the durationof scene presentation, the cluster covariance for eachscene in each condition was averaged into 500-ms bins(Figure 7). The resulting time course of attentionalsynchrony shows the now well-known tendency forgreater attentional synchrony (i.e., low cluster covari-ance) immediately following scene onset (Dorr et al.,2010; Mital et al., 2011; Tseng et al., 2009; Wang,Freeman, Merriam, Hasson, & Heeger, 2012). In ourdata, the initial large cluster covariance (average acrossall conditions 10.488) is due to the randomly locatedfixation cross prior to scene onset used to remove the

    center bias caused by a central fixation cross. However,rather than ameliorate the central tendency, thisappears to have caused all viewers to direct their initialsaccade toward the screen center irrespective of viewingtask or scene type, creating a dip in cluster size 1000 msfollowing scene onset (mean across all conditions 6.728). This central bias can clearly be seen when gazeacross the two viewing conditions and scene types arevisualized as dynamic heat maps (Videos 1,2, and3).Analysis of distance of gaze from screen centerindicates a decrease from an initial distance of 206

    pixels (SD 88.67) at scene onset to 106 pixels (SD 65.2) by 1000 ms. By 2000 ms, the distance to screencenter reaches asymptote for each condition: static free-view 169.0 pixels (SD 81.0), dynamic free-view 153.1 pixels (SD 81.1), static spot-the-location 173.6 (SD 77.8), and dynamic spot-the-location 171.4 (SD 75.4).

    The asymptote in the distance to screen centermirrors the asymptote in cluster covariance (Figure 7)Differences between conditions in cluster covarianceemerge by 1500 ms. In free viewing, the clustering forstatic scenes by 1500 ms is significantly larger (M6.928,SD 4.468) than dynamic scenes (M 6.088,SD4.318),t(25)2.05,p , 0.05, and remains at that levelor higher throughout the presentation time. Duringspot-the-location, there is initially no difference incluster covariance between static and dynamic scenes(at 1500 ms, static 7.448,SD 4.168; dynamic 7.208pixels,SD4.38;t , 1), but the difference reemerges by

    6000 ms (static 8.178, SD 4.458; dynamic 7.088,SD 4.438),t(23) 2.917, p , 0.01, suggesting a returnto normal free-viewing dynamic scene behavior. Al-though only frames from the spot-the-location taskoccurring before the button press were entered into thisanalysis, this return to free viewinglike behaviorsuggests that some participants fail to identify thelocation but do not indicate this failure with a buttonpress as instructed. Some evidence of this return to free-viewing gaze behavior can be seen in the similaritybetween the dynamic heat maps for the spot-the-

    Figure 7. Gaze cluster covariance over time (divided into 500-ms bins). Scene type is represented by line color (green static, bluedynamic) and viewing task by line type (solid free view, dashed spot-the-location). Error bars represent 61SEfor mean across alscenes.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    13/24

    location and free-view conditions prior to the buttonresponse (e.g., Videos 1and2before gaze points turnfrom blue to red). This return to free viewing mayexplain the remaining difference in overall clustercovariance between dynamic and static stimuli duringthe spot-the-location task (Figure 6). Irrespective ofthis failure to follow the task instructions, the initialeradication of the attentional synchrony suggests thatinstructing participants to identify the location hasenabled the viewers to change the way they view thedynamic scene.

    Given the large cluster covariances identified in allconditions other than dynamic free viewing (Figures 6and7), the variance in gaze position may be best fit byusing a model explained by multiple focal points. Thus,

    instead of using a single Gaussian cluster to model thedistribution of gaze, we use a GMM with a Bayesianinformation criterion model selection to discover thenumber of Gaussian clusters required to explain thegaze locations in each frame. This approach removesthe assumption that all gaze will be directed toward asingle part of the image that is intrinsic to othermeasures such as BCEA (Kosnik et al., 1986).

    Examining the number of gaze clusters detected bymodel selection across scene type and viewing task(Figure 8) reveals a shift from a predominance of a

    large number of clusters in both static viewing tasks toa smaller number of clusters during dynamic spot-the-location and free viewing. Scene type or task have noinfluence on the proportion of each trial when gaze isoptimally described by a single cluster (1 cluster: static free viewing 15.22%, SD 7.67; static spot-the-location 13.83%,SD 7.40; dynamic free viewing

    11.5%, SD 8.52; dynamic spot-the-location 11.54%, SD 7.82), F(3) 1.34, p 0.266, n.s.,although they do when the number of clusters is greaterthan 1. In most viewing conditions, the percentage ofeach trial spent with eight clusters is significantlygreater than any other number of clusters (all ps ,0.001) except for dynamic free viewing, in which thetime spent in eight clusters (dynamic spot-the-location 24.42%, SD 9.22) does not differsignificantly from two clusters (M 15.27%, SD 9.05), t(24) 0.840, p 0.409, n.s. The greaterpredominance of a small number of gaze clustersduring dynamic free viewing (less than six) confirms the

    greater attentional synchrony previously reported inthis condition (Figures 6and7) without assuming thatsuch instances occur only when all gaze is directed to asingle part of the scene.

    Instructing participants to spot the location does notchange the distribution of gaze clusters in static scenesbut does during dynamic scenes (compare the differ-ence between the green lines to the blue lines;Figure 8)The number of clusters increases in dynamic spot-the-location compared with free viewing; frequency of sixto eight clusters increases and two to four clustersdecrease (all ps , 0.05), resulting in a higher mean

    number of clusters per scene (dynamic spot-the-location: mean number of clusters 5.40, SD 0.51;dynamic free viewing: mean 4.68, SD 0.89), t(48) 3.498, p , 0.001). However, the average number ofclusters per scene within the dynamic spot-the-locationcondition remains lower than static spot-the-location(mean 5.88, SD 0.37), t(48) 3.824, p , 0.001.Instructing participants to spot the location in dynamicscenes significantly decreases attentional synchrony,represented as a larger number of gaze clusters, butdoes not fully eradicate the differences between staticand dynamic scenes.

    Flicker

    So far, we have demonstrated significant influencesof viewing task on fixation durations, saccade ampli-tudes, and attentional synchrony in dynamic scenes.This last difference suggests that gaze is much moresynchronized during free viewing of dynamic scenescompared with during spot-the-location. What mightbe causing this? Several dynamic scene-viewing studieshave reported a bias toward dynamic elements of the

    Figure 8. The percentage of each trial in which the spatial

    distribution of gaze was optimally described as a specific

    number of clusters (x-axis number of clusters). Scene type isrepresented by line color (green static, blue dynamic) andviewing task by line type (solid free view, dashed spot-the-location). Error bars represent 61 SEfor mean percentage

    across all scenes.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    14/24

    scene, for example, changing pixels (i.e., flicker),motion, and optic flow (Berg et al.,2009; Carmi & Itti,2006a,2006b; Le Meur et al., 2007; Mital et al., 2011;tHart et al.,2009; Vig et al., 2009). If, as suggested by

    the results so far, participants are successfully directingtheir attention away from the dynamic features duringthe spot-the-location task, there should be a significantdecrease in the amount of flicker around fixationcompared with free viewing.

    The amount of flicker around fixation during freeviewing of dynamic scenes is significantly greater (M0.074, SD 0.008) than would be predicted by chance(control0.037,SD0.011),t(54) 11.196,p , 0.001.During the spot-the-location task (Figure 9; spot-the-location before), this difference disappears (M0.034, SD 0.022), t(54) 0.625, p 0.534, n.s.,suggesting that gaze is not directed toward changing

    elements of the scene.Given that previous analyses have revealed what

    appeared to be a return to free viewinglike behaviorduring spot-the-location trials in which participants didnot respond (Figure 7), it was hypothesized that similarbehavior might be observed after the spot-the-locationresponse. To examine this behavior, the flicker aroundfixation after the spot-the-location response wasidentified. As can be seen from Figure 9(right-mostbar), the amount of flicker around fixation increasessignificantly (M 0.056, SD 0.024) relative to before

    the response,t(13)2.749,p , 0.05, and is significantlygreater than the flicker around control locations, t(54) 4.417, p , 0.001. This suggests that participants aresuccessfully directing their gaze away from the dynamicelements of the scene while attempting to recognize thelocation but returning to a default, free viewinglikebehavior that involves attending to changing elements

    of the scene after they have made a response.

    Probability of fixating people

    But what are the dynamic objects that participantsare attending to? Given that the scenes are filmed froma stationary viewpoint, the only elements that willchange are people moving past the camera, traffic,animals, and objects acted upon by these agents or theweather (e.g., leaves blown by the wind). People havebeen shown to be the target of a significant amount ofgaze in static scenes irrespective of viewing task

    (Birmingham et al.,2008; Buswell,1935; Castelhano etal.,2007; Yarbus,1967). Some evidence suggests this isalso the case when viewing close-up dynamic scenes ofpeople (Kuhn, Tatler, Findlay, & Cole,2008; Laidlaw,Foulsham, Kuhn, & Kingstone,2011; Risko, Laidlaw,Freeth, Foulsham, & Kingstone,2012; Vo, Smith,Mital, & Henderson,2012). The attentional synchronyobserved during free viewing of our dynamic scenesmay be due to all participants attending to the samepeople at the same time. However, during the spot-the-location task, attending to people would be detrimentalto recognizing the location.

    The probability of fixating people was calculated foreach condition by identifying whether each fixation fellwithin a person dROI and scoring a hit (1) or a miss(0). Averaging this probability across all fixationsproduced an overall probability of fixating people ineach condition. The overall mean probability that aperson is fixated in each scene type (static vs. dynamic)and task (free view, spot-the-location before response,and spot-the-location after response) is presented inFigure 10. A mixed ANOVA of fixation probabilitycomparing static/dynamic and free viewing to spot-the-location before the response reveals a significant effectof scene type,F(1, 54) 32.709, p , 0.001, pg2 0.377;

    task, F(1, 54) 130.7, p , 0.001, pg2

    0.708; and asignificant interaction,F(1, 54) 19.026, p , 0.001, pg2

    0.261. The interaction can be attributed to theabsence of a difference between static and dynamicscenes in the spot-the-location task before the responsecompared with free viewing (Figure 10). During freeviewing of dynamic scenes, there is a more than 50%chance of gaze being allocated to people (M 0.544,SD 0.072). Gaze also demonstrates a large biastoward people in static scene free viewing (M 0.420,SD 0.08), although the probability is significantly less

    Figure 9. Mean flicker in a 28 region around fixation during free

    viewing of dynamic scenes (Free view green bar) and

    during spot-the-location task before a response was made (STL Before middle blue bar) and after (STL After rightblue bar). A baseline rate of flicker around control locations is

    presented (solid red linemean flicker, dashed lines 61SE).The control locations control for gaze biases (such as screen

    center) by being randomly sampled from different frames in the

    same scene. Error bars represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    15/24

    than dynamic scenes, t(27) 7.394, p , 0.001. Thisdifference is absent before the spot-the-location re-sponse (static0.289,SD0.073; dynamic0.306,SD0.078;t , 1) even though the probabilities still remainsignificantly greater than chance (chance 0.235, SD 0.027; all ps , 0.001; Figure 10, solid red line).

    The difference between static and dynamic scenesreturns following the spot-the-location response (Fig-ure 10; right bars): main effect of before versus afterresponse,F(1, 27) 58.03,p , 0.001,pg20.682; staticversus dynamic,F(1, 27) 8.617, p , 0.01,pg2 0.242;and a significant interaction, F(1, 27) 10.532, p ,0.01, pg2 0.281. The re-emergence of the greater biastoward people in dynamic scenes compared with staticafter the response can be seen clearly in the meanfixation probabilities: static 0.34 (SD 0.066),dynamic 0.409 (SD 0.070),t(27) 4.521, p , 0.001.However, fixation probabilities remain significantly

    lower than during free viewing: static free viewingversus static spot-the-location after response, t(54) 4.081, p , 0.001; dynamic free viewing versus dynamicspot-the-location after response, t(54) 7.122, p ,0.001.

    These results confirm that gaze is predominantlyallocated to people during free viewing of dynamic andstatic scenes, but removing the utility of these fixationsvia task instruction can result in less gaze to people,even though the scenes remain the same. The influenceof viewing task on allocation of gaze to people becomes

    very clear once time is taken into account (Figure 11).During free viewing, more than half of all first saccadesafter scene onset are directed to people (1000-ms timebin;Figure 11, top). This is irrespective of whether thescene is dynamic or static. By 2500 ms after scene onset,the probability of fixating people has decreased forstatic scenes (0.5, SD 0.178) but remains high for

    dynamic scenes (0.58, SD 0.15), t(27) 2.147, p ,0.05. This difference increases over the duration of eachscene as the fixation probability continues to decreasefor static scenes but remains high for dynamic. Bycomparison, gaze during spot-the-location task (beforeresponse) exhibits the same initial tendency towardpeople in both static and dynamic (1000-ms time bin;Figure 11, middle), but this bias is immediatelyeradicated as the fixation probability decreases to thechance baseline in the following time bin (1500-ms timebin;Figure 11, middle). This initial tendency towardpeople cannot be attributed to involuntary capture bythe motion associated with people as the same tendency

    is observed in static and dynamic scenes (e.g., comparethe dynamic gaze heat maps for frame 19 ofVideo 2inboth static and dynamic free viewing). It is possible thatthis initial peak in the probability of fixating peoplemay be due to the collocation of people with screencenter and the tendency for first saccades to be directedtoward the screen center after the randomly locatedfixation cross. This central bias and its impact on theprobability of fixating people may be responsible forthe greater-than-chance mean probability previouslyreported before the spot-the-location response (Figure10). This possibility will be explored in subsequent

    studies without a random fixation cross.As the duration of scene presentation increases, thenumber of participants who have not responded to thespot-the-location task decreases, introducing a highdegree of variance in the fixation probabilities (Figure11, middle). This has the opposite effect on the numberof participants contributing to the after-the-locationresponse group (Figure 11, bottom) and allows thereturn to default viewing behavior to be observed.After the spot-the-location response, the probability offixating people increases and the difference betweenstatic and dynamic scenes returns (Figure 11, bottom),for example, 7000-ms time bin, t(27) 2.014, p , 0.05.

    Thus, participants appear to modulate their allocationof gaze to people as a function of its utility to theirviewing task.

    Discussion

    Attending to a dynamic naturalistic scene entailsreconciling the temporal demands of scene featureswith the intentions of the viewer. Previous investiga-

    Figure 10. Mean probability of each fixation falling within a

    dynamic person region across all scenes for the two stimulustypes (staticgreen vs. dynamicblue; bars) and viewing tasks(x-axis): free view (left), spot-the-location before response

    (middle), spot-the-location after response (right). Solid red line

    indicates chance fixation probability given the size of the people

    regions. Error bars and dashed red lines represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    16/24

    tions into the impact of viewing task on gaze behaviorin dynamic scenes have failed to demonstrate anydifferences in gaze behavior (Taya et al.,2012), lendingsupport to the view that low-level visual features such

    as motion and flicker are dominant in influencing gazeallocation in dynamic scenes (Carmi & Itti, 2006a,2006b; Itti, 2005). However, the problem with priorstudies investigating the influences on gaze in dynamicscenes is that the semantic features that may besignificant to viewers by default during free viewing,such as people, are also sources of low-level salience(e.g., motion). In the present study, we aimed todecorrelate areas of high salience (i.e., motion) fromareas of optimal relevance to the viewing task (i.e.,static indices of scene location) and compare gaze

    behavior in this extreme situation. The role of scenedynamics was also identified by comparing gazebehavior between static and dynamic versions of thesame scenes.

    Theresultsof the present study revealed a significantimpact of viewing task and scene type (static vs.dynamic) across a series of measures of gaze behavior.Basic oculomotor measures showed significant differ-ences between the two viewing tasks for dynamic andstatic scenes. Across both scene types, fixation dura-tions were longer and saccade amplitudes shorterduring free viewing compared with spot-the-location.Both measures showed the characteristic change overtime previously reported in static scenes (fixationdurations increase and saccade amplitudes decrease;

    Figure 11. Mean probability of fixating people over time (averaged into 500-ms bins; x-axis) across all scenes for the two stimulus

    types (static green dashed lines, dynamic scenes blue solid line) and viewing tasks (free view top, before the spot-the-locationresponse middle, after the spot-the-location response bottom). Solid red line indicates chance given the size of the peopleregions. Error bars and dashed red lines represent 61 SE.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

    http://-/?-http://-/?-
  • 8/10/2019 16.full (1)

    17/24

  • 8/10/2019 16.full (1)

    18/24

    are specific enough to allow the default viewing strategyof prioritizing dynamic/social features to be overcome.

    The influence of scene dynamics

    Ourresultsprovide further evidence of interestingdifferences between gaze behavior in dynamic andstatic scenes (see Dorr et al., 2010, for preferencecomparisons). During free viewing, saccade amplitudeswere significantly longer in dynamic compared withstatic scenes, but gaze was more distributed in staticscenes. This suggests that gaze spent longer in eachlocation in dynamic scenes after making large saccadesto get there. Longer mean fixation durations indynamic compared with static scenes confirm this, asdoes the greater attentional synchrony, lower number

    of gaze clusters, and higher probability of fixatingpeople. The pattern of behavior during free viewing ofdynamic scene viewing can be described as a tendencyto identify quickly a small number of areas of interestin a scene and devote a large proportion of viewingtime to fixating and pursuing objects at these locations.In the naturalistic interior and exterior scenes used inthis study, more than half of all fixations coincide withpeople (54.4%), a significantly greater proportion thanwould be predicted by chance. A similar bias towardpeople is observed during free viewing of static scenes

    (42%), but the shorter fixations and greater screenexploration suggest attention is held less by the peopleor that it takes less time for interesting information tobe encoded for each static person compared with

    dynamic people.At first glance, finding longer fixations in scenes

    containing motion may seem counterintuitive asdynamic scenes should contain more centers of interest,which change over time, resulting in more competitionfor gaze. Dorr et al. (2010) also reported similar longerfixations in dynamic versions of naturalistic scenescompared with static versions. However, the influenceof scene content was not controlled for as thepresentation of scenes across stimulus type was notcounterbalanced across participants, meaning thatdifferences in fixation durations for particular scenecontent and individual viewer differences in mean

    fixation durations (Castelhano & Henderson,2008)cannot be eliminated as the reason for differencesacross stimulus types. Our replication of longer fixationdurations when such individual scene and viewerdifferences are controlled suggests that the simplecompetition hypothesis of gaze control in dynamicscenes does not hold. In a fixate-move model of eyemovement control (e.g., Findlay & Walker,1999), thiswould result in greater move signals, pulling the eyesto peripheral locations and shorter fixations. However,the video stimuli used in this study updated only every

    Video 2. Dynamic heat maps representing the distribution of

    gaze over time for Scene 19 (Meadows playground). The four

    videos depict the viewing conditions (free view left column;spot-the-location right column) and scene types (static toprow; dynamicbottom row). During free viewing, hotter colorsindicate greater attentional synchrony between the gaze of

    multiple viewers. During spot-the-location, attentional syn-

    chrony is initially represented as a cold-hot heat map (i.e., blue

    to greenish yellow), then as participants identify the location

    (i.e., press the button), they switch to a red-yellow heat map.

    Video 3. Dynamic heat maps representing the distribution of

    gaze over time for Scene 12 (National Museum of Scotland). The

    four videos depict the viewing conditions (free view leftcolumn; spot-the-location right column) and scene types(static top row; dynamic bottom row). During free viewinghotter colors indicate greater attentional synchrony between

    the gaze of multiple viewers. During spot-the-location, atten-

    tional synchrony is initially represented as a cold-hot heat map

    (i.e., blue to greenish yellow), then as participants identify the

    location (i.e., press the button), they switch to a red-yellow heat

    map.

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

    http://-/?-http://-/?-
  • 8/10/2019 16.full (1)

    19/24

    40 ms (25 fps) and were filmed from a static, human-like vantage point, meaning that most naturallyoccurring motion (e.g., people, animals, traffic, etc.)could not move far across the screen between frames.Unless new objects entered the screen edge, mostscenes motion involves semipredictable continuationof prior motion (Vig et al., 2009), and once fixated, this

    motion may remain within the foveal region for severalframes. Once a fixated object has moved away from thefoveal region, compensatory eye movements such aspursuit or catch-up saccade will occur. A lot of thescenes used in this study also involved relativelystationary people engaging in tasks such as foldinglaundry (Scene 23;Figure 1), playing on a swing (Scene19), playing bagpipes (Scene 20), dancing (Scene 4), ormilling around at such a distance from the camera thatit takes them several frames to change screen location(Scenes 2, 4, 6, 9, 10, 12, 13, 15, 17, 20). Fixating suchregions of low-dynamic but high-social significancemay result in greater top-down signals to hold fixation

    and acquire as much information as possible (Findlay& Walker,1999; Nuthmann, Smith, Engbert, &Henderson,2010). This hypothesis is supported by theobservation of a greater probability of fixating peopleand more flicker at fixation. The change in gazebehavior during spot-the-location also confirms thishypothesis as gaze is directed away from people to thestatic elements of the scene, resulting in shorterfixations.

    Attentional synchrony

    The presence of attentional synchrony during freeviewing of dynamic scenes and its near eradication byviewing instructions provides insight into what maycause the spontaneous clustering of gaze acrossmultiple viewers. Given that attentional synchronyrequires all viewers to make the same decisions aboutwhere and when to allocate gaze, it might be assumedthat as scene complexity increases and the number ofsources of information available for making thesedecisions increases, so should the variation in theoutcome decided by multiple individuals. However, theexact opposite appears to be true. Attentional syn-

    chrony increases with realism: static scenes , dynamicscenes. Interobserver agreement during viewing ofstatic scenes is very low (Mannan et al., 1997; Tatler etal.,2005), even though the same objects in a scene tendto be prioritized by multiple viewers (Buswell, 1935;Yarbus,1967). In dynamic scenes, previous studieshave shown a high degree of attentional synchrony inedited and highly composed videos (Carmi & Itti,2006a,2006b; Goldstein et al., 2007; Mital et al., 2011;Ross & Kowler,2013; Sawahata et al., 2008; Smith &Henderson,2008; tHart et al., 2009; Tosi et al., 1997).

    A large factor in the shaping of attentional synchronymay be shot and sequence composition (Smith, 2012a,2013). However, here we show a greater degree ofattentional synchrony in minimally composed andunedited naturalistic dynamic scenes compared withcontent-matched static scenes. Averaging across dif-ferent dynamic scenes reveals a consistently high level

    of attentional synchrony throughout presentation time.However, examining the attentional synchrony for anindividual scene reveals fluctuations in attentionalsynchrony as scene content changes (also see figure 6 inDorr et al., 2010). For example,Video 1(Scene 20 inFigure 1) depicts a street scene in Edinburgh (PrincesStreet, by the Sir Walter Scott monument). During freeviewing of the scene, there are several moments of tightgaze clustering, such as when a group of boys enter thescene from screen left (Frame 63), when a womanenters the screen to give money to the bagpiper (Frame204), and when the woman in red walks toward thecamera (Frame 330). Similar moments are evident in all

    dynamic scenes (also seeVideos 2and3). Each of thesemoments is signified by a sudden change in the scene,either an appearance or a change in direction of aperson.

    Relative to the static background, the momentswhen an object suddenly appears in the frame can beidentified by a high degree of flicker (Mital et al.,2011)The significance of the event is temporally defined and,as such, absent from static presentations of the samescene. Although the addition of motion increases thepotential sources of information in the scene comparedwith static presentations, the time constraints of when

    information is available simplifies the viewers task ofidentifying which elements of the scene are significantat a particular moment. This results in attentionalsynchrony as all viewers default to the same strategy ofprioritizing points of high novelty during free viewing.Such a predilection toward novelty has been a factor inbasic attention research for decades (Treisman &Gelade,1980; Wolfe,2007) and has recently beenapplied to gaze prediction in static scenes (Najemnik &Geisler,2005) and TV clips (Itti & Baldi, 2009). Itti andBaldi (2009) used a Bayesian definition of surprise tocapture subjective impressions of the continuity of low-level visual features in TV clips and could predict 72%

    of all gaze shifts and 84% of the moments when allviewers gazed at the same location. However, in thisstudy, viewers were instructed to follow the mainactors and actions, so that their gaze shifts reflected anactive search for non-specific information of subjectiveinterest (p. 1299), which would bias them toward themoving elements of the scene which are the elementsmost likely to exhibit surprising changes. During ourfree viewing of dynamic scenes, we observe similarspontaneous tracking of moving objects and people,suggesting that this behavior may be a default for most

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 1

  • 8/10/2019 16.full (1)

    20/24

    viewers when attending naturalistic dynamic socialscenes. However, as has been rightly noted before,instructing participants to free view a scene does notcreate a systematic default viewing task across allviewers but instead creates ambiguity in which taskeach individual is choosing to engage in (Tatler et al.,2005; Tatler et al., 2011). For example, during free

    viewing of Video 1, the majority of viewers decide towatch the bagpiper until new people appear in thescene. Similarly, inVideo 2, all viewers initially look atthe boys playing on the swing until one boy runs offand a seagull flies across the top of the scene, creatingthree areas of interest and dividing the viewers intothree gaze clusters. By gazing at areas of greatestchange, viewers are allocating their processing re-sources and area of greatest visual acuity to the areas ofthe screen that are likely to hold the greatest amount ofinformation, creating an optimal viewing pattern for anunspecified viewing task (Najemnik & Geisler, 2005).

    Once a viewing task is specified that allows viewers

    to preset their sensitivity to visual features or locationsof a scene (Folk, Remington, & Johnston, 1992), theutility of attending to motion or areas of change may benegated. Such presetting can explain the change in gazebehavior observed across our two viewing tasks. It hasbeen proposed that computational models of gazeallocation in dynamic scenes should combine bottom-up salience computations with top-down factors toaccount for this effect (Itti,2005). Models adopting thisapproach attempt to learn task relevance of scenefeatures (Navalpakkam & Itti,2005) or are hard codedwith contextual cues that modulate the strength of the

    bottom-up salience map (Torralba et al., 2006). Suchmodels might be able to account for the differences ingaze behavior observed across our free-viewing andspot-the-location tasks. The indirect instruction toattend to static features of a scene could decreasesensitivity to dynamic features in the salience map,essentially creating foreground subtraction on the sceneso that moving features effectively disappear from themap. Such an approach would be blind to scenesemantics.

    Two pieces of evidence argue against a simplemotion presetting hypothesis: (a) attentional synchronyis still greater in dynamic spot-the-location than static

    and (b) the probability of fixating people remainsgreater than chance. These results may be spurious andevidence of participants failing to follow the instruc-tions to respond whether or not they know the locationand instead reverting back to free viewinglikebehavior before responding (seeFigures 7and11).However, looking at individual examples of gazebehavior in dynamic spot-the-location tasks, it seemslike the first couple of saccades are always directed firstto the screen center and then to nearby people beforeheading into the periphery to locate indexical scene

    features such as a clock (Video 3), the Sir Walter Scottmonument (Video 1), or a tower block (Video 2). Thisinitial bias toward central people is confirmed byFigure 11. This bias could either be evidence of motioncutting through top-down presetting, an initial defaultbias toward people that is hard to override, or simplycoincidence of motion and people with the screen

    center, as is also observed in static scenes (Tatler,2007)Further experiments dissociating motion from peopleare required to test these hypotheses.

    Finally, ourresults further demonstrate onlineendogenous control of gaze in naturalistic visual scenes(Ballard, Hayhoe, & Pelz, 1995; Hayhoe & Ballard,2005; Tatler et al.,2011) but extend the findings tosituations in which salience has been thought to be thestrongest predictor of gaze (i.e., noninteractive dynamicscenes). These findings mirror similar recent findingsusing dynamic scenes. Gaze behavior has been shownto change during the viewing of videos depicting peopletalking to a camera as the availability of audio

    information changes (Lansing & McConkie, 2003; Voet al.,2012); gaze is tightly linked to the social cues of amagician (Kuhn et al., 2008) and impervious todistraction by sudden onsets during a magic trick(Smith, Lamont, & Henderson, in press). Recentevidence of gaze behavior during the perception ofhumans engaged in tasks also suggests that viewersdistribute their attention to areas of the scene deemedrelevant by their expectations about future events(Smith,2012b). Such behaviors go far beyond thepredictions of current saliency-based models of gazeand highlight the need for further work into the

    computational modeling of higher-order factors such asviewing task, social features, and event dynamics.

    Conclusion

    The present study demonstrates how the degree ofattentional synchrony during free viewing of prere-corded naturalistic dynamic scenes can be decreased bya viewing task prioritizing scene semantics defined bystatic, background features. Top-down control modu-lates bottom-up biases toward moving objects and

    changes which parts of a dynamic scene are fixated andthe parameters of the eye movements used to do so(e.g., fixation durations and saccade amplitudes).During free viewing, a default tendency towarddynamic features leads to a high degree of attentionalsynchrony across multiple viewers. However, thesimilarity in the initial bias toward people during freeviewing of both static and dynamic scenes suggests thatattentional synchrony may not be due only toexogenous control of attention by motion but also to acorrelation between low-level features and default

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 2

    http://-/?-http://-/?-
  • 8/10/2019 16.full (1)

    21/24

    cognitive relevance (e.g., interest in people). Suchinteractions between exogenous and endogenous influ-ences on gaze behavior in dynamic scenes must beexplored in future studies.

    Keywords: eye tracking, naturalistic dynamic scenes,attention, social attention, motion, saccadic eye move-ments, fixations, attentional synchrony

    Acknowledgments

    Thanks to John Henderson, George Malcolm, AntjeNuthmann, Annabelle Goujon, and the rest of theEdinburgh University Visual Cognition lab for theirfeedback during early presentations of these data.These data were presented at the Eyetracking inDynamic Scenes workshop (Edinburgh, UK; Septem-ber 7, 2007), the Vision Science Society annual meetings

    (Naples, FL; May 914, 2008, and May 611, 2011),and the European Conference on Eye Movements(Marseille, France; August 2125, 2011).

    Commercial relationships: none.Corresponding author: Tim J. Smith.Email: [email protected]: Department of Psychological Sciences, Birk-beck, University of London, London, UK.

    References

    Antes, J. R. (1974). Time course of picture viewing.Journal of Experimental Psychology, 103(1), 6270.

    Baddeley, R. J., & Tatler, B. W. (2006). High frequencyedges (but not contrast) predict where we fixate: ABayesian system identification analysis. VisionResearch, 46, 28242833.

    Ballard, D. H., Hayhoe, M. M., & Pelz, J. B. (1995).Memory representations in natural tasks.Journal ofCognitive Neuroscience, 7(1), 6680.

    Berg, D. J., Boehnke, S. E., Marino, R. A., Munoz, D.P., & Itti, L. (2009). Free viewing of dynamic

    stimuli by humans and monkeys.Journal of Vision,9(5):19, 115,http://www.journalofvision.org/content/9/5/19, doi:10.1167/9.5.19. [PubMed][Article]

    Birmingham, E., Bischof, W. F., & Kingstone, A.(2008). Gaze selection in complex social scenes.Visual Cognition, 16, 341355.

    Bishop, C. M. (2007).Pattern Recognition and MachineLearning. New York: Springer.

    Buswell, G. T. (1935). How people look at pictures: A

    study of the psychology of perception in art.Chicago: The University of Chicago Press.

    Carmi, R., & Itti, L. (2006a). The role of memory inguiding attention during natural vision. Journal ofVision, 6(9):4, 898914,http://www.

    journalofvision.org/content/6/9/4, doi:10.1167/6.9.4. [PubMed] [Article]

    Carmi, R., & Itti, L. (2006b). Visual causes versuscorrelates of attention selection in dynamic scenes.Vision Research, 46, 4333.

    Castelhano, M. S., & Henderson, J. M. (2008). Stableindividual differences across images in humansaccadic eye movements. Canadian Journal ofExperimental Psychology, 62(1), 114.

    Castelhano, M. S., Mack, M., & Henderson, J. M.(2009). Viewing task influences eye movementcontrol during active scene perception. Journal ofVision, 9(3):6, 115,http://www.journalofvision.

    org/content/9/3/6, doi:10.1167/9.3.6. [PubMed][Article]

    Castelhano, M. S., Wieth, M. S., & Henderson, J. M.(2007). I see what you see: Eye movements in real-world scenes are affected by perceived direction ofgaze. In L. P. a. E. Rome (Ed.), Attention incognitive systems (pp. 252262). Berlin: Springer.

    Cerf, M., Frady, E. P., & Koch, C. (2009). Faces andtext attract gaze independent of the task: Experi-mental data and computer model.Journal of Vision,9(12):10, 115,http://www.journalofvision.org/content/9/12/10, doi:10.1167/9.12.10. [PubMed]

    [Article]Cristino, F., & Baddeley, R. J. (2009). The nature of

    visual representations involved in eye movementswhen walking down the street. Visual Cognition, 17,880903.

    Dorr, M., Martinetz, T., Gegenfurtner, K. R., & Barth,E. (2010). Variability of eye movements whenviewing dynamic natural scenes. Journal of Vision,10(10):28, 117,http://www.journalofvision.org/content/10/10/28, doi:10.1167/10.10.28. [PubMed][Article]

    Einhauser, W., Spain, M., & Perona, P. (2008). Objects

    predict fixations better than early saliency. Journalof Vision, 8(14):18, 1126,http://www.

    journalofvision.org/content/8/14/18, doi:10.1167/8.14.18. [PubMed] [Article]

    Elazary, L., & Itti, L. (2008). Interesting objects arevisually salient. Journal of Vision, 8(3):3, 115,http://www.journalofvision.org/content/8/3/3, doi:10.1167/8.3.3. [PubMed] [Article]

    Findlay, J. M., & Walker, R. (1999). A model ofsaccade generation based on parallel processing

    Journal of Vision (2013) 13(8):16, 124 Smith & Mital 2

    http://www.journalofvision.org/content/9/5/19http://www.journalofvision.org/content/9/5/19http://www.ncbi.nlm.nih.gov/pubmed/19757897http://www.journalofvision.org/content/9/5/19.longhttp://www.journalofvision.org/content/6/9/4http://www.journalofvision.org/content/6/9/4http://www.ncbi.nlm.nih.gov/pubmed/17083283http://www.journalofvision.org/content/6/9/4.longhttp://www.journalofvision.org/content/9/3/6http://www.journalofvision.org/content/9/3/6http://www.ncbi.nlm.nih.gov/pubmed/19757945http://www.journalofvision.org/content/9/3/6.longhttp://www.journalofvision.org/content/9/12/10http://www.journalofvision.org/content/9/12/10http://www.ncbi