How saliency, faces, and sound influence gaze in dynamic...

17
How saliency, faces, and sound influence gaze in dynamic social scenes Antoine Coutrot # $ Gipsa-lab, CNRS, & Grenoble-Alpes University, Grenoble, France Nathalie Guyader # $ Gipsa-lab, CNRS, & Grenoble-Alpes University, Grenoble, France Conversation scenes are a typical example in which classical models of visual attention dramatically fail to predict eye positions. Indeed, these models rarely consider faces as particular gaze attractors and never take into account the important auditory information that always accompanies dynamic social scenes. We recorded the eye movements of participants viewing dynamic conversations taking place in various contexts. Conversations were seen either with their original soundtracks or with unrelated soundtracks (unrelated speech and abrupt or continuous natural sounds). First, we analyze how auditory conditions influence the eye movement parameters of participants. Then, we model the probability distribution of eye positions across each video frame with a statistical method (Expectation- Maximization), allowing the relative contribution of different visual features such as static low-level visual saliency (based on luminance contrast), dynamic low- level visual saliency (based on motion amplitude), faces, and center bias to be quantified. Through experimental and modeling results, we show that regardless of the auditory condition, participants look more at faces, and especially at talking faces. Hearing the original soundtrack makes participants follow the speech turn-taking more closely. However, we do not find any difference between the different types of unrelated soundtracks. These eye- tracking results are confirmed by our model that shows that faces, and particularly talking faces, are the features that best explain the gazes recorded, especially in the original soundtrack condition. Low-level saliency is not a relevant feature to explain eye positions made on social scenes, even dynamic ones. Finally, we propose groundwork for an audiovisual saliency model. Introduction From the beginning of eye tracking, we know that faces attract gaze and capture visual attention more than any other visual feature (Buswell, 1935; Yarbus, 1967). When present in a scene, faces invariably draw gazes, even if observers are explicitly asked to look at a competing object (Bindemann, Burton, Hooge, Jen- kins, & de Haan, 2005; Theeuwes & Van der Stigchel, 2006). Many studies have established that face per- ception is holistic (Boremanse, Norcia, & Rossion, 2013; Farah, Wilson, Drain, & Tanaka, 1998; Hershler & Hochstein, 2005) and pre-attentive (Bindemann, Burton, Langton, Schweinberger, & Doherty, 2007; Crouzet, Kirchner, & Thorpe, 2010), and the brain structures specifically involved in face perception have been pointed out (Haxby, Hoffman, & Gobbini, 2000; Kanwisher, McDermott, & Chun, 1997). Despite their leading role in attention allocation, faces have rarely been considered in visual attention modeling. Over the past 30 years, numerous computational saliency models have been proposed to predict where gaze lands (see Borji & Itti, 2012, for a taxonomy of 65 models). Most of them are based on Treisman and Gelade’s (1980) Feature Integration Theory, stating that low-level features (edges, intensity, color, etc.) are extracted from the visual scene and combined to direct visual attention (Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985; Le Meur, Le Callet, & Barba, 2007; Marat et al., 2009). However, these models cannot be generalized to many experimental contexts, since the dynamic and social nature of visual perception are not taken into account (Tatler, Hayhoe, Land, & Ballard, 2011). Typical examples in which they fail dramatically are visual scenes involving faces (Birmingham & Kingstone, 2009). More recently, visual saliency models combining face detection with classical low-level feature extraction have been developed and have significantly outper- formed the classical ones (Cerf, Harel, Einh ¨ auser, & Koch, 2008; Marat, Rahman, Pellerin, Guyader, & Houzet, 2013). Citation: Coutrot, A., & Guyader, N. (2014). How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision, 14(8):5, 1–17, http://www.journalofvision.org/content/14/8/5, doi:10.1167/14.8.5. Journal of Vision (2014) 14(8):5, 1–17 1 http://www.journalofvision.org/content/14/8/5 doi: 10.1167/14.8.5 ISSN 1534-7362 Ó 2014 ARVO Received October 24, 2013; published July 3, 2014

Transcript of How saliency, faces, and sound influence gaze in dynamic...

Page 1: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

How saliency, faces, and sound influence gaze in dynamicsocial scenes

Antoine Coutrot # $Gipsa-lab, CNRS, & Grenoble-Alpes University,

Grenoble, France

Nathalie Guyader # $Gipsa-lab, CNRS, & Grenoble-Alpes University,

Grenoble, France

Conversation scenes are a typical example in whichclassical models of visual attention dramatically fail topredict eye positions. Indeed, these models rarelyconsider faces as particular gaze attractors and nevertake into account the important auditory informationthat always accompanies dynamic social scenes. Werecorded the eye movements of participants viewingdynamic conversations taking place in various contexts.Conversations were seen either with their originalsoundtracks or with unrelated soundtracks (unrelatedspeech and abrupt or continuous natural sounds). First,we analyze how auditory conditions influence the eyemovement parameters of participants. Then, we modelthe probability distribution of eye positions across eachvideo frame with a statistical method (Expectation-Maximization), allowing the relative contribution ofdifferent visual features such as static low-level visualsaliency (based on luminance contrast), dynamic low-level visual saliency (based on motion amplitude), faces,and center bias to be quantified. Through experimentaland modeling results, we show that regardless of theauditory condition, participants look more at faces, andespecially at talking faces. Hearing the original soundtrackmakes participants follow the speech turn-taking moreclosely. However, we do not find any difference betweenthe different types of unrelated soundtracks. These eye-tracking results are confirmed by our model that showsthat faces, and particularly talking faces, are the featuresthat best explain the gazes recorded, especially in theoriginal soundtrack condition. Low-level saliency is not arelevant feature to explain eye positions made on socialscenes, even dynamic ones. Finally, we proposegroundwork for an audiovisual saliency model.

Introduction

From the beginning of eye tracking, we know thatfaces attract gaze and capture visual attention more

than any other visual feature (Buswell, 1935; Yarbus,1967). When present in a scene, faces invariably drawgazes, even if observers are explicitly asked to look at acompeting object (Bindemann, Burton, Hooge, Jen-kins, & de Haan, 2005; Theeuwes & Van der Stigchel,2006). Many studies have established that face per-ception is holistic (Boremanse, Norcia, & Rossion,2013; Farah, Wilson, Drain, & Tanaka, 1998; Hershler& Hochstein, 2005) and pre-attentive (Bindemann,Burton, Langton, Schweinberger, & Doherty, 2007;Crouzet, Kirchner, & Thorpe, 2010), and the brainstructures specifically involved in face perception havebeen pointed out (Haxby, Hoffman, & Gobbini, 2000;Kanwisher, McDermott, & Chun, 1997). Despite theirleading role in attention allocation, faces have rarelybeen considered in visual attention modeling. Over thepast 30 years, numerous computational saliency modelshave been proposed to predict where gaze lands (seeBorji & Itti, 2012, for a taxonomy of 65 models). Mostof them are based on Treisman and Gelade’s (1980)Feature Integration Theory, stating that low-levelfeatures (edges, intensity, color, etc.) are extracted fromthe visual scene and combined to direct visual attention(Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985; LeMeur, Le Callet, & Barba, 2007; Marat et al., 2009).However, these models cannot be generalized to manyexperimental contexts, since the dynamic and socialnature of visual perception are not taken into account(Tatler, Hayhoe, Land, & Ballard, 2011). Typicalexamples in which they fail dramatically are visualscenes involving faces (Birmingham & Kingstone,2009). More recently, visual saliency models combiningface detection with classical low-level feature extractionhave been developed and have significantly outper-formed the classical ones (Cerf, Harel, Einhauser, &Koch, 2008; Marat, Rahman, Pellerin, Guyader, &Houzet, 2013).

Citation: Coutrot, A., & Guyader, N. (2014). How saliency, faces, and sound influence gaze in dynamic social scenes. Journal ofVision, 14(8):5, 1–17, http://www.journalofvision.org/content/14/8/5, doi:10.1167/14.8.5.

Journal of Vision (2014) 14(8):5, 1–17 1http://www.journalofvision.org/content/14/8/5

doi: 10 .1167 /14 .8 .5 ISSN 1534-7362 � 2014 ARVOReceived October 24, 2013; published July 3, 2014

Page 2: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

Despite these significant efforts focused on attentionmodeling, auditory attention in general, and audiovi-sual attention in particular, has been left aside. Visualsaliency models do not consider sound, even whendealing with dynamic scenes. When running eye-tracking experiments with videos, authors never men-tion the soundtracks or explicitly remove them, makingparticipants look at silent movies, which is, of course,not an ecologically valid situation. Indeed, we live in amultimodal world and our attention is constantlyguided by the fusion between auditory and visualinformation. Film directors offer a good illustration ofthis by using soundtrack to strengthen their hold on theaudience. They manipulate the score to modulate thetension and tempo in scenes or to highlight importantevents in the story (Branigan, 2010; Zeppelzauer,Mitrovic, & Breiteneder, 2011; Chion, 1994). Researchconfirms that music may, in some cases, exert asignificant impact upon the perception, interpretation,and remembering of film information (Boltz, 2004;Cohen, 2005). Not only music, but auditory informa-tion in general affects eye movements. In a previousstudy, we showed that removing the original sound-track from videos featuring various visual contentimpacts eye positions, increasing the dispersion be-tween the eye positions of different observers andshortening saccade amplitudes (Coutrot, Guyader,Ionescu, & Caplier, 2012).

Thus, what we hear has an impact on what we see.This is particularly true for speech and faces, which areknown to strongly interact, as evidenced by the hugeliterature on audiovisual speech integration (Bailly,Perrier, & Vatikiotis-Bateson, 2012; Schwartz, Robert-Ribes, & Escudier, 1998; Summerfield, 1987). Toinvestigate audiovisual integration, most of thesestudies presented talking faces to observers andmeasured how visual or auditory modifications im-pacted observers’ eye movements or speech compre-hension (Bailly, Raidt, & Elisei, 2010; Lansing &McConkie, 2003; Vatikiotis-Bateson, Eigsti, Yano, &Munhall, 1998). They identified the eyes and the mouthas two strong gaze attractors during audiovisual speechprocessing, and showed that the degree to which gaze isdirected toward the mouth depends on the difficulty ofthe speech identification task. Yet, results emanatingfrom experimental set-ups using isolated close-ups offaces might not be generally applied to the real world,where everything is continuously moving and embed-ded in a complex social and dynamic context. Toaddress this issue, Vo, Smith, Mital, and Henderson(2012) eye-tracked participants watching videos of apedestrian engaged in an interview. They showed thatobservers’ gazes were dynamically directed to the eyes,the nose, or the mouth of the interviewee, according toevents depicted (speech onsets, eye contact with thecamera, quick movement of the head). The authors also

found that removing the speech signal decreased thenumber of fixations on the pedestrian’s face in favor ofthe scene background.

Nevertheless, in daily life, conversations are oftenmade of several speakers embedded in a complex scene(objects, background), not only listening to what isbeing said but interacting dynamically. Thus, Foulshamand colleagues eye-tracked observers viewing video clipsof people taking part in a decision-making task.(Foulsham, Cheng, Tracy, Henrich, & Kingstone, 2010).These authors showed that gazes followed the speechturn-taking, especially when the speaker had high socialstatus. These results indicate that during dynamic faceviewing, our visual system operates in a functional,information-seeking fashion. A few very recent papersquantified how the turn-taking affects the gaze of anoninvolved viewer of natural conversations (Foulsham& Sanderson, 2013; Hirvenkari et al., 2013). Thesestudies presented conversations to participants with therelated speech soundtracks or without any sound. Theyboth showed that sound changed the timing of looks.With the related speech soundtracks, speakers werefixated on more often and more quickly after they tookthe floor, leading to a greater attentional synchrony.

All the previously reviewed studies reported behav-ioral and eye movement analyses, but did not quantifythe relative contributions of faces (mute or talking) andof classical visual features to guide eye movements.Birmingham and Kingstone (2009) showed static socialscenes to observers and compared their eye positions tothe corresponding low-level saliency maps (within themeaning of Itti & Koch, 2000). The authors showed thatsaliency did not predict fixations better than chance.They noticed that classical low-level saliency models donot account for the bias of observers to look at the eyeswithin static social scenes. But what about dynamicscenes, where motion is known to be highly predictive offixations, much more than static visual features (Mital,Smith, Hill, & Henderson, 2010)? What are the relativepowers of classical visual features to attract gaze? Howis their attractiveness modulated by auditory informa-tion? In this study, we first quantified temporally howdifferent visual features explain the gaze behavior ofnoninvolved viewers looking at natural conversationsembedded in complex natural scenes. Five classicalvisual features were compared: the face of the conver-sation partners, the low-level static saliency, the low-level dynamic saliency, the center area, and chance (auniform spatial distribution). We chose these featuresbecause they are often pointed out by the visualexploration literature. The center area reflects the centerbias, i.e., the tendency one has to gaze more often at thecenter of the image than at the edges (Tseng, Carmi,Cameron, Munoz, & Itti, 2009). Then, we measured theinfluence of auditory information on these features.Previous studies showed that different types of sounds

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 2

Page 3: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

interact differently with visual information whenviewing videos (Vroomen & Stekelenburg, 2011; Song,Pellerin, & Granjon, 2013). Other studies dealing withstatic images and lateralized natural sounds showed thateye positions are biased toward sound sources, de-pending on saliencies of both auditory and visualstimuli (Onat, Libertus, & Konig, 2007). Like visualsaliency, auditory saliency is measured by how much anauditory event stands out from the surrounding scene(Kayser, Petkov, Lippert, & Logothetis, 2005). Thus,one can legitimately hypothesize that different auditoryscenes (with different auditory saliency profiles) wouldhave different impacts in the way one listens in on aconversation. For instance, an abrupt auditory event,with local saliency peaks, may not influence gaze in thesame way as a continuous auditory stream. Weextracted conversation scenes from Hollywood-likemovies. We recorded the eye movements of participantswatching the movies either with the original speechsoundtrack, with an unrelated speech soundtrack, withthe noise of moving objects (abrupt onsets, e.g., fallingcutlery), or with landscape continuous sound (slowlychanging components, e.g., wind blowing). We modeledthe different recorded gaze patterns with the expecta-tion-maximization (EM) algorithm, a statistical methodwidely used in statistics and machine learning, andrecently successfully applied to visual attention model-ing (Gautier & Le Meur, 2012; Ho-Phuoc, Guyader, &Guerin-Dugue, 2010; Vincent, Baddeley, Correani,Troscianko, & Leonards, 2009). This method is amixture model approach that uses participants’ eyepositions to estimate the relative contribution ofdifferent potential gaze-guiding features. In the follow-ing, we first study the impact of sound on classical(saccade amplitudes, fixation durations, dispersionbetween eye positions) and less classical (distancebetween scanpaths) eye movement parameters. Then,thanks to the EM algorithm, we analyze how auditoryinformation modulates the relative predictive power ofdifferent visual items (faces, low-level static anddynamic visual saliencies, center bias).

Methods

The experiment described in the following is part of abroader study (Coutrot & Guyader, 2013). The stimuliand the eye-tracking data described below are availableat http://www.gipsa-lab.fr/;antoine.coutrot/.

Participants

Seventy-two participants took part in the experi-ment: 30 women and 42 men, from 20 to 35 years old

(M ¼ 23.5; SD¼ 2.1). Participants were not aware ofthe purpose of the experiment and gave their informedconsent to participate. This study was approved by thelocal ethics committee. All were French native speak-ers, had a normal or corrected-to-normal vision, andreported normal hearing.

Stimuli

The visual material consisted of 15 one-shotconversation scenes extracted from French Hollywood-like movies. Videos featured two to four conversationpartners embedded in a natural environment. Videoslasted from 12 to 30 s (M ¼ 19.6; SD¼ 4.9), had aresolution of 720 · 576 pixels2 (28 · 22.5 squareddegrees of visual angle), and a frame rate of 25 framesper second. We chose stimuli featuring conversationpartners embedded in complex scenes (cafe, streets,corridor, office, etc.) involving different moving objects(glasses, spoons, cigarettes, papers, etc.). Faces occu-pied an area of 3.3 6 0.4 · 5.2 6 0.9 deg2 and wereseparated from each other by 108 6 28. Thus, onaverage, each face only occupied (3.3 · 5.2) / (28 ·22.5)¼ 2.7% of the frame area. The auditory materialconsisted of 45 monophonic soundtracks: a first set of15 soundtracks extracted from the conversation scenes(dialogues), a second set of 15 soundtracks made up ofnoises from moving objects (short abrupt onsets, e.g.,falling cutlery), and a third set of 15 soundtracksextracted from landscape scenes (continuous auditorystream, e.g., wind blowing).

To investigate the effect of auditory information ongaze allocation during a conversation, we created fourauditory versions of the same visual scene, each one ofthem corresponding to an auditory condition. TheOriginal version in which visual scenes were accompa-nied by their original soundtracks, the UnrelatedSpeech version in which the original soundtrack wasreplaced by another speech soundtrack from the firstset, the Abrupt Sounds version in which the originalsoundtrack was replaced by a soundtrack from thesecond set, and the Continuous Sound version in whichthe original soundtrack was replaced by a soundtrackfrom the third set. In the following, Unrelated Speech,Abrupt Sounds, and Continuous Sound conditions willbe referred to as the Nonoriginal conditions. Asoundtrack was associated to a particular visual sceneonly once. The soundtracks were monophonic andsampled at 48,000 Hz. All dialogues were in French.

Apparatus

Participants were seated 57 cm away from a 21-in.CRT monitor with a spatial resolution of 1024 · 768

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 3

Page 4: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

pixels and a refresh rate of 75 Hz. The head wasstabilized with a chin rest, forehead rest, and headband.The audio signal was presented via headphones(HD280 Pro, 64X, Sennheiser, Wedemark, Germany).Eye movements were recorded using an eye-tracker(Eyelink 1000, SR Research, Eyelink, Ottawa, Canada)with a sampling rate of 1000 Hz and a nominal spatialresolution of 0.01 degree of visual angle. We recordedthe eye movements of the dominant eye in monocularpupil–corneal reflection tracking mode.

Procedure

Each participant viewed the 15 different conversa-tion scenes. The different auditory versions werebalanced (e.g., four scenes in Original condition, four inUnrelated Speech condition, four in Abrupt Soundscondition, and three in Continuous Sound condition).Participants were told to carefully look at each video.Each experiment was preceded by a calibrationprocedure, during which participants focused their gazeon nine separate targets in a 3 · 3 grid that occupiedthe entire display. A drift correction was carried outbetween each video, and a new calibration procedurewas performed if the drift error was above 0.58. Beforeeach video, a fixation cross was displayed in the centerof the screen for 1 s. After that time, and only if theparticipant looked at the center of the screen (gazecontingent display), the video was played on a meangray level background. Between two consecutivevideos, a gray screen was displayed for 1 s. To avoidany order effect, videos were randomly displayed. Eachvisual scene was seen in each auditory condition by 18different participants.

Data extraction

Eye positions

The eye-tracker system sampled eye positions at 1000Hz. Since videos had a frame rate of 25 frames persecond, 40 eye positions were recorded per frame andper participant. In the following, an eye position is themedian of the 40 raw eye positions: There is one eyeposition per frame and per subject. We discarded from

analysis the eye positions landing outside the videoarea.

Saccades

Saccades were automatically detected by the Eyelinksoftware using three thresholds: velocity (308/s), accel-eration (80008/s2), and saccadic motion (0.158).

Fixations

Fixations were detected as long as the pupil wasvisible and as long as there was no saccade in progress.

Face labeling

The face of each conversation partner was markedby an oval mask. Since faces were moving, thecoordinates of each mask were defined dynamically foreach frame of each video. We used Sensarea, an in-house authoring tool allowing spatio-temporal seg-mentation of video objects to be performed automat-ically or semi-automatically (Bertolino, 2012).

Eye-tracking results

How does sound influence eye movements whenviewing other people having a conversation? In thissection, we characterize how some general eye move-ment parameters such as saccade amplitudes andfixation durations are affected by the auditory content.We also analyze the variability of eye movementsbetween participants. Then, we perform a temporalanalysis to describe how a given soundtrack influencesobservers’ sequence of fixations across the exploration(scanpaths).

Global analysis

Saccade amplitudes

For each participant, we computed the mean saccadeamplitude in each auditory condition (see Table 1). One-way repeated measures ANOVA with mean saccade

Auditory conditions

Original Unrelated speech Abrupt sounds Continuous sound

Saccade amplitudes (degree) 4.5 6 0.2 4.9 6 0.2 5.0 6 0.2 4.9 6 0.2

Fixation durations (ms) 430 6 23 423 6 21 412 6 21 419 6 22

Dispersions (degree) 4.8 6 0.5 5.3 6 0.5 5.6 6 0.6 5.5 6 0.6

Table 1. General eye movement parameters in each auditory condition. Notes: Saccade amplitudes and fixation durations areaveraged over participants, whereas dispersions are averaged over stimuli. (M 6 SE).

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 4

Page 5: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

amplitude per subject as a dependent variable andauditory condition (Original, Unrelated Speech, AbruptSounds, and Continuous Sound) as within-subject factorwas performed. A principal effect of the auditorycondition was found, F(3, 213)¼ 7.72; p , 0.001, andBonferroni posthoc pairwise comparisons revealed thatsaccade amplitudes are higher for the three Nonoriginalconditions compared to the Original condition (allps , 0.01). No difference was found between Non-original auditory conditions (all ps¼ 1).

Saccade amplitudes follow a bimodal distribution,with modes around 18 and 78, as shown Figure 1a. Wecan notice that the first mode of Original distribution issignificantly higher than the first mode of UnrelatedSpeech, Abrupt Sounds, and Continuous Sounddistributions. (Three two-sample Kolmogorov-Smirnovtests between the Original condition and the three otherconditions, all ps , 0.001). To further understand thisbimodal distribution, we split the saccades into twogroups: short (,38) saccades, corresponding to the firstmode, and large (.38) saccades, corresponding to thesecond mode. In each group, we compared theproportion of saccades (a) starting from one face andlanding on another one (Inter); (b) starting from oneface and landing on the same one (Intra); and (c)starting from or landing on the background (Other; seeFigure 1b). There are no Inter saccades in the firstmode and almost no Intra saccades in the second mode.Thus it is reasonable to assume that the first moderepresents the saccades made within a given face (fromeyes to mouth, to nose, etc.) and that the second moderepresents the saccade made between faces.

Fixation durations

We conducted one-way repeated measures ANOVAwith mean fixation duration per subject as a dependentvariable and auditory condition as within subjectfactor. We did not find any effect of the auditorycondition, F(3, 213)¼0.39; p¼0.76. Fixation durationsfollow a classical positively skewed, long-tailed distri-bution.

Dispersion

To estimate the variability of eye positions betweenobservers, we used a dispersion metric. For a frame andn observers (p¼ (xi, yi)i�[1..n] the eye positioncoordinates), the dispersion D is defined as follows:

DðpÞ ¼ 1

nðn� 1ÞXni¼1

Xn

j¼1j6¼i

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxi � xjÞ2 þ ðyi � yjÞ2

qð1Þ

The dispersion is the mean Euclidian distance betweenthe eye positions of different observers for a givenframe. Small dispersion values reflect clustered eyepositions.

We averaged dispersion values over all frames andcompared the results obtained for the 15 videos in eachauditory condition (Table 1). We conducted one-wayrepeated measures ANOVA with mean dispersion pervideo as a dependent variable and auditory conditionas within subject factor. A principal effect of theauditory condition was found, F(3, 42) ¼ 17.97; p ,0.001, and Bonferroni posthoc pairwise comparisonsrevealed that dispersion is higher in the three Non-original conditions compared to the Original condition

Figure 1. (a) Probability density estimate of saccade amplitudes in each auditory condition. The density is evaluated at 100 equally

spaced points covering the range of data (ksdensity Matlab function). (b) Proportion of saccades starting from one face and landing

on another one (Interfaces); starting from one face and landing on the same face (Intrafaces); and starting from or landing on the

background (Other). Saccades are separated in two groups: ,38 saccades (corresponding to the first mode of Figure 1a) and .38

saccades (corresponding to the second mode of Figure 1a).

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 5

Page 6: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

(all ps , .001). We found no difference betweenNonoriginal conditions (Abrupt Sounds vs. UnrelatedSpeech: p¼ 0.07; Continuous Sound vs. AbruptSounds: p¼ 1; Continuous Sound vs. UnrelatedSpeech: p¼ 0.79).

We showed that in Nonoriginal auditory conditions,the dispersion between the eye positions of differentsubjects is higher and saccade amplitudes are larger.These results reflect a greater attentional synchrony inthe Original condition: Eye positions are more clus-tered in a few regions of interest. To better understandthese global results, we looked at the temporalevolution of gaze behavior and compared subjects’scanpaths in each auditory condition.

Temporal analysis

In this section, we first look at the temporalevolution of the variability between observers’ eyepositions (dispersion) and of their distance from thescreen center (distance to center [DtC]). For the sake ofclarity, only the evolution along the 80 first frames wasplotted but analyses were carried out over wholevideos. Then, for each auditory condition, we comparethe number of fixations and the fixation sequences(scanpaths) landing on talking and mute faces. In thefollowing, by talking face, we mean a face that talks inthe Original auditory condition.

Dispersion

We represented the frame-by-frame evolution ofdispersion (Figure 2). During the five first frames,dispersion remains small (around 0.58), regardless of

the auditory condition. Then, it increases sharply andreaches a plateau after the first second (around 25frames) of visual exploration. During the first second,all dispersion curves are superimposed. But once theplateau has been reached, the dispersion curve in theOriginal condition stays below the others, as we foundin the global analysis.

Distance to Center

DtC is defined, for a given frame, as the meandistance between observers’ eye positions and thescreen center. A small DtC value corresponds to astrong center bias, and can be seen as an indicator ofthe type of exploration strategy (active or passive). Thecenter bias reflects the tendency one has to gaze moreoften at the center of the image than at the edges (seethe Modeling section below). The DtC (not represent-ed) follows the same pattern as dispersion. It stayssmall (around 0.58) during the five first frames, then itincreases sharply and reaches a plateau after thetwentieth frame (around 6.58). Contrary to dispersion,DtC curves do not differ significantly between auditoryconditions during the whole experiment.

Fixation ratio

We matched the eye positions to the frame-by-framelabeled faces previously defined. We also manuallyspotted the time periods during which each face wasspeaking. Speaking and mute time periods were definedin the Original auditory condition, i.e., when the facewas actually articulating. Thus, we were able to spatio-temporally distinguish talking faces from mute faces.For each of the 33 faces present in our stimuli and foreach frame, we computed a fixation ratio, i.e., thenumber of fixations landing on the faces divided by thetotal number of fixations. We then averaged theseratios over the speaking and the mute periods of time(28 faces talked at least once and 27 faces were silent atleast once; see Table 2). We found that talking facesattracted gaze around twice as much as mute faces,regardless of the auditory condition. One-way repeatedmeasures ANOVA with fixation ratio on talking facesas a dependent variable and auditory condition aswithin factor was performed. A principal effect of theauditory condition was found, F(3, 81)¼8.9; p , 0.001,and Bonferroni posthoc pairwise comparisons revealedthat talking faces were more fixated in the Original thanin the three Nonoriginal conditions (all ps , 0.001), butthat there was no difference between Nonoriginalconditions (all ps ¼ 1).

The same analysis was performed with mute faces.We did not find any effect of the auditory condition,F(3, 78) ¼ 1.5; p ¼ 0.21. These ratios might seem lowcompared with the literature. This is understandable

Figure 2. Temporal evolution of the dispersion between

observers’ eye positions. Dispersions are computed frame-by-

frame and averaged over the 15 videos of each auditory

condition. Values are given in degree of visual angle with error

bars corresponding to the standard errors.

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 6

Page 7: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

since we used stimuli featuring conversation partnersembedded in complex natural environments, and manyobjects that could also attract observers’ gaze. Tofurther understand how soundtracks impact on thetiming of looks in talking and mute faces, we used astring edit distance to directly compare observers’scanpaths.

Scanpath comparison

To compare scanpaths, a classical method is to usethe Levenshtein distance, a string edit distance mea-suring the number of differences between two se-quences (Levenshtein, 1966). This distance gives theminimum number of operations needed to transformone sequence into the other (insertion, deletion, orsubstitution of a single character), and has been widelyused to compare scanpaths. In this case, the comparedsequence is the sequence of successive fixations made byan observer across visual exploration (see Le Meur &Baccino, 2013, for a review). Here, we used quite asimple approach, since we only intended to comparethe observer fixation patterns in regions of interest

(faces), without considering the distance between them.For a given video, we sampled the eye movementsequence of each subject frame by frame. To eachframe, we assigned a character corresponding to thearea of the scene currently looked at (face a, face b, . . .,background; see Figure 3). We also defined the groundtruth sequence, or GT, of each video. If a video lasts mframes, GT is an array of length m, such as if face aspeaks at frame i, then GT(i)¼ a. If no face speaks atframe j, then GT(j)¼ background. This choice is quiteconservative since even when no one is speaking,observers usually continue looking at faces. For eachsubject, we compared the Levenshtein distance betweenthe fixation sequence recorded on each video and GT,normalized by the length m of the video. We conductedone-way repeated measures ANOVA with mean-normalized Levenshtein distance per subject as adependent variable and auditory condition as withinsubject factor. A principal effect of the auditorycondition was found, F(3, 213) ¼ 17.6; p , 0.001, andBonferroni posthoc pairwise comparisons revealed thatthe Levenshtein distance was smaller between GT andthe eye movement sequences recorded in the Original

Auditory conditions

Original Unrelated speech Abrupt sounds Continuous sound

Fixation in talking faces (%) 48 6 5 40 6 4 38 6 4 38 6 5

Fixation in mute faces (%) 20 6 4 23 6 3 22 6 3 22 6 3

Table 2. Fixation ratios (number of fixations landing on faces divided by the total number of fixations). Notes: Fixation ratios arecomputed for each face in each video. By averaging these ratios over speaking and silent time periods, we obtain fixation ratios fortalking and mute faces. (M 6 SE).

Figure 3. Left: Frames are split into five regions of interest (face a, face b, face c, face d, and background e). At the bottom, each line

represents the scanpath of a subject: Each letter stands for the region the subject was looking at during each frame. Right: Mean

normalized Levenshtein distance between the scanpaths and the ground truth sequence, in each auditory condition. Error bars

correspond to the standard errors.

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 7

Page 8: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

than in the three Nonoriginal conditions (all ps ,0.001). No difference between Nonoriginal conditionswas found (Abrupt Sounds vs. Unrelated Speech: p ¼0.12; Continuous Sound vs. Abrupt Sounds: p¼ 1;Unrelated Speech vs. Continuous Sound: p ¼ 0.64).Thus, we found a greater similarity between scanpathsand the ground truth sequences in Original than inNonoriginal conditions.

Interim summary

We show that the presence of faces deeply impactvisual exploration by attracting most fixations towardthem. In particular, talking faces attract gaze aroundtwice as many as mute faces, regardless of the auditorycondition. In the Original auditory condition, eyepositions are more clustered within face areas, leadingto smaller saccade amplitudes. Temporal analysisreveals that, in contrast to mute faces, talking facesattract more observers’ gazes in the Original condition.We find no significant difference between Nonoriginalconditions. These results are confirmed by the com-parison between scanpaths and speech turn-taking,pointing out that in the Original condition, partici-pants’ gaze follows the speech turn-taking (GT) moreclosely than in Nonoriginal conditions.

To better characterize the differences betweenexploration strategies in each auditory condition, wequantify the importance of different visual featureslikely to drive gaze when viewing conversations. To doso, we model the probability distribution of eyepositions by a mixture of different causes and separatetheir contributions with a statistical method, the EMalgorithm.

Modeling

In this section, we quantify how soundtracksmodulate the strength of potential gaze guidingfeatures such as static and dynamic low-level visualsaliencies, faces, and center bias (see below). Toseparate and quantify the contribution of the differentgaze guiding features, we used the EM algorithm, astatistical method using observations (the recorded eyepositions) to estimate the relative importance of eachfeature in order to maximize the global likelihood ofthe mixture model (Dempster, Laird, & Rubin, 1977).The EM algorithm is widely used in statistics andmachine learning, and some recent studies havesuccessfully applied it to visual attention modeling instatic scenes (Gautier & Le Meur, 2012; Ho-Phuoc etal., 2010; Vincent et al., 2009). To our knowledge, EMhas never been used on dynamic scenes. In order torepresent the dynamic turn-taking of conversations, we

computed the weights of the different features for eachframe of each video.

Let P(wjf, v) be the probability distribution of n eyepositions with coordinates (w¼ (xi, yi)i�[1..n], made by ndifferent observers on frame f of video v. To break thisprobability distribution down into m different gazeguiding features, a classical method is to express P as amixture of different causes U, each associated to aweight a:

Pðwj f; vÞ ¼Xm

k¼1

akðf; vÞUkðx; y; f; vÞ;withXm

k¼1

akðf; vÞ ¼ 1

P and U have the same dimensions as frames (720 ·576). The EM algorithm converges toward the mostlikely combination of weights, i.e., the one thatoptimizes the maximum likelihood of the data, giventhe eye position probability distribution P and thefeatures U. The first step (expectation) takes all thevisual features modeling the data (low-level static anddynamic saliencies, center bias, uniform distribution,and face masks) and converts them into two-dimen-sional (2-D) spatial probability distributions. Assumingthat the current model (i.e., the weight combination) iscorrect, the algorithm labels each eye position with thecorresponding probability of each 2-D spatial distri-bution. The second step (maximization) assumes thatthese probabilities are correct and sets the weights ofthe different features to their maximum likelihoodvalues. These two steps are iterated, until a convergencethreshold is reached. Finally, the best weight combi-nation is found for each frame of each video in eachauditory condition. This allows the frame-by-frameevolution of the relative importance of each feature tobe followed. Below, we describe the features we chosefor the mixture model: static and dynamic low-levelsaliencies, center bias, and faces (Figure 4).

Low-level saliency

To compute the saliency of video frames we used thespatio-temporal saliency model proposed in (Marat etal., 2009). This biologically inspired model, only basedon luminance information, is divided into two mainsteps: a retina-like and a cortical-like stage. Before theretina stage, camera motion compensation is performedto extract only the moving areas relative to thebackground. The retina-like stage does not model thephotoreceptor distribution. It extracts, on one hand,low spatial frequencies further processed in thedynamic pathway to extract moving areas in the videoframe, and on the other hand, high spatial frequenciesfurther processed in the static pathway to extractluminance orientation and frequency contrast. Then,

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 8

Page 9: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

the cortical-like stage processes these two pathwayswith a bank of Gabor filters.Static saliency: The Gabor filter outputs are normalizedto strengthen the filtered frames having spatiallydistributed maxima. Then, they are added up, yieldinga static saliency map (Figure 4b). This map emphasizesthe high luminance contrast.Dynamic saliency: Through the assumption of lumi-nance constancy between two successive frames,motion estimation is performed for each spatialfrequency of the bank of Gabor filters. Finally, atemporal median filter is applied over five successiveframes to remove potential noise from the dynamicsaliency map (Figure 4c). This map emphasizes themoving areas, returning the amplitude of the motion.

Center bias

Most eye-tracking studies reported that subjects tendto gaze more often at the center of the image than at theedges. Several hypotheses have been proposed toexplain this bias. Some are stimuli-related, like thephotographer bias (one often places regions of interestat the center of the picture); others are inherent to theoculomotor system (motor bias) or to the observers’viewing strategy (Marat et al., 2013; Tatler, 2007; Tsenget al., 2009). As in Gautier and Le Meur (2012), thecenter bias is modeled by a time-independent bidi-mensional Gaussian function, centered at the screen

center as N(0, R), withP¼ (

r2x 0

0 r2y) the covariance

matrix and r2x, r2

y the variance. We chose rx and ryproportional to the frame size (288 · 22.58), and ran thealgorithm with several values ranging from rx¼28 to rx¼ 3.58 and ry¼ 1.68 to ry¼ 2.88. Changing these values

did not significantly change the outputs. The resultspresented in this study were obtained with rx¼ 2.38and ry ¼ 1.98 (Figure 4d).

Uniform distribution

Fixations occur at all positions with equal proba-bility (Figure 4e). This feature is a catch-all hypothesisthat stands for any fixations that are not explained byother features. The lower the weight of this feature is,the better the other features will explain the data.

Faces

For a given frame, we created as many face maps asfaces present in the frame. Face maps were made up ofthe corresponding face binary masks described in theMethod section (Figure 4f, g). In Figure 5a, the AllFaces weight corresponds to the sum of the weights ofthe different face maps in the frame.

For each video, the weight of each feature wasaveraged over time. We compared the weights of thedifferent features for each video, regardless of theauditory condition, as well as the weights of eachfeature in the different auditory conditions (Figure 5a).Faces were by far the most important featuresexplaining gaze behavior, regardless of the auditorycondition (weights � 0.6). This result matches thefixation ratios reported in Table 2: The fixation ratio inall faces (i.e., mute þ talking) is around 60%.

We performed repeated measures ANOVA withFeature Weights (Static Saliency, Dynamic Saliency,Center Bias, Uniform, and Faces) and AuditoryConditions (Original, Unrelated Speech, Continuous

Figure 4. Features chosen to model the probability distribution of eye positions on each frame.

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 9

Page 10: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

Sound, and Abrupt Sounds) as within-subject factors.The main effect of Feature Weights yielded an F ratioof F(4, 56) ¼ 145.95, p , 0.001. Bonferroni pairwisecomparisons revealed that the weight of Faces wassignificantly higher than the others (all ps , 0.001).There was no significant difference between StaticSaliency, Dynamic Saliency, and Center Bias (Staticvs. Dynamic: p ¼ 1; Static vs. Center Bias: p ¼ 0.67;Dynamic vs. Center Bias: p¼ 1). Uniform distributionwas lower than Static and Dynamic Saliencies(Uniform vs. Static: p , 0.001; Uniform vs. Dynamic:p ¼ 0.06), but was not significantly different fromCenter Bias (Uniform vs. Center Bias: p ¼ 0.18). Themain effect of Auditory conditions yielded an F ratioof F(3, 42) ¼ 74.39, p , 0.001. The interaction effectwas also significant, with an F ratio of F(12, 168) ¼6.43, p , 0.001. Bonferroni pairwise comparisonsbetween each auditory condition for each feature werecalculated.

Static Saliency, Dynamic Saliency, Center Bias andUniform features: No significant difference betweenauditory conditions (all ps¼ 1).

All Faces: We found that All Faces weight is higherin the Original auditory condition than in otherconditions (Original vs. Unrelated Speech: p¼ 0.019;Original vs. Abrupt Sounds: p , 0.001; Original vs.Continuous Sound: p , 0.001). We found no significantdifference between Unrelated Speech, Abrupt Sounds,and Continuous Sound (Abrupt Sounds vs. UnrelatedSpeech: p¼ 0.32; Continuous Sound vs. AbruptSounds: p¼ 1; Unrelated Speech vs. ContinuousSound: p ¼ 1).

Talking and mute faces

We tagged manually the periods of time duringwhich each face was speaking or mute (in the Originalauditory condition), as it was done to calculate thefixation ratios. By averaging the weights of face mapsover these periods of time, we were able to separate thecontribution of talking faces from mute faces. Theweights shown in Figure 5b nicely match the fixationratios reported in Table 2 around 20% for mute facesregardless of the auditory condition, around 50% fortalking faces in the Original condition, and 40% in theNonoriginal condition.

We conducted repeated measures ANOVA with theface category (mute and talking) and the auditorycondition (Original, Unrelated Speech, ContinuousSound, and Abrupt Sounds) as within-subject factors.The main effect of the face category yielded an F ratioof F(1, 14) ¼ 106.75, p , 0.001. The maps containingthe talking faces had a mean weight of 0.45 and themaps containing the mute faces had a mean weight of0.2. The main effect of auditory conditions yielded an Fratio of F(3, 42)¼ 5.16, p¼ 0.004. The interaction effectwas also significant, with an F ratio of F(3, 42)¼ 20.14,p , 0.001.

Bonferroni pairwise posthoc comparisons revealedthat talking face weights were higher in the Originalauditory condition than in the other conditions (all ps, 0.001). For the weights of the mute face map, wefound no difference between the auditory conditions(Original vs. Unrelated Speech: p¼ 0.70; Original vs.Abrupt Sounds: p¼ 1; Original vs. Continuous Sound:p¼ 1; Unrelated Speech vs. Continuous Sound: p ¼ 1;

Figure 5. (a) Weights of the features chosen to model the probability distribution of eye positions (the sum of the five features equals

one). (b) Contributions of talking and mute faces to the all faces weight (the sum of the two equals the all faces weight). For each

video, weights are averaged over all frames. Results are then averaged over all videos and error bars correspond to the standard

errors. *Marks a significant difference between auditory conditions for the corresponding feature (Bonferroni pairwise posthoc

comparisons, see below for further details).

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 10

Page 11: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

Abrupt Sounds vs. Unrelated Speech: p¼ 1; Contin-uous Sound vs. Abrupt Sounds: p ¼ 1).

Interim summary

We show that in dynamic conversation scenes, low-level saliencies (both static and dynamic) and centerbias are poor gaze-guiding features compared to faces,and especially to talking faces. Even if the relatedspeech enhances talking face weight by 10%, gaze ismostly driven toward talking faces by visual informa-tion. Indeed, even with unrelated auditory information,the weight of talking faces is still twice the weight ofmute faces. We found no difference between unrelatedauditory conditions.

Discussion

Gaze attraction toward faces is widely agreed upon.However, when trying to model visual attention,authors rarely take faces into account and neverconsider the auditory information that is usually partof dynamic scenes. In this paper, we quantify howauditory information influences gaze when viewing aconversation. For this purpose we eye-tracked partic-ipants viewing conversation scenes in different auditoryconditions (original speech, unrelated speech, noises ofmoving objects, and continuous landscape sound), andwe compared their gaze behaviors. First, we commenton our results with reference to previous studies onfaces and visual attention. Then we discuss how speechand other sounds modulate gaze behavior whenviewing conversations. Finally, we propose ground-work for an audiovisual saliency model.

Faces: Strong gaze attractors

We found that in every auditory condition, facesattract the most fixations (.60%). This central role offaces in visual exploration is reflected by saccadeamplitude distribution. Classically, saccades madeduring the free exploration of natural scenes follow apositively skewed, long-tailed distribution (Bahill,Adler, & Stark, 1975; Coutrot et al., 2012; Tatler,Baddeley, & Vincent, 2006). In contrast, here we founda bimodal distribution, with modes around 18 and 78.An interpretation is that when viewing scenes includingfaces, participants make at least two kinds of saccades:intraface (from eyes to mouth to nose, etc.) andinterface (from one conversation partner to another)saccades. We tested this hypothesis by comparing theproportion of intraface and interface saccades and their

amplitudes. We found that almost all intraface saccadeswere concentrated within the first mode, while allinterface saccades were concentrated within the secondone. This result is confirmed by the mean face area(around 38 · 58, matching the first mode) and the meandistance between conversation partners (around 108,matching the second mode) present in our stimuli.Moreover, fixation durations were longer (around 420ms) than usually reported in the literature (250–350ms), which supports the idea of long explorations of afew regions of interest, like faces (Pannasch, Helmert,Herbold, Roth, & Henrik, 2008; Smith & Mital, 2013).

Studies have long established the specificity of facesin visual perception (Yarbus, 1967), but the use of staticimages made the generalization of their results to thereal world problematic. Recently, some social gazestudies used dynamic stimuli to get as close as possibleto ecological situations and confirmed that observersspend most of the time looking at faces (Foulsham etal., 2010; Frank, Vul, & Johnson, 2009; Hirvenkari etal., 2013; Vo et al., 2012). This exploration strategyleads eye positions to cluster on faces (Mital et al.,2010), and more generally induces a decrease in eyeposition dispersion, as compared to natural sceneswithout semantically rich regions (e.g., landscapes;Coutrot & Guyader, 2013). Our results are consistentwith a very strong impact of faces on gaze behaviorwhen exploring natural dynamic scenes. They extendprevious findings by highlighting that the presence offaces in natural scenes leads to a bimodal saccadeamplitude distribution corresponding to the saccadesmade within a same face and between two differentfaces. This strong impact of faces occurred even thoughthe stimuli we chose featured conversation partnerswho were embedded in complex natural environments(cafe, office, street, corridor) and many objects thatcould also attract observers’ gaze.

We also quantified and compared the strength ofdifferent gaze guiding features such as static anddynamic low-level visual saliencies, faces, and centerbias. Our results show that after a short predominanceof the center bias (during the first five frames), faces areby far the most pertinent features to explain gazeallocation. This five-frame delay is classically reportedfor reflexive saccades toward peripheral target (latencyaround 150–250 ms; Carpenter, 1988; Yang, Bucci, &Kapoula, 2002). Then, we found that although theweights are globally high for every face, they are evenhigher for talking faces, regardless of the auditorycondition. This indicates that visual cues are sufficientto efficiently drive gaze toward speakers. Yet, the quitelow weights we found for both static and dynamic low-level dynamic saliencies suggest that their contributionto gaze guiding is slight. This result reinforces previousstudies claiming that classical visual attention modelsdo not account for human eye fixations when looking

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 11

Page 12: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

at static images involving complex social scenes(Birmingham & Kingstone, 2009). Thus, to explain theattractiveness of speakers even without their relatedspeech, higher level visual cues might be invoked, suchas expressions or body language (Richardson, Dale, &Shockley, 2008). However, these are more difficult tomodel.

Influence of related speech

We found that if the fixation ratio is globally highfor every face, it is even higher for talking faces,regardless of the auditory condition. As stated in theprevious paragraph, this result suggests that sinceobservers are able to follow speech turn-taking withoutthe related speech soundtrack, visual and auditoryinformation are in part redundant in guiding theviewers’ gaze (as was also reported in Hirvenkari et al.,2013). So, what is the added value of sound? A body ofconsistent evidence shows that with the related speech,observers follow the speech turn-taking even moreclosely. First, the dispersion between eye positionsmade with the related speech was found to be smallerthan without it (as was also reported in Foulsham &Sanderson, 2013). Second, when we modeled the gaze-attracting power of different visual features, theweights of talking faces were found to be significantlyhigher with than without the related speech. Third, thefirst mode of saccade amplitude distribution (corre-sponding to the intraface type of saccade) was found tobe much greater with than without the related speech.These results show that without the related speechsoundtrack, observers were less clustered on talkingfaces, making fewer little saccades (from eyes to mouthto nose), usually made to better understand speakersmomentary emotional state, or to support speechperception by sampling mouth movements and otherfacial nonverbal cues (Buchan, Pare, & Munhall, 2007;Vatikiotis-Bateson et al., 1998; Vo et al., 2012).Finally, we compared the scanpaths between subjectsin each auditory condition to a ground truth sequencerepresenting speech turn-taking. We found a greatersimilarity between subjects’ scanpaths and the groundtruth sequence in the original auditory condition. Thisresult is coherent with the recent studies of Hirvenkariet al. (2013) and Foulsham and Sanderson (2013),which noted the temporal relationship between speechonsets and the deployment of visual attention. Bothstudies reported that with the related speech sound-track, fixations on the speaker increased right afterspeech onset, peaking about 800 ms to 1 s later.Removing the sound did not affect the general gazepattern, but it did change the speed at which fixationsmoved to the speaker. It may be consistent withconsidering speech as an alerting signal telling that

another conversation partner has taken the floor.Without related speech, observers have to realize thatthe speakership has shifted and seek the new speaker,which could explain the lower similarity between theirscanpaths and the speech turn-taking. In this section,we discussed gaze behavior between Original andNonoriginal conditions, but what about the differencesbetween Nonoriginal conditions?

Influence of other soundtracks

Our results show an effect of the related speech oneye movements while watching conversations. But whatabout unrelated sounds? Studies showed that present-ing natural images and lateralized natural soundsbiased observers’ gazes towards the part of the imagecorresponding to the sound source (Onat et al., 2007;Quigley, Onat, Harding, Cooke, & Konig, 2008).Moreover, this spatial bias is dependent on the imagesaliency presented without any sound, meaning thatgaze behavior is the result of an audiovisual integrationprocess. Yet, our study is quite different from these,since we used unspatialized soundtracks and dynamicstimuli. Other studies investigated the perception ofaudiovisual synchrony for complex events by present-ing speech versus object-action video clips at a range ofstimulus onset asynchronies (Vatakis & Spence, 2006).Participants were significantly better at judging thetemporal order of streams (auditory or visual) for theobject actions than for the speech video clips, meaningthat cross-modal temporal discrimination performanceis better for audiovisual stimuli of lower complexity,compared to stimuli having continuously varyingproperties. Indeed, authors argued that since speechpresents a fine temporal correlation between sound andvision (phoneme and viseme), judging temporal orderin audiovisual speech may be more difficult than forabrupt noises like moving object sounds (Vroomen &Stekelenburg, 2011). Thus, audiovisual integrationseems to be linked to the abrupt or slowly changingnature of audiovisual component signals, and to theircorrelation. That is why we chose to investigate howvisual exploration is influenced by unrelated speechsoundtracks (is speech special?), sound of movingobjects (abrupt sound onsets), and landscape sounds(slowly varying components).

Surprisingly, we found no difference between thethree Nonoriginal auditory conditions, whether interms of dispersion between eye positions, saccadeamplitudes, fixation durations, scanpath comparisons,fixation ratios in faces (mute or talking), or weights ofany features computed by the EM algorithm. A reasonfor this absence of effect might be found in the notionof audiovisual binding.

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 12

Page 13: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

A classical view of audiovisual integration is thataudio and visual streams are separately processedbefore interaction automatically occurs, leading to anintegrated percept (Calvert, Spence, & Stein, 2004).Other studies suggested that audiovisual fusion couldbe conceived as a two-stage process, beginning bybinding together pieces of audio and video that presenta certain amount of spatio-temporal correlation, beforethe actual integration (Berthommier, 2004). A recentstudy reinforced this idea by showing that it is possibleto unbind visual and auditory streams (Nahorna,Berthommier, & Schwartz, 2012). To do so, the authorsused the McGurk effect as a marker of audiovisualintegration: The more it occurs, the more visual andauditory information the participants integrate. Resultsshowed that if a given McGurk stimulus (visual /ga/dubbed onto an acoustic /ba/) is preceded by anincoherent audiovisual context, the amount of McGurkeffect (perception of /da/; McGurk & MacDonald,1976), and thus the audiovisual integration, is largelyreduced. The authors showed that even a very shortincoherent audiovisual context (one syllable) is enoughto cause unbinding.

In our study, there might be no difference in gazebehavior between Nonoriginal auditory conditionssimply because unrelated speech, object noise, andlandscape sound soundtracks are not temporallycorrelated enough with the visual information to passthrough the binding stage, preventing any furtherintegration. In the three Nonoriginal auditory condi-tions, observers might just filter out the unbound audioinformation and focus on the sole visual stream. Thus,any unrelated soundtracks or no soundtrack at allmight result to the same gaze behavior, only driven byvisual information. This interpretation is confirmed bythe results of two recent papers that compared the gazebehavior of participants watching videos with orwithout their original soundtrack (Coutrot et al., 2012;Foulsham & Sanderson, 2013). Foulsham et al. (2010)used dynamic conversations as stimuli and foundhigher dispersion between eye positions and largersaccade amplitudes in the visual condition than in theaudiovisual condition, which is coherent with ourprevious results (Coutrot et al., 2012). In fact, we alsofound higher dispersion in the visual condition than inaudio-visual conditions, but without larger saccadeamplitudes. Since we used various videos as stimuli (notspecifically involving faces), these results corroboratethe idea developed at the beginning of this Discussion:that the presence of faces induces an intraface andinterface type of saccade. As explained, removing theoriginal soundtrack increases the inter/intraface sac-cade ratio, resulting in an increase in saccade ampli-tude. On the contrary, when the visual scenes do notinvolve faces, removing the original soundtrack yields

smaller saccades: Observers might become less activeand make less goal-directed saccades.

It is interesting to notice that this binding phenom-enon has been understood and used by filmmakers for along time. For instance, the French composer and filmtheorist Michel Chion (1994, p. 40) denies the verynotion of soundtrack as a coherent unity:

By stating that there is no soundtrack I mean firstof all that the sounds of a film, taken separatelyfrom the image, do not form an internally coherententity on equal footing with the image track.Second, I mean that each audio element enters intosimultaneous vertical relationship with narrativeelements contained in the image (characters,actions) and visual elements of texture and setting.These relationships are much more direct andsalient than any relations the audio element couldhave with other sounds. It’s like a recipe: Even ifyou mix the audio ingredients separately beforepouring them into the image, a chemical reactionwill occur to separate out the sounds andmake eachreact on its own with the field of vision.

Chion (1994), Nahorna et al. (2012), and this studyagree on the necessity for sound to ‘‘enter intosimultaneous vertical relationship’’ (i.e., to be corre-lated) with visual information so as to be bound andintegrated with it, or using Chion’s words, to ‘‘react’’with it.

Toward an audiovisual saliency model

In many situations, low-level visual saliency modelsfail to predict fixation locations (Tatler et al., 2011). Forscenes involving semantically interesting regions(Nystrom & Holmqvist, 2008; Rudoy, Goldman,Shechtman, & Zelnik-Manor, 2013) and faces (Bir-mingham & Kingstone, 2009), it has been shown thathigh-level factors override low-level factors to guidegaze. In this paper, we modeled the probabilitydistribution of eye positions across each video with theEM algorithm, a statistical method allowing thecontribution of different gaze guiding features such asfaces, low-level visual saliency and center bias to beseparated and quantified. Regardless of the auditorycondition, the weight associated to faces exceeded by farthe weight associated to any other features. We foundthat the weight of low-level saliency is at the same levelas center bias or chance. This supports the idea that low-level factors are not pertinent to explain gaze behaviorwhen faces are present and extends it to dynamic scenes.We also found that even if the related speech enhancestalking faces’ weight by 10%, gaze is mostly driventoward talking faces by visual information. Indeed, even

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 13

Page 14: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

with unrelated auditory information, the talking faceweight is still twice the mute faces’ weight.

Thus, in addition to already existing face detectors(Cerf et al., 2008; Marat et al., 2013), future audiovisualsaliency models should include visual or audiovisualspeaker diarization algorithms. Distinguishing silencefrom speech situations, and identifying the location ofthe active speaker in the latter case, remains achallenge, particularly in ecological—and thus noisy—environments. Yet many recent studies try to addressthis issue, for instance by exploiting the coherencebetween the speech acoustic signal and the speaker’s lipmovements (Blauth, Minotto, Jung, Lee, & Kalker,2012; Noulas, Englebienne, & Krose, 2012; see Angueraet al., 2012, for a review).

To sum up, to predict eye positions made whileviewing conversation scenes, we think that futuresaliency models should detect talking and silent faces. Ifthe scene comes with its related speech soundtrack, 50%of the total saliency should be attributed to talkingfaces, 20% to mute faces. The remaining should beshared between center bias (mainly during the five firstframes) and low-level saliency. If the scene comes withunrelated soundtrack, the weight of talking facesshould be slightly lowered to the benefit of the otherfeatures. Nevertheless, talking faces should remain themost attractive feature.

Conclusion

We find that when viewing ecological conversationsin complex natural environment, participants lookmore at faces in general and at talking faces inparticular, regardless of the auditory information. Thisresult suggests that if auditory information doesinfluence viewers’ gaze, visual information is still theleading factor. We do not find any difference betweenthe different types of unrelated soundtracks (unrelatedspeech, moving object abrupt noises, and continuouslandscape sound). We hypothesize that unrelatedsoundtracks are not correlated enough with the visualinformation to be bound to it, preventing any furtherintegration. However, hearing the original speechsoundtrack makes participants follow the speech turn-taking more closely. This behavior increases thenumber of small intraface saccades and reduces thevariability between eye positions. Using a statisticalmethod, we quantify the propensity of several classicalvisual features to drive gazes (faces, center bias, staticand dynamic low-level saliencies). We find that classicallow-level saliency globally fails to predict eye positions,whereas faces (and especially talking faces) are goodpredictors. Therefore, we suggest the joint use of facedetector and speaker diarization algorithms to distin-

guish talking from mute faces and label them withappropriate weights.

Keywords: faces, speech, gaze, scanpath, saliency,audiovisual integration, expectation-maximization, da-tabase

Acknowledgments

The authors would like to thank Jean-Luc Schwartz,Jonas Chatel-Goldman, and two anonymous refereesfor their enlightening comments on the manuscript.Commercial relationships: none.Corresponding author: Antoine Coutrot.Email: [email protected]: Gipsa-lab, CNRS & Grenoble-Alpes Univer-sity.

References

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C.,Friedland, G., & Vinyals, O. (2012). Speakerdiarization: A review of recent research. IEEETransaction on Audio, Speech, & Language Pro-cessing, 20(2), 356–370.

Bahill, T., Adler, D., & Stark, L. (1975, June). Mostnaturally occurring human saccades have magni-tudes of 15 degrees or less. Investigative Ophthal-mology & Visual Science, 14(6), 468–469, http://www.iovs.org/content/14/6/468. [PubMed] [Article]

Bailly, G., Perrier, P., & Vatikiotis-Bateson, E. (2012).Audiovisual speech processing. Cambridge, UK:Cambridge University Press.

Bailly, G., Raidt, S., & Elisei, F. (2010). Gaze,conversational agents, and face-to-face communi-cation. Speech Communication, 52, 598–612.

Berthommier, F. (2004). A phonetically neutral modelof the low-level audiovisual interaction. SpeechCommunication, 44(1–4), 31–41.

Bertolino, P. (2012). Sensarea: An authoring tool tocreate accurate clickable videos. In 10th workshopon content-based multimedia indexing (pp. 1–4).Annecy, France.

Bindemann, M., Burton, A. M., Hooge, I. T. C.,Jenkins, R., & de Haan, E. H. F. (2005). Facesretain attention. Psychonomic Bulletin & Review,12(6), 1048–1053.

Bindemann, M., Burton, A. M., Langton, S. R. H.,Schweinberger, S. R., & Doherty, M. J. (2007). Thecontrol of attention to faces. Journal of Vision,7(10):15, 1–8, http://www.journalofvision.org/

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 14

Page 15: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

content/7/10/15, doi:10.1167/7.10.15. [PubMed][Article]

Birmingham, E., & Kingstone, A. (2009). Saliency doesnot account for fixations to eyes within socialscenes. Vision Research, 49, 2992–3000.

Blauth, D. A., Minotto, V. P., Jung, C. R., Lee, B., &Kalker, T. (2012). Voice activity detection andspeaker localization using audiovisual cues. PatternRecognition Letters, 33(4), 373–380.

Boltz, M. G. (2004). The cognitive processing of filmand musical soundtracks. Memory & Cognition,32(7), 1194–1205.

Boremanse, A., Norcia, A., & Rossion, B. (2013). Anobjective signature for visual binding of face partsin the human brain. Journal of Vision, 13(11):6, 1–18, http://www.journalofvision.org/content/13/11/6, doi:10.1167/13.11.6. [PubMed] [Article]

Borji, A., & Itti, L. (2012). State-of-the-art in visualattention modeling. IEEE Transactions on PatternsAnalysis and Machine Intelligence, 35(1), 185–207.

Branigan, E. (2010). Soundtrack in mind. Projections,4(1), 41–67.

Buchan, J. N., Pare, M., & Munhall, K. G. (2007).Spatial statistics of gaze fixations during dynamicface processing. Social Neuroscience, 2(1), 1–13.

Buswell, G. T. (1935). How people look at pictures: Astudy of the psychology of perception in art.Chicago: University of Chicago Press.

Calvert, G., Spence, C., & Stein, B. E. (2004).Handbook of multisensory processes. Cambridge,MA: MIT Press.

Carpenter, R. H. S. (1988). Movements of the eyes (2ndrev. & enlarged ed.). London, England: PionLimited.

Cerf, M., Harel, J., Einhauser, W., & Koch, C. (2008).Predicting human gaze using low-level saliencycombined with face detection. In Advances in neuralinformation processing systems (pp. 241–248).

Chion, M. (1994). Audio-vision: Sound on screen. NewYork: Columbia University Press.

Cohen, A. J. (2005). How music influences theinterpretation of film and video: Approaches fromexperimental psychology. In R. A. Kendall & R.W. H. Savage (Eds.), Selected reports in ethnomu-sicology: Perspectives in systematic musicology (pp.15–36). Los Angeles: Department of Ethnomusi-cology, University of California.

Coutrot, A., & Guyader, N. (2013). Toward theintroduction of auditory information in dynamicvisual attention models. In IEEE internationalworkshop on image analysis for multimedia interac-tive services (WIAMIS) (pp. 1–4). Paris, France.

Coutrot, A., Guyader, N., Ionescu, G., & Caplier, A.(2012). Influence of soundtrack on eye movementsduring video exploration. Journal of Eye MovementResearch, 5(4), 1–10.

Crouzet, S. M., Kirchner, H., & Thorpe, S. J. (2010).Fast saccades toward faces: Face detection in just100 ms. Journal of Vision, 10(4):16, 1–17, http://www.journalofvision.org/content/10/4/16, doi:10.1167/10.4.16. [PubMed] [Article]

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal StatisticalSociety: Series B (Methodological), 39(1), 1–38.

Farah, M. J., Wilson, K. D., Drain, M., & Tanaka, J.N. (1998). What is ‘‘special’’ about face perception?Psychological Review, 105(3), 482–498.

Foulsham, T., Cheng, J. T., Tracy, J. L., Henrich, J., &Kingstone, A. (2010). Gaze allocation in a dynamicsituation: Effects of social status and speaking.Cognition, 117(3), 319–331.

Foulsham, T., & Sanderson, L. A. (2013). Look who’stalking? Sound changes gaze behaviour in adynamic social scene. Visual Cognition, 21(7), 922–944.

Frank, M. C., Vul, E., & Johnson, S. P. (2009).Development of infants’ attention to faces duringthe first year. Cognition, 110, 160–170.

Gautier, J., & Le Meur, O. (2012). A time-dependentsaliency model combining center and depth biasesfor 2D and 3D viewing conditions. CognitiveComputation, 4, 1–16.

Haxby, J. V., Hoffman, E. A., & Gobbini, M. I. (2000).The distributed human neural system for faceperception. Trends in Cognitive Sciences, 4(6), 223–233.

Hershler, O., & Hochstein, S. (2005). At first sight: Ahigh-level pop out effect for faces. Vision Research,45, 1707–1724.

Hirvenkari, L., Ruusuvori, J., Saarinen, V.-M., Kivio-ja, M., Perakyla, A., & Hari, R. (2013). Influence ofturn-taking in a two-person conversation on thegaze of a viewer. PLoS ONE, 8(8), 1–6.

Ho-Phuoc, T., Guyader, N., & Guerin-Dugue, A.(2010). A functional and statistical bottom-upsaliency model to reveal the relative contributionsof low-level visual guiding factors. CognitiveComputation, 2(4), 344–359.

Itti, L., & Koch, C. (2000). A saliency-based searchmechanism for overt and covert shifts of visualattention. Vision Research, 40, 1489–1506.

Itti, L., Koch, C., & Niebur, E. (1998). A model ofsaliency-based visual attention for rapid scene

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 15

Page 16: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

analysis. IEEE Transactions on Patterns Analysisand Machine Intelligence, 20(11), 1254–1259.

Kanwisher, N., McDermott, J., & Chun, M. M. (1997).The fusiform face area: A module in humanextrastriate cortex specialized for face perception.Journal of Neuroscience, 17(11), 4302–4311.

Kayser, C., Petkov, C. I., Lippert, M., & Logothetis, N.K. (2005). Mechanisms for allocating auditoryattention: An auditory saliency map. CurrentBiology, 15, 1943–1947.

Koch, C., & Ullman, S. (1985). Shifts in selective visualattention: Towards the underlying neural circuitry.Human Neurobiology, 4, 219–227.

Lansing, C. R., & McConkie, G. W. (2003). Wordidentification and eye fixation locations in visualand visual-plus-auditory presentations of spokensentences. Perception & Psychophysics, 65(4), 536–552.

Le Meur, O., & Baccino, T. (2013). Methods forcomparing scanpaths and saliency maps: Strengthsand weaknesses. Behavior Research Methods, 45(1),251–266.

Le Meur, O., Le Callet, P., & Barba, D. (2007).Predicting visual fixations on video based on low-level visual features. Vision Research, 47, 2483–2498.

Levenshtein, V. I. (1966). Binary codes capable ofcorrecting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8), 707–710.

Marat, S., Ho-Phuoc, T., Granjon, L., Guyader, N.,Pellerin, D., & Guerin-Dugue, A. (2009). Modellingspatio-temporal saliency to predict gaze directionfor short videos. International Journal of ComputerVision, 82(3), 231–243.

Marat, S., Rahman, A., Pellerin, D., Guyader, N., &Houzet, D. (2013). Improving visual saliency byadding ‘face feature map’ and ‘center bias’.Cognitive Computation, 5(1), 63–75.

McGurk, H., & MacDonald, J. (1976). Hearing lipsand seeing voices. Nature, 264, 746–748.

Mital, P. K., Smith, T. J., Hill, R. L., & Henderson, J.M. (2010). Clustering of gaze during dynamic sceneviewing is predicted by motion. Cognitive Compu-tation, 3(1), 5–24.

Nahorna, O., Berthommier, F., & Schwartz, J.-L.(2012). Binding and unbinding the auditory andvisual streams in the McGurk effect. Journal of theAcoustical Society of America, 132(2), 1061–1077.

Noulas, A. K., Englebienne, G., & Krose, B. J. A.(2012). Multimodal speaker diarization. IEEETransactions on Pattern Analysis & Machine Intel-ligence, 34(1), 79–93.

Nystrom, M., & Holmqvist, K. (2008). Semanticoverride of low-level features in image viewing—both initially and overall. Journal of Eye MovementResearch, 2(2), 1–11.

Onat, S., Libertus, K., & Konig, P. (2007). Integratingaudiovisual information for the control of overtattention. Journal of Vision, 7(10):11, 1–16, http://www.journalofvision.org/content/7/10/11, doi:10.1167/7.10.11. [PubMed] [Article]

Pannasch, S., Helmert, J. R., Herbold, A.-K., Roth, K.,& Henrik, W. (2008). Visual fixation durations andsaccade amplitudes: Shifting relationship in avariety of conditions. Journal of Eye MovementResearch, 2(4), 1–19.

Quigley, C., Onat, S., Harding, S., Cooke, M., &Konig, P. (2008). Audio-visual integration duringovert visual attention. Journal of Eye MovementResearch, 1(2), 1–17.

Richardson, D., Dale, R., & Shockley, K. (2008).Synchrony and swing in conversation: Coordina-tion, temporal dynamics, and communication. In I.Wachsmuth, M. Lenzen, & G. Knoblich (Eds.),Embodied communication (pp. 75–94). New York:Oxford University Press.

Rudoy, D., Goldman, D. B., Shechtman, E., & Zelnik-Manor, L. (2013). Learning video saliency fromhuman gaze using candidate selection. In Confer-ence on computer vision and pattern recognition (pp.4321–4328). Portland, OR.

Schwartz, J.-L., Robert-Ribes, J., & Escudier, P.(1998). Ten years after Summerfield: A taxonomyof models of audiovisual fusion in speech percep-tion. In R. Campbell, B. Dodd, & D. K. Burnham(Eds.), Hearing by eye II: Advances in the psychol-ogy of speechreading and auditory-visual speech (pp.85–108). Hove, UK: Psychology Press.

Smith, T. J., & Mital, P. K. (2013). Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes. Journalof Vision, 13(8):16, 1–24, http://www.journalofvision.org/content/13/8/16, doi:10.1167/13.8.16. [PubMed] [Article]

Song, G., Pellerin, D., & Granjon, L. (2013). Differenttypes of sounds influence gaze differently in videos.Journal of Eye Movement Research, 6(4), 1–13.

Summerfield, Q. (1987). Some preliminaries to acomprehensive account of audio-visual speechperception. In B. Dodd & R. Campbell (Eds.),Hearing by eye: The psychology of lipreading (pp. 3–51). New York: Lawrence Erlbaum.

Tatler, B. W. (2007). The central fixation bias in sceneviewing: Selecting an optimal viewing positionindependently of motor biases and image feature

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 16

Page 17: How saliency, faces, and sound influence gaze in dynamic ...antoinecoutrot.magix.net/public/assets/coutrot_jov2014.pdf · faces might not be generally applied to the real world, where

distributions. Journal of Vision, 7(14):4, 1–17,http://www.journalofvision.org/content/7/14/4,doi:10.1167/7.14.4. [PubMed] [Article]

Tatler, B. W., Baddeley, R. J., & Vincent, B. T. (2006).The long and the short of it: Spatial statistics atfixation vary with saccade amplitude and task.Vision Research, 46, 1857–1862.

Tatler, B. W., Hayhoe, M. M., Land, M. F., & Ballard,D. H. (2011). Eye guidance in natural vision:Reinterpreting salience. Journal of Vision, 11(5):5,1–23, http://www.journalofvision.org/content/11/5/5, doi:10.1167/11.5.5. [PubMed] [Article]

Theeuwes, J., & Van der Stigchel, S. (2006). Facescapture attention: Evidence from inhibition ofreturn. Visual Cognition, 13(6), 657–665.

Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychol-ogy, 12, 97–136.

Tseng, P.-H., Carmi, R., Cameron, I. G. M., Munoz,D. P., & Itti, L. (2009). Quantifying center bias ofobservers in free viewing of dynamic natural scenes.Journal of Vision, 9(7):4, 1–16, http://www.journalofvision.org/content/9/7/4, doi:10.1167/9.7.4. [PubMed] [Article]

Vatakis, A., & Spence, C. (2006). Audiovisual syn-chrony perception for music, speech, and objectactions. Brain Research, 1111, 134–142.

Vatikiotis-Bateson, E., Eigsti, I.-M., Yano, S., &Munhall, K. G. (1998). Eye movement of perceivers

during audiovisualspeech perception. Perception &Psychophysics, 60(6), 926–940.

Vincent, B. T., Baddeley, R. J., Correani, A., Tro-scianko, T., & Leonards, U. (2009). Do we look atlights? Using mixture modelling to distinguishbetween low- and high-level factors in naturalimage viewing. Visual Cognition, 17(6–7), 856–879.

Vo, M. L. H., Smith, T. J., Mital, P. K., & Henderson,J. M. (2012). Do the eyes really have it? Dynamicallocation of attention when viewing moving faces.Journal of Vision, 12(13):3, 1–14, http://www.journalofvision.org/content/12/13/3, doi:10.1167/12.13.3. [PubMed] [Article]

Vroomen, J., & Stekelenburg, J. J. (2011). Perception ofintersensory synchrony in audiovisual speech: Notthat special. Cognition, 118(1), 75–83.

Yang, Q., Bucci, M. P., & Kapoula, Z. (2002). Thelatency of saccades, vergence, and combined eyemovements in children and in adults. InvestigativeOphthalmology & Visual Science, 43(9), 2939–2949,http://www.iovs.org/content/43/9/2939. [PubMed][Article]

Yarbus, A. L. (1967). Eye movements and vision. NewYork: Plenum.

Zeppelzauer, M., Mitrovic, D., & Breiteneder, C.(2011). Cross-modal analysis of audio-visual filmmontage. In International conference on computercommunications and networks (pp. 1–6). Maui,Hawaii.

Journal of Vision (2014) 14(8):5, 1–17 Coutrot & Guyader 17