Looking Beyond a Clever Narrative: Visual Context and ...

10
Looking Beyond a Clever Narrative: Visual Context and Aention are Primary Drivers of Affect in Video Advertisements Abhinav Shukla International Institute of Information Technology Hyderabad, India [email protected] Harish Katti Centre for Neuroscience, Indian Institute of Science Bangalore, India [email protected] Mohan Kankanhalli School of Computing, National University of Singapore Singapore, Singapore [email protected] Ramanathan Subramanian Advanced Digital Sciences Center, University of Illinois Singapore, Singapore [email protected] ABSTRACT Emotion evoked by an advertisement plays a key role in influencing brand recall and eventual consumer choices. Automatic ad affect recognition has several useful applications. However, the use of content-based feature representations does not give insights into how affect is modulated by aspects such as the ad scene setting, salient object attributes and their interactions. Neither do such ap- proaches inform us on how humans prioritize visual information for ad understanding. Our work addresses these lacunae by decom- posing video content into detected objects, coarse scene structure, object statistics and actively attended objects identified via eye- gaze. We measure the importance of each of these information channels by systematically incorporating related information into ad affect prediction models. Contrary to the popular notion that ad affect hinges on the narrative and the clever use of linguistic and social cues, we find that actively attended objects and the coarse scene structure better encode affective information as compared to individual scene objects or conspicuous background elements. KEYWORDS Affect Analysis; Computer Vision; Visual Attention; Eye Tracking; Scene Context; Valence; Arousal; 1 INTRODUCTION We encounter video-based advertising on a daily basis on television and websites like YouTube®. Advertising is a multi-billion dollar in- dustry worldwide that seeks to capture user attention and maximize short and long term brand recall, ad monetization and click-based user interactions [12]. A prevalent notion in advertisement (ad) design [24] is that the ad narrative is key [56]. From an analysis viewpoint, narrative analytics would require one to parse out the story line, subtle forms of communication such as motivation [21], sarcasm, humor and sentiments in an ad [24]. Such data often does not accompany video content in popular annotated databases [5], and are often difficult to analyze reliably. Contrary to this view, advertisement evaluation and impact [7, 25] have been attributed to evoked emotions of pleasantness (valence) and arousal [53]. Past research has also shown that affect-based Accepted for publication in the Proceedings of 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA.. DOI: 10.1145/3242969.3242988. video ad analytics compares favorably [58] to visual similarity and text-based analysis [38]. Affect based content analysis works well in combination with behavioural read-outs of cognitive load and arousal [27] as well as user attention [48]. Computational work on affect-based ad analysis has shown that analyzing even 10 second segments can give meaningful prediction of induced emotional states [47, 48, 58]. Overall, affect based analysis seems much more robust than semantic based approaches which would require the narrative of nearly the entire ad for a reliable evaluation to be made. Ad affect prediction becomes important in the context of compu- tational advertising as online ads are often interrupted or skipped by users as they have little similarity/relevance to the streamed videos within which they are embedded. Despite the success of content based affect prediction, there is little or no understanding of which aspects of ad video information best inform the induced valence and arousal. For example, is the setting or mood of an ad- vertisement more important than objects that are part of the story line? Are all objects equally important? Lack of knowledge on these questions is a conceptual bottle-neck to design principled methods for video advertisement analysis. In this paper, we set out to address this exact question by parcel- ing out video advertisement content into independent channels capturing various aspects of the scene content. Exemplar channels include (i) the coarse scene structure that captures the overall mood or ambiance of the video scene, (ii) collection of the (detected) scene objects, (iii) scene objects that are interesting from the user’s per- spective, and (iv) features encoding the list of scene objects and their co-occurrence (object context). Scene objects capturing overt visual attention during a given time interval in the video [30] are inferred from eye-gaze recordings of users viewing ad videos. These channels have the added advantage that they are easily in- terpretable, and insights derived therefrom can be incorporated into computational frameworks as well as the ad design. This method of parceling media content into distinct channels is inspired from recent approaches in visual neuroscience [15, 28, 29] to examine various behavioral and neural read outs of human perception. We then train predictive models for valence and arousal upon process- ing each of these visual channels. Analysis of these models not only conveys the effectiveness of each of these channels in predicting af- fect; our experiments reveal the unexpected and critical role played by the coarse grained scene structure in conveying valence and arousal. arXiv:1808.04610v1 [cs.CV] 14 Aug 2018

Transcript of Looking Beyond a Clever Narrative: Visual Context and ...

Looking Beyond a Clever Narrative: Visual Context andAttention are Primary Drivers of Affect in Video Advertisements

Abhinav ShuklaInternational Institute of Information Technology

Hyderabad, [email protected]

Harish KattiCentre for Neuroscience, Indian Institute of Science

Bangalore, [email protected]

Mohan KankanhalliSchool of Computing, National University of Singapore

Singapore, [email protected]

Ramanathan SubramanianAdvanced Digital Sciences Center, University of Illinois

Singapore, [email protected]

ABSTRACTEmotion evoked by an advertisement plays a key role in influencingbrand recall and eventual consumer choices. Automatic ad affectrecognition has several useful applications. However, the use ofcontent-based feature representations does not give insights intohow affect is modulated by aspects such as the ad scene setting,salient object attributes and their interactions. Neither do such ap-proaches inform us on how humans prioritize visual informationfor ad understanding. Our work addresses these lacunae by decom-posing video content into detected objects, coarse scene structure,object statistics and actively attended objects identified via eye-gaze. We measure the importance of each of these informationchannels by systematically incorporating related information intoad affect prediction models. Contrary to the popular notion that adaffect hinges on the narrative and the clever use of linguistic andsocial cues, we find that actively attended objects and the coarsescene structure better encode affective information as compared toindividual scene objects or conspicuous background elements.

KEYWORDSAffect Analysis; Computer Vision; Visual Attention; Eye Tracking;Scene Context; Valence; Arousal;

1 INTRODUCTIONWe encounter video-based advertising on a daily basis on televisionand websites like YouTube®. Advertising is a multi-billion dollar in-dustry worldwide that seeks to capture user attention andmaximizeshort and long term brand recall, ad monetization and click-baseduser interactions [12]. A prevalent notion in advertisement (ad)design [24] is that the ad narrative is key [56]. From an analysisviewpoint, narrative analytics would require one to parse out thestory line, subtle forms of communication such as motivation [21],sarcasm, humor and sentiments in an ad [24]. Such data often doesnot accompany video content in popular annotated databases [5],and are often difficult to analyze reliably.Contrary to this view, advertisement evaluation and impact [7, 25]have been attributed to evoked emotions of pleasantness (valence)and arousal [53]. Past research has also shown that affect-based

Accepted for publication in the Proceedings of 20th ACM International Conference onMultimodal Interaction, Boulder, CO, USA..DOI: 10.1145/3242969.3242988.

video ad analytics compares favorably [58] to visual similarity andtext-based analysis [38]. Affect based content analysis works wellin combination with behavioural read-outs of cognitive load andarousal [27] as well as user attention [48]. Computational work onaffect-based ad analysis has shown that analyzing even 10 secondsegments can give meaningful prediction of induced emotionalstates [47, 48, 58]. Overall, affect based analysis seems much morerobust than semantic based approaches which would require thenarrative of nearly the entire ad for a reliable evaluation to be made.

Ad affect prediction becomes important in the context of compu-tational advertising as online ads are often interrupted or skippedby users as they have little similarity/relevance to the streamedvideos within which they are embedded. Despite the success ofcontent based affect prediction, there is little or no understandingof which aspects of ad video information best inform the inducedvalence and arousal. For example, is the setting or mood of an ad-vertisement more important than objects that are part of the storyline? Are all objects equally important? Lack of knowledge on thesequestions is a conceptual bottle-neck to design principled methodsfor video advertisement analysis.In this paper, we set out to address this exact question by parcel-ing out video advertisement content into independent channelscapturing various aspects of the scene content. Exemplar channelsinclude (i) the coarse scene structure that captures the overall moodor ambiance of the video scene, (ii) collection of the (detected) sceneobjects, (iii) scene objects that are interesting from the user’s per-spective, and (iv) features encoding the list of scene objects andtheir co-occurrence (object context). Scene objects capturing overtvisual attention during a given time interval in the video [30] areinferred from eye-gaze recordings of users viewing ad videos.

These channels have the added advantage that they are easily in-terpretable, and insights derived therefrom can be incorporated intocomputational frameworks as well as the ad design. This methodof parceling media content into distinct channels is inspired fromrecent approaches in visual neuroscience [15, 28, 29] to examinevarious behavioral and neural read outs of human perception. Wethen train predictive models for valence and arousal upon process-ing each of these visual channels. Analysis of these models not onlyconveys the effectiveness of each of these channels in predicting af-fect; our experiments reveal the unexpected and critical role playedby the coarse grained scene structure in conveying valence andarousal.

arX

iv:1

808.

0461

0v1

[cs

.CV

] 1

4 A

ug 2

018

The main contributions of our paper are:-(1) We demonstrate a novel approach for advertisement video

analysis by decomposing video information into multiple,independent information channels.

(2) We show that a model incorporating information from boththe user and scene content best predicts the ad emotion.

(3) We bring forth the critical role played by the coarse scenestructure towards conveying the ad affect.

(4) We crystallize insights from our analysis to propose rulessuitable for the design and creation of video ads, and com-putational advertising frameworks.

Our paper is organized as follows. Section 2 discusses relatedwork and espouses the novelty of our work with respect to existingliterature. Section 3 describes the ad dataset used in our experiments.Section 4 describes the various video channels created from theoriginal ad content, while Section 5 discusses experiments andrelated findings. Section 7 concludes the paper.

2 RELATEDWORKTo position our work with respect to existing literature and to es-tablish its novelty, we examine related work on: (i) Advertisementsand Emotions, (ii) Visual Content Analysis (especially for affectrecognition and understanding scene structure and semantics) and(iii) Human Centered Affect Recognition.

2.1 Advertisements and EmotionsAdvertisements are inherently emotional, and have long been shownto modulate user behaviour towards products in response to theemotions they induce [18, 19, 42]. Holbrook and Batra [18] observecorrelations between emotions induced by ads, and users’ brandattitude. Pham et al. [42] note that ad-evoked feelings impact usersboth explicitly and implicitly. Recent work has aimed at exploitingthe affective information present in ads for computational adver-tising, i.e., inserting the most emotionally relevant ads in scenetransition points within programs. Shukla et al. [47, 48] show thatimproved audiovisual affect recognition can enhance ad recall andimprove viewer experience when inserting ads in videos. Prod-ucts deployed in the industry such as Creative Testing and MediaPlanning by RealEyes [1], which provide second-by-second emo-tion traces and predict exactly how people react to media content,thereby revealing what works and where the pitfalls lie. Thesemethods exploit emotional content and structure of advertisementsto perform precise ad characterization, which underlines the im-portance of emotions in ads.

2.2 Visual Content Analysis for AffectRecognition

A very important aspect of our work is understanding the impor-tance of various visual scene elements for affect prediction. Classicalworks in visual affect recognition such as [16, 17, 59] focus on lowlevel handcrafted features such as motion activity, color histogramsetc. Recent video analytic works have leveraged the advances indeep learning [8, 47]. Combined with the breakthroughs in deeplearning since AlexNet [34], the Places database [60] enables train-ing of deep learning models for scene attribute recognition on alarge scale. There have also been notable community challenges

[10, 46] that focus especially on affect recognition from visual con-tent analysis in various settings. Although all of these works focuson affect prediction from content, they do not try and measure theaffective content encoded by the various scene elements. Attemptsto decompose visual information and study user behavioural mea-sures have risen lately. Katti et al. [28, 29] have explored humanresponse times to object detection and human beliefs for objectpresence from various kinds of visual information. Groen et al. [15]have evaluated the distinct contributions of functional and deepneural network features to representational similarity of scenes inhuman brain and behaviour. Karpathy et al. [26] utilize a foveatedstream and a low resolution context stream for large scale videoclassification. Xu et al. [57] utilize a neural network to model humanvisual attention (computerized gaze) for image caption generation.Fan et al. [11] use eye tracking as well as saliency modeling tostudy how image sentiment and visual attention are linked. Ourwork builds on the foundations laid by [11, 26, 57] and specificallyanalyzes the contributions of various scene elements and channelstowards capturing affective information in the confined field ofadvertisement videos.

2.3 Human Centered Affect RecognitionHuman physiological signals have been widely employed for im-plicit tagging of multimedia content in recent years. Implicit tagginghas been defined as using non-verbal spontaneous behavioral re-sponses to find relevant tags for multimedia content [40]. Theseresponses include Electroencephalography (EEG) and Magnetoen-cephalography (MEG) along with physiological responses suchas heart rate and body temperature [2, 31, 48, 49, 54] and gazingbehavior [13, 41, 44, 50]. Determining the affective dimensions ofvalence and arousal using these methods has been shown to be veryeffective. Affective attributes have also been shown to be useful forretrieval [52]. These methods have an advantage of not requiringadditional human effort in annotation, and are based on the analysisof spontaneous user reactions or behavioral signals. We employeye gaze as a modality of interest in our experiments.

2.4 Analysis of Related WorkOn examining the related work, we find that: (1) Even though emo-tions are very important in visual advertising, there have rarelybeen attempts to computationally analyze how they are representedin the scene structure and setting, (2) Characterizing affective vi-sual information has been achieved primarily via computationalcues alone and not by human driven cues, (3) Despite the fact thatdecomposition of visual information has proved to be useful fordetermining user behavioural traits, there has been no such workin the context of affect recognition.

In this regard, we present the first work to study the contributionof different types of visual information to affect recognition andcombine content with human driven information from eye gaze.We evaluate the performance of features extracted from variousinformation channels on an affective dataset of advertisementsdescribed below.

Figure 1: (left) Scatter plot of mean arousal, valence ratingsobtained from 23 users. Color-coding based on expert labels.(middle) arousal and (right) valence score distributions (viewunder zoom).

3 DATASETThis section presents details regarding the ad dataset used in thisstudy along with the protocol employed for collecting eye trackingdata, which is used in some affect recognition experiments.

3.1 Dataset DescriptionDefining valence as the feeling of pleasantness/unpleasantness andarousal as the intensity of emotional feeling while viewing an adver-tisement, five experts compiled a dataset of 100, roughly 1-minutelong commercial advertisements (ads) in [47]. These ads are pub-licly available1 and roughly uniformly distributed over the arousal–valence plane defined by Greenwald et al. [14] (Figure 1(left)) as perthe first impressions of 23 naive annotators. The authors of [47]chose the ads based on consensus among five experts on valenceand arousal labels (either high (H)/low (L)). Labels provided by [47]were considered as ground-truth, and used for affect recognitionexperiments in this work.

To evaluate the effectiveness of these ads as affective controlstimuli, we examined how consistently they could evoke targetemotions across viewers. To this end, the ads were independentlyrated by 23 annotators for valence and arousal in this work2. Allads were rated on a 5-point scale, which ranged from -2 (veryunpleasant) to 2 (very pleasant) for valence and 0 (calm) to 4 (highlyaroused) for arousal. Table 1 presents summary statistics for adsover the four quadrants. Evidently, low valence ads are longer andare perceived as more arousing than high valence ads, implyingthat they had a stronger emotional influence among viewers.

We computed agreement among raters in terms of the (i) Krip-pendorff’s α and (ii) Fleiss κ scores. The α coefficient is applicablewhen multiple raters supply ordinal scores to items. We obtainedα = 0.62 and 0.36 respectively for valence and arousal, implyingthat valence impressions were more consistent across raters. Ona coarse-grained scale, we then computed the Fleiss κ agreementacross annotators– the Fleiss Kappa statistic is a generalizationof Kohen’s Kappa, and is applicable when multiple raters assigncategorical (high/low in our case) to a fixed number of items. Uponthresholding each rater’s arousal, valence scores by their mean rat-ing to assign high/low labels for each ad, we observed a Fleiss κ of0.56 (moderate agreement) for valence and 0.27 (fair agreement) forarousal across raters. Computing Fleiss κ upon thresholding eachrater’s scores with respect to the grand mean, we obtained a Fleiss

1On video hosting websites such as YouTube2 [47] employs 14 annotators.

Table 1: Summary statistics compiled from 23 users.

Quadrant Mean length (s) Mean arousal Mean valence

H arousal, H valence 48.16 2.19 0.98

L arousal, H valence 44.18 1.47 0.89

L arousal, L valence 60.24 1.72 -0.74

H arousal, L valence 64.16 2.95 -1.10

κ of 0.64 (substantial agreement) for valence, and 0.30 (fair agree-ment) for arousal. Clearly, there is reasonable agreement amongthe novice raters on affective impressions with considerably higherconcordance for valence. These results led us to confirm that the100 ads compiled by [47] are appropriate for affective studies.

To examine if the arousal and valence dimensions are indepen-dent based on user ratings, we (i) examined scatter plots of theannotator ratings, and (ii) computed correlations amongst thoseratings. The scatter plot of the mean arousal, valence annotator rat-ings, and the distribution of arousal, valence ratings are presentedin Fig. 1. The scatter plot is color-coded based on expert labels, andis more spread out than the classical ’C’ shape observed with affec-tive datasets [35], music videos [32] and movie clips [3]. Arousaland valence scores are spread across the rating scale, with a greatervariance observable for valence ratings. Wilcoxon rank sum testson annotator ratings revealed significantly different mean arousalratings for high and low arousal ads (p < 0.0001), and distinctivemean valence scores for high and low valence ads (p < 0.0001),consistent with expectation.

Pearson correlation was computed between the arousal and va-lence dimensions by limiting the false discovery rate to within5% [6]. This procedure revealed a weak and insignificant negativecorrelation of 0.16, implying that ad arousal and valence scoreswere largely uncorrelated. Overall, observed results suggest that(a) the compiled ads evoked a wide range of emotions across thearousal and valence dimensions, and (b) there was considerableconcordance among annotator impressions especially for valence,and they provided distinctive ratings for emotionally dissimilar ads.

3.2 Eye Tracking Data Acquisition ProtocolAs the annotators rated the ads for arousal and valence, we acquiredtheir eye tracking data using the Tobii EyeX eye tracker. The eyetracker has a sampling rate of 60 Hz and an advertised operatingrange of 50-90 cm. It was placed in a fixed position under the screenon which the ads were presented. Before each recording session, theeye tracker was calibrated using its sample stimulus presentationfor each subject. To minimize tracking errors and fatigue effects,we allowed each annotator to take comfort breaks after viewing 20ads. Before they began with the next set of 20 ads, we recalibratedthe eye tracker to ensure signal quality. We presented the videoson an LED screen with a resolution of 1366 × 768 pixels. Each adwas preceded by a 1s fixation cross to orient user attention. Werecorded a raw stream of the x and y pixel coordinates of the gaze,along with the gaze timestamps in milliseconds.

4 VISUAL CHANNELS FOR AFFECTRECOGNITION

This section discusses the various channels of visual informationthat we employed for affect recognition including content-drivenand human-driven (eye gaze-based). For ad affect recognition, werepresent the visual modality using representative frame imagesfollowing [47]. We fine-tuned Places205 via the LIRIS-ACCEDE [5]dataset, and used the penultimate fully connected layer (FC7) de-scriptors for our analysis. From each video in the ad and LIRIS-ACCEDE datasets, we sample one frame every three seconds whichenables extraction of a continuous video profile for affect recogni-tion. This process generated 1793 representative frames for the 100advertisement videos.

4.1 Visual Channels and FeaturesA description of each visual channel and type of feature employedfor affect recognition follows. The method title contains in paren-theses an abbreviation for easier reference in subsequent sections.

4.1.1 Raw Frames (Video). The base channel for our analysisis the raw video frame. The visual content of the raw frames is theexact and complete information available to the viewer when an adis presented. Theoretically, this representation should carry all theaffective information that is meant to be conveyed via the visualmodality, however we wanted to explore the utility of this (entire)information for affect recognition. Furthermore, we synthesized anumber of visual channels derived from the raw frames, encodingdifferent kinds of information. Figure 2 shows two example framesfrom an advertisement which serve as references to compare withthe synthesized channels.

Figure 2: Example ad video frames.

4.1.2 Frames Blurred by a Gaussian filter with constant vari-ance (Constant Blur). Prior studies have observed that scene emo-tion [51] and memorability [22] are more correlated with the scenegist rather than finer details. In order to create a channel that mainlyencodes coarse grained scene structure while removing finer de-tails, i.e., to primarily retain the scene gist, ambience or setting, weperformed Gaussian blurring on the raw frames with a filter ofwidth σ = 0.2w , wherew denotes video frame width. The resultantimages lose many of the sharp details while retaining the coarsescene structure, which we hypothesize contains significant affectiveinformation. Figure 3 shows two exemplar frames correspondingto the Constant Blur channel for the raw frames in Fig. 2.

4.1.3 Frames Repeatedly Gaussian Blurred until no objects arevisually discernible (Adaptive Blur). This channel is synthesizedso as to have no visually discernible objects to the human eye (sothat only a ‘very coarse’ scene structure is retained). To synthesizethis channel, we use a state-of-the-art object detector. The YOLOv3[45] object detector is employed to automatically locate salient

Figure 3: Example images for the Constant Blur channel

scene objects belonging to one of the 80 object categories of theMS-COCO [37] dataset. We run YOLOv3 on the frames on whichConstant Blur has been applied, and examine if objects are detectedwith a confidence threshold of greater than 0.25 (which is the defaultthreshold of YOLOv3). We then repeatedly blur the video frameswith the same parameters as in Constant Blur until no scene objectis detected.

In practice, we needed no more than three repeated blurs withthe chosen parameters to ensure zero object detections in any of the1793 examined frames. Figure 4 illustrates the output of a secondblurring on the frames shown in Figure 3. In both cases, one personwas detected on the left side of the image after the first stage ofblurring but no detections following the second blur. Differencesbetween the two pairs of images can be best perceived under zoom.

Figure 4: Example images for the Adaptive Blur channel

4.1.4 Crops of salient objects detected via YOLOv3 (ObjectCrops). A complementary argument to the scene gist influencingits emotion is that the scene affect is a function of the constituentobjects and their interactions [44]. To measure the contributionof individual scene objects towards the scene affect, we attemptedaffect prediction from purely object information. To this end, uponrunning YOLOv3 [45] to perform object detection on the raw frames,we obtained bounding boxes of objects belonging to the 80 MS-COCO [37] categories (in some instances, we get no detections.e.g., credit or text-only screens or those with uncommon objectclasses). We then cropped these bounding boxes, and fed them inde-pendently as inputs to the affect recognition model. Figure 5 showsa set of Object Crops for the two reference examples.

Figure 5: Example outputs for the Object Crops channel.Each object crop is fed independently to the affect recogni-tion system.

4.1.5 Removal of non-object regions from the frames (ObjectRetained). As an alternative to Object Crops, where object infor-mation is supplied individually to the affect recognition model, wesynthesized another channel where the video frame is composedof multiple (detectable) objects. To this end, we performed objectdetection as with Object Crops, but instead of using each crop as aseparate sample, we blackened (set to 0) those video frame pixelsthat do not correspond to object detections. This process essentiallyremoves all non-object regions from the video frames.

4.1.6 ImageNet Softmax Posterior Probabilites (AlexNet FC8).The two prior channels are limited in the event of spurious detec-tions or no detections at all. In order to have a channel that encodesobject related information and also serve as a generic visual descrip-tor, we used the FC8 layer feature vector of an AlexNet [34] networktrained on the ImageNet dataset. The FC8 feature vector is 1000dimensional with each dimension corresponding to the softmaxposterior probability of the input video frame containing a partic-ular ImageNet class (out of a 1000 possible classes). We extractedthese AlexNet FC8 features from the raw frames and employed themfor affect recognition. The salient aspect of the FC8 features is thatthey encode object-specific information as well as context in theform of co-occurrence.

4.1.7 Gist Features (Gist). We employed Gist features as de-fined in [39] as a baseline. These are a classical set of perceptualdimensions (naturalness, openness, roughness, expansion, rugged-ness) that represent the dominant spatial structure of a scene. Thesedimensions may be reliably estimated using spectral and coarselylocalized information. The model generates a multidimensionalspace in which scenes sharing membership in semantic categories(e.g., streets, highways, coasts) are projected close together. Wehypothesized a relationship between these dimensions and the af-fective dimensions of valence and arousal. We compute the Gistdescriptor using 4 blocks and 8 orientations per scale.

4.1.8 Removal of non-gazed regions from video frames (EyeROI). As discussed in Section 3.2, eye tracking data for multipleannotators for users who viewed the ads was recorded. From theeye gaze data, we extracted heatmaps per video frame using theEyeMMV toolbox [33]. For the heatmap generation, we used aninterval density/spacing of 200 pixels and a Gaussian kernel of size5 and σ = 3. We normalized gaze frequencies to values between 0and 255 where 255 represents the image region that was maximallygazed. For each video frame, the gaze points are extracted from atime window of two seconds (from one second prior to frame onsetto one second post frame presentation). Figure 6 shows the gazeheatmaps for the two example reference frames.

Figure 6: Example eye gaze heatmaps, with warmer coloursrepresenting higher gazing frequency (or highly fixated im-age regions).

Based on these gaze heatmaps, we determined the actively at-tended image regions. To specifically zero in on the most fixatedupon region, we considered only the warmer pixels in the heatmap(yellow and red regions). This channel is similar to the ObjectsRetained channel, but the area to be retained here is determinedby human gaze. Similarly as with objects retained, we blackenedall non-fixated pixels from the video frames. Figure 7 shows twoexample outputs for this channel.

Figure 7: Example images for the Eye ROI channel

4.1.9 Combination of Gaze Regions with Blurred Context (EyeROI + Context Blur). In this channel, we wanted to combine thehuman cues from eye gaze along with the coarse grained contextualinformation provided by the Constant Blur channel. The hypothesisbehind this channel being useful for affect prediction is the follow-ing. This channel carries both contextual information without the’noise’ induced by finer visual details from the non-salient imageregions, and also preserves the most relevant/attended image re-gion as determined by the gaze data. In essence, this channel can beinterpreted as a combination of the Constant Blur and the Eye ROIchannels. We simply take the image from the Constant Blur channeland then copy the non-zero pixels from the Eye ROI channel to thecorresponding locations. This yields a ’focus’ image in which thesalient region is visible clearly, but the surrounding context beingencoded at low resolution (similar to foveated rendering in virtualreality applications) . Figure 8 presents exemplar outputs for thischannel.

Figure 8: Example images for the Eye ROI + Context Blurchannel

4.1.10 Eye Tracking Histogram Features (Eye Hist). We alsoconsidered purely eye gaze-based statistics as features for affectrecognition, as proposed by [43]. Eye movements are composed offixations and saccades. A fixation is defined as continuously gazingat a particular location for a specified minimum duration, while asaccade denotes the transition from one fixation to another. Usingthe EyeMMV toolbox [33], we extracted fixations with a thresh-old duration of 100ms. From these fixations, we extracted varioushistograms for quantities like saccade length (50 bin histogramspaced between minimum and maximum lengths), saccade slope(30 bin histogram spaced between minimum and maximum slopes),saccade duration (60 bin histogram spaced between minimum andmaximum durations), saccade velocity (50 bin histogram spacedbetween minimum and maximum velocities), saccade orientation

(36 bin histogram spaced at 10 degrees), fixation duration (60 binhistogram spaced between minimum and maximum durations) andspatial fixations (with patches/bins of size 20x40 spread over thewhole screen size). We then concatenated each of these features (perannotator) to form a single feature vector for each video. We thenused this feature vector characterizing eye movement histogramsfor affect recognition.

4.2 Feature Extraction from Image ChannelsThis section describes the datasets and methods to create affectrecognition models from the different visual channels.

4.2.1 Datasets. Due to subjective variance in emotion percep-tion, careful affective labeling is imperative for effectively learningcontent and user-centric affective correlates [17, 55], which is whywe analyze ads that evoked perfect consensus among experts. Oflate, convolutional neural networks (CNNs) have become extremelypopular for visual [34] and audio [20] recognition, but these modelsrequire huge training data. Given the small size of our ad dataset,we fine-tune the pre-trained Places205 [60] model using the affec-tive LIRIS-ACCEDE movie dataset [5], and employ this fine-tunedmodel for encoding ad emotions– a process known as domainadaptation or transfer learning in machine learning literature.

To learn deep visual features characterizing ad affect, we em-ployed the Places205 CNN [60] intended for image classification.Places205 is trained using the Places-205 dataset comprising 2.5 mil-lion images and 205 scene categories. The Places-205 dataset con-tains a wide variety of scenes with varying illumination, viewpointand field of view, and we expected a strong correlation betweenscene perspective, lighting and the scene mood. LIRIS-ACCEDEcontains arousal, valence ratings for ≈ 10s long movie snippets,whereas the ads are about one minute long (ranging from 30–120s).

4.2.2 CNN Training. For every channel (described above) thatis extracted from the LIRIS-ACCEDE dataset, we synthesize thecorresponding outputs and use this large collection to fine tunePlaces205 for the specific channel. For instance, to fine tune for theConstant Blur channel we create a large set of Constant Blur imagesfrom the approximately 10000 videos from LIRIS-ACCEDE, and usethese images to fine tune Places205. This model is subsequently usedto extract FC7 features for the Constant Blur images synthesizedfor the ad dataset. This is not possible for the eye tracking basedchannels because LIRIS-ACCEDE doesn’t have eye tracking basedannotations. For channels involving eye tracking data (Eye ROI, EyeROI + Context Blur), we use the model fine tuned on the raw imagedata for feature extraction.

We use the Caffe [23] deep learning framework for fine-tuningPlaces205, with a momentum of 0.9, weight decay of 0.0005, and abase learning rate of 0.0001 reduced by 1

10th every 20000 iterations.

We train multiple binary affect prediction networks this way to rec-ognize high and low arousal and valence from FC7 features of eachchannel (excepting FC8 features and eye movement histograms,for which the features/histograms themselves are employed as fea-tures). To fine-tune Places205, we use only the top and bottom 1/3rdLIRIS-ACCEDE videos in terms of arousal and valence rankings un-der the assumption that descriptors learned for the extreme-ratedclips will more effectively represent affective concepts than the

neutral-rated clips. 4096 dimensional FC7 features extracted fromthe penultimate fully connected layer of the fine tuned networksare used for ad affect recognition.

5 EXPERIMENTS AND RESULTSWe first provide a brief description of the classifiers used andsettings employed for binary affect prediction (i.e., an ad is predictedas invoking high or low valence/arousal). Ground truth labels areprovided by the experts, and these labels are in agreement with theuser ratings as described in Section 3.

Classifiers: Weemployed the Linear Discriminant Analysis (LDA),linear SVM (LSVM) and Radial Basis SVM (RSVM) classifiers in ouraffect recognition experiments. LDA and LSVM separateH/L labeledtraining data with a hyperplane, while RSVM enables non-linearseparation between the H and L classes via transformation onto ahigh-dimensional feature space. We used these three classifiers oninput FC7 features from the fine tuned networks described in theprevious section. For AlexNet FC8 [34], we used the FC8 featuresdenoting the output of an object classification network. Similarly,Eye Hist features were directly input to the LDA, linear SVM andradial basis SVM classifiers.

Metrics and Experimental Settings: We used the F1-score (F1),defined as the harmonic mean of precision and recall as our per-formance metric, due to the unbalanced distribution of positiveand negative samples. As we evaluate affect recognition perfor-mance on a small dataset, affect recognition results obtained over10 repetitions of 5-fold cross validation (CV) (total of 50 runs) arepresented. CV is typically used to overcome the overfitting problemon small datasets, and the optimal SVM parameters are determinedfrom the range [10−3, 103] via an inner five-fold CV on the trainingset. Finally, in order to examine the temporal variance in affectrecognition performance, we present F1-scores obtained over (a)all ad frames (‘All’), (b) last 30s (L30) and (c) last 10s (L10) for affectrecognition from all visual modalities.

5.1 Results OverviewTable 2 contains the performance for each method for both valenceand arousal recognition in all the experimental settings. The highestF1 score obtained over all the temporal windows are denoted inbold for each method. Based on the performance of the methods,we observe the following.

Focusing on all classifiers that operate on deep learning basedfeatures, valence (Peak F1 = 0.74) is recognized generally betterthan arousal (Peak F1 = 0.73). The only two methods for whichpeak arousal recognition is better than valence are the Gist features(Peak Arousal F1 = 0.58, Peak valence F1 = 0.57) and the Eye ROImethod (Peak valence F1 = 0.68, Peak Arousal F1 = 0.70). This goeson to show that visual information based modalities contain moreinformation related to valence than arousal, which is consistentwith expectation and as shown in prior works [47, 48].

Relatively small F1-score variances are noted for the ’all’ condi-tion with the adopted cross validation procedure in Table 2, espe-cially with fc7 features indicating that the affect recognition resultsare minimally impacted by overfitting. The σ values for the ’L30’and the ’L10’ settings are uniformly and progressively greater for allmethods involving deep features. There can also be seen a general

Table 2: Advertisement Affect Recognition. F1 scores are presented in the form µ ± σ .

Method valence ArousalF1 (all) F1 (L30) F1 (L10) F1 (all) F1 (L30) F1 (L10)

Video FC7 + LDA 0.69±0.03 0.68±0.11 0.60±0.19 0.66±0.03 0.55±0.09 0.42±0.14Video FC7 + LSVM 0.66±0.02 0.64±0.10 0.57±0.19 0.63±0.02 0.58±0.10 0.47±0.15Video FC7 + RSVM 0.71±0.02 0.70±0.10 0.61±0.18 0.68±0.02 0.57±0.11 0.42±0.14

Constant Blur FC7 + LDA 0.70±0.02 0.67±0.10 0.63±0.21 0.69±0.03 0.59±0.10 0.46±0.16Constant Blur FC7 + LSVM 0.67±0.02 0.68±0.10 0.62±0.23 0.61±0.03 0.55±0.11 0.45±0.20Constant Blur FC7 + RSVM 0.73±0.02 0.70±0.09 0.65±0.21 0.70±0.02 0.59±0.09 0.45±0.16Adaptive Blur FC7 + LDA 0.71±0.02 0.67±0.10 0.63±0.21 0.70±0.02 0.59±0.11 0.47±0.15Adaptive Blur FC7 + LSVM 0.66±0.02 0.69±0.10 0.62±0.23 0.62±0.03 0.55±0.11 0.46±0.19Adaptive Blur FC7 + RSVM 0.72±0.02 0.70±0.09 0.64±0.21 0.69±0.02 0.59±0.09 0.48±0.14Object Crops FC7 + LDA 0.59±0.04 0.60±0.10 0.54±0.17 0.58±0.03 0.55±0.07 0.42±0.14Object Crops FC7 + LSVM 0.59±0.03 0.59±0.10 0.54±0.17 0.56±0.03 0.54±0.10 0.45±0.15Object Crops FC7 + RSVM 0.63±0.04 0.63±0.10 0.58±0.17 0.61±0.03 0.59±0.08 0.47±0.15Object Retained FC7 + LDA 0.59±0.03 0.56±0.08 0.57±0.21 0.55±0.03 0.51±0.10 0.56±0.20Object Retained FC7 + LSVM 0.57±0.03 0.57±0.10 0.53±0.20 0.56±0.04 0.54±0.09 0.58±0.17Object Retained FC7 + RSVM 0.61±0.03 0.58±0.09 0.56±0.21 0.55±0.03 0.50±0.11 0.55±0.19

AlexNet FC8 + LDA 0.71±0.02 0.68±0.11 0.71±0.18 0.67±0.03 0.58±0.10 0.51±0.15AlexNet FC8 + LSVM 0.70±0.02 0.70±0.09 0.72±0.19 0.66±0.02 0.58±0.11 0.51±0.17AlexNet FC8 + RSVM 0.71±0.02 0.69±0.09 0.73±0.17 0.66±0.02 0.55±0.10 0.50±0.17

Gist + LDA 0.57±0.03 0.52±0.10 0.43±0.16 0.58±0.02 0.52±0.09 0.46±0.19Gist + LSVM 0.57±0.03 0.53±0.10 0.42±0.16 0.58±0.02 0.52±0.08 0.46±0.17Gist + RSVM 0.39±0.02 0.36±0.06 0.35±0.09 0.56±0.02 0.54±0.09 0.52±0.17

Eye ROI FC7 + LDA 0.65±0.02 0.61±0.10 0.60±0.21 0.68±0.02 0.56±0.09 0.53±0.19Eye ROI FC7 + LSVM 0.62±0.03 0.58±0.07 0.55±0.21 0.66±0.02 0.59±0.09 0.59±0.18Eye ROI FC7 + RSVM 0.68±0.02 0.63±0.10 0.62±0.20 0.70±0.02 0.56±0.08 0.54±0.18

Eye Hist + LDA 0.53±0.04 0.57±0.04 0.50±0.05 0.52±0.03 0.54±0.05 0.56±0.02Eye Hist + LSVM 0.52±0.04 0.54±0.03 0.52±0.05 0.53±0.04 0.53±0.06 0.55±0.04Eye Hist + RSVM 0.53±0.04 0.58±0.04 0.52±0.05 0.52±0.04 0.56±0.06 0.56±0.02

Eye ROI + Context Blur FC7 + LDA 0.69±0.02 0.71±0.09 0.65±0.16 0.69±0.02 0.61±0.10 0.63±0.18Eye ROI + Context Blur FC7 + LSVM 0.67±0.02 0.67±0.11 0.59±0.18 0.69±0.02 0.61±0.09 0.67±0.18Eye ROI + Context Blur FC7 + RSVM 0.74±0.02 0.74±0.08 0.65±0.19 0.73±0.02 0.64±0.09 0.61±0.16

degradation in F1 score over the three settings. Affect recognitionperformance being lower in the L30 and L10 conditions highlightsboth a lack of heterogeneity in affective information towards theend of the ad and the limitation of using a single affective label(as opposed to continuous labeling, e.g., using Feeltrace [9]) overtime. Generally lower F1-scores achieved for arousal with all meth-ods suggests that arousal is more difficult to model than valence,and that coherency between visual features and valence labels issustainable over time.

The deep learning based FC7/FC8 features, even while not beinginterpretable, considerably outperform the handcrafted Gist andthe Eye Hist features. This can be attributed to a variety of reasons.Even though Tavakoli et al. [43] show that eye tracking featurescorrelate well with valence, their approach is validated on imageswhereas we focus on ad videos which have different eye gaze dy-namics, influenced by engaging ad content. The coarse visual scenestructure (Constant and Adaptive Blur conditions) in itself proves tobe considerably more informative for affect prediction than fixationstatistics. Similarly, while gist features may be able to effectivelyencode characteristics of simple scenes (e.g., indoor vs outdoor), advideos may contain a wide variety in scene content (e.g., such asa sudden transformation from indoor to outdoor), which rendersthem ineffective for describing visual ad content.

Comparing the various channels, we notice a sharp differencein performance between channels that only encode object-relatedinformation vs. those that encode contextual information. The bestperforming computationally driven object channel is Object Crops(Peak valence F1 = 0.63, Peak Arousal F1 = 0.61), which performssignificantly worse than the best performing contextual channelConstant Blur (Peak valence F1 = 0.73, Peak Arousal F1 = 0.70).The Alexnet FC8 channel which contains information regarding theexistence and co-occurrence of 1000 object classes in a video frameas compared to the YOLOv3 detector that only handles 80 classes,performs much better than the Object Crops and Objects Retainedchannels. Among object based channels, Alexnet FC8 performs thebest and only performs marginally worse than the context-basedblur channels for valence recognition. However, the blurred videochannels appear to encode information more pertinent for arousalrecognition as compared to Alexnet FC8.

Interestingly, we also observe that the performance of the Con-stant Blur and the Adaptive Blur channels is slightly better thanthe performance of the Video channel (Peak valence F1 = 0.71, PeakArousal F1 = 0.68) containing the complete video information. Theirperformances are very comparable, which is largely due to the factthat the two channels encode largely similar information (for manyvideo frames, no objects could be detected after the first blurring

step). The Eye ROI channel clearly outperforms the Objects Retainedchannel, which points to human driven salient regions containingmore affective information than computationally detected objects.

Lastly, the best arousal and valence recognition is achieved withthe Eye ROI + Context Blur channel (Peak valence F1 = 0.74, PeakArousal F1 = 0.73) which encodes high resolution information cor-responding to the gazed (or interesting) content, along with thecoarse grained scene context (as the rest of the scene is blurred).This observation implies that eliminating ‘noise’ in the form offiner details pertaining to the scene context is beneficial for af-fect recognition. Context noise is also known to affect (low-level)visual tasks such as object classification– e.g., noise arising frompartial matches of high resolution background information hasbeen shown to confound object categorization in [28].

6 DISCUSSION

A long-standing debate exists regarding what influences emo-tions while viewing visual content– individual scene objects andtheir interactions central to the narrative, or the overall scene struc-ture (that we call the gist or context). Particularly with reference toad design and creation, the ad director has to decide whether to a)focus on key elements conveying the ad emotion, or b) the contextsurrounding these elements as well. In this regard, the findings ofthis study provide some interesting takeaways. Our results suggestthat both the central scene objects (captured by user gaze and en-coded in high resolution) and the scene context play a critical roletowards conveying the scene affect.

The most striking result that we have obtained is that coarse-grained scene structure conveys affect better than object-specificinformation, eye-gaze related measures and even the original video.Although we know that coarse scene structure can influence scenedescription [36] and object perception [4] irrespective of the pres-ence [4, 29] or absence of target objects [28, 29], it is indeed sur-prising that coarse scene descriptors can effectively predict valenceand arousal in video advertisements. Also, the obtained resultspoint to the importance of capturing user-specific information (viagaze tracking), and highlight the limitations of content descriptors(including deep features) for affect recognition.

How are our findings useful in the context of ad affect recogni-tion and computational advertising, where the objective is to insertthe most appropriate ads at optimal locations within a streamedvideo? Past work has shown that such affective information cansignificantly influence the evaluation [53] and impact of consumerbehavior [7]. In terms of affect mining, we find that affective infor-mation can be reliably modeled even from short ad segments, giventhe reasonable recognition achieved with L30 and L10 ad segments.

Also, emotions invoked by ads can be identified well using onlyprimitive or blurred scene descriptors, which suggest that a compu-tational advertising framework can be developedwith relatively lowcomplexity. Overall, observed results translate into the followingad design principles (a) Target valence and arousal should prefer-ably be conveyed by the scene background in addition to beingembedded in the ad narrative or object information. (b) The factthat models trained with information from all detectable objects donot do as well as ones trained with few actively attended objects

suggests that cluttered scenes can hamper affective processing andshould be avoided.

Can we leverage our insights to achieve more effective computa-tional advertising? Content based features extracted using imagestatistics [58] and deep neural network representations [47, 48] havebeen used for affect-driven placement of advertisements withinstreamed video. Our work suggests that optimal ad-video scenematching at the content plus emotional levels can be achieved byusing traditional object annotation methods for content match-ing, and a coarse scene structure matching for emotional indexing.From a methods perspective, embedding high resolution objectsinto coarse backgrounds is compatible with existing deep neuralarchitectures, and represents an easily implementable pipeline.

7 CONCLUSIONS AND FUTUREWORKAnalyses of video-based information channels confirm that inter-esting scene objects, captured via eye gaze and encoded at highresolution, along with coarse information regarding the remain-ing scene structure (or context) best capture the scene emotion.The different channels explored in this work can be synthesizedwith minimal human intervention (only eye tracking informationis required), and the proposed affect recognition models are highlyscalable owing to the nature of computer vision and machine learn-ing models used. These channels can be used to annotate largevideo datasets for a variety of scene understanding applications.

We aim to fully automate our analytical pipeline in future by re-placing human eye-gaze with state-of-art visual saliency predictionalgorithms. The proposed ad affect recognition models can be com-bined with a computational advertising framework like CAVVA [58]to maximize engagement, immediate and long term brand recalland click through rates as validated by previous work [47, 48].

In future, we would also like to explore decompositions of infor-mation in the audio, speech and text modalities and how they canbe useful for affect prediction. Another interesting line for futureresearch is the development of a system that learns and recom-mends impactful scene structures for the creation of emotionaladvertisement videos.

ACKNOWLEDGMENTSThis research is supported by the National Research Foundation,Prime Ministers Office, Singapore under its International ResearchCentre in Singapore Funding Initiative.

REFERENCES[1] 2018. RealEyes Emotional Intelligence. https://www.realeyesit.com/.[2] M. K. Abadi, S. M. Kia, R. Subramanian, P. Avesani, and N. Sebe. 2013. User-centric

Affective Video Tagging from MEG and Peripheral Physiological Responses. InAffective Computing and Intelligent Interaction. 582–587.

[3] M. K. Abadi, R. Subramanian, S. M. Kia, P. Avesani, I. Patras, and N. Sebe. 2015.DECAF: MEG-Based Multimodal Database for Decoding Affective PhysiologicalResponses. IEEE Trans. Affective Computing 6, 3 (2015), 209–222.

[4] M. Bar. 2004. Visual objects in context. Nat. Rev. Neurosci. 5, 8 (Aug 2004),617–629.

[5] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015.LIRIS-ACCEDE: A video database for affective content analysis. IEEE Trans.Affective Computing 6, 1 (2015), 43–55.

[6] Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate:a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Series B(Methodological) 57, 1 (1995), 289–300.

[7] J. Broach, V. Carter, J. Page, J. Thomas, and R. D.Wilson. 1995. Television pro-gramming and its influence on viewers perceptions of commercials:The role ofprogram arousal and pleasantness. Journal of Advertising 24, 4 (1995), 45âĂŞ–54.

[8] Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, andLouis-Philippe Morency. 2017. Multimodal Sentiment Analysis with Word-levelFusion and Reinforcement Learning. In ACM International Conference on Multi-modal Interaction. 163–171.

[9] Roddy Cowie, Ellen Douglas-Cowie, Susie Savvidou*, Edelle McMahon, MartinSawey, and Marc Schröder. 2000. ’FEELTRACE’: An instrument for recordingperceived emotion in real time. In ISCA tutorial and research workshop (ITRW) onspeech and emotion.

[10] Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and TomGedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0.In Proceedings of the 19th ACM International Conference onMultimodal Interaction.

[11] Shaojing Fan, Zhiqi Shen, Ming Jiang, Bryan L. Koenig, Juan Xu, Mohan S.Kankanhalli, andQi Zhao. 2018. Emotional Attention: A Study of Image Sentimentand Visual Attention. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR).

[12] P.W. Farris, N. T. Bendle, P. E. Pfeifer, and D. J. Reibstein. 2010. Marketing Metrics:The Definitive Guide to Measuring Marketing Performance (2 ed.). Wharton SchoolPublishing.

[13] S. O. Gilani, R. Subramanian, Y. Yan, D.Melcher, N. Sebe, and S.Winkler. 2015. PET:An eye-tracking dataset for animal-centric Pascal object classes. In InternationalConference on Multimedia and Expo. 1–6.

[14] M. K. Greenwald, E. W. Cook, and P. J. Lang. 1989. Affective judgement and psy-chophysiological response: dimensional covariation in the evaluation of pictorialstimuli. Journal of Psychophysiology 3 (1989), 51–64.

[15] Iris IA Groen, Michelle R Greene, Christopher Baldassano, Li Fei-Fei, Diane MBeck, and Chris I Baker. 2018. Distinct contributions of functional and deepneural network features to representational similarity of scenes in human brainand behavior. Elife 7 (2018), e32962.

[16] Hatice Gunes and Maja Pantic. [n. d.]. Automatic, Dimensional and ContinuousEmotion Recognition. International Journal of Synthetic Emotions 1, 1 ([n. d.]),68–99.

[17] Alan Hanjalic and Li-Quan Xu. 2005. Affective Video Content Representation.IEEE Trans. Multimedia 7, 1 (2005), 143–154.

[18] Morris B Holbrook, Rajeev Batra, and Rajeev Batra. 1987. Assessing the Roleof Emotions as Mediators of Consumer Responses to Advertising. Journal ofConsumer Research 14, 3 (1987), 404–420. https://doi.org/10.1002/job.305

[19] Morris B Holbrook and John O Shaughnessy. 1984. The role of emotlon inadvertising. Psychology & Marketing 1, 2 (1984), 45–64.

[20] Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan. 2014. SpeechEmotion Recognition Using CNN. In ACM Multimedia. 801–804.

[21] Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, ChristopherThomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. AutomaticUnderstanding of Image and Video Advertisements. (2017), 1100–1110.

[22] P. Isola, Jianxiong Xiao, D. Parikh, A. Torralba, and A. Oliva. 2014. What Makesa Photograph Memorable? Pattern Analysis and Machine Intelligence, IEEE Trans-actions on 36, 7 (July 2014), 1469–1482.

[23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, RossGirshick, Sergio Guadarrama, and Trevor Darrell. 2014. CAFFE: ConvolutionalArchitecture for Fast Feature Embedding. In ACM Int’l Conference on Multimedia.675–678.

[24] Adriana K. and James H. 2018. Towards Automatic Understanding of VisualAdvertisements (ADS). (2018).

[25] M. A. Kamins, L. J. Marks, and D. Skinner. 1991. Television commercial evaluationin the context of program induced mood: Congruency versus consistency effects.Journal of Advertising 20, 1 (1991), 1âĂŞ–14.

[26] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutionalneural networks. In Proceedings of the IEEE conference on Computer Vision andPattern Recognition. 1725–1732.

[27] Mohan Kankanhalli Karthik Yadati and, Harish Katti and. 2013. InteractiveVideo Advertising: A Multimodal Affective Approach. In Multimedia Modeling.106–117.

[28] Harish Katti, Marius V Peelen, and SP Arun. 2016. Object detection canbe improved using human-derived contextual expectations. arXiv preprintarXiv:1611.07218 (2016).

[29] Harish Katti, Marius V Peelen, and SP Arun. 2017. How do targets, nontargets,and scene context influence real-world object detection? Attention, Perception, &Psychophysics 79, 7 (2017), 2021–2036.

[30] Harish Katti, Anoop Kolar Rajagopal, Kalapathi Ramakrishnan, Mohan Kankan-halli, and Tat-Seng Chua. 2013. Online estimation of evolving human visualinterest. ACM Transactions on Multimedia 11, 1 (2013), 8:1–8:21.

[31] Sander Koelstra, Christian Mühl, and Ioannis Patras. 2009. EEG analysis forimplicit tagging of video data. In Affective Computing and Intelligent Interactionand Workshops, 2009. ACII 2009. 3rd International Conference on. IEEE, 1–6.

[32] Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong-Seok Lee, AshkanYazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012.

DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Trans.Affective Computing 3, 1 (2012), 18–31.

[33] Vassilios Krassanakis, Vassiliki Filippakopoulou, and Byron Nakos. 2014. Eye-MMV toolbox: An eye movement post-analysis tool based on a two-step spatialdispersion threshold for fixation identification. Journal of Eye Movement Research7, 1 (2014), 1–10.

[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Clas-sification with Deep Convolutional Neural Networks. In Neural InformationProcessing Systems. 1097–1105.

[35] Peter J. Lang, Margaret M. Bradley, and B. N. Cuthbert. 2008. Internationalaffective picture system (IAPS): Affective ratings of pictures and instruction manual.Technical Report A-8. The Center for Research in Psychophysiology, Universityof Florida, Gainesville, FL.

[36] F. F. Li, R. VanRullen, C. Koch, and P. Perona. 2002. Rapid natural scene catego-rization in the near absence of attention. Proc. Natl. Acad. Sci. U.S.A. 99, 14 (Jul2002), 9596–9601.

[37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In European conference on computer vision. Springer, 740–755.

[38] Tao Mei, Xian-Sheng Hua, Linjun Yang, and Shipeng Li. 2007. VideoSense: To-wards Effective Online Video Advertising. In ACM Int’l Conference on Multimedia.1075–1084.

[39] Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: Aholistic representation of the spatial envelope. International journal of computervision 42, 3 (2001), 145–175.

[40] Maja Pantic and Alessandro Vinciarelli. 2009. Implicit human-centered tagging[Social Sciences]. IEEE Signal Processing Magazine 26, 6 (2009), 173–180.

[41] Viral Parekh, Ramanathan Subramanian, Dipanjan Roy, and C. V. Jawahar. 2018.An EEG-Based Image Annotation System. In Computer Vision, Pattern Recognition,Image Processing, and Graphics. 303–313.

[42] Michel Tuan Pham, Maggie Geuens, and Patrick De Pelsmacker. 2013. Theinfluence of ad-evoked feelings on brand evaluations: Empirical generalizationsfrom consumer responses to more than 1000 TV commercials. InternationalJournal of Research in Marketing 30, 4 (2013), 383 – 394.

[43] Hamed R.-Tavakoli, Adham Atyabi, Antti Rantanen, Seppo J. Laukka, SamiaNefti-Meziani, and Janne Heikkila. 2015. Predicting the Valence of a Scene fromObservers’ Eye Movements. PLoS ONE 10, 9 (2015), 1–19.

[44] Subramanian Ramanathan, Harish Katti, Raymond Huang, Tat-Seng Chua, andMohan Kankanhalli. 2009. Automated Localization of Affective Objects andActions in Images via Caption Text-cum-eye Gaze Analysis. In ACM InternationalConference on Multimedia. 729–732.

[45] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement.arXiv (2018).

[46] Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie,Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmi, and MajaPantic. 2017. AVEC 2017–Real-life Depression, and Affect Recognition Workshopand Challenge. 3–9.

[47] Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, MohanKankanhalli, and Ramanathan Subramanian. 2017. Affect Recognition in Adswith Application to Computational Advertising. In ACM Multimedia. New York,NY, USA, 1148–1156.

[48] Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, MohanKankanhalli, and Ramanathan Subramanian. 2017. Evaluating Content-centric vs.User-centric Ad Affect Recognition. In Proceedings of the 19th ACM InternationalConference on Multimodal Interaction. 402–410.

[49] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, andMaja Pantic. 2012. Amultimodal database for affect recognition and implicit tagging. IEEE Transactionson Affective Computing 3, 1 (2012), 42–55.

[50] Mohammad Soleymani, Maja Pantic, and Thierry Pun. 2012. Multimodal emotionrecognition in response to videos. IEEE transactions on affective computing 3, 2(2012), 211–223.

[51] Ramanathan Subramanian, Divya Shankar, Nicu Sebe, and David Melcher. 2014.Emotion modulates eye movement patterns and subsequent memory for the gistand details of movie scenes. Journal of Vision 14, 3 (2014), 31.

[52] Marko Tkalčič, Urban Burnik, and Andrej Košir. 2010. Using affective parametersin a content-based recommender system for images. User Modeling and User-Adapted Interaction 20, 4 (2010), 279–311.

[53] Mark Truss. 2006. The Advertised Mind: Ground-Breaking Insights into HowOur Brains Respond to Advertising. Journal of Advertising Research 46, 1 (2006),132–134.

[54] Julia Wache, Ramanathan Subramanian, Mojtaba Khomami Abadi, Radu-Laurentiu Vieriu, Nicu Sebe, and Stefan Winkler. 2015. Implicit User-centricPersonality Recognition Based on Physiological Responses to Emotional Videos.In International Conference on Multimodal Interaction. 239–246.

[55] Hee Lin Wang and Loong-Fah Cheong. 2006. Affective understanding in film.IEEE Trans. Circ. Syst. V. Tech. 16, 6 (2006), 689–704.

[56] Alex W. White. 2006. Advertising Design and Typography. Allworth Press.

[57] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neuralimage caption generation with visual attention. In International conference onmachine learning. 2048–2057.

[58] Karthik Yadati, Harish Katti, and Mohan Kankanhalli. 2014. CAVVA: Computa-tional affective video-in-video advertising. IEEE Trans. Multimedia 16, 1 (2014),15–23.

[59] Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. Asurvey of affect recognition methods: Audio, visual, and spontaneous expressions.IEEE transactions on pattern analysis and machine intelligence 31, 1 (2009), 39–58.

[60] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.2014. Learning deep features for scene recognition using places database. InAdvances in neural information processing systems. 487–495.