Community based feedback techniques to improve video search

SIViP (2008) 2:289–306DOI 10.1007/s11760-008-0087-y

ORIGINAL PAPER

Community based feedback techniques to improve video search

David Vallet · Frank Hopfgartner · Martin Halvey ·Joemon M. Jose

Received: 2 June 2008 / Revised: 26 July 2008 / Accepted: 12 September 2008 / Published online: 21 October 2008© Springer-Verlag London Limited 2008

Abstract In this paper, we present a novel approach to aidusers in the difficult task of video search. We use a graphbased model based on implicit feedback mined from theinteractions of previous users of our video search system toprovide recommendations to aid users in their search tasks.This approach means that users are not burdened with pro-viding explicit feedback, while still getting the benefits ofrecommendations. The goal of this approach is to improvethe quality of the results that users find, and in doing so alsohelp users to explore a large and difficult information space.In particular we wish to make the challenging task of videosearch much easier for users. The results of our evaluationindicate that we achieved our goals, the performance of theusers in retrieving relevant videos improved, and users wereable to explore the collection to a greater extent.

Keywords Video · Search · Collaborative · Implicit ·Feedback · User study

D. Vallet (B)Universidad Autónoma de Madrid, Escuela Politécnica SuperiorCiudad Universitaria de Cantoblanco, 28049 Madrid, Spaine-mail: [email protected]

F. Hopfgartner · M. Halvey · J. M. JoseDepartment of Computing Science, University of Glasgow,Glasgow G12 8QQ, UKe-mail: [email protected]

M. Halveye-mail: [email protected]

J. M. Josee-mail: [email protected]

1 Introduction

With the growing capabilities and the falling prices ofcurrent hardware systems, there are ever increasing possibi-lities to store and manipulate videos in a digital format. Alsowith ever increasing broadband capabilities it is now pos-sible to view video online as easily as text-based pages wereviewed when the Web first appeared. People are now crea-ting their own digital libraries from materials created throughdigital cameras and camcorders, and use a number of systemsto place this material on the Web, as well as store them astheir own individual collections [20]. However, the systemsthat currently are used to manage and retrieve these videosare insufficient for dealing with such large and swiftly increa-sing volumes of video. Current state of the art systems rely oneither using annotations provided by users or on methods thatuse the visual low level features available in the videos. Asexperimental results in state of the art video retrieval show,neither of these approaches is sufficient to overcome the dif-ficulties associated with video search (see Sect. 2 for moredetails).

We believe that many of these problems associated withbrowsing and searching large collections of video can bealleviated through the use of recommendation techniques.Recommendation techniques based on implicit actions donot require users to alter their normal behaviour, while allof the actions that users carry out can be used to improvetheir retrieval results. To test these assertions, we have deve-loped a graph based model that utilises the implicit actionsinvolved in previous user searches. This model can providerecommendations to support users in completing their searchtasks. This approach is flexible as it allows us to use a com-bination of explicit and implicit feedback, or implicit feed-back alone, to help users. We believe that this approachcan result in a number of desirable outcomes. We achieved

123

290 SIViP (2008) 2:289–306

an improved user performance in terms of task completion,proving that we can aid user exploration of the collection andcan also increase the user satisfaction with their search andtheir search results. An evaluative study was performed, inorder to examine and validate these assumptions. Two sys-tems were compared. The first system is a baseline systemthat provides no recommendations. The second system is asystem that provides recommendations based on our modelof implicit user actions. The two systems and their respectiveperformances were evaluated both qualitatively and quantita-tively. It was found that our approach increases the accuracyof videos retrieved by users, allows users to navigate a videocollection to a greater extent and that users were more atease using our recommendation system in comparison witha baseline. The remainder of this paper is organised as fol-lows: In Sect. 2, we will provide a rationale for our work, anddescribe the state of the art in a number of areas that haveinspired this work. Subsequently, in Sect. 3 we will describeour approach for using implicit feedback to provide recom-mendations. Section 4 will describe the two systems usedin our study. In Sect. 5 we will describe our experimentalmethodology, which is followed by the results of our experi-ments in Sect. 6. Finally, in Sect. 7, we provide a discussionof our work and some conclusions.

2 Background and motivation

The work that we have undertaken is multi-faceted and drawsinspiration from various research areas. We will discuss anumber of these research areas and how they are related toour work. We begin by discussing video retrieval, and inparticular interactive video retrieval.

2.1 Interactive video retrieval

Interactive video retrieval refers to the process of users for-mulating and carrying out video searches, and subsequentlyreformulating queries based on the previously retrievedresults to retrieve more results. Most state of the art videoretrieval engines, interactive or otherwise can be dividedinto two major components [32], an indexing engine anda retrieval engine, which accesses the index through queries.The indexing engine is responsible for providing a quicklyaccessible index of information that is available in a cor-pus of video. The indexing begins with shot segmentation,in this step a larger video is segmented into a sequence ofshorter video shots. A shot is a sequence of the video whichis visually related, the length of a shot can vary dependingon the video content. Boundaries between shots are typi-cally marked by a cut or fade in the video sequence. Formost video retrieval systems, a shot is the element of retrie-val. Each shot is indexed separately by the video retrieval

system. The shots may have text associated with them andcan be represented by individual frames. In addition to this,for each shot, a number of example frames which providethe best representation of the shot are calculated, these arecalled keyframes. When a search is carried out, the resultsof the search are normally presented as a list of shots, whichare represented by keyframes.

In addition to this basic representation of videos, there areadditional pieces of metadata or additional information thatcan be extracted from video to aid search and be added to theindex, e.g. speech, closed captions, etc. A study of a num-ber of state of the art video retrieval systems [15] concludesthat the availability of these additional resources varies fordifferent systems. For example Heesch et al. [13] includeClosed Caption transcripts (CC), Automatic Speech Recog-nition (ASR) and Object Character Recognition (OCR) out-put in their index, whereas Foley et al. [7] include indexspeech recognition output only. As these additional resourcesare not always available, they cannot always be relied uponto aid user searches. Despite this uncertainty and lack ofresources there has been a great deal of research carriedout into their use. For example, Huang [16] has argued thatspeech contains a great deal of semantic information thatcan be used to help search. Further research from Changet al. [3] found that text extracted from speech data can be animportant feature for extracting additional information froma video; it can be used for named identity extraction, annota-ting concepts in video and topic change extraction in a video.The indexing procedure, together with the retrieval engine,constitutes the “backend” of a video retrieval system.

The main component that allows the interactivity in inter-active video retrieval is the “front end”; this is the interfacebetween the computer and the user. The interface allows auser to compose queries and interact with the results. Theusers can then re-formulate queries if they wish and get anew set of results. There are a number of different ways inwhich a user can query a video retrieval system; these includequery by text, query by example and query by concept. Noneof these approaches have as of yet provided an adequatesolution to providing the tools to facilitate video search.Query by text is one of the most popular methods of sear-ching for video. It is simple and users are familiar with thisparadigm from text based searches. However, query by textrelies on the availability of sufficient textual descriptions.Textual descriptions may be extracted from closed captionsor through ASR; however, the availability of these additio-nal resources varies for different systems [15]. More recentonline state of the art systems rely on using annotations fromusers to provide textual descriptions for videos, as well asother media, e.g. images, Web pages, podcasts, etc. Howe-ver, annotations also provide a number of problems. Userscan have different perceptions about the same video and tagthat video differently. This can result in synonyms, polysemy

123

SIViP (2008) 2:289–306 291

and homonymy, which makes it difficult for other users toretrieve the same video. It has also been found that users arereluctant to provide a large number of tags [11]. Query byexample allows the users to provide sample images or videoclips as examples to retrieve more results. This approach usesthe low-level features that are available in images and videos,such as colour, texture and shape to retrieve results. Theapproach is also inadequate for video retrieval because of thedifference between low-level data representation of videosand the higher level concepts users associate with video,commonly known as the semantic gap [17]. Bridging thesemantic gap is one of the most challenging research issuesin multimedia information retrieval today. In an attempt tobridge this gap, a great deal of interest in the multimediasearch community has been invested in search by concept.Semantic concepts such as “vehicle” or “person” can be usedto aid retrieval; an example of this semantic application hasbeen explored by the Large Scale Ontology for Multime-dia (LSCOM) initiative [23]. Search by concept requiresa large number of concepts to be represented and a num-ber of training examples to represent those concepts. Whileeach of these methods alone is inadequate, they have beenused in conjunction with each other in a number of systems,including MediaMill [31] and Informedia [4]. MediaMill andInformedia have been amongst the most successful systems atrecent TRECVID interactive search evaluations [24]. Howe-ver, these results are for “expert” users, who are supposed anidealistic performance compared to the common user [5].

2.2 Personal multimedia search and sharing

In recent years there has been a rapid increase in the num-ber of photographs and videos that individuals have storedin personal collections and shared online. This has led to theemergence of some interesting research that has investiga-ted user interaction with multimedia. Kirk et al. present theirresearch on “photowork” [19] and “videowork” [20]; pho-towork and videowork are the activities that users engage inwith their digital photos or videos prior to sharing. These useractivities form some of the context for the browsing and sear-ching that users carry out. Kirk et al. find that search as weknow it may have much less relevance than new paradigmsand ways of browsing, for the design of new digital phototools. Ethnographic-style studies [2] have also found thatthere are huge similarities between the ways in which par-ticipants used personally captured photos and commerciallypurchased music. Halvey and Keane [11] provide analysesof people’s linking and search behaviour using YouTube.Initial results show that page views in the video contextdeviate from the typical power-law relationships seen on theWeb. There are also clear indications that tagging and tex-tual descriptions play a key role in making some video-pagesmore popular than others. However, a number of studies

have shown that users cannot be relied upon to provide largeamounts of textual annotations [10,11], thus we need anotherform of information to bridge the semantic gap. We believethat using collaborative or community based methods to aidusers’ video search is one of the best solutions to the problemsassociated with video search.

2.3 Collaborative information access

Many of the earliest collaborative techniques emerged onlinein the 1990s [9,25,28] and focused on the notion of collabo-rative filtering. Collaborative filtering was first developed inthe Tapestry system to recommend e-mails to users of onlinenewsgroups [9]. Collaborative filtering aims to group userswith similar interests, with a view to treating them similarlyin the future. So, if two users have consistently liked or dis-liked the same resources, then chances are that they will likeor dislike future resources of that type. Since those early dayscollaborative or community based methods have evolved andbeen used to aid browsing [36], e-learning [8] and in collabo-rative search engines [30]. More recently there has also beensome recent initial research into carrying out collaborativevideo search [1]. This work, however, concentrated on twousers carrying out a search simultaneously rather than usingthe implicit interactions from previous searches to improvefuture searches, which is not the focus of our research.

Traditionally, explicit relevance feedback has been usedfor a number of collaborative systems; however, there are anumber of problems with this approach. Providing explicitfeedback can be a cognitively taxing process. Users are for-ced to update their need constantly and this can be a difficultprocess when their information need is vague [34] or whenthey are unfamiliar with the document collection [27]. Also,previous evaluations have found that users of explicit feed-back systems often do not provide sufficient levels of feed-back in order for adaptive retrieval algorithms to work [12].With this in mind we are using the implicit action of users asthe basis for our system. Implicit feedback has been shownto be a good indicator of interest in a number of areas [18].Hopfgartner et al. [14] have suggested that implicit relevancefeedback can aid users searching in digital video library sys-tems. Using click through data as implicit feedback, Whiteet al. [37] use the concept of “search trails”, meaning thesearch queries and document interaction sequences perfor-med by the users during a search session, to enhance Websearch. Craswell and Szummer [6] apply a random walk on agraph of user click through data, to help retrieve relevantdocuments for user searches. Specifically for multimediasearch, relevance feedback based on the content of videohas also been used in conjunction with related information,e.g. tags, to provide video search recommendations to users[21,38]. However, we believe that such techniques are insuf-ficient where there is a lack of associated information and

123

292 SIViP (2008) 2:289–306

will also suffer from problems associated with the semanticgap [17]. In response to a number of the problems that havebeen outlined in the previous sections, we have developed ourown graph based model of implicit actions, which we use toprovide recommendations. This approach uses the previouswork of White et al. [37] and Craswell and Szummer [6] as aguide. This model and the recommendation techniques thatwe use are described in the following section.

3 Implicit feedback: a graph based approach

For our recommendation model based on user actions, thereare two main desired properties of the model for action infor-mation storage. The first property is the representation of allof the user interactions with the system, including the searchtrails for each interaction. This allows us to fully exploit allof the interactions to provide richer recommendations. Thesecond property is the aggregation of implicit informationfrom multiple sessions and users into a single representation,thus facilitating the analysis and exploitation of past implicitinformation. To achieve these properties we opt for a graphbased representation of the users’ implicit information. Wetake the concept of trails from White et al. [37]; howeverunlike White et al. we do not limit the possible recommen-ded documents to those documents that are at the end ofthe search trail. We believe that during an interactive searchthe documents that most of the users with similar interac-tion sequences interacted with, are the documents that couldbe most relevant for recommendation, not just the final docu-ment in the search trail. Similar to Craswell and Szummer [6],our approach represents queries and documents in the samegraph, however we represent the whole interaction sequence,unlike their approach where the clicked documents are linkeddirectly to the query node. Once again we want to recom-mend potentially important documents that are part of theinteraction sequence and not just the final document of thisinteraction. Another difference between our approach andprevious work is that we take into consideration other typesof implicit feedback actions, related to multimedia search,e.g. length of play time, browsing keyframes, etc., as wellas click through data. This additional data allows us to pro-vide a richer representation of user actions and potentiallybetter recommendations. Overall our representation exploitsa greater range of user interactions in comparison with otherapproaches [6,21,37]. This results in a more complete repre-sentation of a wide range of user actions that may facilitatebetter recommendations. These properties and this approachresult in two graph-based representations of user actions. Thefirst representation utilises a Labelled Directed Multigraph(LDM) for the detailed and full representation of implicitinformation. The second graph is a Weighted Directed Graph(WDG), which interprets the information in the LDM and

represents it in such a way that is exploitable for a recommen-dation algorithm. The recommendations that are provided arebased on three different techniques based on the WDG. Thetwo graph representation techniques and the recommenda-tion techniques are described in detail in the following sec-tions.

3.1 Labelled directed multigraph

A user session s can be represented as a set of queries Qs ,which were input by the user u, and a set of multimediadocuments Ds the users interacted with during the searchsession. Queries and documents are represented as nodesNs = {Qs ∪Ds} of our graph representation, Gs = (Ns, As).The interactions of the user during the search session arerepresented as a set of action arcs As(G) = {ni , n j , a, u, t}.Each action arc indicates that, at a time t , the user u perfor-med an action of type a that lead the user from the queryor document node ni to node n j , ni , n j ∈ Ns . Note that n j

is the object of the action and that actions can be reflexive.For instance, when a user clicked to view a video and thennavigate through it. Action types depend on the kind ofactions recorded by the implicit feedback system. In oursystem we recorded playing a video, navigating through avideo, highlighting a video to get additional metadata andselecting a video. Links can contain extra associated meta-data, as type specific attributes, e.g. length of play in a playtype action. The graph is multilinked, as different actionscan have the same source and destination nodes. The ses-sion graph Gs = (Ns, As) will then be constructed by all theaccessed nodes and linking actions, and will represent thewhole interaction process for the user’s session s. Finally, allsession-based graphs can be aggregated into a single graphG = G(N , A), N = Us Ns, A = Us As which representsthe overall pool of implicit information. Quite simply, allof the nodes from the individual graphs are mapped to onelarge graph, and then all of the action edges are mappedonto the same graph. This graph may not be fully connec-ted, as it is possible, for instance, that users selected differentpaths through the data or entered a query and took no furtheractions, etc. While the LDM gives a detailed representationof user interaction with the collection, it is extremely difficultto use to provide recommendations. The multiple links makethe graph extremely complex. In addition to this all of theactions are weighted equally. This is not always a true repre-sentation; some actions may be more important than othersand should be weighted differently.

3.2 Weighted directed graph

In order to allow our recommendation algorithm to exploitthe LDM representation of user actions, we convert the LDMto a WDG by collapsing all links interconnecting two nodes

123

SIViP (2008) 2:289–306 293

into one single weighted edge. This process is carried out asfollows. Given the detailed LDM graph of a session s, Gs =(Ns, As), we compute its correspondent weighted graph Gs =(Ns, Ws). Links Ws = {ni , n j , ws} indicate that at least oneaction lead the user from the query or document node ni ton j . The weight value ws represents the probability that noden j , was relevant to the user for the given session, this valueis either given explicitly by the user, or calculated by meansof the implicit evidence obtained from the interactions of theuser with that node:

Ws(ni , n j

)

=⎧⎨

⎩

1, iff exp licit relevance for n j

−1, iff exp licit irrelevance for n j

lr(n j ) ∈ [0, 1], otherwise (i. e. implicit relevance)

In the case that there is only implicit evidence for a node n,the probability value is given by the local relevance lr(n).lr(n) returns a value between 0 and 1 that approximates aprobability that node n was relevant to the user given thedifferent interactions that the user had with the node. Forinstance if the user opened a video and played it for thewhole of its duration, this can be enough evidence that thevideo has a high chance of being relevant to the user. Fol-lowing this idea, and based on previous work on the impact

Table 1 Values for function f ()

used during the experimentAction a f (a)

Play 3

View 10

Navigate browse R/L 2

Tooltip 1

of implicit feedback importance weights [14], the local rele-vance function is defined as lr(n) = 1 − 1

x(n), where x(n) is

the total of added weights associated to each type of actionin which node n is an object of. This subset of actions is defi-ned as As(Gs, n) = {ni , n j a u t |n j = n}, n ∈ Ns . Theseweights are natural positive values returned by a functionf (a) : A → N, which maps each type of action to a num-ber. These weights are higher for an action that is understoodto give more evidence of relevance to the user. In this way,lr(n) is closer to 1 as more actions are observed that involven and the higher the associated weight given to each actiontype. In our weighting model some of the implicit actions areweighted nearly as highly as explicit feedback. The accumu-lation of implicit relevance weights can thus be calculatedas x(n) = ∑

a∈As (Gs ,n) f (a). Table 1 shows an example offunction f , used during our evaluation process; all of theseactions are part of the system described in Sect. 4. This systemconsiders the following actions: (1) playing a video duringa given interval of time (Play); (2) clicking a search resultin order to view its contents (View); (3) navigating throughthe contents of a video (Navigate) ; (4) browsing to the nextor previous video keyframe (Browse R/L) and (5) tooltipinga search result by leaving the mouse pointer over the searchresult. The weights assigned to these actions are based on pre-vious work by Hopfgartner et al. [14], where they carried outa study which simulated users carrying out interactive videosearches. As part of this work a number of different weigh-ting schemes were evaluated in order to determine whichweighting scheme was the most effective for implicit feed-back for interactive video search. The weights that we use arebased on the results of this previous work. Figure 1 showsan example of LDM and its correspondent WDG for a givensession.

Fig. 1 Correspondencebetween the LDM (left) andWDG (right) models

123

294 SIViP (2008) 2:289–306

Fig. 2 Graph illustrating implicit relevance pool

Similarly to the detailed LDM graph, the session-basedWDGs can be aggregated into a single overall graph G =(N , W ), which will be called the implicit relevance pool,as it collects all the implicit relevance evidence of all usersacross all sessions. The nodes of the implicit pool are all thenodes involved in any past interaction N = Us Ns , whereasthe weighted links combine all of the session-based values.In our approach we opted for a simple aggregation of theseprobabilities, W = {ni , n j , w}, w = �sws . Each link repre-sents the overall implicit (or explicit, if available) relevancethat all users, whose actions lead from node ni to n j , gave tonode n j . Figure 2 shows an example of the implicit relevancepool.

3.3 Implicit relevance pool recommendation techniques

In our system we recommend both queries and documents tothe users. These recommendations are based on the status ofthe current user session. As the user interacts with the system,a session-based WDG is constructed. The current user’s ses-sion is thus represented by G ′

s = (N ′s, W ′

s). This graph is the

basis of the recommendation algorithm which has three com-ponents; each component uses the implicit relevance pool inorder to retrieve similar nodes that were somehow relevantto other users. The first two components are neighbourhoodbased. A neighbourhood approach is a way of obtaining rela-ted nodes; quite simply we define the node neighbourhood ofa given node n, as the nodes that are within a distance DMAX

of n, without taking the link directionality into considera-tion. These nodes are somehow related to n by the actionsof the users, either because the users interacted with n afterinteracting with the neighbour nodes, or because they are thenodes the user interacted with after interacting with n. Moreformally as a way of obtaining related nodes, we define thenode neighbourhood of a given node n as:

N H(n) = {m ∈ N |δ(n, m) < DMAX}

where δ(n, m) is the shortest path distance between nodes nand m, and DMAX is the maximum distance in order to takeinto consideration a node as a neighbour. The best performingsetting for this value, in our experiments, was DMAX = 3.

123

SIViP (2008) 2:289–306 295

Using the properties derived from the implicit relevancepool, we can calculate the overall relevance value for a givennode. This value indicates the aggregation of implicit rele-vance that users gave historically to n, when n was involvedwith the users’ interactions. Given all the incident weightedlinks of n, defined by the subset Ws(Gs, n) = {ni , n j w|n j =n}, n ∈ Ns the overall relevance value for n is calculated asfollows:

or(n) =∑

w∈ws (Gs ,n)

w

Given the ongoing user session s, and the implicit relevancepool we can then define the node recommendation value as:

nh(n, Ns) =∑

ni ∈Ns ,n∈N H(ni )

lr ′(ni ) · or(n)

where lr ′(ni ) is the local relevance computed for the currentsession of the user G ′

s , so that the relevance of the node to thecurrent session is taken into consideration. We can then definethe first recommendation value r1(n, N ′

s) = nr(n, Q′s)|Q′

s ∈N ′

s , i.e. the node recommendation value for the queries rela-ted to the current session. Similarly, we can define the secondrecommendation value r2(n, N ′

s) = nr(n, D′s)|D′

s ∈ N ′s ,

which exploits the session-related documents instead. Thelast recommendation component is based on the user’s inter-action sequence. The interaction sequence recommendationapproach tries to take into consideration the interaction pro-cess of the user, with the scope of recommending those nodesthat are following this sequence of interactions. For ins-tance, if a user has opened a video of news highlights, therecommendation could contain the more in-depth stories thatprevious users found interesting to view next. The recom-mendation value r3(n, N ′

s), called interactive recommenda-tion, can thus be defined as follows:

r3(n, Ns) =∑

ni ∈ Nsp = ni � n j → nlength(p) < L M AX

lr ′(ni ) · ξ length(p)−1 · w(n j , n)

where ni � n j denotes the existence of a path from ni to n j

in the graph, n j → n means that n is adjacent to n j , and thesame notation is used as a shorthand to define p as any pathbetween ni and n, taking into consideration the link directio-nality. length(p) is counted as the number of links in path p,which must be less than a maximum length LMAX. Finally,ξ is a length reduction factor, set to 0.8 in our experiments,value which allowed weight to be propagated significantlyup to six degrees of separation. This length reduction fac-tor allows giving more importance to those documents thatdirectly follow the interaction sequence, though if a docu-ment with high levels of interaction occurs two or three stepsaway it may still be recommended. In a final step, we obtainthe three recommendation lists from each recommendation

component and merge them into a single final recommenda-tion lists. For this we use a rank-based aggregation approach,the scores of the final recommendations are the sum of therank-based normalised scores of each of the recommenda-tion list, i.e. using a score 1

r(n)where r(n) is the position of

n in the recommended list. The final list is then split intorecommended queries and recommended documents; theseare then presented to the user.

4 System description

Our implicit feedback approach has been implemented inan interactive video retrieval system. This allows us to haveactual end users test our system and approach. The key-frames in our index were indexed based on ASR transcriptand machine translation output. The Okapi BM25 retrievalmodel [26] was used to rank retrieval results. In addition to theranked list of search results, the system provides users withadditional recommendations of video shots that might matchtheir search criteria based on our recommendation graph (seeSect. 3 for details on the recommendation graph).

Figure 3 shows a screen shot of the recommendation sys-tem. The interface can be divided into three main panels: thesearch panel (A), the result panel (B) and the playback panel(C). The search panel (A) is where users formulate and carryout their searches. Users can enter a text based query in thesearch panel (A) to begin their search. The users are presen-ted with text based recommendations for search queries thatthey can use to enhance their search (b). The users are alsopresented with recommendations of video shots that mightmatch their search criteria (a). Each recommendation is onlypresented once, but may be retrieved by the user at a laterstage if they wish to do so. The result panel is where userscan view the search results (B). This panel is divided into fivetabs, the results for the current search, a list of results thatthe user has marked as relevant, a list of results that the userhas marked as maybe being relevant, a list of results that theuser has marked as irrelevant and a list of recommendationsthat the user has been presented with previously. Users canmark results in these tabs as being relevant by using a slidingbar (c). Additional information about each video shot can beretrieved by hovering the mouse tip over a video keyframe,that keyframe will be highlighted, along with neighbouringkeyframes and any text associated with the highlighted key-frame (d). The playback panel (C) is for viewing video shots(g). As a video is playing it is possible to view the currentkeyframe for that shot (e), any text associated with that key-frame (f) and the neighbouring keyframes. Users can play,pause, stop and can navigate through the video as they canon a normal media player, and also make relevance judge-ments about the keyframe (h). Some of these tools in theinterface allow users of the system to provide explicit and

123

296 SIViP (2008) 2:289–306

Fig. 3 Interface of the videoretrieval system

implicit feedback, which is then used to provide recommen-dations to future users. Explicit feedback is given by users bymarking video shots as being either relevant or irrelevant (c,h). Implicit feedback is given by users playing a video (g),highlighting a video keyframe (d), navigating through videokeyframes (e) and selecting a video keyframe (e).

In order to provide a comparison to our recommendationsystem, we also implemented a baseline system that pro-vides no recommendations to users. The baseline system haspreviously been used for the interactive search task track atTRECVID 2006 [35], the performance of this system wasaverage when compared with other systems at TRECVIDthat year. A tooltip feature which shows neighbouring key-frames and the transcript of a shot was added to this system toimprove its performance. Overall the only difference betweenthe baseline and the recommendation system is the provisionof keyframe recommendations (a).

5 Experimental methodology

In order to determine the usefulness of our approach we car-ried out a user-centred evaluation of our system and approach.In this section we will give details on our hypothesis, expe-rimental methodology, video collections and tasks that wereused.

5.1 Hypothesis

The goal of our evaluation was to investigate the effect ofusing community based feedback to aid search in a video

search system. There are a number of potential benefits ofour approach:

• The performance of the users of the system, in terms ofprecision of retrieved videos, will improve with the useof recommendations based on feedback.

• The users will be able to explore the collection to a greaterextent, and also discover aspects of the topic that they maynot have considered.

• The users will be more satisfied with the system that pro-vides feedback, and also be more satisfied with the resultsof their search.

5.2 Collection and tasks

In order to determine the effects of implicit feedback, userswere required to carry out a number of video search tasksbased on the TRECVID 2006 collection and tasks [29]. TheTRECVID evaluation meetings are an on-going series ofworkshops focusing on a list of different information retrievalresearch areas in content based retrieval of video. For our eva-luation we focus on the interactive search tasks. Interactivesearch tasks involve the use of low level content based searchtechniques and feedback from users of the video search sys-tem.

For the interactive search task users are given a specificquery and a maximum of 15 min to find shots relevant to thatquery. Figure 4 shows an example query for TRECVID 2006.In interactive tasks the users can use a combination of textand shots to form an initial search for relevant videos to thequery. Users can interact with the systems and examine theresults, and can also re-formulate their queries and retrieve

123

SIViP (2008) 2:289–306 297

Fig. 4 TRECVID 2006 taskexample

new results to continue their search. Users then mark as manyshots as possible as being relevant to the topic. The shotsthat the participants or system marked as relevant are thencompared with a set of relevant shots which has been createdbased on pooling [33] to determine the accuracy and recall ofthe results. In 2006 there were 79,848 shots in the TRECVIDtest collection, and a total of 24 tasks. For our evaluation weare limiting the number of tasks that the users carry out tofour. Although for the TRECVID evaluations 24 tasks arecarried out, it was not practical for our evaluation to do this.The user evaluations are very expensive in terms of time andcosts. In our case, each user required 15 min per topic plusanother 5–10 min to complete questionnaires per user, thus itwould not have been possible to carry out a systematic studywith a large number of topics. In addition, the goals of thisevaluation were not the same as TRECVID. We felt it wasnot necessary to use all of the 24 tasks. For this evaluation wechose the four tasks for which the average precision in the2006 TRECVID workshop was the worst. In essence theseare the most difficult tasks. The four tasks were chosen asin general these are tasks for which there are less relevantdocuments. Indeed the mean average precision (MAP) valuesshow that it is extremely difficult to find these documents.We feel that any improvement which may be gained on thesedifficult tasks with few documents will be reflected on lessdifficult tasks with larger numbers of relevant documents.The same cannot be said about the gains made for easiertasks being borne out in more difficult tasks. Moreover, dueto the difficult nature of these topics, different users had to usea different search query, which ensures that users do not justfollow other users’ search trails (this is shown in subsequentsections). As can be seen below there were very few relevantshots in the collection for these task, 98 shots out of 79,848shots for one of the tasks. In addition to this, not all of therelevant shots have text associated with them. As the mostpopular form of search is search by textual query [5], findingthese shots becomes even more difficult. The four tasks thatwere used for this evaluation were:

1. Find shots with a view of one or more tall buildings (morethan 4 stories) and the top story visible (142 relevantshots, 53 with associated text)

2. Find shots with one or more soldiers, police, or guardsescorting a prisoner (204 relevant shots, 106 with asso-ciated text)

3. Find shots of a group including at least four people dres-sed in suits, seated, and with at least one flag (446 relevantshots, 287 with associated text)

4. Find shots of a greeting by at least one kiss on the cheek(98 relevant shots, 74 with associated text).

The users were given the topic and a maximum of 15 min tofind shots relevant to the topic. The users could only carryout text based queries, as this is the normal method of searchin most online and desktop video retrieval systems and alsothe most popular search method at TRECVID [5]. The shotsthat were marked as relevant were then compared with theground truth in the TRECVID collection.

5.3 Experimental design

For our evaluation we adopted 2-searcher-by-2-topic LatinSquare designs. Each participant carried out two tasks usingthe baseline system, and two tasks using the recommenda-tion system. The order of system usage was varied as wasthe order of the tasks; this was to avoid any order effect asso-ciated with the tasks or with the systems. To determine theeffect of adding more implicit actions to the implicit pool,participants in the experiment were placed in groups of four.For each group, the recommendation system used the implicitfeedback from all of the previous users. At the beginning ofthe evaluation there was no pool of implicit actions, thereforethe first group of four users received no recommendations;their interactions formed the training set for the initial eva-luations. Using this experimental model we can evaluate theeffect of the implicit feedback within a group of participants,and also the effect of additional implicit feedback across theentire group of participants. In addition to this, the groundtruth provided in the TRECVID 2006 collection allowed usto carry out analyses that we may not have been able to dowith other collections. Each participant was given 5 min trai-ning on each system and carried out a training task with eachsystem. These training tasks were the tasks for which parti-cipants had performed the best at TRECVID 2006. For each

123

298 SIViP (2008) 2:289–306

participant their interaction with the system was logged, thevideos they marked as relevant were stored and they alsofilled out a number of questionnaires at different stages ofthe experiment.

5.4 Participants

24 participants took part in our evaluation and interacted withour two systems. The participants were mostly postgraduatestudents and research assistants. The participants consistedof 18 males and 6 females with an average age of 25.2 years(median: 24.5) and an advanced proficiency with English.Students were paid a sum of £10 for their participation inthe experiment. Prior to the experiment the participants wereasked to fill out a questionnaire so that we could ascertaintheir proficiency with and experience of dealing with mul-timedia. We also asked participants about their knowledgeof news stories, as the video collection which the partici-pants would be dealing with consists of mainly news videos.It transpired that the participants follow news stories/eventsonce or twice a week and also watch news stories online.The majority of participants deal with multimedia regularly(once or twice a day) and are quite familiar with creatingmultimedia data (images, videos). The participants also hada great deal of experience of searching for various types ofmultimedia. These activities were mainly carried out online,with Flickr, Google or YouTube being cited as the most com-monly used online services. The most common search stra-tegy that users mentioned was searching for data by usinginitial keywords and then adapting the query terms to narrowdown the search results based on the initial results they get.Using the recommendations provided by some of these ser-vices was also mentioned by a number of users. Although theparticipants often searched for multimedia data, they statedthat they rarely use multimedia management tools to orga-nise their personal multimedia collection. The most commonpractice amongst the participants is creating directories andfiles on their own personal computer. Categorising videos andimages according to the content and time when this data wasproduced, is the most popular method of managing media.However, when asked how a system could support the ownsearch strategy, many participants mentioned that it would behelpful to sort or retrieve multimedia based on their seman-tic content. The following section outlines the results of ourevaluation.

6 Results

6.1 Task performance

As we were using the TRECVID collection and tasks, wewere able to calculate precision and recall (PR) values for all

of the tasks. Figure 5 shows the P@N for the baseline andrecommendation systems for varying values of N. P@N isthe ratio between the number of relevant documents in thefirst N retrieved documents and N. The P@N value focuseson the quality of the top results, with a lower considerationon the quality of the recall of the system.

Figure 6 shows the MAP for baseline and recommenda-tion systems for different groups of users. Each group offour users also had additional feedback from previous parti-cipants, which the previous group of four users did not have.MAP is the average for the 11 fixed precision values of thePR metric, and is normally used for a simple and convenientsystem’s performance comparison. The values in Figs. 5 and6 were calculated using the evaluation tools provided by theorganisers of TRECVID.

Figure 7 shows the average time in seconds that it takesa user to find the first relevant shot for both the baseline andthe recommender systems.

The results indicate that the system that uses recommenda-tions outperforms the baseline system in terms of precision.It can be seen quite clearly from Fig. 5 that the shots returnedby the recommendation system have a much higher precisionover the first 5–30 shots than the baseline system. We verifiedthat the difference between the two P@N values for valuesof N between 5 and 100 was statistically significant using a

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

5 10 15 20 30 100 200 500 1000 2000

Prec

isio

n

N

P@N

Baseline

Recommendation

Fig. 5 P@N for the baseline and recommendation systems for varyingvalues of N

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

User 5-8 User 9-12 User 13-16 User 17-20 User 21-24

Baseline Recommendation

Fig. 6 Mean average precision (MAP) for baseline and recommenda-tion systems for different groups of users

123

SIViP (2008) 2:289–306 299

Fig. 7 Average time in seconds to find first relevant shot for baselineand recommendation systems

Table 2 Action type and the number of occurrences during the expe-riment

Action type Occurrences

Query 1,083

Mark relevant 1,343

Mark maybe relevant 176

Mark not relevant 922

View 3,034

Play (for more than 3 s) 7,598

Browse keyframes 814

Navigate within a video 3,794

Tooltip 4,795

Total actions 23,559

pair wise t-test (p = 0.0214, t = 3.3045). Over the next100–2,000 shots the difference is negligible. However, itis unlikely that a user would view that number of shots;given that in total our 24 participants viewed 3,034 shots(see Table 2), in the entire trial, 24 h of video viewing. Thisdemonstrates that the use of the implicit feedback canimprove the retrieval results of the system, and thus be ofgreater assistance for the users.

Figure 6 shows that the MAP values of the shots the parti-cipants selected using the recommendation system are higherthan the MAP values of the shots that the participants selec-ted using the baseline system. We verified that the differencebetween the two sets of results was statistically significantusing a pair wise t-test (p = 0.0028, t = 6.5623). The gene-ral trend is that the MAP values of the shots found usingthe recommendation system is increasing with the amountof training data that is used to propagate the graph basedmodel. There is a slight dip in one group; however, this maybe due to the small sample groups that we are using. TheMAP values over all sessions for the baseline system was0.0057, whereas for the recommender system it was 0.0082.These results show that participants are at the same time fin-ding related, new and diverse relevant shots in the data set.

However, these findings are not quite borne out by the recallvalues for the tasks. The recall for the tasks is quite low. Ingeneral the recall is low for all of the systems for all of thetasks at TRECVID 2006; the main focus is on the precisionvalues. While recall is an important aspect we feel that it ismore important that the users found accurate results and thatthey perceived that they had explored the collection, as theyhad found a heterogeneous set of results. While the resultsin Figs. 5 and 6 show that the users are seeing more accurateresults and finding more accurate results, this is not tellingthe full story. In a number of scenarios users will just want tofind just one result to satisfy their information need. Figure 7shows that for three of the four tasks the users using therecommendation system find their first relevant result morequickly than the users using the baseline system. The one taskfor which the baseline system outperforms the recommenda-tion system is due to the actions of two users who did notuse the recommendations. We do not know why these twousers did not use the recommendations, as they did utilisethe recommendations for the other task which they carriedout using the recommendation system. A closer examinationof the users who did use the recommendations found thatthree users found relevant shots in less than 1 min, none ofthe users using the baseline system managed to find relevantshots in less than a minute. Overall the difference in values isnot statistically significant, but a definite trend can be seen.

The results presented so far have shown that users doachieve more accurate results using the system that providesrecommendations. We measured P@N and MAP values; ithas been shown that the recommendation system outperformsthe baseline system, and that this difference is statisticallysignificant. It can be seen that overall the system that is pro-viding recommendations is returning more accurate results tothe user. As a result of this, the users are interacting with morerelevant videos and find more accurate results. In addition tothis, users are finding relevant videos more quickly using therecommendation system (see Fig. 7). This demonstrates thevalidity of our first assumption, that the performance of theusers of the system, in terms of precision of retrieved videos,has improved with the use of recommendations based onimplicit feedback. In the following subsection we will dis-cuss user exploration of the collection in more detail.

6.2 User exploration

6.2.1 User interactions

As was outlined in our system description (Sect. 4) thereare a number of ways that the participants could interactwith our system. Once a participant enters an initial query inour system, all of the available tools may be used to browseor search the video collections. We begin our investigationof user exploration by briefly analysing these interactions.

123

300 SIViP (2008) 2:289–306

Table 3 Number of graphelements in graph after eachgroup of four users

Users Number of nodes (%) Number of queries (%) Number of edges (%) Total graph elements (%)

1–4 1,001 (28.31) 115 (18.51) 2,505 (23.09) 3,621 (24.13)

1–8 1,752 (49.56) 258 (41.54) 4,645 (42.81) 6,655 (44.35)

1–12 2,488 (70.38) 388 (62.48) 7,013 (64.63) 9,989 (66.57)

1–16 3,009 (85.12) 452 (72.79) 8,463 (78) 11,924 (79.46)

1–20 3,313 (93.72) 550 (88.57) 9,868 (90.95) 13,731 (91.5)

1–24 3,535 (100) 621 (100) 10,850 (100) 15,006 (100)

Table 2 outlines how many times each action available wasused across the entire experimental group. During the experi-ments, the participants entered 1,083 queries; many of thesequeries were unique. This indicates that the participants tooka number of different approaches to the tasks, indicating thattheir actions were not determined by carrying out the sametasks. The figures in Table 2 also show that participants playshots quite often. However, if a video shot is selected then itplays automatically in our system. This makes it more diffi-cult to determine whether participants are playing the videosfor additional information or if the system is doing so automa-tically. To compensate for this we only count a play action if avideo plays for more than 3 s. Another feature that was widelyused in our system was the tooltip feature. The tooltip high-lighting functionality allowed the users to view neighbou-ring keyframes and associated text when moving the mouseover one keyframe. This meant that the participants couldget context and a feel for the shot without actually having toplay that shot. This feature was used on average 42.3 (with amedian of 38) times per participant per task when viewing astatic shot.

6.2.2 Analysis of interaction graph

In order to gain further insight into the user interactions anumber of different aspects of the interaction graph wereanalysed. In particular we were interested in investigatingchanges in the graph structure as additional users used oursystem. These aspects include the number of nodes, the num-ber of unique queries and the number of links that werepresent in the graph. Table 3 shows the results of this ana-lysis. It can quite clearly be seen in Table 3 that the numberof new interactions increases as the number of participantsalso increases. The majority of nodes in our graph are videoshots (apart from query nodes), as the number of partici-pants increases so does the number of unique shots that havebeen viewed. On further investigation of the graph and logsit was found that, overall, 49% of documents selected byusers 1–12 were selected at least by one user in 13–24. Users1–12 clicked 1,050 unique documents, whereas users 13–24clicked 596 unique documents. Also, users 1–12 produced1,737 clicks, whereas users 13–24 produced 1,024. This can

be interpreted as users 13–24 were satisfied more quicklythan users 1–12. It was also found that the number of uniquequeries also increases with the additional users. These resultsgive an indication that later participants are not just using therecommendations to mark relevant videos, but also interac-ting with further new and unique shots.

6.2.3 Top retrieved videos

Figure 8 shows the probability that a particular shot is rele-vant plotted against a relevance value that is assigned to thatdocument from our graph representation. The relevance valueon the x-axis thus represents the sum of the weights of all ofthe edges leading to a particular node. The average interactionvalue was just 1.23, with irrelevant documents having an ave-rage value of 1.13 and relevant documents having an averageof 2.94. This result is encouraging as it shows that relevantdocuments do receive more interaction from the users of thesystem. It can be seen that up until a certain point as theinteractions from previous users increase so does the proba-bility of the document being relevant. It was also found thatfor some of the documents with higher relevance values theprobability tails off slightly. Further investigation found thatthere were two main reasons that a number of irrelevant docu-ments had high relevance values. Firstly, there were shots that

Fig. 8 Probability of a document being relevant given a certain levelof interaction. The y-axis represents the probability that the video isrelevant and the x-axis represents the assigned interaction value in ourgraph

123

SIViP (2008) 2:289–306 301

seemed relevant at first glance but upon further investigationwere not relevant; however, for participants to investigate thisrequired some interaction with the shot thus giving it a highinteraction value. Secondly, there were a number of shots thatappeared in the top of the most common queries before anyrecommendations were given, thus increasing the chancesof participants interacting with those videos. It should alsobe noted that on average only 5.49% of nodes in the graphrelate to relevant shots. This indicates that users are exploringand interacting with large portions of the collection that arenot relevant, to help them find relevant shots. However, evenwith this kind of sparse and difficult data the performance ofthe users is improved with the recommendations presentedto the users. It was found that as the amount of informationin the graph increased so did the proportion of recommen-dations selected by users; users 5–8 selected 9.77% of therecommendations, whereas users 21–24 selected 18.67% ofthe recommendations.

6.2.4 Text queries

In both the baseline and recommendation systems the parti-cipants were presented with query expansion terms that theycould use to enhance their queries. We found however, thatthe majority of participants chose not to use the query expan-sion terms provided by the baseline system as they foundthem confusing. The query terms returned by the baselinesystem were stemmed and normalised and hence were not inthe written form as users expected them to be, where as thequeries recommended by the recommendation system werequeries that previous users had used. One participant statedthat “The query expansion terms did not have any meaning.”Another participant said that the “query expansion did notfocus on real search task”. This can be explained in partby specificities of some of the chosen topics, for example, inTask 1, when a user enters the name of a city (“New York”) toget a shot of the city’s sky line, the query expansion terms didnot help to specify the search query. The top five queries foreach topic are presented in Table 4. The top 15 unique termsacross all four tasks are shown in Table 5. Across all 24 usersa number of terms were repeated extensively. There were 130unique terms and combinations of these were used to createa number of unique queries, on average the participants used2.21 terms per query. However, it can be seen that across the24 users and 4 topics there is relatively little repetition ofthe exact same queries, there were 621 unique queries outof 1,083 total queries (57%). In fact only 4 queries occur 10times or more, and they were all for the same task. This taskhad fewer facets to it than the others, and thus there was lessscope for the users to use different search terms. This showsthat despite the fact that users are carrying out the same taskthey are searching in differing ways, as the search tasks aremulti-faceted and the participants are providing their own

Table 4 Five most popular que-ries for each topic

Task 1

City 9

Building 8

New York 8

Tall buildings

Tower 7

Task 2

Jail 5

Prisoner guards 4

Prisoner 4

Police 4

Prisoner escorted 3

Task 3

Flag 8

Meeting flag 7

Conference 5

Meeting 5

Group flag 5

Task 4

Kiss 22

Greeting kiss 20

Greeting 10

Kiss cheek 10

Cheek 6

Table 5 Fifteen most commonlyused keywords across all fourtasks

Kiss 175

Flag 101

Prisoner 100

Greeting 72

Building 62

Police 60

People 50

Tall 48

Cheek 45

Meeting 39

City 37

Soldier 36

Greet 34

Four 31

Buildings 31

context. The results in this section indicate that the usersexplore the collection to a greater extent using the recom-mendations. Users of the system did not merely interact withvideos that the previous users had interacted with, but ins-tead could see what previous users had done and explorenew video shots. Nodes were added to the graph of implicitactions throughout the evaluation (see Table 3). Also therewas very little query repetition, and newer users used new and

123

302 SIViP (2008) 2:289–306

diverse query terms. These results give an indication that fur-ther participants are not just using the recommendations tomark relevant videos, but also interacting with further shots.These results also give an indication that we are achievingthe second benefit of our approach; that users will be ableto explore the collection to a greater extent, and also disco-ver aspects of the topic that they may not have considered.However, this finding has not been fully validated. In orderto do this we must analyse the users’ use perceptions of thetasks, this analysis is presented in the following section.

6.3 User perceptions

In order to provide further validation for our second hypothe-sis that “the users will be able to explore the collection to agreater extent, and also discover aspects of the topic that theymay not have considered”, and to validate our third hypothe-sis, that “the users will be more satisfied with the systemthat provides feedback, and also be more satisfied with theresults of their search ”, we analysed the post task and postexperiment questionnaires that our participants filled out.

6.3.1 Search tasks

To begin with, we wished to gain insight into the partici-pants’ perceptions of the two systems and also of the tasksthat they had carried out. In the post-search questionnaires,we asked subjects to complete four 5-point semantic differen-tials indicating their responses to the attitude statement: “Thesearch we asked you to perform was”. The paired stimulioffered as responses were: “relaxing”/“stressful”, “interes-ting”/“boring”, “restful”/“tiring” and “easy”/“difficult”. Theaverage obtained differential values are shown in Table 6 foreach system. Each cell in Table 6 represents the responses for20 participants (the four participants in the initial training setwere not included as they did not use the recommendationsystem). The most positive response across all system andtask combinations for each differential pair is shown in bold.

The trends in Table 6 indicate that the users gave morepositive responses for the recommendation system. It wasfound that the participants perceived some tasks as moreeasy, relaxing, restful and interesting than others. It can alsobe seen in Table 6 that there is a slight preference towards thesystem that provides recommendations amongst the partici-pants. We applied two-way analysis of variance (ANOVA)to each differential across both systems and the four tasks.We found that how easy and relaxing the participants foundthe tasks was system dependent (p < 0.14 and p < 0.134respectively for the significance of the system), whereas theuser interest in the task was more dependent on the task thatthey were carrying out (p < 0.194 for the significance of thesystem).

Table 6 Perceptions of search process for each system (higher = better)

Differential Baseline Recommendation

Easy 1.9 2.65

Restful 2.7 2.575

Relaxing 2.725 3.175

Interesting 2.325 2.75

6.3.2 Retrieved videos

In post search task questionnaires we also solicited subjects’opinions on the videos that were returned by the system. Wewanted to discover if participants explored the video col-lection more based on the recommendations or if it in factnarrowed the focus in achievement of their tasks. The fol-lowing Likert 5-point scales and semantic differentials wereused 21:

• “During the search I have discovered more aspects of thetopic than initially anticipated” (Change 1)

• “The video(s) I chose in the end match what I had in mindbefore starting the search” (Change 2)

• “My idea of what videos and terms were relevant changedthroughout the task” (Change 3)

• “I believe I have seen all possible videos that satisfy myrequirement” (Breadth)

• “I am satisfied with my search results” (Satisfaction)• Semantic differentials : The videos I have received

through the searches were: “relevant”/“irrelevant”,“appropriate”/“inappropriate”, “complete”/“incomplete”,“surprising”/“expected”.

Table 7 shows the average responses for each of these scalesand differentials, using the labels after each of the Likertscales in the bulleted list above, for each system. The valuesfor the four semantic differentials are included at the bottomof the table. The most positive response across all system andtask combinations is shown in bold.

The general trends that can be seen in Table 7 show thatthe users gave more positive responses for the recommen-dation system. It appears that participants have a better per-ception of the video shots that they found during their tasksusing the recommendation system. It also appears that theparticipants believe more strongly that this system changedtheir perception of the task and presented them with moreoptions. This would back up the findings in the previous sec-tion that the participants explored the collection to a grea-ter extent when presented with the recommendations. Weapplied two-way analysis of variance (ANOVA) to each dif-ferential across both systems and the four tasks to test theseassertions. The initial ideas that the participants had about

123

SIViP (2008) 2:289–306 303

Table 7 Perceptions of the retrieval tasks for each system(higher = better)


Change 1 3.1 3.5

Change 2 3.475 3.725

Change 3 2.725 3.05

Breadth 2.625 3.075

Satisfaction 2.95 3.4

Relevant 1.925 2.55

Appropriate 3.125 3.775

Complete 2.225 2.5

Surprising 1.55 1.725

relevant shots were dependent on the task (p < 0.019 forsignificance of task). The changes in their perceptions weremore dependent on the system that they used rather than thetask, as was the participants belief that they had found rele-vant shots through the searches (p < 0.217 for significanceof system). This demonstrates that the recommendation sys-tem helped the users to explore the collection to a greaterextent, and also indicates that the users have a preferencefor the recommendation system. This finding strengthens theargument that our recommendation is providing benefits interms of exploration and user perception.

6.3.3 System support

We also wanted to determine the participants’ opinion aboutthe systems’ support of their retrieval tasks. Firstly we askedthem if the system had helped them to complete their task(satisfied). Participants were then asked to complete a furtherfive 5-point Likert scales indicating their responses to thefollowing statement: “The system helped me to completemy task because…”. The criteria of the responses were:

• “explore the collection” (explore)• “find relevant videos” (relevant)• “detect and express different aspects of the task” (dif-

ferent)• “focus my search” (focus)• “find videos that I would not have otherwise considered”

(consider).

Table 8 presents the average responses for each of thesescales, using the labels after each of the Likert scales abovefor each system. The most positive response is shown in bold.Some of the scales were inverted to reduce bias in the ques-tionnaires.

Once again it appears that participants have a better per-ception of the video shots that they found during their tasks

Table 8 Perceptions of system support for each system (higher = better)


Satisfied 3.3 3.6

Explore 3.775 3.9

Relevant 3 3.4

Different 2.925 3.275

Focus 2.625 3.25

Consider 3.075 3.375

using the system with recommendations, and that they believethe system helped them to explore the collection of shotsmore thoroughly using this system. We applied two-way ana-lysis of variance (ANOVA) to each differential across bothsystems and the four tasks to test our hypotheses; none ofthe dependencies were significant. From our analysis of theresults, however, there is a trend that the focus of the search,the ability to express different aspects of the task and thechange in videos considered is more dependent on the task,rather than the system.

6.3.4 Ranking of systems

After completing all of the tasks and having used both systemswe attempted to discover whether the participants preferredthe system that provided recommendations or the system thatdid not. The participants were asked to complete an exit ques-tionnaire where they were asked which system they preferredfor particular aspects of the task, they could also indicate ifthey found no difference between the systems. The partici-pants were asked, “Which of the systems did you…”:

• “find best overall” (Best)• “find easier to learn to use” (Learn)• “find easier to use” (Easier)• “prefer” (Prefer)• “find changed your perception of the task” (Perception)• “find more effective for the tasks you performed” (Effec-

tive).

The users were also given some space where they could pro-vide any feedback on the system that they felt may be useful.

It can be seen clearly in Table 9 that the participants hada preference for the system that provided the recommenda-tions. It is also encouraging that the participants found thereto be no major difference in the effort and time required tolearn how to use the recommendations that are provided bythe system with recommendations. This indicates that userswere more satisfied with the system that provides recom-mendation, thus realising the third goal of our system will bemore satisfied with the system that provides feedback. Users

123

304 SIViP (2008) 2:289–306

Table 9 Users preferences for the two different systems

Differential Baseline Recommendation No difference

Best 2 16 1

Learn 2 7 11

Easier 2 5 13

Prefer 1 17 2

Perception 3 11 6

Effective 3 14 3

have a definite preference for the recommendation system. 17out of 20 users preferred the recommendation system, withone user preferring the baseline system. The participants alsoindicated in their post task questionnaires that the system thatprovided recommendations helped them to explore the taskand find aspects of the task that they otherwise would nothave considered, in comparison with the baseline system.

The results of our analysis have addressed all of the pointsof our hypotheses and have demonstrated that we have achie-ved our goals. The following section will provide a discussionof some follow-up evaluations that were carried out to helpvalidate and expand our results and findings.

6.4 Follow-up evaluation

In order to expand on some of our results and to alleviateany of doubts surrounding the validity of our approach weperformed a follow-up evaluation. The goal of this evaluationwas to validate our approach using related but not identicaltasks. For this evaluation we used the same two systems thathave been described earlier in this paper, the same datasetand the same experimental methodology; however we usefour different tasks. Two of these tasks were related to tasksthat had been carried out in the first evaluation, whereas twofurther tasks were not. The four tasks were:

• Find shots of one or more people entering or leaving avehicle (Task 5, not related)

• Find shots of a natural scene – with, for example, fields,trees, sky, lake, mountain, rocks, rovers, beach, ocean,grass, sunset, waterfall, animals or people; but no buil-dings, no roads, no vehicles (Task 6, not related)

• Find shots of a skyline with tall buildings visible (Task 7,related)

• Find shots of a greeting between two or more people (Task8, related).

Some of these tasks were from TRECVID 2006 and somewere not, so we cannot perform all of the same evaluationsthat we have carried out for the main user-centred evalua-tion presented in this paper, as we do not have the ground

truth data that we had available for the initial evaluation.However, we can get an indication of user task performanceand perceptions, when users are not repeating the same tasks.The pool of implicit actions from the previous experimentwas used to provide recommendations for this evaluation.Three independent persons judged the shots that were mar-ked as relevant, so that we could perform some analysis.Even though we have a set of relevant shots for the tasks thatwere from TRECVID, the assessments of shots were perfor-med on all four tasks, in order to maintain consistency. Fourusers carried out the new evaluation, as this evaluation wasto validate findings to date and not to re-test the hypotheses.

After the experiment was completed it was found that forthe two related tasks the users retrieved more video shotsusing the recommendation system in comparison with thebaseline system. For the unrelated task the participants retrie-ved slightly less videos with the recommendation system,however the difference was not significant. In terms of pre-cision, for one of the related tasks the precision of the resultsis increased three fold using the recommendation system, forthe second related task the precision is slightly lower, in thiscase the difference was not significant. In terms of the unrela-ted tasks the precision was greater for one of the tasks with therecommendation system, and lower for the other, again thiswas not significant. The participants indicated in their posttask questionnaires that the system that provided recommen-dations helped them to explore the task and find aspects ofthe task that they otherwise would not have considered. Allof the participants had a preference for the recommendationsystem.

Some of the variations in these results may be due to usingsuch a small sample of users, but overall the trends supportthe conclusions found in the first evaluation, without repea-ting the same tasks. It appears that overall the use of recom-mendations does not hinder performance on unrelated tasks,while still helping users with related but not identical tasks.These findings once again support the hypotheses that weremade at the beginning of this paper.

7 Conclusions

In this paper, we have presented a novel approach for com-bining implicit and explicit feedback from previous users toinform and help users of a video search system. The recom-mendations provided are based on user actions and on theprevious interaction pool. This approach has been realised ina video retrieval system, and this system was tested by a poolof users on complex and difficult search tasks. There are anumber of conclusions that can be made about using commu-nity based implicit feedback to provide recommendations.For the results of task performance, whether users retrievemore videos that match their search task, we measured P@N

123

SIViP (2008) 2:289–306 305

and MAP values. It was shown that the recommendation sys-tem outperforms the baseline system, in that the users of therecommendation system retrieve more accurate results ove-rall and that this difference is statistically significant. It wasalso seen that users are finding relevant results more quicklyusing the recommendation system. These results validate ourfirst hypothesis, that the performance of users of the recom-mendation system will improve with the use of recommenda-tions based on implicit feedback. The statistics presented onuser exploration, show that the users are pursuing the taskssufficiently differently. They were able to explore the collec-tion to a greater extent and find more relevant videos. Nodeswere added to the graph of implicit actions throughout theevaluation, indicating that users are not just using the samequeries and marking the same results, but they are explo-ring new parts of the collection (see Table 2). These resultsgive an indication that further participants are not just usingthe recommendations to mark relevant videos, but also inter-acting with further shots. This also indicates that we haveachieved our second goal; that users will be able to explorethe collection to a greater extent, and also discover aspectsof the topic that they may not have considered. In additionto demonstrating the validity of our second hypothesis, thesefindings also illustrate the validity of our approach and expe-rimental methodology. The tasks that were chosen for theexperiment were multi-faceted and ambiguous. As the tasksare multi-faceted we believed that participants would carryout their searches in differing ways and use numbers of dif-ferent query terms and methodologies, thus providing theirown context. This belief has been demonstrated by these fin-dings. This second hypothesis was validated by our analysisof user perceptions of the system where the users gave anindication that the recommendation system helped them toexplore the collection. The participants indicated in their posttask questionnaires that the system that provided recommen-dations helped them to explore the task and find aspects ofthe task that they otherwise would not have considered, incomparison with the baseline system. It is also shown thatthe users have a definite preference for the recommendationsystem. 17 out of 20 users preferred the recommendationsystem, while one user preferred the baseline system. Thesefindings indicate that we have achieved goal three of ourhypothesis; that users will be more satisfied with the systemthat provides feedback, and also be more satisfied with theresults of their search. These results successfully demonstratethe potential of using this implicit feedback to aid multime-dia search, and that this area deserves further investigationto be fully developed. In addition to our main evaluation, afollow-up evaluation was conducted. In this evaluation weasked users to once again use our baseline and recommen-dation systems, using the same graph of implicit actions, butperforming two tasks related to the tasks from the originalevaluation and two unrelated tasks. It was found that even for

the unrelated tasks users still preferred the recommendationsystem, and that the recommendations could still potentiallyaid users in their tasks.

In conclusion, our results have demonstrated the hugepotential of using a collection of implicit actions from a com-munity to help relieve some the major problems that ordinaryusers have while searching for multimedia, thus presentinga potentially significant stride towards bridging the semanticgap [17].

Acknowledgments This research work was partially supported bythe EC under contracts: K-Space (FP6-027026) and SEMEDIA (FP6-045032).

References

1. Adcock, J., Pickens, J., Cooper, M., Anthony, L., Chen, F., Qvar-fordt, P.: FXPAL interactive search experiments for TRECVID2007. Paper presented at the 5th TREC Video Retrieval Evaluation(TRECVID) workshop, Gaithersburg, Maryland, 5–6 November2007

2. Bentley, F., Metcalf, C., Harboe, G.: Personal vs. commercialcontent: the similarities between consumer use of photos andmusic. In: Proc. SIGCHI conference on Human Factors in com-puting systems, Montréal, Québec, Canada, pp. 667–676 (2006)

3. Chang, S.-F., Manmatha, R., Chua, T.-S.: Combining text andaudio-visual features in video indexing. In: Proc. IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing,Philadelphia, USA, pp. 1005–1008 (2005)

4. Christel, M.G., Conescu, R.M.: Mining novice user activityin TRECVID interactive retrieval tasks. In: Proc. InternationalConference on Image and Video Retrieval, Tempe, Arizona, USA,pp. 21–30 (2006)

5. Christel, M.G.: Establishing the utility of non-text search for newsvideo retrieval with real world users. In: Proc. ACM Multimedia,University of Augsburg, Germany, pp. 707–716 (2007)

6. Craswell, N., Szummer, M.: Random walks on the click graph. In:Proc. Annual International ACM SIGIR Conference, Amsterdam,pp. 239–246 (2007)

7. Foley, E., Gurrin, C., Jones, G., Lee, H., McGivney, S., O’Connor,N.E., Sav, S., Smeaton, A.F., Wilkins, P.: TRECVid 2005 Experi-ments at Dublin City University. Paper presented at the 3th TRECVideo Retrieval Evaluation (TRECVID) workshop, Gaithersburg,Maryland, 14–15 November 2007

8. Freyne, J., Farzan, R., Brusilovsky, P., Smyth, B., Coyle, M.: Col-lecting community wisdom: integrating social search and socialbrowsing. In: Proc. International Conference on Intelligent UserInterfaces, Honolulu, Hawaii, pp. 52–61 (2007)

9. Goldberg, D., Nichols, D., Oki, B.M., Douglas, T.: Using col-laborative filtering to weave an information tapestry. Commun.ACM. 35(12), 61–70 (1992)

10. Golder, S.A., Huberman, B.A.: Usage patterns in collaborativetagging systems. J. Inform. Sci. 32(2), 198–208 (2006)

11. Halvey, M., Keane, M.T.: Analysis of online video search and sha-ring. In: Proc. ACM Conference on Hypertext and Hypermedia,Manchester, UK, pp. 217–226 (2007)

12. Hancock-Beaulieu, M., Walker, S.: An evaluation of automaticquery expansion in an online library catalogue. J. Doc. 48(4), 406–421 (1992)

13. Heesch, D., Howarth, P., Magalhaes, J., May, A., Pickering, M.,Yavlinski, A., Rueger, S.: Video retrieval using search and

123

306 SIViP (2008) 2:289–306

browsing. Paper presented at the 13th Text REtrieval Conference,Gaithersburg, Maryland, 16–19 November 2007

14. Hopfgartner, F., Urban, J., Villa, R., Jose, J.: Simulated testingof an adaptive multimedia information retrieval system. In: Proc.International Workshop on Content-Based Multimedia Indexing,Bordeaux, France, pp. 328–335 (2007)

15. Hopfgartner, F.: Understanding video retrieval. VDM Verlag,(2007)

16. Huang, C.-W.: Automatic Closed Caption Alignment Based onSpeech Recognition Transcripts. Technical Report, University ofColumbia (2003)

17. Jaimes, A., Christel, M., Gilles, S., Ramesh, S., Ma, W.-Y.: Multi-media information retrieval: what is it, and why isn’t anyone usingit? In: Proc. ACM SIGMM International Workshop on MultimediaInformation Retrieval, Singapore, pp. 3–8 (2005)

18. Kelly, D., Teevan, J.: Implicit feedback for inferring user prefe-rence: a bibliography. SIGIR Forum. 32(2), 18–28 (2003)

19. Kirk, D., Sellen, A., Rother, C., Wood, K.: Understanding photo-work. In: Proc. SIGCHI conference on Human Factors in compu-ting systems, Montréal, Québec, Canada, pp. 761–770 (2006)

20. Kirk, D., Sellen, A., Harper, R., Wood, K.: Understanding video-work. In: Proc. SIGCHI conference on Human Factors in compu-ting systems, San Jose, California, USA, pp. 61–70 (2007)

21. Likert, R.: A technique for the measurement of attitudes. Arch.Psychol. 140, 1–55 (1932)

22. Mei, T., Hua, X.-S., Yang, L., Yang, S.-Q., Li, S.: VideoReach: anonline video recommendation system. In: Proc. Annual Interna-tional ACM SIGIR Conference, Seattle, WA, USA, pp. 767–768(2006)

23. Naphade, M., Smith, J.R., Tesic, J., Chang, J.-S., Hsu, W.,Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale ontology formultimedia. IEEE MultiMed. 13(3), 86–91 (2006)

24. National Institute of Standards Technology.: NIST TREC Videoretrieval Evaluation Online Proceedings. http://www-nlpir.nist.gov/projects/tvpubs/tv/pubs.org.html

25. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.:GroupLens: an open architecture for collaborative filtering of net-news. In: Proc. National CSCW Workshop, pp. 165–173 (1994)

26. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.,Gatford, M.: Okapi at TREC-3. Paper presented at the 4th TextREtrieval Conference, Gaithersburg, Maryland (1994)

27. Salton, G., Buckley, C.: Improving retrieval performance by rele-vance feedback. In: Readings in information retrieval, pp. 355–364, Morgan Kauffman, San Francisco (1997)

28. Shardanand, U., Maes, P.: Social information filtering: algorithmsfor automating word of mouth. In: Proc. SIGCHI conference onHuman Factors in computing systems, Denver, Colorado, USA,pp. 210–217 (1995)

29. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns andTRECVid. In: Proc. ACM SIGMM International Workshop onMultimedia Information Retrieval, Santa Barbara, California,USA, pp. 321–330 (2006)

30. Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M.,Boydell, O.: Exploiting query repetition and regularity in an adap-tive community-based web search engine. User Model. User-Adaptated Interact. 14(5), 383–423 (2004)

31. Snoek, C., Worring, M., Koelma, D., Smeulders, A.: Learnedlexicon-driven interactive video retrieval. In: Proc. InternationalConference on Image and Video Retrieval, Tempe, Arizona, USA,pp. 11–20 (2006)

32. Snoek, C.G., Worring, M., Koelma, D.C., Smeulders, A.W.M.: Alearned lexicon-Driven paradigm for interactive video retrie-val. Trans. Multimed. 9(2), 280–292 (2007)

33. Sparck Jones, K., Van Rijsbergen, C.: Report on the need for andprovision of an “Ideal” information retrieval test collection. In:British Library Research and Development Report 5266, Compu-ter Laboratory, University of Cambridge (1975)

34. Spink, A, Greisdorf, H., Bateman, J.: From highly relevant to notrelevant: examining different regions of relevance. Inf. Process.Manage. 34(5), 599–621 (1998)

35. Urban, J., Hilaire, X., Hopfgartner, F., Villa, R., Jose, J.,Chantamunee, S., Gotoh, Y.: Glasgow University at TREC-Vid 2006. Paper presented at the 4th TREC Video RetrievalEvaluation (TRECVID) workshop, Gaithersburg, Maryland, 13–14 November 2006

36. Wexelblat, A., Maes, P.: Footprints: history rich tools for infor-mation foraging. In: Proc. SIGCHI conference on Human Factorsin computing systems, Pittsburgh, PA, USA, pp. 270–277 (1999)

37. White, R., Bilenko, M., Cucerzan, S.: Studying the use of populardestinations to enhance Web search interaction. In: Proc. AnnualInternational ACM SIGIR Conference, Amsterdam, pp. 159–166(2007)

38. Yang, B., Mei, T., Hua, X.-S., Yang, L., Yang, S.-Q., Li, M.: Onlinevideo recommendation based on multimodal fusion and relevancefeedback. In: Proc. Annual International ACM SIGIR Conference,Amsterdam, pp. 73–80 (2007)

123

http://www-nlpir.nist.gov/projects/tvpubs/tv/pubs.org.html

http://www-nlpir.nist.gov/projects/tvpubs/tv/pubs.org.html

Community based feedback techniques to improve video search

Documents

Transcript of Community based feedback techniques to improve video search