When textual and visual information join forces for multimedia retrieval

20
When Textual and Visual Information Join Forces for MultiMedia Retrieval Bahjat Safadi, Mathilde Sahuguet, Benoit Huet EURECOM, Multimedia Department Sophia Antipolis, France

description

Currently, popular search engines retrieve documents on the basis of text information. However, integrating the visual information with the text-based search for video and image retrieval is still a hot research topic. In this paper, we propose and evaluate a video search framework based on using visual information to enrich the classic text-based search for video retrieval. The framework extends conventional text-based search by fusing together text and visual scores, obtained from video subtitles (or automatic speech recognition) and visual concept detectors respectively. We attempt to overcome the so called problem of semantic gap by automatically mapping query text to semantic concepts. With the proposed framework, we endeavor to show experimentally, on a set of real world scenarios, that visual cues can effectively contribute to the quality improvement of video retrieval. Experimental results show that mapping text-based queries to visual concepts improves the performance of the search system. Moreover, when appropriately selecting the relevant visual concepts for a query, a very significant improvement of the system's performance is achieved.

Transcript of When textual and visual information join forces for multimedia retrieval

Page 1: When textual and visual information join forces for multimedia retrieval

When Textual and Visual InformationJoin Forces

for MultiMedia Retrieval

Bahjat Safadi, Mathilde Sahuguet, Benoit HuetEURECOM, Multimedia Department

Sophia Antipolis, France

Page 2: When textual and visual information join forces for multimedia retrieval

Introduction

� EU alone hosts 500+ online video platforms

� 42.7m hrs of footage in online archives of broadcast ers and producers (61% of archive footage is online)

� UGC on the advance: � YouTube receives 60 hrs of video/minute� Vine and Instagram video

� Internet video is now 40 percent of consumer Intern et traffic, and will reach 62 percent by the end of 2015, 75% i n 2017(source: CISCO)

� How to make the content accessible?� Browsing, Searching, Hyperlinking

B Huet - Eurecom - BAMMF - p 220/06/2014

Page 3: When textual and visual information join forces for multimedia retrieval

Objectives and Contributions

� We propose and evaluate a video search framework us ing visual information to enrich the classic text-based search for video retrieval operating at the fragment level.

� We investigate the following two questions: � To which extent can visual concepts contribute information when retrieving

videos? � How can we cope with the confidence in visual concept detection?

� The framework extends conventional text-based searc h by fusing together textual and visual scores.

� We address both the semantic and intention gaps� By automatically mapping the query text to semantic concepts.� With the addition of “visual cues”

20/06/2014 B Huet - Eurecom - BAMMF - p 3

Page 4: When textual and visual information join forces for multimedia retrieval

MediaEval Search & Hyperlinking

� Information seeking in a video dataset: retrieving media fragments/anchors

B Huet - Eurecom - BAMMF - p 420/06/2014

Page 5: When textual and visual information join forces for multimedia retrieval

The Video Archive

2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users

– Textual query + visual cues� Face detection� Concept detection

B Huet - Eurecom - BAMMF - p 520/06/2014

Page 6: When textual and visual information join forces for multimedia retrieval

The Video Archive

2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users

– Textual query + visual cues� Face detection� Concept detection

B Huet - Eurecom - BAMMF - p 620/06/2014

Text query : Medieval history of why castles were first builtVisual cues : Castle

Text query : Best players of all time; Embarrassing England performances; Wake up call for English football; Wembley massacre;

Visual cues : Poor camera quality; heavy looking football; unusual goal celebrations; unusual crowd reactions; dark; grey; overcast; black and white;

Page 7: When textual and visual information join forces for multimedia retrieval

The proposed Framework

B Huet - Eurecom - BAMMF - p 720/06/2014

Videos, scenes and subtitles

Collection

Scenes

Conceptsindexing scores

Visualsemantic concepts

Content-based indexing

Off-line

On-line

Textual/visual

Query:Textual query

Scenes + subtitles

Text-based scores

Lucene indexing

User querying

Visual-based scores? Selected

concepts

Visualcues

Ranking

Ranked list

Fusion

Page 8: When textual and visual information join forces for multimedia retrieval

The proposed Framework

B Huet - Eurecom - BAMMF - p 820/06/2014

Scenes

Conceptsindexing scores

Videos, scenes and subtitles

Collection

Visualsemantic concepts

Content-based indexing

No training data for visual concepts

Use 151 visual concept detectors trained on TrecVid 2012 data

Unknown performance

Page 9: When textual and visual information join forces for multimedia retrieval

Visual concept detector confidence (w)

� 100 top images for the concept “Animal”

� 58 out of 100 are manually evaluated as valid

B Huet - Eurecom - BAMMF - p 920/06/2014

Page 10: When textual and visual information join forces for multimedia retrieval

The proposed Framework

B Huet - Eurecom - BAMMF - p 1020/06/2014

Textual/visual

Query:

User querying

<queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>

Users are not aware of visual concepts

Page 11: When textual and visual information join forces for multimedia retrieval

Mapping visual cues to visual concepts

� <queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>

Farm

Shells

Exploration

Poem

Animal

House

Memories

AnimalBirdsInsect

Cattle

DogsBuilding

SchoolChurch

Flags

Mountain

WordNet Mapping

keyw

ords

visual concepts

B Huet - Eurecom - BAMMF - p 1120/06/2014

Page 12: When textual and visual information join forces for multimedia retrieval

Mapping visual cues to visual concepts

� Concepts mapped to the visual query "Castle”

� Semantic similarity computed using the “Lin” distance

20/06/2014 B Huet - Eurecom - BAMMF - p 12

Concept Windows Plant Court Church Building

β 0.4533 0.4582 0.5115 0.6123 0.701

Page 13: When textual and visual information join forces for multimedia retrieval

The proposed Framework

B Huet - Eurecom - BAMMF - p 1320/06/2014

Text-based scores

Lucene indexing

Visual-based scores

WordNetsimilarity

Selected concepts

RankingFusion

One score for each scene (t)

f i = t iα + v i

1−α

One score for each scene (v):

Computed from the scores of the selected concepts for each scene

v iq = w c × vs i

c

c∈C 'q

Page 14: When textual and visual information join forces for multimedia retrieval

Evaluation

� To which extent can visual concepts contribute info rmation when retrieving videos?

� How can we cope with the confidence in visual conce pt detection?

� BBC Archive subset provided by the MediaEval 2013 Se arch and Hyperlinking task.

� Evaluation Measures:� Mean Reciprocal Rank (MRR): assesses the rank of the relevant segment� Mean Generalized Average Precision (mGAP) : takes into account starting

time of the segment� Mean Average Segment Precision (MASP) : measures both ranking and

segmentation of relevant segments

20/06/2014 B Huet - Eurecom - BAMMF - p 14

Page 15: When textual and visual information join forces for multimedia retrieval

Retrieval Performance (50 queries)

� Low impact of visual concept detector confidence ( w)

� Significant improvement can be achieved by combinin g only mapped concepts with θ ≥ 0.3.

� Best performance is obtained when θ ≥ 0.8 (gain ≈ 11-12%).

20/06/2014 B Huet - Eurecom - BAMMF - p 15

w=1.0 w=confidence(c)

Page 16: When textual and visual information join forces for multimedia retrieval

Visual concepts and Query association

� The number of concepts associated to queries with different threshold θ.

20/06/2014 B Huet - Eurecom - BAMMF - p 16

θ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Min 5 5 5 2 0 0 0 0 0 0

Max 45 45 41 37 25 19 19 12 6 2

Mean 20 19 18 15 11 7 5 3 1 1

#Q(#c’q>0) 50 50 50 50 49 49 48 44 29 21

Page 17: When textual and visual information join forces for multimedia retrieval

Retrieval on queries with visual concepts (21)

� Concept mapping improves significantly the performance of the text-based search task on these queries.

� The best performance was achieved with θ ≥ 0.7 (gain ≈ 32-33%).

20/06/2014 B Huet - Eurecom - BAMMF - p 17

w=1.0 w=confidence(c)

Page 18: When textual and visual information join forces for multimedia retrieval

Conclusion

� A novel video search framework using visual informa tion to enrich a text-based search for video retrieval has been presented.

� We conducted our evaluations on the MediaEval 2013 w here we achieved the 2sd best on Search and 1 st on Hyperlinking

� Experimental results show that mapping text-based q ueries to visual concepts improves significantly the searc h system.

� When appropriately selecting the relevant visual co ncepts, a very significant improvement is achieved (gain ≈ 33%).

20/06/2014 B Huet - Eurecom - BAMMF - p 18

Page 19: When textual and visual information join forces for multimedia retrieval

Related Publications

� B. Safadi, M. Sahuguet and B. Huet, When textual and visual information join forces for multimedia retrieval, ICMR 2014, ACM International Conference on Multimedia Retrieval, April 1-4, 2014, Glasgow, Scotland

� M. Sahuguet and B. Huet. Mining the Web for Multimedia-based Enriching . Multimedia Modeling MMM 2014, 20th International Conference on MultiMedia Modeling, 8-10th January 2014, Dublin, Ireland

� M. Sahuguet, B. Huet, B. Cervenkova, E. Apostolidis, V. Mezaris, D. Stein, S. Eickeler, J-L. Redondo Garcia, R. Troncy, L. Pikora. LinkedTV at MediaEval 2013 search and hyperlinking ta sk, MEDIAEVAL 2013, Multimedia Benchmark Workshop, October 18-19, 2013, Barcelona, Spain

� Stein, D.; Öktem, A.; Apostolidis, E.; Mezaris, V.; Redondo García, J. L.; Troncy, R.; Sahuguet, M. & Huet, B., From raw data to semantically enriched hyperlinking : Recent advances in the LinkedTV analysis workflow, NEM Summit 2013, Networked & Electronic Media, 28-30 October 2013, Nantes, France

� V. Mezaris and B. Huet, “Video Hyperlinking ”, Tutorial Accepted at ICIP 2014 (Oct) Paris

� B. Safadi, M. Sahuguet and B. Huet, “Linking text and visual concepts semantically for c ross modal multimedia search ”, ICIP 2014, Paris 2014.

B Huet - Eurecom - BAMMF - p 1920/06/2014

Page 20: When textual and visual information join forces for multimedia retrieval

Questions?

http://www.slideshare.net/huetbenoit/

� Thank you.

When Textual and Visual InformationJoin Forces

for MultiMedia RetrievalBenoit Huet

B Huet - Eurecom - BAMMF - p 2020/06/2014