When Textual and Visual InformationJoin Forces
for MultiMedia Retrieval
Bahjat Safadi, Mathilde Sahuguet, Benoit HuetEURECOM, Multimedia Department
Sophia Antipolis, France
Introduction
� EU alone hosts 500+ online video platforms
� 42.7m hrs of footage in online archives of broadcast ers and producers (61% of archive footage is online)
� UGC on the advance: � YouTube receives 60 hrs of video/minute� Vine and Instagram video
� Internet video is now 40 percent of consumer Intern et traffic, and will reach 62 percent by the end of 2015, 75% i n 2017(source: CISCO)
� How to make the content accessible?� Browsing, Searching, Hyperlinking
B Huet - Eurecom - BAMMF - p 220/06/2014
Objectives and Contributions
� We propose and evaluate a video search framework us ing visual information to enrich the classic text-based search for video retrieval operating at the fragment level.
� We investigate the following two questions: � To which extent can visual concepts contribute information when retrieving
videos? � How can we cope with the confidence in visual concept detection?
� The framework extends conventional text-based searc h by fusing together textual and visual scores.
� We address both the semantic and intention gaps� By automatically mapping the query text to semantic concepts.� With the addition of “visual cues”
20/06/2014 B Huet - Eurecom - BAMMF - p 3
MediaEval Search & Hyperlinking
� Information seeking in a video dataset: retrieving media fragments/anchors
B Huet - Eurecom - BAMMF - p 420/06/2014
The Video Archive
2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users
– Textual query + visual cues� Face detection� Concept detection
B Huet - Eurecom - BAMMF - p 520/06/2014
The Video Archive
2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users
– Textual query + visual cues� Face detection� Concept detection
B Huet - Eurecom - BAMMF - p 620/06/2014
Text query : Medieval history of why castles were first builtVisual cues : Castle
Text query : Best players of all time; Embarrassing England performances; Wake up call for English football; Wembley massacre;
Visual cues : Poor camera quality; heavy looking football; unusual goal celebrations; unusual crowd reactions; dark; grey; overcast; black and white;
The proposed Framework
B Huet - Eurecom - BAMMF - p 720/06/2014
Videos, scenes and subtitles
Collection
Scenes
Conceptsindexing scores
Visualsemantic concepts
Content-based indexing
Off-line
On-line
Textual/visual
Query:Textual query
Scenes + subtitles
Text-based scores
Lucene indexing
User querying
Visual-based scores? Selected
concepts
Visualcues
Ranking
Ranked list
Fusion
The proposed Framework
B Huet - Eurecom - BAMMF - p 820/06/2014
Scenes
Conceptsindexing scores
Videos, scenes and subtitles
Collection
Visualsemantic concepts
Content-based indexing
No training data for visual concepts
Use 151 visual concept detectors trained on TrecVid 2012 data
Unknown performance
Visual concept detector confidence (w)
� 100 top images for the concept “Animal”
� 58 out of 100 are manually evaluated as valid
B Huet - Eurecom - BAMMF - p 920/06/2014
The proposed Framework
B Huet - Eurecom - BAMMF - p 1020/06/2014
Textual/visual
Query:
User querying
<queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>
Users are not aware of visual concepts
Mapping visual cues to visual concepts
� <queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>
Farm
Shells
Exploration
Poem
Animal
House
Memories
AnimalBirdsInsect
Cattle
DogsBuilding
SchoolChurch
Flags
Mountain
WordNet Mapping
keyw
ords
visual concepts
B Huet - Eurecom - BAMMF - p 1120/06/2014
Mapping visual cues to visual concepts
� Concepts mapped to the visual query "Castle”
� Semantic similarity computed using the “Lin” distance
20/06/2014 B Huet - Eurecom - BAMMF - p 12
Concept Windows Plant Court Church Building
β 0.4533 0.4582 0.5115 0.6123 0.701
The proposed Framework
B Huet - Eurecom - BAMMF - p 1320/06/2014
Text-based scores
Lucene indexing
Visual-based scores
WordNetsimilarity
Selected concepts
RankingFusion
One score for each scene (t)
f i = t iα + v i
1−α
One score for each scene (v):
Computed from the scores of the selected concepts for each scene
v iq = w c × vs i
c
c∈C 'q
∑
Evaluation
� To which extent can visual concepts contribute info rmation when retrieving videos?
� How can we cope with the confidence in visual conce pt detection?
� BBC Archive subset provided by the MediaEval 2013 Se arch and Hyperlinking task.
� Evaluation Measures:� Mean Reciprocal Rank (MRR): assesses the rank of the relevant segment� Mean Generalized Average Precision (mGAP) : takes into account starting
time of the segment� Mean Average Segment Precision (MASP) : measures both ranking and
segmentation of relevant segments
20/06/2014 B Huet - Eurecom - BAMMF - p 14
Retrieval Performance (50 queries)
� Low impact of visual concept detector confidence ( w)
� Significant improvement can be achieved by combinin g only mapped concepts with θ ≥ 0.3.
� Best performance is obtained when θ ≥ 0.8 (gain ≈ 11-12%).
20/06/2014 B Huet - Eurecom - BAMMF - p 15
w=1.0 w=confidence(c)
Visual concepts and Query association
� The number of concepts associated to queries with different threshold θ.
20/06/2014 B Huet - Eurecom - BAMMF - p 16
θ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Min 5 5 5 2 0 0 0 0 0 0
Max 45 45 41 37 25 19 19 12 6 2
Mean 20 19 18 15 11 7 5 3 1 1
#Q(#c’q>0) 50 50 50 50 49 49 48 44 29 21
Retrieval on queries with visual concepts (21)
� Concept mapping improves significantly the performance of the text-based search task on these queries.
� The best performance was achieved with θ ≥ 0.7 (gain ≈ 32-33%).
20/06/2014 B Huet - Eurecom - BAMMF - p 17
w=1.0 w=confidence(c)
Conclusion
� A novel video search framework using visual informa tion to enrich a text-based search for video retrieval has been presented.
� We conducted our evaluations on the MediaEval 2013 w here we achieved the 2sd best on Search and 1 st on Hyperlinking
� Experimental results show that mapping text-based q ueries to visual concepts improves significantly the searc h system.
� When appropriately selecting the relevant visual co ncepts, a very significant improvement is achieved (gain ≈ 33%).
20/06/2014 B Huet - Eurecom - BAMMF - p 18
Related Publications
� B. Safadi, M. Sahuguet and B. Huet, When textual and visual information join forces for multimedia retrieval, ICMR 2014, ACM International Conference on Multimedia Retrieval, April 1-4, 2014, Glasgow, Scotland
� M. Sahuguet and B. Huet. Mining the Web for Multimedia-based Enriching . Multimedia Modeling MMM 2014, 20th International Conference on MultiMedia Modeling, 8-10th January 2014, Dublin, Ireland
� M. Sahuguet, B. Huet, B. Cervenkova, E. Apostolidis, V. Mezaris, D. Stein, S. Eickeler, J-L. Redondo Garcia, R. Troncy, L. Pikora. LinkedTV at MediaEval 2013 search and hyperlinking ta sk, MEDIAEVAL 2013, Multimedia Benchmark Workshop, October 18-19, 2013, Barcelona, Spain
� Stein, D.; Öktem, A.; Apostolidis, E.; Mezaris, V.; Redondo García, J. L.; Troncy, R.; Sahuguet, M. & Huet, B., From raw data to semantically enriched hyperlinking : Recent advances in the LinkedTV analysis workflow, NEM Summit 2013, Networked & Electronic Media, 28-30 October 2013, Nantes, France
� V. Mezaris and B. Huet, “Video Hyperlinking ”, Tutorial Accepted at ICIP 2014 (Oct) Paris
� B. Safadi, M. Sahuguet and B. Huet, “Linking text and visual concepts semantically for c ross modal multimedia search ”, ICIP 2014, Paris 2014.
B Huet - Eurecom - BAMMF - p 1920/06/2014
Questions?
http://www.slideshare.net/huetbenoit/
� Thank you.
When Textual and Visual InformationJoin Forces
for MultiMedia RetrievalBenoit Huet
B Huet - Eurecom - BAMMF - p 2020/06/2014
Top Related