Story-based Video Retrieval in TV series using Plot...

25
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Computer Vision for Human-Computer Interaction Lab www.kit.edu Story-based Video Retrieval in TV series using Plot Synopses Makarand Tapaswi, Martin Bäuml, Rainer Stiefelhagen Karlsruhe Institute of Technology, Germany 03 April, ACM ICMR 2014

Transcript of Story-based Video Retrieval in TV series using Plot...

Page 1: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

KIT – University of the State of Baden-Wuerttembergand National Research Center of the Helmholtz Association

Computer Vision for Human-Computer Interaction Lab

www.kit.edu

Story-based Video Retrieval in TV series using Plot Synopses

Makarand Tapaswi, Martin Bäuml, Rainer StiefelhagenKarlsruhe Institute of Technology, Germany

03 April, ACM ICMR 2014

Page 2: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Gandalf falls to a Balrog of Moria

Obi-Wan cuts Darth Maul in two with

his light saber

Story

0:00:00 2:58:00

0:00:00 2:16:00

Page 3: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

3

Goal

Page 4: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Names PlacesTalkingVerbs

ActionVerbs

Objects

Idea

Page 5: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

5

Crowd-sourcing

Freiburg et al. 2011Concert concepts with user feedback

Wang et al. 2013Joint latent space for images and text

Text (transcripts) to video

Everingham et al. 2006Person Identification

Laptev et al. 2008Action Recognition

Xu et al. 2008Event detection in sports

Related Work

Describing images and videos

Farhadi et al. 2010<object, action, scene>Triplets to describe images

Habibian et al. 2013Video2SentenceSentence2Video

Page 6: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Text – Video Alignment

Pre-processing

Character identification

Alignment

6

Page 7: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

7

Pre-processing

Buffy awakens to find Dracula in her bedroom. She is helpless against his powers and unable to stop him from biting her. When she wakes the next morning …

Part-of-speech tagging Coreference resolution

Buffy/NNP awakens/VBZ to/TOfind/VBP Dracula/NNP in/INher/PRP bedroom/NN ./.She/PRP is/VBZ helpless/JJagainst/IN his/PRP powers/NNS …

Original sentence

Shot boundary detection

Buffy awakens to find Dracula in her bedroom. She is helpless against his powers and unable to stop him from biting her. When she wakes the next morning …

Names

Places

Page 8: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

8

00:10:01,933 --> 00:10:04,447So I won't be taking drama with you.

00:10:04,533 --> 00:10:08,811- What? You have to. You promised!- I know, but Giles said that it was

00:10:08,893 --> 00:10:11,407- The hell with Giles.- I can hear you, Willow.

Buffy: So I won't be taking drama with you.

Willow: What? You have to, you promised!

Buffy: Well, I know, but Giles said that it just was-

Willow: The hell with Giles.

Giles: I can hear you, Willow.

align (fan) transcripts to subtitles

who speaks what? what is spoken when?

Weak character labels

Weakly Labeled Data

speaking: Riley?

speaking: Willow?

Bäuml et al. 2013

Page 9: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

9

Person id in video

Weakly Labeled Data

speaking: Riley?

speaking: Willow?

Automatically identify all tracks

Train classifiers

Bäuml et al. 2013

Page 10: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Alignment

10

• Compute the similarity matrix• Find the alignment which maximizes similarity*

Shots

Senten

ces

Page 11: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

11

A simple prior

Distribute shots equally to sentences

Prior Similarity Similarity

Page 12: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

12

Similarity – Identities

Riley asks Spike about Dracula, but the former commando is warned.

Buffy awakens to find Dracula in her bedroom.

131

132

133

134

Note: 𝑤𝐴 represents IDF or importance of A in the episode.

130 131 132 133 134

+𝒘𝑹𝒊𝒍𝒆𝒚

+𝒘𝑺𝒑𝒊𝒌𝒆+𝒘𝑹𝒊𝒍𝒆𝒚 +𝒘𝑺𝒑𝒊𝒌𝒆 +𝒘𝑫𝒓𝒂𝒄𝒖𝒍𝒂 0

0 0 0 +𝒘𝑫𝒓𝒂𝒄𝒖𝒍𝒂 +𝒘𝑩𝒖𝒇𝒇𝒚

Matrix of similarity scores

130

Page 13: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Similarity – Subtitles

Giles has Willow start scanning books into a computer so there can be resources for the gang to use

He then tells her that he’s going to England because it seems he’s no longer needed by Buffy or the Scoobies

24 25 26 27

+1 +1 0 0

0 0 0 +2

Matrix of similarity scores

Page 14: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

14

Max Similarity

Maximize joint similarity over all shot-sentence assignmentssuch that each shot is assigned to ONE sentence

maximizes similarity

breaks structure causes jumpiness

Properties

Page 15: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

15

DTW2

Consecutive shots are likely to be assigned to same (or next) sentence

maximizes similarity with temporal consistency

efficient computation

can assign too many shots to one sentence

unable to handle plot-nonlinearity

Properties

maximize similarity+ each shot to ONE sentence

Page 16: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

16

DTW3

Regularize number of shots being assigned to one sentence

maximizes similarity with temporal consistency

automatically controls the number of shots assigned to a sentence

efficient computation

unable to handle plot non-linearity

Properties

maximize similarity+ each shot to ONE sentence+ temporal consistency

Page 17: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Evaluation

Data set

Quantitative results

Qualitative results

17

Page 18: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

• Buffy the Vampire Slayer (season 5)

• Plot synopsis from Wikipedia

– 22 episodes, 15+ hours of video

– 15700 shots

– 800 sentences

– 21000 face tracks

• Per episode,– #shots: 540 – 940; avg. ~720

– #sentences: 22 – 54; avg. ~36

18

Data set

Page 19: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

19

Method BuffyE01

BuffyE02

Buffy E03

Buffy E04

Average E01 - E22

Human 81.5 86.4 77.5 72.8 –

Prior 2.9 23.8 27.9 8.8 10.11

Character ID MAX 11.6 30.9 23.6 19.1 –

Character ID DTW2 9.4 35.0 18.8 28.4 –

Character ID DTW3 42.2 43.8 40.4 40.3 41.17

Subtitles DTW3 20.4 48.4 35.3 30.1 37.00

Char-ID+Subt. DTW3 40.8 51.3 41.4 47.6 49.16

Alignment accuracy

Accuracy = correctly assigned shots

total number of shots%

Page 20: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Alignment result

20

Page 21: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Application

Story-based Retrieval

Demo

21

Page 22: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Retrieval

Text Query

Plot Synopsis Retrieval

Alignment

Results Play Video

Page 23: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

Retrieval performance

62 queries;

Query Ground TruthTime and Sentence

top5?

Time

Buffy fights DraculaE01: m35-36(33) Buffy and Dracula fight in a vicious battle Overlap

Toth’s spell splits Xanderinto two personalities

E03: m11-12(7) The demon hits Xander with light from a rod … (8) … but then we see another Xander

×

Willow teleports Glory away

E13: m39(34) … before Willow and Tara perform a spell to teleport Glory somewhere else

Overlap

Glory sucks Tara’s mindE19: m24-27(15) Protecting Dawn, Tara refuses, and Glory drains Tara’s mind of sanity.

Overlap

Xander proposes AnyaE22: m24-27(6) Xander proposes Anya 2m44s

Page 24: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

24

Reaching the goal…

Conclusion Story-based retrieval in TV series

Alignment of human-written descriptions to shots in video

Dynamic programming based efficient solution

15+ hours of annotated video data

Page 25: Story-based Video Retrieval in TV series using Plot Synopsesmakarand/presentations/2014_04_ICMR.pdf · Verbs Objects Idea. 5 Crowd-sourcing Freiburg et al. 2011 Concert concepts with

25

Thank you!

Story-based Video Retrieval in TV seriesusing Plot Synopses

Makarand Tapaswi [email protected]://cvhci.anthropomatik.kit.edu/~mtapaswi

Downloads: https://cvhci.anthropomatik.kit.edu/projects/mma