Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction...

32
Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014

Transcript of Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction...

Page 1: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

1

Joint Summarization of Large-scale Collections of Web Images and Videos

for Storyline Reconstruction

Gunhee Kim Leonid Sigal Eric P. Xing

June 16, 2014

Page 2: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

2

• Problem Statement

• Algorithm Video summarization

Storyline reconstruction

• Experiments

• Conclusion

Outline

Page 3: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

3

Background

Online photo/video sharing becomes so popular

Information overload problem in visual data

Average 3,000 pictures uploaded per minute

100 hours of video are uploaded per minute

Any efficient and comprehensive summary?

Page 4: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

4

Our Objective

Jointly summarize large sets of online images and videos

• The characteristics of two media are complementary

A user video

Videos: Much redundant and noisy information

backlit subjectsfull of trivial BGoverexposure

A set of photo streams

Images: More carefully taken from canonical viewpoints

Video summarizationCollections of Images

Page 5: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

5

Our Objective

Jointly summarize large sets of online images and videos

• The characteristics of two media are complementary

A set of user videos

Images: Sequential structure is often missing

A photo stream

Videos: Motion pictures

Image summarization Collections of Videos

Page 6: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

Problem Statement

6

(Input) A set of photo streams and user videos for a topic of interest

• Edges: chronological or causal relations (i.e., recur in many photo streams)

• Vertices: dominant image clusters

(Output1) Video summary: keyframe-based summarization

(Output2) Image summary as Storyline graph

Page 7: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

7

Flickr and YouTube Dataset

20 outdoor recreational classes

SurfingBeach

HorseRiding

RAfting

YAcht

Air Ball-ooning

ROwing

ScubaDiving

FormulaOne

SNowboarding

SafariPark

MountainCamping

RockClimbing

Tour deFrance

LondonMarathon

FlyFishing

• # videos (15,912)

Independ-ence Day

ChineseNew year

MemorialDay

St.PatrickDay

Wimble-don

• # images/photo streams (2,769,504, 35,545)

Page 8: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

8

• Problem Statement

• Algorithm Video summarization

Storyline reconstruction

• Experiments

• Conclusion

Outline

Page 9: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

9

Algorithm for Video Summarization

1. For each video , find the K-nearest photo streams

• Extreme diversity even with the same keywords

• Use Naïve-Bayes Nearest Neighbor method

A user video

A set of photo streams

2. Build a similarity graph between video frames and images

Page 10: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

10

Algorithm for Video Summarization

1. For each video , find the K-nearest photo streams

• Extreme diversity even with the same keywords

• Use Naïve-Bayes Nearest Neighbor method

A user videos

A set of photo streams

2. Build a similarity graph between video frames and images

• k-th order Markov chain between frames

• Each image casts m similarity votes

Page 11: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

11

Algorithm for Video Summarization

3. Solve the following optimization problem of diversity ranking

A user videos

A set of photo streams

• Choose the nodes to place heat source to maximize the temperature• Sources should be (i) densely connected nodes, (ii) distant one another.

Submodular

[Kim et al. ICCV 2011]

A simply greedy achieves a constant factor approximation

Page 12: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

12

• Problem Statement

• Algorithm Video summarization

Image summarization (Storyline reconstruction)

• Experiments

• Conclusion

Outline

Page 13: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

13

Definition of Storyline Graphs

A storyline graph• : the vertex set = the set of codewords (i.e. image clusters)

Edges should be Sparse and Time-varying [Song et al. 09, Kolar et al.10]

• Images are too many, and much of them are largely redundant

• : popular transitions recurring across many photo streams

Sparsity : only a small number of branching stories per node

• A few nonzero elements in

Page 14: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

14

Definition of Storyline Graphs

Edges should be Sparse and Time-varying [Song et al. 09, Kolar et al.10]

Time-varying: popular transitions change over time

timeline

t = 10AM t = 12PM t = 2PM

Cluster 10 25

44

A storyline graph• : the vertex set = the set of codewords (i.e. image clusters)

• Images are too many, and much of them are largely redundant

• : popular transitions recurring across many photo streams

At 1PM

At 7PM

Page 15: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

15

Directed Tree Derived from Photo Stream

1. For each photo stream , find the K-nearest videos

• Use Naïve-Bayes Nearest Neighbor method

2. k-th order Markov chain btw images in a photo stream

4. Additional links are connected based on one-to-one correspondences

3. Keyframe detection for each neighbor video

Page 16: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

16

Directed Tree Derived from Photo Stream

5. Replace the vee structure (impractical artifact) by two parallel edges

✗• and are followed by .

• Both and must occur in order for to appear.

Page 17: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

17

Inferring Photo Storyline Graphs (1/3)

Input: A set of photo streams

Output : A set of adjacency matrices for

Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions

(A1) All photo streams are taken independently

Likelihood of a single photo stream

(A2) k-th order Markovian assumption btw consecutive images in PS (ex. k=1)

(A3) The codewords of xli are conditional independent one another given xl

i-1

Transition model

Page 18: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

18

Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions

Inferring Photo Storyline Graphs (2/3)

For transition model, use a linear dynamic model

where Gaussian noise

• 1st order Markovian assumption

• k-th order Markovian assumption

A transition from x to y is very unlikely!

where

Transition model

Page 19: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions

Inferring Photo Storyline Graphs (3/3)

where

For transition model, use a linear dynamic model

where Gaussian noise

• 1st order Markovian assumption

• The transition model per dimension can be

The log likelihood

Transition model

d-th row

Page 20: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

20

Optimization (1/2)

• (A4) Graphs vary smoothly over time.

For each t , estimate At by maximizing the log-likelihood

Optimization

Data (i.e. images) Timeline

Gaussian Kernel weighting

Page 21: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

21

Optimization (2/2)

In summary, the graph inference is

Iteratively solve a weighted L1-regularized least square problem

• Trivially parallelizable (for each d)

• Linear-time algorithm (eg. Coordinate descent)

• Important in our problem (i.e. handling millions of images).

where

Sparsity

Page 22: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

22

• Problem Statement

• Algorithm Video summarization

Storyline reconstruction

• Experiments

• Conclusion

Outline

Page 23: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

23

Evaluation of Video Summarization via AMT

(OursV): our method with videos only. (OursIV): our method with videos and images(Unif): uniform sampling. (Spect),(Kmeans): Spectral clustering/Kmeans(RankT): Keyframe extraction methods using the rank-tracing technique

Groundtruths for video summarication via Amazon Mechanical Turk

• (1) For each of 100 test videos, each algorithm selects K keyframes

• (2) At least five turkers are asked to choose GT keyframes

• (3) Compare between GT keyframes and ones chosen by the algorithm

Page 24: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

24

Comparison of Video Summarization

air+ballooning fly+fishing

AMT

(OursIV)

(OursV)

(Kmean)

(Unif)

(Unif): cannot correctly handle different lengths of subshots

(OursIV): Get help from the voting by more carefully taken images

(Kmean): hard to know best K

(OursV): suffer from the limitations of using low-level features only

Page 25: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

25

Evaluation on Storyline Graphs via AMT

Main difficulty of quantitative evaluation

• No groudtruth available.

• For a human subject, images and too many and graphs are too big

Crowdsourcing-based evaluation via

Ex) fly+fishing

Which is

better?

Page 26: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

26

Evaluation on Storyline Graphs via AMT

1. Each algorithm creates storyline per topic.

2. Sample 100 important images as test images

3. Each algorithm predicts next most-likely image after the test image

4. A pairwise preference test

• Given the test image, which of A and B is more likely to come next?

✔ Our method

Baseline 2

• Get responses from at least 3 turkers per test image

A crowd of human subjects evaluate only a basic unit (i.e. important edges of storyline).

Test image A

B

Page 27: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

27

Quantitative of Storyline Graphs via AMT

Results of pairwise preference tests

• The numbers indicates the percentage of responses that our prediction is more likely to occur next.

(OursV): our method with videos only. (OursIV): our method with videos and imagesNET: Network-based topic models ([Kim et al. 2008]) HMM: Hidden Markov ModelsPage: PageRank based image retrieval (no structural info)

• At least the number should be higher than 50% to validate the superiority of our algorithm.

Page 28: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

28

Qualitative Evaluation on Storyline Graphs

Given a pair of images in a novel photo stream, predict 10 images that are likely to occur between them using its storyline graph

• (HMM) retrieves reasonably good but highly redundant images. No branching structure.

• (PageRank) retrieves high-quality images but no sequential structure.

GT

Ours

(HMM)

(PageRank)

Page 29: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

29

Qualitative Evaluation on Storyline Graphs

Given a pair of images in a novel photo stream, predict 10 images that are likely to occur between them using its storyline graph

GT

Ours

A downsized storyline graph

Page 30: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

30

• Problem Statement

• Algorithm Video summarization

Storyline reconstruction

• Experiments

• Conclusion

Outline

Page 31: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

31

Structural summary with branching narratives

• Global optimality, linear complexity, and easy parallelization

Joint summarization of Flickr images and YouTube videos

Inference algorithm for sparse time-varying directed graphs

Conclusion

Semantic summary even with simple feature similarity

• 2.7M Flickr images and 17K YouTube videos for 20 classes

Images: More carefully taken from canonical viewpoints

• The characteristics of two media are complementary

Videos: Motion pictures

Page 32: Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction Gunhee Kim Leonid Sigal Eric P. Xing 1 June 16, 2014.

32

Thank you !