y x, Yin Liz, Jamieson Warnery, James M. Rehgz...

1
Gaze-enabled Egocentric Video Summarization via Constrained Submodular Maximization Jia Xu y , Lopamudra Mukherjee x , Yin Li z , Jamieson Warner y , James M. Rehg z , Vikas Singh y y University of Wisconsin-Madison, x University of Wisconsin-Whitewater, z Georgia Institute of Technology M OTIVATION Egocentric video in real life: I Life-logging with wearable cameras: GoPro, Google Glass I Memory aid for the aging population (Alzheimer Disease) Need for summarization: I Personalization ( = Gaze I Efcient algorithms ( = Submodular optimization A submodular summarization model for egocentric videos, capturing common-sense properties of a good summary: relevance, diversity, compactness, and personalization P ROBLEM F ORMULATION Relevance and Diversity Measurement: I Mutual Information M(VnS;S) = H(VnS) H(VnSjS) = H(VnS) + H(S) H(V) I Entropy H(S) = 1 + log(2 ) 2 jSj + 1 2 log(det(L S )) where L S is the principal submatrix of L indexed by S I Maximizing M(VnS; S) is equivalent to maximizing M(S) = 1 2 log(det(L VnS )) + 1 2 log(det(L S )) as jSj + jVnSj = n , and H(V) is constant. Relation to Determinantal Point Process: Positive semidenite kernel matrix L indexed by elements of V L ij = v T i kv i k v j kv j k Diversity score for S 2 V D(S) = log(det(L S )) Partition Matroid Constraint: I High level supervision: timeline I Partition the video into b disjoint blocks P 1 ;P 2 ; ;P b I Compactness: allocation bound for each block I=fA: jA \ P m j f m ; m = 1; 2; ;bg W HAT CAN HELP THE SUMMARIZATION ? Gaze in egocentric video summarization: I Better temporal segmentation: egocentric video is continuous, but gaze is discrete I Personalization: xation tells intention Attention Measurement using Gaze Fixations: I(S) = X i2S c i S UBMODULAR S UMMARIZATION Main Model: max S F(S) = M(S) + I(S) s.t. S2I Corollary: F (S) is submodular. Algorithm 1 Local Search for Constrained Submodular Maximization 1: Input: M = ( V; I ); F ; 0 2: Initialize S ; ; 3: while (Any of the following local operations applies, update S accordingly) do 4: Add operation. If e 2 VnS such that S [f eg) 2 I and F(S [f eg) F(S) > , then S = S [ f eg. 5: Swap operation. If e i 2 S and e j 2 VnS such that (Snfe i g)[f e j g2I and F((Snfe i g)[f e j g) F(S) > , then S = ( Snf e i g)[f e j g. 6: Delete operation. If e 2 S such that F (Snfeg) F(S) > , then S = Snfeg. 7: end while 8: return S; Proposition: Greedy local search achieves a 1 4 -approximation factor for our constrained submodular maximization problem. DATA C OLLECTION I 5 subjects to record their daily lives I 21 videos with gaze I 15 hours in total Annotation: subjects group subshots into events. E XPERIMENTAL E VALUATION Systematic Evaluation: Method uniform kmeans uniform kmeans ours (our subshots) (our subshots) F-measure 0:161 0:215 0:016 0:526 0:475 0:026 0.621 Table 1: Comparisions of average F-measure on GTEA-GAZE+. Method uniform kmeans uniform kmeans ours (our subshots) (our subshots) F-measure 0:080 0:095 0:030 0:476 0:509 0:025 0.585 Table 2: Comparisions of average F-measure on our new EgoSum+gaze dataset. Visual Results: Figure 1: Results from GTEA-gaze+: pizza preparation. Figure 2: Results from our new EgoSum+gaze dataset: our subject mixes a shake, drinks it, washes his cup, plays chess and texts a friend. K EY T AKEAWAYS I A very rst study on the role of gaze in egocentric video summarization I An efcient submodular summarization model for egocentric videos Computer Vision and Pattern Recognition, 2015 Acknowledge to NSF RI 1116584, NSF CGV 1219016, NIH BD2K 1U54AI117924 and 1U54EB020404 [email protected]

Transcript of y x, Yin Liz, Jamieson Warnery, James M. Rehgz...

Page 1: y x, Yin Liz, Jamieson Warnery, James M. Rehgz ypages.cs.wisc.edu/~jiaxu/projects/ego-video-sum/poster.pdf · Computer Vision and Pattern Recognition, 2015 Acknowledge to NSF RI 1116584,

Gaze-enabled Egocentric Video Summarization via Constrained Submodular MaximizationJia Xu†, Lopamudra Mukherjee§, Yin Li‡, Jamieson Warner†, James M. Rehg‡, Vikas Singh††University of Wisconsin-Madison, §University of Wisconsin-Whitewater, ‡Georgia Institute of Technology

MOTIVATION

Egocentric video in real life:I Life-logging with wearable cameras: GoPro, Google GlassI Memory aid for the aging population (Alzheimer Disease)

Need for summarization:I Personalization ⇐= GazeI Efficient algorithms⇐= Submodular optimization

A submodular summarization model for egocentric videos,capturing common-sense properties of a good summary:relevance, diversity, compactness, and personalization

PROBLEM FORMULATION

Relevance and Diversity Measurement:I Mutual Information

M(V\S;S) = H(V\S)− H(V\S|S)= H(V\S) + H(S)− H(V)

I Entropy

H(S) = 1 + log(2π)2

|S| + 12

log(det(LS))

where LS is the principal submatrix of L indexed by S

I Maximizing M(V\S;S) is equivalent to maximizing

M(S) = 12

log(det(LV\S)) +12

log(det(LS))

as |S| + |V\S| = n , and H(V) is constant.Relation to Determinantal Point Process:Positive semidefinite kernel matrix L indexed by elements of V

Lij =vT

i‖vi‖

vj

‖vj‖Diversity score for S ∈ V

D(S) = log(det(LS))Partition Matroid Constraint:

I High level supervision: timelineI Partition the video into b disjoint blocks P1,P2, · · · ,Pb

I Compactness: allocation bound for each blockI = {A : |A ∩ Pm| ≤ fm,m = 1,2, · · · ,b}

WHAT CAN HELP THE SUMMARIZATION?

Gaze in egocentric video summarization:I Better temporal segmentation: egocentric video is continuous,

but gaze is discreteI Personalization: fixation tells intention

fixation fixation saccade saccade saccade fixation fixation

κ 0.91 0.85 0.5 0.89 0.81

thresholdsubshot 1 subshot 2

Attention Measurement using Gaze Fixations:

I(S) =∑i∈S

ci

SUBMODULAR SUMMARIZATION

· · ·

· · ·

· · ·

Input

Subshot Segmentationwith Gaze

Subshots

Building SubshotRepresentation

Extract R-CNNfeature for each

subshot keyframe

Compute co-variance matrix

for subshots

Aggregatefixation counts

for each subshot

Data Representation

Constrained SubmoudlarMaximization

Final Summary

1:00PM

2:00PM

3:00PM

4:00PM

5:00PM

Main Model:maxS

F (S) = M(S) + λI(S)

s.t. S ∈ ICorollary: F (S) is submodular.Algorithm 1 Local Search for Constrained Submodular Maximization

1: Input: M = (V , I),F , ε ≥ 02: Initialize S ← ∅;3: while (Any of the following local operations applies, update S accordingly) do4: Add operation. If e ∈ V\S such that S∪{e}) ∈ I and F (S∪{e})−F (S) > ε,

then S = S ∪ {e}.5: Swap operation. If ei ∈ S and ej ∈ V\S such that (S\{ei}) ∪ {ej} ∈ I and

F ((S\{ei}) ∪ {ej})− F (S) > ε, then S = (S\{ei}) ∪ {ej}.6: Delete operation. If e ∈ S such that F (S\{e})− F (S) > ε, then S = S\{e}.7: end while8: return S;

Proposition: Greedy local search achieves a 14-approximation

factor for our constrained submodular maximization problem.

DATA COLLECTIONI 5 subjects to record their daily livesI 21 videos with gazeI 15 hours in total

Annotation: subjects group subshots into events.

EXPERIMENTAL EVALUATIONSystematic Evaluation:Method uniform kmeans uniform kmeans ours

(our subshots) (our subshots)F-measure 0.161 0.215± 0.016 0.526 0.475± 0.026 0.621

Table 1: Comparisions of average F-measure on GTEA-GAZE+.

Method uniform kmeans uniform kmeans ours(our subshots) (our subshots)

F-measure 0.080 0.095± 0.030 0.476 0.509± 0.025 0.585Table 2: Comparisions of average F-measure on our new EgoSum+gaze dataset.

Visual Results:

uniform

k-means

uniform(our subshots)

k-means(our subshots)

ours

Figure 1: Results from GTEA-gaze+: pizza preparation.

uniform

k-means

uniform(our subshots)

k-means(our subshots)

ours

Figure 2: Results from our new EgoSum+gaze dataset: our subject mixes a shake, drinksit, washes his cup, plays chess and texts a friend.

KEY TAKEAWAYS

I A very first study on the role of gaze in egocentric video summarizationI An efficient submodular summarization model for egocentric videos

Computer Vision and Pattern Recognition, 2015 Acknowledge to NSF RI 1116584, NSF CGV 1219016, NIH BD2K 1U54AI117924 and 1U54EB020404 [email protected]