Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image...

17
Learning event representations in image sequences by dynamic graph embedding Mariella Dimiccoli and Herwig Wendt * August 10, 2019 Abstract Recently, self-supervised learning has proved to be ef- fective to learn representations of events in image se- quences, where events are understood as sets of tem- porally adjacent images that are semantically per- ceived as a whole. However, although this approach does not require expensive manual annotations, it is data hungry and suffers from domain adaptation problems. As an alternative, in this work, we pro- pose a novel approach for learning event represen- tations named Dynamic Graph Embedding (DGE). The assumption underlying our model is that a se- quence of images can be represented by a graph that encodes both semantic and temporal similarity. The key novelty of DGE is to learn jointly the graph and its graph embedding. At its core, DGE works by iterating over two steps: 1) updating the graph rep- resenting the semantic and temporal structure of the data based on the current data representation, and 2) updating the data representation to take into account the current data graph structure. The main advan- tage of DGE over state-of-the-art self-supervised ap- proaches is that it does not require any training set, but instead learns iteratively from the data itself a low-dimensional embedding that reflects their tempo- ral and semantic structure. Experimental results on two benchmark datasets of real image sequences cap- tured at regular intervals demonstrate that the pro- * M. Dimiccoli is with the Institut de Rob`otica i Inform`atica Industrial, CSIC-UPC, Barselona, Spain ([email protected]). H. Wendt is with the Insti- tut de Recherche en Informatique de Toulouse, CNRS, University of Toulouse, France ([email protected]). Figure 1: Temporally adjacent frames in two events in first person image sequences. posed DGE leads to effective event representations. In particular, it achieves robust temporal segmen- tation on the EDUBSeg and EDUBSeg-Desc bench- mark datasets, outperforming the state of the art. Index Terms: clustering, graph embedding, geometric learning, temporal context prediction, temporal seg- mentation, event representations 1 Introduction Temporal segmentation of videos and image se- quences has a long story of research since it is crucial not only to video understanding but also to video browsing, indexing and summarization [20, 32, 38]. With the proliferation of wearable cameras in re- cent years, the field is facing new challenges. In- deed, wearable cameras allow to capture, from a first- 1 arXiv:1910.03483v1 [cs.LG] 8 Oct 2019

Transcript of Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image...

Page 1: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

Learning event representations in image sequences

by dynamic graph embedding

Mariella Dimiccoli and Herwig Wendt ∗

August 10, 2019

Abstract

Recently, self-supervised learning has proved to be ef-fective to learn representations of events in image se-quences, where events are understood as sets of tem-porally adjacent images that are semantically per-ceived as a whole. However, although this approachdoes not require expensive manual annotations, itis data hungry and suffers from domain adaptationproblems. As an alternative, in this work, we pro-pose a novel approach for learning event represen-tations named Dynamic Graph Embedding (DGE).The assumption underlying our model is that a se-quence of images can be represented by a graph thatencodes both semantic and temporal similarity. Thekey novelty of DGE is to learn jointly the graph andits graph embedding. At its core, DGE works byiterating over two steps: 1) updating the graph rep-resenting the semantic and temporal structure of thedata based on the current data representation, and 2)updating the data representation to take into accountthe current data graph structure. The main advan-tage of DGE over state-of-the-art self-supervised ap-proaches is that it does not require any training set,but instead learns iteratively from the data itself alow-dimensional embedding that reflects their tempo-ral and semantic structure. Experimental results ontwo benchmark datasets of real image sequences cap-tured at regular intervals demonstrate that the pro-

∗M. Dimiccoli is with the Institut de Robotica iInformatica Industrial, CSIC-UPC, Barselona, Spain([email protected]). H. Wendt is with the Insti-tut de Recherche en Informatique de Toulouse, CNRS,University of Toulouse, France ([email protected]).

Figure 1: Temporally adjacent frames in two eventsin first person image sequences.

posed DGE leads to effective event representations.In particular, it achieves robust temporal segmen-tation on the EDUBSeg and EDUBSeg-Desc bench-mark datasets, outperforming the state of the art.

Index Terms: clustering, graph embedding, geometriclearning, temporal context prediction, temporal seg-mentation, event representations

1 Introduction

Temporal segmentation of videos and image se-quences has a long story of research since it is crucialnot only to video understanding but also to videobrowsing, indexing and summarization [20, 32, 38].With the proliferation of wearable cameras in re-cent years, the field is facing new challenges. In-deed, wearable cameras allow to capture, from a first-

1

arX

iv:1

910.

0348

3v1

[cs

.LG

] 8

Oct

201

9

Page 2: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

person (egocentric) perspective, and “in the wild”,long unconstrained videos (≈35fps) and image se-quences (aka photostreams, ≈2fpm). Due to their lowtemporal resolution, the segmentation of first personimage sequences is particularly challenging. Indeed,abrupt changes in appearance may arise within anevent due to sudden camera movements, and eventtransitions that are smooth at higher sampling ratedue to continuous recording are reduced to a fewframes that are difficult to detect. While for a humanobserver it is relatively easy to segment egocentric im-age sequences into discrete units, this poses seriousdifficulties for automated temporal segmentation (seeFig. 1 for an illustration).

Given the limited amount of annotated data, cur-rent state-of-the-art approaches for the temporal seg-mentation of first-person image sequences aim at ob-taining event representations by encoding the tempo-ral context of each frame of a sequence in an unsu-pervised fashion [13,19]. These methods rely on neu-ral or recurrent neural networks and are generallybased on the idea of learning event representationsby training the network to predict past and futureframe representations. Recurrent neural networkshave proved to be more efficient than simple neu-ral networks for the temporal prediction task. Themain limitation of these approaches is that they mustrely on large training datasets to yield state-of-the-art performance. Even if, in this case, training datado not require manual annotations, they can alwaysintroduce a bias and suffer from the domain adapta-tion problem. For instance, in the case of temporalsegmentation of image sequences, the results will bedifficult to generalize to data acquired with a cam-era with different field of view or for people havingdifferent lifestyles.

In this paper, we aim at overcoming this limita-tion with a novel approach that is able to unveil arepresentation encoding the temporal structure of animage sequence from the single sequence itself. Withthis goal in mind, we propose to learn event represen-tations as an embedding on a graph. Our model isbased on the assumption that each event belongs toa community structure of semantically similar eventson an unknown underlying graph, where communitiesare understood here as sets of nodes that are inter-

Figure 2: The assumption underlying our learningapproach is that image sequences captured at regu-lar intervals (2fpm) can be organized into a graphstructure, where each community structure in thegraph corresponds to a particular temporal context,that captures both temporal and semantic related-ness. Points of the same color in the figure are re-lated semantically and may be more or less related attemporal level. The arrows indicate temporal tran-sitions between semantic clusters. They have only avisualization purpose, since temporal transition arebetween pairs of points.

connected by edges with large weights. Moreover,the graph weights reflect not only temporal proxim-ity, but also semantic similarity. This is motivatedby neuroscientific findings showing that neural repre-sentations of events arise from temporal communitystructures [46]. In other words, frames that share thetemporal context are grouped together in the repre-sentational space. In Fig. 2 we illustrate the idea bymeans of an egocentric image sequence capturing thefull day of a person: going from home to work usingpublic transports, having a lunch break in a restaurantand going back to home after doing some shopping,etc. Each point cloud corresponds with images simi-lar in appearance and most of them are visited mul-tiple times. This means that every pair of images ina point cloud is related semantically, but they couldor could not be related at temporal level.

Based on this model, the proposed solution con-

2

Page 3: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

sists in learning simultaneously the graph structure(encoded by its weights) and the data representation.This is achieved by iterating over two alternate steps:1) the updating of the graph structure as a func-tion of the current data representation, where thegraph structure is assumed to encode a finite num-ber of community structures, and 2) the updatingof the data representation as a function of the cur-rent graph structure in a low-dimensional embeddingspace. We term this solution dynamic graph em-bedding (DGE). We provide illustrative experimentson synthetic data, and we validate the proposed ap-proach on two real world benchmark datasets for firstperson image sequence temporal segmentation. Ourframework is the first attempt to learn simultane-ously graph structure and data representations fortemporal segmentation of image sequences.

Our main contributions are: (i) we re-frame theevent learning problem as the problem of learning agraph embedding, (ii) we introduce an original graphinitialization approach based on the concept of tem-poral self-similarity, (iii) we propose a novel techni-cal approach to solve the graph embedding problemwhen the underlying graph structure is unknown,(iv) we demonstrate that the learnt graph embed-ding is suitable for the task of temporal segmentation,achieving state-of-the-art results on two challengingreference benchmark datasets [4,14], without relyingon any training set for learning the temporal segmen-tation model.

The structure of the paper is as follows. Section 2highlights related work on data representation learn-ing on graphs and on the temporal segmentation ofvideos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections 3.2 to3.5 we detail the proposed graph embedding model.The performance of our algorithm on real world dataare evaluated in Section 4. In Section 5, we concludeon our contributions and results.

2 Related work

2.1 Geometric learning

The proposed approach lies in the field of geometriclearning, which is an umbrella term for those tech-niques that work in non-Euclidean domains such asgraphs and manifolds. Following [5], geometric learn-ing approaches either address the problem of charac-terizing the structure of the data, or deal with an-alyzing functions defined on a given non-Euclideandomain. In the former case, which is more closelyrelated with the method proposed in this paper, thegoal is to learn an embedding of the data in a lowdimensional space, such that the geometric relationsin the embedding space reflect the graph structure.These methods are commonly referred to as node-embedding and can be understood from an encoder-decoder perspective [25]. Given a graph G = (V, E),where V and E represents the set of nodes and edgesof the graph respectively, the encoder maps each nodev ∈ V of G in a low dimensional space. The decoderis a function defined in the embedding space thatacts on node pairs to compute a similarity S betweenthe nodes. Therefore the graph embedding problemcan be formulated as the problem of optimizing de-coder and encoder mappings to minimize the discrep-ancy between similarity values in the embedding andoriginal feature space. Within the general encoder-decoder framework, node embedding algorithms canbe roughly classified into two classes: shallow embed-dings [1, 7, 10, 22, 41, 43, 50] and generalized encoder-decoder architectures [8,24,27,52]. In shallow embed-ding approaches, the encoder function acts simply asa lookup function and the input nodes vi ∈ V arerepresented as one-hot vectors, so that they cannotleverage node attributes during encoding.

Instead, in generalized encoder-decoder architec-tures [8, 27, 52], the encoders depend more generallyon the structure and attributes of the graph. In par-ticular, convolutional encoders [24] generate embed-dings for a node by aggregating information from itslocal neighborhood, in a manner similar to the recep-tive field of a convolutional kernel in computer vision.They rely on node features to generate embeddingsand, as the process iterates, the node embedding con-

3

Page 4: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

tains information aggregated from further and fur-ther reaches of the graph. Closely related to convolu-tional encoders are Graph Neural Networks (GNNs).The main difference is that GNNs capture the de-pendence of graphs via message passing between thenodes of graphs and utilize node attributes and nodelabels to train model parameters end-to-end for a spe-cific task in a semi-supervised fashion [12,18,26,39].

In all these methods, the graph structure is as-sumed to be given by the problem domain. For in-stance in social networks, the structure of the graphis given by the connection between people. However,in the case of temporal segmentation considered here,the problem is non-structural since the graph struc-ture is not given by the problem domain but needsto be determined together with the node embedding.

2.2 Event segmentation

Extensive research has been conducted to tempo-rally segment videos and image sequences into events.Early approaches aimed at segmenting edited videossuch as TV programs and movies [9, 31, 36, 37, 56]into commercial, news-related or movie events. Morerecently, with the advent of wearable cameras andcamera equipped smartphones, there has been an in-creasing interest in segmenting untrimmed videos orimage sequences captured by nonprofessionals intosemantically homogeneous units [28,30,51]. In partic-ular, videos or image sequences captured by a wear-able camera are typically long and unconstrained [3],therefore it is important for the user to have themsegmented into semantically meaningful chapters. Inaddition to appearance-based features [2,54], motionfeatures have been extensively used for temporal seg-mentation of both third-person videos [32, 51] andfirst-person videos [29, 33, 44, 47]. In [47], motioncues from a wearable inertial sensor are leveraged fortemporal segmentation of human motion into actions.Lee and Grauman [33] used temporally constrainedclustering of motion and visual features to determinewhether the differences in appearance correspond toevent boundaries or just to abrupt head movements.Poleg et al. [44] proposed to use integrated motionvectors to segment egocentric videos into a hierarchyof long-term activities, whose first level corresponds

to static/transit activities.

However, motion information is not available infirst person image sequences, that are the main fo-cus of this paper. In addition, given the limitedamount of annotated data, event segmentation is veryoften performed by using a clustering algorithm rely-ing on hand-crafted visual features [16,17,34,35,55].Tavalera et al. [49] proposed to combine agglomera-tive clustering with a change detection method withina graph-cut energy minimization framework. Lateron, [14] extended this framework by improving thefeature representations through building a vocabu-lary of concepts for representing each frame. Paciet al. [42] proposed a Siamese ConvNets based ap-proach that aims at learning a similarity functionbetween low-resolution egocentric images. Recently,del Molino et al. [19] proposed to learn event rep-resentations by training in an unsupervised fashiona LSTM-based model to predict the temporal con-text of each frame, that is initially represented bythe output of the pre-pooling layer of InceptionV3.This simple approach has shown outstanding resultson reference benchmark datasets [4, 14], as long asthe model is trained on a large training dataset (over1.2 million images). The method in [19] is similarto the chronologically earlier approach reported in[13], that proposed to train the model in a fully self-supervised fashion instead. Specifically, starting froma concept vector representation of each frame, theauthors proposed a neural network based model andan LSTM-based model performing a self-supervisedpretext task consisting in predicting the concept vec-tors of neighbor frames given the concept vector ofthe current frame. Consequently, unlike [19], the sin-gle image sequence itself is used to learn the featuresencoding the temporal context, without need for atraining dataset. The performance achieved in [13]were less impressive than in [19].

Here, we propose a new model that as [13] doesnot make use of any training set for learning the tem-poral event representation, but achieves state-of-the-art results. In particular, it outperforms [19] on theEDUBSeg and EDUBSeg-Desc benchmarks [4, 14].

4

Page 5: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

3 Dynamic Graph Embedding(DGE)

3.1 Problem formulation and pro-posed model

We formulate the event learning problem as a ge-ometric learning problem. More specifically, givena set of data points (the frames of the image se-quence) embedded into a high-dimensional Euclideanspace (the initial data representation), we assumethat these data points are organized in an underly-ing low dimensional graph structure. Our a priori onthe structure of the graph in the underlying low di-mensional space is that it consists of a finite numberof community structures corresponding to differenttemporal contexts. Since along an image sequence, asame temporal context can be visited several timesat different time intervals, we assume that the un-derlying graph consists of a finite number of com-munity structures and that edges between nodes be-longing to different communities correspond to tran-sitions between different temporal contexts, whereasedges between nodes belonging to the same commu-nity correspond to transitions between nodes sharingthe same temporal context, being them temporallyadjacent or not. This structure implicitly assumesthat the graph topology models jointly temporal andsemantic similarity relations.

More formally, given a sequence of N images andtheir feature vectors X ∈ RN × Rn, we aim at find-ing a fully connected, weighted graph G = (X, E , W)with node embedding X ∈ RN × Rd in a low dimen-sional space, d � n, and edge weights W given bythe entries of an affinity matrix G = S(X), whereS is a similarity kernel, such that the similarity Gkj

between any pair k, j of nodes of the graph reflectsboth semantic relatedness and temporal adjacencybetween the images. Semantic relatedness is cap-tured by a similarity function between semantic im-age descriptors, whereas temporal adjacency is im-posed through temporal constraints injected in thegraph structure. The above constraints lead to eas-ily grouping the graph nodes in a finite number ofclusters, each corresponding to a different temporal

context. As seen in the previous section, in classicalnode embedding the low dimensional representationof each node encodes information about the positionand the structure of local neighborhood in the graph.Since all these methods incorporate graph structurein some way, the construction of the underlying graphis extremely important, but relatively little explored.In our problem at hand the graph structure is ini-tially unknown since it arises from events, that areunknown. Therefore, we aim at learning jointly thestructure of the underlying graph and the node em-bedding.

3.2 Graph initialization by nonlocalself-similarity

Temporal nonlocal self-similarity. Let X ∈RN × Rn be the set of N given data points in ahigh-dimensional space Rn, normalized in the in-terval [−1, 1]. To obtain a first coarse estimate ofthe graph, we apply a nonlocal self-similarity algo-rithm in the temporal domain to the initial dataX [15]. The nonlocal self-similarity filtering cre-ates temporal neighborhoods of frames that are likelyto be in the same event. Let X(k) ∈ Rn denotethe k-th row of X, that is, the vector of n im-age features at time k, k = 1, . . . , N . Further, letNMk = {k − M, . . . , k − 1, k + 1, . . . , k + M} andNLk = {k−L, . . . , k−1, k+1, . . . , k+L} denote the in-

dices of the 2M and 2L neighboring feature vectors ofX(k), respectively, with L > M . In analogy with 2Ddata (images) [6], the self-similarity function of X(k)in a temporal sequence, conditioned to its temporalneighborhood j ∈ NM

k , is given by the quantity [15]

SNL(k, j) =1

Z(k)exp

(−dist(X(NM

k ), X(NMj ))

h

),

(1)where Z(k) is a normalizing factor such that∑j∈NL

kSNLkj = 1, ensuring that SNLkj can be inter-

preted as a conditional probability of X(j) givenX(Nk), as detailed in [6], dist(X(Nk), X(Nj)) =∑2Mi=1 ||X(Nk(i)) −X(Nj(i))||`1 is the sum of the `1

distances of the vectors in the neighborhoods of kand j, and h is the parameter that tunes the decay of

5

Page 6: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

the exponential function. The key idea of our graphinitialization is to model each frame k by its denoisedversion, obtained as

X(k) = NLmeans1D(X(k)) =∑

j∈NLk

SNLkj ·X(j),

X0 = (X(1)T X(2)T . . . X(N)T )T . (2)

A numerical illustration on real data is provided inFig. 4.Initial graph and initial embedding. An ini-tial graph G0 is obtained by computing the RN ×RNaffinity matrix G0 = Sl(X0) of X0, defined element-wise as the pairwise similarity

(Sl(X))kj = exp

(−cdist(X(j), X(k))

l

)(3)

where cdist(·, ·) is the cosine distance and l the fil-tering parameter of the exponential function. In thefollowing, we will not distinguish any longer betweena graph G and its representation by its affinity matrixG and make use both symbols synonymously. In ourmodel, G0 represents the initial data structure in theoriginal high dimensional space as a fully connectedgraph, from which we would like to learn the graphin the embedding space, say G, that better encodestemporal constraints. To obtain an initial embed-ding X0 for the graph, we apply PCA on X0, keepthe d major principal components X, and minimizethe cross-entropy loss L between the affinity matricesSl(X0) and Sl(X0)

X0 = argminXL(Sl(X),G0)

where the different filtering parameters l and l ac-count for the different dimensionality of X0 and X0.Even if PCA is a linear operator and for small setsof high-dimensional vectors dual PCA could be moreappropriate [21], we found it sufficient here for ini-tializing the algorithm. The initial graph G0 in theembedding space is then given by G0 = Sl(X0).

3.3 DGE core alternating steps

Given the initial embedding X0 and graphs G0 andG0, the main loop of our DGE alternates over thefollowing two steps:

1. Assuming that Gi−1 is fixed, update the noderepresentations Xi.

2. Assuming that Xi is fixed, update the graph Gi.

ALGORITHM 1: Dynamic Graph Embedding

N − length of the image sequencen − original feature dimensiond − embedding feature dimension (d� n)

Input : X ∈ RN × Rn initial feature matrixOutput: X∈RN×Rdgraph embedded feature matrix

/* Reference graph initialization Eq. (1-3) */

X0 = NLmeans1D(X) ∈ RN × Rn denoise initialfeaturesG0 = Sl(X0) ∈ RN × RN initialize graph in originalspace

/* Graph embedding initialization Eq. (3) */

X = PCAd(X0) ∈ RN × RdX0 = argminX L(Sl(X),G0) initialize embeddingfeaturesG0 = Sl(X0) ∈ RN×RN initialize graph inembedding space

/* DGE core loop Eqs. (4-6) */for i← 1 to K do

— graph embedding update

Xi = argminX(1− α)L1(Sl(X), Gi−1) +αL2(Sl(X),G0)update embedding features given current graph

Gi−1

— graph structure update: temporal prior

Gi ← Sl(Xi) ∗ Kp local average of weights ofgraph of Xi

Gi ← fη(Gi)strengthen temporally adjacent edges

— graph structure update: semantic prior

C = kmeansNC (Xi;NC) estimate semanticcommunities

Gi = gµ(Gi, C;µ) encode semantic similarity ingraph

end

Step (1) is inspired from graph embedding meth-ods, such as the ones reviewed in Section 2.1, thathave proved to be very good at encoding a givengraph structure. Step (2) aims at enforcing temporalconstraints and at fostering semantic similarity in the

6

Page 7: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

graph structure.Graph embedding update. To estimate thegraph embedding Xi at iteration i assuming that Gi−1is given, we solve

Xi=argminX

(1−α)L1(Sl(X), Gi−1)+αL2(Sl(X),G0).

(4)Here L1 and L2 are cross-entropy losses and Sl(·, ·)is the cosine-based similarity defined in (3). The firstloss term controls the fit of the representation X withthe learnt graph Gi in low dimensional embeddingspace. The second loss term quantifies the fit of therepresentation X with the fixed initial graph G0 inhigh dimensional space and is reminiscent of shallowgraph embedding; α ∈ [0, 1] is a regularization pa-rameter that controls the weight of each loss. Stan-dard gradient descent can be used to solve (4).Graph structure update. To obtain an esti-mate of the graph structure at the i-th iteration, sayGi, assuming that Xi is given, we start from an initialestimate for Gi as Gi = Sl(Xi), and then make useof the model assumptions described in Section 3.1 tomodify the graph: temporal adjacency, and semanticsimilarity.

i) To foster similarity for temporally adjacentnodes, we apply two operations. First, local aver-aging of the edge weights, Gi =← Gi ∗ Kp, where ∗is the 2D convolution operator and Kp a p × p ker-nel that is here simply the normalized bump func-tion. Second, and more importantly, we apply a non-linear operation to Gi that scales down by a factorη, 0 < η < 1, the weights of edges between nodes kand j that are not direct temporal neighbors (i.e., forwhich |k−j| > 1) while leaving similarities of directlytemporally adjacent nodes unchanged

(Gi)kj ← (fη(Gi))kj =

{(1− η)(Gi)kj if |k − j|>1

(1− η)(Gi)kj otherwise,

(5)thus strengthening the temporal adjacency of thegraph.

ii) To reinforce the semantic similarity of Xi, wefirst obtain a coarse estimate of the community struc-ture of the graph Gi. To this end, we apply a clus-tering algorithm on Xi, which yields estimated clus-ter labels C = (cj)

NCj=1, cj ∈ {1, ..., NC}, for each

frame. Then we update Gi by applying to it a non-linear function gµ that reduces the similarity betweennodes k and j that do not belong to the same clus-ter, cj 6= ck, by a factor µ, 0 < µ < 1, and does notchange similarities within clusters, i.e.,

(Gi)kj ← (gµ(Gi, C))kj =

{(1− µ)(Gi)kj if cj 6= ck

(1− µ)(Gi)kj otherwise.

(6)For µ < 1, this operation hence reinforces within-event semantic similarity.

DGE aims revealing at the temporal and semanticrelatedness for each pair of data vectors, and there-fore the estimated graphs Gi, i = 0, . . . ,K are fullyconnected at each stage. A high-level overview of ourDGE approach can be found on ALGORITHM 1.

3.4 Graph post-processing: Eventboundary detection

Depending on the problem, applicative context andobjective, different standard and graph signal pro-cessing tools can be applied to the estimated graphGK in order to extract the desired information, ortransform GK [40, 53]. In this work, the focus is onevent detection in image sequences and we thus di-rectly use the learnt representation XK correspond-ing to GK in a contextual event boundary detector.Boundary detector. Since the focus of this pa-per is on how to improve event representation andnot on the temporal segmentation algorithm itself,we used the same boundary detector as in [19]. Sucha boundary detector is based on the idea that thelarger the distance between the predicted visual con-texts for a frame k, once computed forward (frompast frames j < k) and once backward (from fu-ture frames j > k), is, the more likely this framewill correspond to an event boundary. Hence, theboundary prediction function is defined as the dis-tance between the past and future contexts of framek: pred(k) = dist(ctxP , ctxF ), where dist(·, ·) is thecosine distance and the temporal contexts ctxF andctxP are defined as the average of the representa-tion vectors X of a small number of the previous (orpast) frames. Those frames for which the values ofthe boundary prediction function exceed a threshold

7

Page 8: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, JULY 2019 6

iteration i = 0 iteration i = 1 iteration i = 3 iteration i = 10

(a) F-score 0.59 . F-score 0.62 F-score 0.64 F-score 0.67

(b) Data & representation. (e)

. . . .Gi=0 . Gi=1 . Gi=3 . Gi=10 .

(c) Graphs Gi .

. . . .Gi=0 . Gi=1 . Gi=3 . Gi=10 .

(d) Graphs Gi (pruned) .Fig. 3. Illustration of DGE on synthetic data (Gaussian mixture with 4 components, d = 2): (a) scatter plot of features Xi for iterations i = (0, 1, 3, 10)(left to right column, respectively); mixture components are indicated with color, solid lines indicate temporal adjacency; (b) the features Xi plotted as timeseries, with estimated segment boundaries (vertical dashed lines) and component number ground truth (solid black); (c) the corresponding graphs Gi and (d)Gi after pruning edges with weight smaller than 0.7 times that of the strongest edge (yellow and blue color correspond with strong and weak similarity,respectively); (e) F-score, precision and recall for estimated segmentation as a function of iteration number i.

IV. PERFORMANCE EVALUATION

A. Datasets and experimental setup

Datasets.We used two temporal segmentation benchmark datasets for

performance evaluation. The EDUBSeg dataset, introduced in[45], consists of 20 first person image sequences acquired byseven people with a total of 18,735 images. The dataset hasbeen used to validate image sequence temporal segmentationin several recent works [4], [5], [45], [46]. For each imagesequence, 3 manual segmentations have been obtained by3 different persons; in line with previous works, the firstsegmentation was used here as ground truth. The seconddataset is the larger and more recent EDUBSeg-Desc dataset[52], with 46 sequences (42,947 images) acquired by eightpeople. Every frame of the egocentric image sequences wasdescribed here using the output of the pre-pooling layer ofInceptionV3 [53] pretrained on ImageNet as in [5], resultingin n = 2048 features.

On average, the sequences of both datasets contain roughlythe same number of ground truth event boundaries (28 forthe former vs. 26 for the latter), but those of EDUBSeg-Descdataset consist of 25% longer continuously recorded segmentsthan EDUBSeg (3h46m25s vs. 3h1m29s continuous “cameraon” time, and 3.0 vs. 3.55 continuously recorded segmentsper sequence). Because of this increased continuity with alarger number of harder to detect continuous event transitions,EDUBSeg-Desc is considered more difficult.

Other publicly available datasets of egocentric image se-quences, such as CLEF [54], NTCIR [55] and the more recentR3 [5], do not have ground truth event segmentations. Theycan therefore not be used for performance evaluation. In [5],these datasets with more than 1.2 million images were usedas training sets to learn the temporal context representation.We emphasize that in contrast, our algorithm operates fullyself-supervised instead, without training dataset.Performance evaluation. Following previous work, we useF-score, Precision (Prec) and Recall (Rec) to evaluate the

Figure 3: Illustration of DGE on synthetic data (Gaussian mixture with 4 components, d = 2):(a) scatter plot of features Xi for iterations i = (0, 1, 3, 10) (left to right column, respectively); mixturecomponents are indicated with color, solid lines indicate temporal adjacency; (b) the features Xi plottedas time series, with estimated segment boundaries (vertical dashed lines) and component number groundtruth (solid black); (c) the corresponding graphs Gi and (d) Gi after pruning edges with weight smallerthan 0.7 times that of the strongest edge (yellow and blue color correspond with strong and weak similarity,respectively); (e) F-score, precision and recall for estimated segmentation as a function of iteration numberi.

are the detected event boundaries, see [19] for details.We use the same parameter values and thresholds asin [19].

Hereafter we call our temporal segmentation modelrelying on the features learnt by using the proposed

DGE approach CES-DGE, in analogy with CES-VCPin [19] (where CES stands for contextual event seg-mentation and VCP for visual context prediction).

8

Page 9: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

3.5 Numerical illustration of CES-DGE on synthetic data

To illustrate the main idea behind our modeling, inFig. 3 we show with a synthetic example in n = 2dimensions how the original data X0 and the asso-ciated initial graph G0 change over 10 iteration ofthe DGE algorithm (here, d = n = 2, and we as-sume X0 = X0 = X0, i.e., no preprocessing). Thedata consist of a temporal sequence of N = 350 fea-ture vectors that are drawn from a Gaussian mixturewith 4 equi-probable components with mean vectors(−7.8 5.1), (−1.4 2.8), (−2.9 12.1), (2.5 3.4) and diag-onal covariance matrix σI with σ = 3.7. To model thecommunity structures in the data, after each switchto a new state the state is maintained with probabil-ity that decreases from 1 at an exponential rate withtime. In Fig. 3, panel (a) shows the learnt representa-tions Xi as scatter plots for iterations i = (0, 1, 3, 10),where colors indicate the true cluster labels and con-nections between data points temporal adjacency;panel (b) plots Xi as time series, together with eventboundary detection as dashed bars and ground truthlabels as solid bars; panel (c) shows the correspondinggraph Gi and panel (e) summarizes boundary detec-tion performance as a function of iteration numberi; in addition, to highlight the estimated event graphstructure, panel (d) plots a pruned version of Gi inwhich all edges with weight below 70% the weightof the strongest edge are removed. Clearly, it is dif-ficult to obtain a good segmentation for the initial,cleaned data (iteration i = 0, left column). It canbe appreciated how, with a few DGE iterations, theembedded data points Xi belonging to the four differ-ent components are pushed apart, and the distancebetween temporally adjacent points belonging to thesame components is reduced, increasing the similar-ity within each temporal context and revealing theunderlying data generating mechanism (Fig. 3 (a-b),columns 2 to 4, respectively). Moreover, the learntrepresentation remains faithful to the temporal struc-ture within each segment (e.g., the position and rela-tive size of local maxima and minima is not changed).An alternative view is provided by the correspondinglearnt graphs Gi in Fig. 3 (c-d), for which increas-ingly homogeneous diagonal and off-diagonal blocks

Parameter F-score

K 1 2 3 4 5(DGE iterations) 0.6550.698 0.682 0.671 0.662

α 0.05 0.1 0.2 0.3 0.4(initial graph memory) 0.6790.698 0.690 0.685 0.682

p 2 3 4 5 6(2D local average size) 0.6910.698 0.691 0.687 0.679

η 0.01 0.1 0.3 0.4 0.6(graph temporal regularization)0.685 0.688 0.6980.686 0.672

µ 0.03 0.05 0.1 0.2 0.3(extra-cluster penalty) 0.675 0.697 0.6980.687 0.670

NCL 3 4 6 8 100.650 0.656 0.666 0.6820.698

(cluster number) 12 14 16 18 200.686 0.685 0.677 0.674 0.676

Table 1: Robustness of CES-DGE with respect tohyperparameter values (EDUBSeg, tolerance τ = 5):F-scores obtained when varying the DGE hyperpa-rameter indicated in the first column, with all othersheld fixed (best results in bold).

with sharp boundaries emerge with increasing num-ber of iterations i, reflecting the temporal commu-nity structure underlying the data. This improvedrepresentation leads in turn to significantly betterevent segmentation results, with F-score increasingfrom initially 0.59 to 0.62, 0.64, 0.67 for iterationsi = 1, 3, 4− 10, respectively.

3.6 Comparative results for EDUB-Seg dataset

4 Performance evaluation

4.1 Datasets and experimental setup

Datasets.We used two temporal segmentation benchmark

datasets for performance evaluation. The EDUB-Seg dataset, introduced in [14], consists of 20 firstperson image sequences acquired by seven peoplewith a total of 18,735 images. The dataset has beenused to validate image sequence temporal segmenta-

9

Page 10: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

Method F-score Rec Prec

k-Means smoothed 0.51 0.39 0.82AC-color 0.38 0.25 0.90

SR-ClusteringCNN 0.53 0.68 0.49KTS 0.53 0.40 0.87

CES-VCP 0.69 0.77 0.66CES-DGE 0.70 0.70 0.72

Manual segmentation 0.72 0.68 0.80

Table 2: Comparison of CES-DGE with state-of-the-art methods & manual segmentation for EDUBSegand tolerance τ = 5.

Method CES-VCP CES-DGETolerance F-score Rec Prec F-score Rec Prec

τ = 5 0.69 0.77 0.66 0.70 0.70 0.72τ = 4 0.67 0.75 0.63 0.68 0.69 0.70τ = 3 0.64 0.62 0.71 0.65 0.64 0.68τ = 2 0.59 0.67 0.56 0.59 0.59 0.61τ = 1 0.44 0.44 0.49 0.48 0.48 0.50

Table 3: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance forEDUBSeg.

tion in several recent works [13, 14, 19, 42]. For eachimage sequence, 3 manual segmentations have beenobtained by 3 different persons; in line with previ-ous works, the first segmentation was used here asground truth. The second dataset is the larger andmore recent EDUBSeg-Desc dataset [4], with 46 se-quences (42,947 images) acquired by eight people.Every frame of the egocentric image sequences wasdescribed here using the output of the pre-poolinglayer of InceptionV3 [48] pretrained on ImageNet asin [19], resulting in n = 2048 features.

On average, the sequences of both datasets con-tain roughly the same number of ground truth eventboundaries (28 for the former vs. 26 for the latter),but those of EDUBSeg-Desc dataset consist of 25%longer continuously recorded segments than EDUB-Seg (3h46m25s vs. 3h1m29s continuous “camera on”time, and 3.0 vs. 3.55 continuously recorded segmentsper sequence). Because of this increased continuitywith a larger number of harder to detect continuous

event transitions, EDUBSeg-Desc is considered moredifficult.

Other publicly available datasets of egocentric im-age sequences, such as CLEF [11], NTCIR [23] andthe more recent R3 [19], do not have ground truthevent segmentations. They can therefore not be usedfor performance evaluation. In [19], these datasetswith more than 1.2 million images were used as train-ing sets to learn the temporal context representation.We emphasize that in contrast, our algorithm op-erates fully self-supervised instead, without trainingdataset.Performance evaluation. Following previouswork, we use F-score, Precision (Prec) and Recall(Rec) to evaluate the performance of our approach.In previous works [13, 14, 19, 42], results have beenreported for a single level of tolerance τ = 5 (i.e.,time interval of length 2τ within which a detectedevent boundary is considered correct). Here, we re-port results for several values for the level of toleranceτ ∈ [1, T ], where T = 5.DGE hyperparameters. The hyperparametervalues for the graph initialization and embedding(i.e., for the non-local self-similarity kernel (1) andfor the similarity (3)) have been chosen a priori basedon visual inspection of the similarity matrices of X0

and X0 for a couple of sequences of EDUBSeg. Thevalues are fixed to L = 3, M = 1, h = 0.25 forEq. (1), and l = 0.001, l = 0.3 for Eq. (3). The em-bedding dimension is set to d = 15, which is foundsufficient for the representation to faithfully repro-duce the graph topology underlying the data (largervalues for d were tested but did not yield performancegains). The DGE core loop hyperparameters are setto α = 0.1 (graph regularization), p = 3, η = 0.3(temporal prior), and µ = 0.1, NC = 10 (semanticprior; a k-means algorithm is used to estimate clus-ter labels); robustness of this choice is investigated inthe next section.

4.2 Robustness to changes in hyper-parameter values

Tab. 1 reports F-score values obtained on the EDUB-Seg dataset when the DGE iterations K and the DGEcore parameters α, p, η, µ and NCL are varied indi-

10

Page 11: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

vidually. It can be appreciated that the performanceof CES-DGE is overall very robust w.r.t. precise hy-perparameter values. The highest sensitivity is ob-served for the DGE iteration numberK, which shouldbe chosen as 1 < K < 5 so that F-score values areat most 3% below the best observed F-score. Re-sults are very robust w.r.t. temporal regularizationfor reasonably small parameter values of p ≤ 5 andη ≤ 0.4, for larger values the learnt representation isover-smoothed. Similarly, F-score values vary littlewhen changing the semantic similarity parameter aslong as 0 < µ ≤ 0.2 (note that these variations aresignificant since η, µ ∈ (0, 1)). Finally, the numberof clusters used to estimate semantic similarity canbe chosen within a reasonable range (F-score valuesdrop by less than 3% for NCL ∈ (6, . . . , 20)). Over-all, these results suggest that CES-DGE is quite in-sensitive to hyperparameter tuning and yields robustsegmentation for a wide range of parameter values.

Tab. 2 reports comparisons with five state-of-the-art methods for a fixed value of tolerance (τ = 5, re-sults reproduced from [19]). The first four (k-Meanssmoothed with k = 30, AC-color [33], SR-Clustering[14], KTS [45]) are standard/generic approaches andachieve modest performance, with F-scores no betterthan 0.53. CES-VCP of [19] yields significantly betterF-score of 0.69, thanks to the use of a large trainingset for learning the event representation. The pro-posed CES-DGE approach further improves on thisstate-of-the-art result and yields F-score of 0.70. In-terestingly, CES-DGE also achieves more balancedRec and Prec values of 0.70 and 0.72, as compared to0.77 and 0.66 for CES-VCP. Finally, even when com-pared to average manual segmentation performance,estimated by averaging the performance of the two re-maining manual segmentations for EDUBSeg againstthe selected ground truth, our results are only 2%below.

In Table 3, we provide comparisons between ourproposed approach and CES-VCP for different valuesof tolerance. We can observe that CES-DGE achievessystematically better results than CES-VCP in termsof F-score for all values of tolerance. These improve-ments with respect to the state of the art reach up to4%. Besides, Rec/Prec values for CES-DGE are morebalanced and within ±3% of the values for F-score for

all tolerance levels (±8% of F-score for CES-VCP).

Overall, this leads to conclude that the proposedCES-DGE approach is effective in learning event rep-resentations for image sequences and yields robustsegmentation results. These results are even moreremarkable considering that CES-DGE learns featurerepresentations from the sequence itself, without re-lying on any training dataset.

4.3 Comparative results forEDUBSeg-Desc dataset

Tab. 4 summarizes the event boundary detectionperformance of the proposed CES-DGE approachand of CES-VCP for the larger EDUBSeg-Descdataset; since CES-VCP reported state-of-the-art re-sults (with F-score values 16% above other methods,cf., Tab. 2), we omit comparison with other methodsin what follows for space reasons. The same (hy-per)parameter values as for EDUBSeg are used. Itcan be observed that the performance for both CES-VCP and CES-DGE are inferior to those reported forthe EDUBSeg dataset in the previous section, for alltolerance values, indicating that EDUBSeg-Desc con-tains more difficult image sequences. Interestingly,while the F-scores achieved by CES-VCP are up to12% (and more than 11% on average) below that re-ported for EDUBSeg, the F-scores of the proposedCES-DGE approach are at worst 5% smaller. Inother words, CES-DGE yields up to 8% (on aver-age 6%) better F-score values than the state of theart for the EDUBSeg-Desc dataset. Our CES-DGEalso yields systematically better Rec and Prec val-ues, for all levels of tolerance. Overall, these findingscorroborate those obtained for the EDUBSeg datasetand confirm the excellent practical performance ofthe proposed approach. The results further suggestthat CES-DGE, by relying only on the image se-quence itself for learning event representations, canbenefit from improved adaptability and robustnessas compared to approaches that leverage on trainingdatasets.

11

Page 12: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

Method CES-VCP CES-DGETolerance F-score Rec Prec F-score Rec Prec

τ = 5 0.57 0.59 0.60 0.65 0.67 0.65τ = 4 0.56 0.58 0.58 0.63 0.66 0.63τ = 3 0.52 0.54 0.54 0.60 0.62 0.60τ = 2 0.49 0.50 0.50 0.54 0.56 0.54τ = 1 0.43 0.44 0.45 0.45 0.46 0.45

Table 4: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance forEDUBSeg-Desc.

4.4 Ablation study

Graph initialization. We investigate perfor-mance obtained by applying the boundary detectorto features obtained at different stages of our method:the original features X (denoted CES-raw) and thefeatures X0 obtained by applying NLmeans on thetemporal dimension (CES-NLmeans 1D), both of di-mension n = 2048; the initial embedded features X0

(CES-Embedding) and the features obtained afterrunning the DGE main loop for K = 2 iterations(CES-DGE), both of dimension d = 15. The results,obtained for the EDUBSeg dataset, are reported inTab. 5. They indicate that CES-NLmeans 1D in-creases F-score by 8% w.r.t. CES-raw, and CES-Embedding adds another 1% in F-score. CES-DGEgains an additional 9%, hence significantly improvesupon this initial embedding. This confirms that thegraph initialization and the reduction of the dimen-sion of the graph representation is beneficial.

An illustration of the effect of the graph initial-ization and of the DGE steps for EDUBSeg Sub-ject 1 Day 3 is provided in Fig. 4. It can be ob-served how boundaries between temporally adjacentframes along the diagonal in the graph are succes-sively enhanced when the original features X0 (col-umn 1) are replaced with the denoised version X0

(column 2), with the embedded features X0 (column3), and finally with the DGE representation estimates(columns 4 & 5 for DGE iterations i = 1, 2, re-spectively). Moreover, the boundaries of off-diagonalblocks that correspond to segments of frames thatressemble segments at other temporal locations, pre-

Method F-score Rec Prec

CES-raw 0.52 0.56 0.56CES-NLmeans 1D 0.60 0.63 0.61CES-Embedding 0.61 0.61 0.65

CES-DGE 0.70 0.70 0.72

Table 5: CES-DGE ablation study for EDUBSeg andtolerance τ = 5.

Deactivated DGE parameter α = 0 p = 0 η = 0 µ = 0F-score −0.685−0.694−0.683−0.666

difference with full DGE (0.698)−0.013−0.004−0.015−0.032

Table 6: F-scores obtained when single core steps areremoved from DGE (indicated by a zero value for therespective parameter).

sumably of the same event, are sharpened, revealingthe community structures.

DGE core operations. In Tab. 6, we reportthe performance that is obtained on the EDUBSegdataset when the different operations in the DGEcore iterations are removed one-by-one by setting therespective parameter to zero: graph regularization(α), edge local averaging (p), temporal edge weight-ing (η), and extra-cluster penalization (µ). It is ob-served that the overall DGE F-score drops by 0.4%to 3.2% when one single of these operations is deac-tivated (versus a drop of 8.6% from 0.698 to 0.612when no DGE operation is performed at all, as dis-cussed above). The fact that removing any of theoperations individually does not lead to a knock-outof the DGE loop suggests that the associated in-dividual (temporal & semantic) model assumptionsare all and independently important. Among all op-erations, the largest individual F-score-drop (3.2%)corresponds with deactivating the extra-cluster pe-nalization. This points to the essential role of se-mantic similarity in the graph model. The smallestF-score difference is associated with edge local aver-aging; to improve this temporal regularization step,future work could use nonlocal instead of local aver-aging, or learnt kernels. However, the graph temporaledge weighting is effective in encoding the temporalprior (1.5% F-score drop if deactivated).

12

Page 13: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, JULY 2019 8

data X0 . NLmeans X0 embedded X0 DGE Xi iteration i = 1 DGE Xi iteration i = 2

(a) Data & representation. F-score 0.54 F-score 0.71 F-score 0.74 F-score 0.79

. . . . .G0 . G0 . G0 . Gi=1 . Gi=2 .

(b) Graphs G .

. . . . .G0 . G0 . G0 . Gi=1 . Gi=2 .

(c) Graphs G (pruned) .Fig. 4. Illustration of DEG on Subject 1 Day 3 of EDUBSeg. Panel (a): Data and learnt representations with boundary estimates (red dashed vertical bars),ground truth (blue solid vertical bars) and resulting F-score values; from left to right initial features X0 (1st column), denoised features X0 (2nd column),initial embedded features Xi=0 (3rd column), learnt representation Xi after a few iterations i (4th & 5th column). The panel (b) shows the correspondinggraphs G0, G0 and Gi, and panel (c) the graph after pruning edges with weight smaller than 0.3 times the strongest edge (dark blue corresponds with smallweights and yellow corresponds with large weights, respectively).

Method CES-VCP CES-DGETolerance F-score Rec Prec F-score Rec Prec⌧ = 5 0.69 0.77 0.66 0.70 0.70 0.72⌧ = 4 0.67 0.75 0.63 0.68 0.69 0.70⌧ = 3 0.64 0.62 0.71 0.65 0.64 0.68⌧ = 2 0.59 0.67 0.56 0.59 0.59 0.61⌧ = 1 0.44 0.44 0.49 0.48 0.48 0.50

TABLE IIICOMPARISON OF CES-DGE WITH STATE-OF-THE-ART CES-VCP FOR

DIFFERENT VALUES OF TOLERANCE FOR EDUBSEG.

in the previous section, for all tolerance values, indicating thatEDUBSeg-Desc contains more difficult image sequences. In-terestingly, while the F-scores achieved by CES-VCP are up to12% (and more than 11% on average) below that reported forEDUBSeg, the F-scores of the proposed CES-DGE approachare at worst 5% smaller. In other words, CES-DGE yieldsup to 8% (on average 6%) better F-score values than thestate of the art for the EDUBSeg-Desc dataset. Our CES-DGEalso yields systematically better Rec and Prec values, for alllevels of tolerance. Overall, these findings corroborate thoseobtained for the EDUBSeg dataset and confirm the excellentpractical performance of the proposed approach. The resultsfurther suggest that CES-DGE, by relying only on the imagesequence itself for learning event representations, can benefitfrom improved adaptability and robustness as compared to

Method CES-VCP CES-DGETolerance F-score Rec Prec F-score Rec Prec⌧ = 5 0.57 0.59 0.60 0.65 0.67 0.65⌧ = 4 0.56 0.58 0.58 0.63 0.66 0.63⌧ = 3 0.52 0.54 0.54 0.60 0.62 0.60⌧ = 2 0.49 0.50 0.50 0.54 0.56 0.54⌧ = 1 0.43 0.44 0.45 0.45 0.46 0.45

TABLE IVCOMPARISON OF CES-DGE WITH STATE-OF-THE-ART CES-VCP FOR

DIFFERENT VALUES OF TOLERANCE FOR EDUBSEG-DESC.

approaches that leverage on training datasets.

E. Ablation study

Graph initialization. We investigate performance obtainedby applying the boundary detector to features obtained atdifferent stages of our method: the original features X (de-noted CES-raw) and the features X0 obtained by applyingNLmeans on the temporal dimension (CES-NLmeans 1D),both of dimension n = 2048; the initial embedded featuresX0 (CES-Embedding) and the features obtained after runningthe DGE main loop for K = 2 iterations (CES-DGE), both ofdimension d = 15. The results, obtained for the EDUBSegdataset, are reported in Tab. V. They indicate that CES-NLmeans 1D increases F-score by 8% w.r.t. CES-raw, andCES-Embedding adds another 1% in F-score. CES-DGE gains

Figure 4: Illustration of DEG on Subject 1 Day 3 of EDUBSeg. Panel (a): Data and learntrepresentations with boundary estimates (red dashed vertical bars), ground truth (blue solid vertical bars)and resulting F-score values; from left to right initial features X0 (1st column), denoised features X0 (2ndcolumn), initial embedded features Xi=0 (3rd column), learnt representation Xi after a few iterations i (4th& 5th column). The panel (b) shows the corresponding graphs G0, G0 and Gi, and panel (c) the graphafter pruning edges with weight smaller than 0.3 times the strongest edge (dark blue corresponds with smallweights and yellow corresponds with large weights, respectively).

5 Conclusion

This paper proposed a novel approach to learn rep-resentations of events in low temporal resolution im-age sequences, named Dynamic Graph Embedding(DGE). Unlike state of the art work, which requires(large) datasets for training the model, DGE operatesin a fully self-supervised way and learns the temporalevent representation for an image sequence directlyfrom the sequence itself. To this end, we introducedan original model based on the assumption that thesequence can be represented as a graph that captures

both the temporal and the semantic similarity of theimages. The key novelty of our DGE approach isthen to learn the structure of this unknown underly-ing data generating graph, jointly with a low dimen-sional graph embedding. This is, to the best of ourknowledge, one of the first attempts to learn simulta-neously graph structure and data representations forimage sequences. Experimental results have shownthat DGE yields robust and effective event repre-sentations and outperforms the state of the art interms of event boundary detection precision, improv-

13

Page 14: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

ing F-score values by 1% and 8% on the EDUBSegand EDUBSeg-Desc event segmentation benchmarkdatasets, respectively. Future work will include ex-ploring the use of more sophisticated methods thanthe k-means algorithm in the semantic similarity es-timation step, and other applications in the field ofMultimedia, such as the temporal segmentation ofwearable sensor data streams.

References

[1] Amr Ahmed, Nino Shervashidze, ShravanNarayanamurthy, Vanja Josifovski, and Alexan-der J Smola. Distributed large-scale naturalgraph factorization. In Proc. 22nd Int. Conf.on World Wide Web, pages 37–48. ACM, 2013.

[2] Vinay Bettadapura, Daniel Castro, and IrfanEssa. Discovering picturesque highlights fromegocentric vacation videos. In Proc. IEEEWinter Conf. Applications of Computer Vision(WACV), pages 1–9. IEEE, 2016.

[3] Marc Bolanos, Mariella Dimiccoli, and PetiaRadeva. Toward storytelling from visual lifel-ogging: An overview. IEEE Transactions onHuman-Machine Systems, 47(1):77–90, 2017.

[4] Marc Bolanos, Alvaro Peris, Francisco Casacu-berta, Sergi Soler, and Petia Radeva. Egocentricvideo description based on temporally-linked se-quences. Journal of Visual Communication andImage Representation, 50:205–216, 2018.

[5] Michael M Bronstein, Joan Bruna, Yann Le-Cun, Arthur Szlam, and Pierre Vandergheynst.Geometric deep learning: going beyond eu-clidean data. IEEE Signal Processing Magazine,34(4):18–42, 2017.

[6] Antoni Buades, Bartomeu Coll, and J-M Morel.A non-local algorithm for image denoising. InProc. IEEE Int. Conf. Computer Vision andPattern Recognition (CVPR), volume 2, pages60–65, 2005.

[7] Shaosheng Cao, Wei Lu, and Qiongkai Xu.Grarep: Learning graph representations with

global structural information. In Proc. 24thACM Int. Conf. on information and knowledgemanagement, pages 891–900, 2015.

[8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deepneural networks for learning graph representa-tions. In Proc. 30th AAAI Conf. on ArtificialIntelligence, 2016.

[9] Vasileios Chasanis, Argyris Kalogeratos, andAristidis Likas. Movie segmentation into scenesand chapters using locally weighted bag of visualwords. In Proc. ACM Int. Conf. on Image andVideo Retrieval, page 35, 2009.

[10] Haochen Chen, Bryan Perozzi, Yifan Hu, andSteven Skiena. Harp: Hierarchical representa-tion learning for networks. In Proc. 32nd AAAIConf. on Artificial Intelligence, 2018.

[11] Duc-Tien Dang-Nguyen, Luca Piras, MichaelRiegler, Giulia Boato, Liting Zhou, and CathalGurrin. Overview of imagecleflifelog 2017:Lifelog retrieval and summarization. In CLEF(Working Notes), 2017.

[12] Michael Defferrard, Xavier Bresson, and PierreVandergheynst. Convolutional neural networkson graphs with fast localized spectral filtering. InProc. Advances in Neural Information Process-ing Systems (NeurIPS), pages 3844–3852, 2016.

[13] Catarina Dias and Mariella Dimiccoli. Learn-ing event representations by encoding the tem-poral context. In Proc. IEEE European Conf. onComputer Vision Workshops (ECCVW), pages587–596, 2018.

[14] Mariella Dimiccoli, Marc Bolanos, Estefania Ta-lavera, Maedeh Aghaei, Stavri G Nikolov, andPetia Radeva. Sr-clustering: Semantic regular-ized clustering for egocentric photo streams seg-mentation. Computer Vision and Image Under-standing, 155:55–69, 2017.

[15] Mariella Dimiccoli and Herwig Wendt. En-hancing temporal segmentation by nonlocal self-similarity. In Proc. IEEE Int. Conf. on ImageProcessing (ICIP). In press., 2019.

14

Page 15: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

[16] Aiden R Doherty, Daragh Byrne, Alan FSmeaton, Gareth JF Jones, and Mark Hughes.Investigating keyframe selection methods in thenovel domain of passively captured visual lifel-ogs. In Proc. Int. Conf. on Content-based imageand video retrieval, pages 259–268, 2008.

[17] Aiden R Doherty, Ciaran O Conaire, MichaelBlighe, Alan F Smeaton, and Noel E O’Connor.Combining image descriptors to effectively re-trieve events from visual lifelogs. In Proc. 1stACM Int. Conf. on Multimedia information re-trieval, pages 10–17, 2008.

[18] Victor Garcia and Joan Bruna. Few-shot learn-ing with graph neural networks. In Proc.Int. Conf. on Learning Representations (ICLR),2018.

[19] Ana Garcia del Molino, Joo-Hwee Lim, andAh-Hwee Tan. Predicting visual context forunsupervised event segmentation in continuousphoto-streams. In Proc. 26th ACM Int. Conf.on Multimedia, pages 10–17. ACM, 2018.

[20] Ana Garcia del Molino, Cheston Tan, Joo-Hwee Lim, and Ah-Hwee Tan. Summariza-tion of egocentric videos: A comprehensive sur-vey. IEEE Transactions on Human-MachineSystems, 47(1):65–76, 2017.

[21] Georgios B Giannakis, Yanning Shen, and Geor-gios Vasileios Karanikolas. Topology identifica-tion and learning over graphs: Accounting fornonlinearities and dynamics. Proceedings of theIEEE, 106(5):787–807, 2018.

[22] Aditya Grover and Jure Leskovec. node2vec:Scalable feature learning for networks. In Proc.22nd ACM SIGKDD Int. Conf. on Knowledgediscovery and data mining, pages 855–864, 2016.

[23] Cathal Gurrin, Hideo Joho, Frank Hopfgart-ner, Liting Zhou, Rashmi Gupta, Rami Albatal,Dang Nguyen, and Duc Tien. Overview ofNTCIR-13 lifelog-2 task. In Proc. 13th Conf.on Evaluation of Information Access Technology,2017.

[24] Will Hamilton, Zhitao Ying, and Jure Leskovec.Inductive representation learning on largegraphs. In Proc. Advances in Neural Informa-tion Processing Systems (NeurIPS), pages 1024–1034, 2017.

[25] William L Hamilton, Rex Ying, and JureLeskovec. Representation learning on graphs:Methods and applications. IEEE Data Engineer-ing Bulletin, 2017.

[26] Mikael Henaff, Joan Bruna, and Yann Le-Cun. Deep convolutional networks ongraph-structured data. arXiv preprintarXiv:1506.05163, 2015.

[27] Geoffrey E Hinton and Ruslan R Salakhutdinov.Reducing the dimensionality of data with neuralnetworks. Science, 313(5786):504–507, 2006.

[28] Weiming Hu, Nianhua Xie, Li Li, XianglinZeng, and Stephen Maybank. A survey on vi-sual content-based video indexing and retrieval.IEEE Transactions on Systems, Man, and Cy-bernetics, Part C (Applications and Reviews),41(6):797–819, 2011.

[29] Shao Huang, Weiqiang Wang, Shengfeng He,and Rynson WH Lau. Egocentric temporal ac-tion proposals. IEEE Transactions on ImageProcessing, 27(2):764–777, 2017.

[30] Lukman H. Iwan and James A. Thom. Temporalvideo segmentation: detecting the end-of-act incircus performance videos. Multimedia Tools andApplications, 76(1):1379–1401, Jan 2017.

[31] Ewa Kijak, Guillaume Gravier, Lionel Oisel, andPatrick Gros. Audiovisual integration for ten-nis broadcast structuring. Multimedia Tools andApplications, 30(3):289–311, 2006.

[32] Irena Koprinska and Sergio Carrato. Tempo-ral video segmentation: A survey. Signal Pro-cessing: Image Communication, 16(5):477–500,2001.

15

Page 16: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

[33] Yong Jae Lee and Kristen Grauman. Predict-ing important objects for egocentric video sum-marization. International Journal of ComputerVision, 114(1):38–55, 2015.

[34] Jie Lin, Ana Garcia del Molino, Qianli Xu, FenFang, V Subbaraju, and Joo-Hwee Lim. Vci2rat the ntcir-13 lifelog semantic access task. Proc.NTCIR-13, Tokyo, Japan, 2017.

[35] Wei-Hao Lin and Alexander Hauptmann. Struc-turing continuous video recordings of everydaylife using time-constrained clustering. In Multi-media Content Analysis, Management, and Re-trieval 2006, volume 6073, page 60730D. Inter-national Society for Optics and Photonics, 2006.

[36] Cailiang Liu, Dong Wang, Jun Zhu, andBo Zhang. Learning a contextual multi-thread model for movie/tv scene segmentation.IEEE Transactions on Multimedia, 15(4):884–897, 2013.

[37] Nan Liu, Yao Zhao, Zhenfeng Zhu, and HanqingLu. Exploiting visual-audio-textual characteris-tics for automatic tv commercial block detectionand segmentation. IEEE Transactions on Mul-timedia, 13(5):961–973, 2011.

[38] Arthur G Money and Harry Agius. Video sum-marisation: A conceptual framework and surveyof the state of the art. Journal of Visual Commu-nication and Image Representation, 19(2):121–143, 2008.

[39] Mathias Niepert, Mohamed Ahmed, and Kon-stantin Kutzkov. Learning convolutional neuralnetworks for graphs. In Proc. Int. Conf. on Ma-chine Learning (ICML), pages 2014–2023, 2016.

[40] Antonio Ortega, Pascal Frossard, JelenaKovacevic, Jose MF Moura, and Pierre Van-dergheynst. Graph signal processing: Overview,challenges, and applications. Proceedings of theIEEE, 106(5):808–828, 2018.

[41] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang,and Wenwu Zhu. Asymmetric transitivity pre-serving graph embedding. In Proc. 22nd ACM

SIGKDD Int. Conf. on Knowledge discovery anddata mining, pages 1105–1114, 2016.

[42] Francesco Paci, Lorenzo Baraldi, GiuseppeSerra, Rita Cucchiara, and Luca Benini. Con-text change detection for an ultra-low power low-resolution ego-vision imager. In Proc. IEEE Eu-ropean Conf. Computer Vision Workshops (EC-CVW), pages 589–602, 2016.

[43] Bryan Perozzi, Rami Al-Rfou, and StevenSkiena. Deepwalk: Online learning of social rep-resentations. In Proc. 20th ACM SIGKDD Int.Conf. on Knowledge discovery and data mining,pages 701–710, 2014.

[44] Yair Poleg, Chetan Arora, and Shmuel Peleg.Temporal segmentation of egocentric videos. InProc. IEEE Int. Conf. Computer Vision andPattern Recognition (CVPR), pages 2537–2544,2014.

[45] Danila Potapov, Matthijs Douze, Zaid Har-chaoui, and Cordelia Schmid. Category-specificvideo summarization. In Proc. European Conf.on Computer Vision (ECCV), pages 540–555,2014.

[46] Anna C Schapiro, Timothy T Rogers, Na-talia I Cordova, Nicholas B Turk-Browne, andMatthew M Botvinick. Neural representationsof events arise from temporal community struc-ture. Nature Neuroscience, 16(4):486, 2013.

[47] Ekaterina H Spriggs, Fernando De La Torre,and Martial Hebert. Temporal segmentationand activity classification from first-person sens-ing. In Proc. IEEE Int. Conf. Computer Visionand Pattern Recognition Workshops (CVPRW),pages 17–24, 2009.

[48] Christian Szegedy, Vincent Vanhoucke, SergeyIoffe, Jon Shlens, and Zbigniew Wojna. Rethink-ing the inception architecture for computer vi-sion. In Proc. IEEE Conf. Computer vision andpattern recognition (CVPR), pages 2818–2826,2016.

16

Page 17: Learning event representations in image sequences by dynamic … · 2019-10-09 · videos and image sequences. In Section 3.1 we intro-duce our problem formulation while in Sections

[49] Estefania Talavera, Mariella Dimiccoli, MarcBolanos, Maedeh Aghaei, and Petia Radeva. R-clustering for egocentric video segmentation. InProc. Iberian Conf. Pattern Recognition and Im-age Analysis, pages 327–336, 2015.

[50] Jian Tang, Meng Qu, Mingzhe Wang, MingZhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proc.24th Int. Conf. on World Wide Web, pages1067–1077, 2015.

[51] Kevin Tang, Li Fei-Fei, and Daphne Koller.Learning latent temporal structure for complexevent detection. In Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), pages1250–1257. IEEE, 2012.

[52] Daixin Wang, Peng Cui, and Wenwu Zhu. Struc-tural deep network embedding. In Proc. 22ndACM SIGKDD Int. Conf. on Knowledge discov-ery and data mining, pages 1225–1234, 2016.

[53] Xiaofang Wang, Yuxing Tang, Simon Masnou,and Liming Chen. A global/local affinity graphfor image segmentation. IEEE Transactions onImage Processing, 24(4):1399–1411, 2015.

[54] Jia Xu, Lopamudra Mukherjee, Yin Li,Jamieson Warner, James M Rehg, and VikasSingh. Gaze-enabled egocentric video summa-rization via constrained submodular maximiza-tion. In Proc. IEEE Conf. Computer Vision andPattern Recognition (CVPR), pages 2235–2244,2015.

[55] Shuhei Yamamoto, Takuya Nishimura, YasunoriAkagi, Yoshiaki Takimoto, Takafumi Inoue, andHiroyuki Toda. Pbg at the ntcir-13 lifelog-2 lat,lsat, and lest tasks. Proc. NTCIR-13, Tokyo,Japan, 2017.

[56] Jinhui Yuan, Huiyi Wang, Lan Xiao, WujieZheng, Jianmin Li, Fuzong Lin, and Bo Zhang.A formal study of shot boundary detection.IEEE Transactions on Circuits and Systems forVideo Technology, 17(2):168–186, 2007.

17