ECO: Efficient Convolutional Network for Online...

ECO Efficient Convolutional Network for

Online Video Understanding

Mohammadreza Zolfaghari[0000minus0002minus2973minus4302] Kamaljeet Singh and ThomasBrox

University of Freiburgzolfaghasinghkbroxcsuni-freiburgde

Abstract The state of the art in video understanding suffers from twoproblems (1) The major part of reasoning is performed locally in thevideo therefore it misses important relationships within actions thatspan several seconds (2) While there are local methods with fast per-frame processing the processing of the whole video is not efficient andhampers fast video retrieval or online classification of long-term activi-ties In this paper we introduce a network architecture1 that takes long-term content into account and enables fast per-video processing at thesame time The architecture is based on merging long-term content al-ready in the network rather than in a post-hoc fusion Together witha sampling strategy which exploits that neighboring frames are largelyredundant this yields high-quality action classification and video cap-tioning at up to 230 videos per second where each video can consist ofa few hundred frames The approach achieves competitive performanceacross all datasets while being 10x to 80x faster than state-of-the-artmethods

Keywords Online video understanding Real-time Action recognitionVideo captioning

1 Introduction

Video understanding and specifically action classification have benefited a lotfrom deep learning and the larger datasets that have been created in recent yearsThe new datasets such as Kinetics [20] ActivityNet [13] and SomethingSome-thing [10] have contributed more diversity and realism to the field Deep learningprovides powerful classifiers at interactive frame rates enabling applications likereal-time action detection [30]

While action detection which quickly decides on the present action withina short time window is fast enough to run in real-time activity understandingwhich is concerned with longer-term activities that can span several secondsrequires the integration of the long-term context to achieve full accuracy Sev-eral 3D CNN architectures have been proposed to capture temporal relations

1 httpsgithubcommzolfaghariECO-efficient-video-understanding

2 M Zolfaghari K Singh and T Brox

between frames but they are computationally expensive and thus can coveronly comparatively small windows rather than the entire video Existing meth-ods typically use some post-hoc integration of window-based scores which issuboptimal for exploiting the temporal relationships between the windows

In this paper we introduce a straightforward end-to-end trainable archi-tecture that exploits two important principles to avoid the above-mentioneddilemma Firstly a good initial classification of an action can already be ob-tained from just a single frame The temporal neighborhood of this frame com-prises mostly redundant information and is almost useless for improving thebelief about the present action2 Therefore we process only a single frame of atemporal neighborhood efficiently with a 2D convolutional architecture in orderto capture appearance features of such frame Secondly to capture the con-textual relationships between distant frames a simple aggregation of scores isinsufficient Therefore we feed the feature representations of distant frames intoa 3D network that learns the temporal context between these frames and so canimprove significantly over the belief obtained from a single frame ndash especiallyfor complex long-term activities This principle is much related to the so-calledearly or late fusion used for combining the RGB stream and the optical flowstream in two-stream architectures [8] However this principle has been mostlyignored so far for aggregation over time and is not part of the state-of-the-artapproaches

Consequent implementation of these two principles together leads to a highclassification accuracy without bells and whistles The long temporal context ofcomplex actions can be fully captured whereas the fact that the method onlylooks at a very small fraction of all frames in the video leads to extremely fastprocessing of entire videos This is very beneficial especially in video retrievalapplications

Additionally this approach opens the possibility for online video understand-ing In this paper we also present a way to use our architecture in an onlinesetting where we provide a fast first guess on the action and refine it using thelonger term context as a more complex activity establishes In contrast to onlineaction detection which has been enabled recently [30] the approach providesnot only fast reaction times but also takes the longer term context into account

We conducted experiments on various video understanding problems includ-ing action recognition and video captioning Although we just use RGB imagesas input we obtain on-par or favorable performance compared to state-of-the-artapproaches on most datasets The runtime-accuracy trade-off is superior on alldatasets

2 An exception is the use of two frames for capturing motion which could be achievedby optionally feeding optical flow together with the RGB image In this paper weonly provide RGB images but an extension with optical flow eg a fast variant ofFlowNet [16] would be straightforward

ECO Efficient Convolutional Network for Online Video Understanding 3

2 Related Work

Video classification with deep learning Most recent works on video clas-sification are based on deep learning [619293448] To explore the temporalcontext of a video 3D convolutional networks are on obvious option Tran etal [33] introduced a 3D architecture with 3D kernels to learn spatio-temporalfeatures from a sequence of frames In a later work they studied the use of aResnet architecture with 3D convolutions and showed the improvements overtheir earlier c3d architecture [34] An alternative way to model the temporalrelation between frames is by using recurrent networks [62425] Donahue etal [6] employed a LSTM to integrate features from a CNN over time Howeverthe performance of recurrent networks on action recognition currently lags be-hind that of recent CNN-based methods which may indicate that they do notsufficiently model long-term dynamics [2425] Recently several works utilized3D architectures for action recognition [5483935] These approaches model theshort-term temporal context of the input video based on a sliding window Atinference time these methods must compute the average score over multiplewindows which is quite time consuming For example ARTNet [39] requires onaverage 250 samples to classify one video

All these approaches do not sufficiently use the comprehensive informationfrom the entire video during training and inference Partial observation not onlycauses confusion in action prediction but also requires an extra post-processingstep to fuse scores Extra featurescore aggregation reduces the speed of videoprocessing and disables the method to work in a real-time setting

Long-term representation learning To cope with the problem of partialobservation some methods increased the temporal resolution of the sliding win-dow [364] However expanding the temporal length of the input has two majordrawbacks (1) It is computationally expensive and (2) still fails to cover thevisual information of the entire video especially for longer videos

Some works proposed encoding methods [422826] to learn a video repre-sentation from samples In these approaches features are usually calculated foreach frame independently and are aggregated across time to make a video-levelrepresentation This ignores the relationship between the frames

To capture long-term information recent works [32740] employed a sparseand global temporal sampling method to choose frames from the entire videoduring training In the TSN model [40] as in the aggregation methods aboveframes are processed independently at inference time and their scores are aggre-gated only in the end Consequently the performance in their experiments staysthe same when they change the number of samples which indicates that theirmodel does not really benefit from the long-range temporal information

Our work is different from these previous approaches in three main aspects(1) Similar to TSN we sample a fixed number of frames from the entire videoto cover long-range temporal structure for understanding of video In this waythe sampled frames span the entire video independent of the length of the video(2) In contrast to TSN we use a 3D-network to learn the relationship betweenthe frames throughout the video The network is trained end-to-end to learn


this relationship (3) The network directly provides video-level scores withoutpost-hoc feature aggregation Therefore it can be run in online mode and inreal-time even on small computing devices

Video Captioning Video captioning is a widely studied problem in com-puter vision [1445439] Most approaches use a CNN pre-trained on image clas-sification or action recognition to generate features [94345] These methods likethe video understanding methods described above utilize a frame-based featureaggregation (eg Resnet or TSN) or a sliding window over the whole video (egC3D) to generate video-level features The features are then passed to a recur-rent neural network (eg LSTM) to generate the video captions via a learnedlanguage model The extracted visual features should represent both the tempo-ral structure of the video and the static semantics of the scene However mostapproaches suffer from the problem that the temporal context is not properlyextracted With the network model in this work we address this problem andcan consequently improve video captioning results

Real-time and online video understanding Deep learning acceleratedimage classification but video classification remains challenging in terms ofspeed A few works dealt with real-time video understanding [18443031] EMV[44] introduced an approach for fast calculation of motion vectors Despite thisimprovement video processing is still slow Kantorov [18] introduced a fast densetrajectory method The other works used frame-based hand-crafted features foronline action recognition [1522] Both accuracy and speed of feature extrac-tion in these methods are far from that of deep learning methods Soomro etal [31] proposed an online action localization approach Their model utilizesan expensive segmentation method which therefore cannot work in real-timeMore recently Singh et al [30] proposed an online detection approach basedon frame-level detections at 40fps We compare to the last two approaches inSection 5

3 Long-term Spatio-temporal Architecture

The network architecture is shown in Fig 1 A whole video with a variablenumber of frames is provided as input to the network The video is split intoN subsections Si i = 1 N of equal size and in each subsection exactlyone frame is sampled randomly Each of these frames is processed by a single2D convolutional network (weight sharing) which yields a feature representa-tion encoding the framersquos appearance By jointly processing frames from timesegments that cover the whole video we make sure that we capture the mostrelevant parts of an action over time and the relationship among these parts

Randomly sampling the position of the frame is advantageous over alwaysusing the same position because it leads to more diversity during training andmakes the network adapt to variations in the instantiation of an action Notethat this kind of processing exploits all frames of the video during training tomodel the variation At the same time the network must only process N framesat runtime which makes the approach very fast We also considered more clever


Input to 3D Net

K x [N x 28 x 28]

M2 M1

K x 28 x 28

Weight Sharing

Weight Sharing

M2 M1

Video (V)

K x 28 x 28

2D Net (H2d)

K x 28 x 28

3D NetAction

s1

s2

SN

Mϕ = [M1 M1] Mϕ = [MK MK]

MK

1K

(H3d)

111

MK

222

M2 M1MK

NNN

1 NN1

Temporal stacking of feature

maps

Feature Maps

Fig 1 Architecture overview of ECO Lite Each video is split into N sub-sections of equal size From each subsection a single frame is randomly sampledThe samples are processed by a regular 2D convolutional network to yield a rep-resentation for each sampled frame These representations are stacked and fedinto a 3D convolutional network which classifies the action taking into accountthe temporal relationship

Action

Pooling

(N X 1024)

Action

Video

2D Net

2D-Net

3D Net3D Net

Video

2D Net 3D Net3D Net

1024

512

Concat

N frames N frames

S

(A) (B)

Fig 2 (A) ECO Lite architecture as shown in more detail in Fig 1 (B) FullECO architecture with a parallel 2D and 3D stream

partitioning strategies that take the content of the subsections into accountHowever this comes with the drawback that each frame of the video must beprocessed at runtime to obtain the partitioning and the actual improvementby such smarter partitioning is limited since most of the variation is alreadycaptured by the random sampling during training

Up to this point the different frames in the video are processed indepen-dently In order to learn how actions are made up of the different appearancesof the scene over time we stack the representations of all frames and feed theminto a 3D convolutional network This network yields the final action class label

The architecture is very straightforward and it is obvious that it can betrained efficiently end-to-end directly on the action class label and on largedatasets It is also an architecture that can be easily adapted to other videounderstanding tasks as we show later in the video captioning section 54

31 ECO Lite and ECO Full

The 3D architecture in ECO Lite is optimized for learning relationships betweenthe frames but it tends to waste capacity in case of simple short-term actionsthat can be recognized just from the static image content Therefore we suggest


an extension of the architecture by using a 2D network in parallel see Fig2(B)For the simple actions this 2D network architecture can simplify processing andensure that the static image features receive the necessary importance whereasthe 3D network architecture takes care of the more complex actions that dependon the relationship between frames

The 2D network receives feature maps of all samples and produces N featurerepresentations Afterwards we apply average pooling to get a feature vectorthat is a representative for static scene semantics We call the full architectureECO and the simpler architecture in Fig 2(A) ECO Lite

32 Network details

2D-Net For the 2D network (H2D) that analyzes the single frames we use thefirst part of the BN-Inception architecture (until inception-3c layer) [17] Detailsare given in the supplemental material It has 2D filters and pooling kernels withbatch normalization We chose this architecture due to its efficiency The outputof H2D for each single frame consist of 96 feature maps with size of 28times 28

3D-Net For the 3D network H3D we adopt several layers of 3D-Resnet18 [34]which is an efficient architecture used in many video classification works [3439]Details on the architecture are provided in the supplemental material The out-put of H3D is a one-hot vector for the different class labels

2D-NetS In the ECO full design we use 2D-Nets in parallel with 3D-netto directly providing static visual semantics of video For this network we usethe BN-Inception architecture from inception-4a layer until last pooling layer[17] The last pooling layer will produce 1024 dimensional feature vector foreach frame We apply average pooling to generate video-level feature and thenconcatenate with features obtained from 3D-net

33 Training details

We train our networks using mini-batch SGD with Nesterov momentum and uti-lize dropout in each fully connected layer We split each video into N segmentsand randomly select one frame from each segment This sampling provides ro-bustness to variations and enables the network to fully exploit all frames Inaddition we apply the data augmentation techniques introduced in [41] we re-size the input frames to 240 times 320 and employ fixed-corner cropping and scalejittering with horizontal flipping (temporal jittering provided by sampling) Af-terwards we run per-pixel mean subtraction and resize the cropped regions to224times 224

The initial learning rate is 0001 and decreases by a factor of 10 when vali-dation error saturates for 4 epochs We train the network with a momentum of09 a weight decay of 00005 and mini-batches of size 32


SN - Working memory of N frames Q - N Incoming Frames

ECO Predictions

N2N2

N frames Update SN

Fig 3 Scheme of our sampling strategy for online video understanding Half ofthe frames are sampled uniformly from the working memory in the previous timestep the other half from the queue (Q) of incoming frames

We initialize the weights of the 2D-Net weights with the BN-Inception archi-tecture [17] pre-trained on Kinetics as provided by [41] In the same way we usethe pre-trained model of 3D-Resnet-18 as provided by [39] for initializing theweights of our 3D-Net Afterwards we train ECO and ECO Lite on the Kineticsdataset for 10 epochs

For other datasets we finetune the above ECOECO Lite models on thenew datasets Due to the complexity of the Something-Something dataset wefinetune the network for 25 epochs reducing the learning rate every 10 epochsby a factor of 10 For the rest we finetune for 4k iterations and the learningrate drops by a factor of 10 as soons as the validation loss saturates The wholetraining process on UCF101 and HMDB51 takes around 3 hours on one TeslaP100 GPU for the ECO architecture We adjusted the dropout rate and thenumber of iterations based on the dataset size

34 Test time inference

Most state-of-the-art methods run some post-processing on the network resultFor instance TSN and ARTNet [4139] collect 25 independent framesvolumesper video and for each framevolume sample 10 regions by corner and cen-ter cropping and their horizontal flipping The final prediction is obtained byaveraging the scores of all 250 samples This kind of inference at test time iscomputationally expensive and thus unsuitable for real-time setting

In contrast our network produces action labels for the whole video directlywithout any additional aggregation We sample N frames from the video applyonly center cropping then feed them directly to the network which provides theprediction for the whole video with a single pass

4 Online video understanding

Most works on video understanding process in batch mode ie they assumethat the whole video is available when processing starts However in severalapplication scenarios the video will be provided as a stream and the currentbelief is supposed to be available at any time Such online processing is possible


Algorithm 1 Online video understanding

Input Live video stream (V ) ECO pretrained model (ECONF ) Numberof Samples =Sampling window (N)

Output PredictionsInitialize an empty queue Q to queue N incoming framesInitialize working memory SN Initialize average predictions PAwhile new frames available from V do

Add frame fi from V to queue Qif i N then

SN = Sample 50 frames Q and 50 from SN Empty queue QFeed SN to model ECONF to get output probabilities P PA = Average P and PA Output average predictions PA

end

end

with a sliding window approach yet this comes with restrictions regarding thesize of the window ie long-term context is missing or with a very long delay

In this section we show how ECO can be adapted to run very efficiently inonline mode too The modification only affects the sampling part and keeps thenetwork architecture unchanged To this end we partition the incoming videocontent into segments of N frames where N is also the number of frames thatgo into the network We use a working memory SN which always comprises theN samples that are fed to the network together with a time stamp When avideo starts ie only N frames are available all N frames are sampled denselyand are stored in the working memory SN With each new time segment Nadditional frames come in and we replace half of the samples in SN by samplesfrom this time segment and update the prediction of the network see Fig 3When we replace samples from SN we uniformly replace samples from previoustime segments This ensures that changes can be anticipated in real time whilethe temporal context is taken into account and slowly fades out via the workingmemory Details on the update of SN are shown in Algorithm 1

The online approach with ECO runs at 675 fps (and at 970 fps with ECOLite) on a Tesla P100 GPU In addition the model is memory efficient by justkeeping exactly N frames This enables the implementation also on much smallerhardware such as mobile devices The video in the supplemental material showsthe recorded performance of the online version of ECO in real-time

5 Experiments

We evaluate our approach on different video understanding problems to showthe generalization ability of approach We evaluated the network architecture on


Table 1 Comparison to the state-of-the-art on UCF101 and HMDB51 datasets(over all three splits) using just RGB modality

DatasetMethod Pre-training

UCF101 () HMDB51 ()

I3D [4] ImageNet 845 498TSN [41] ImageNet 864 537DTPP [47] ImageNet 897 611

Res3D [34] Sports-1M 858 549

TSN [41] ImageNet + Kinetics 911 -I3D [4] ImageNet + Kinetics 956 748

ResNeXt-101 [12] Kinetics 945 702ARTNet [39] Kinetics 935 676T3D[5] Kinetics 917 611

ECOEn Kinetics 948 724

Table 2 Comparing performanceof ECO with state-of-the-artmethods on the Kinetics dataset

Val () Test ()

Methods Top-1 Avg Avg

ResNeXt-101 [12] 651 754 784Res3D [34] 656 757 744I3D-RGB [4] minus minus 782ARTNet [39] 692 787 773T3D [5] 622 minus 715

ECOEn 700 797 763

Table 3 Comparison withstate-of-the-arts on Something-Something dataset Last rowshows the results using bothFlow and RGB

Methods Val () Test ()

I3D by [10] - 2723M-TRN [46] 3444 3360

ECOEnLite 464 423ECOEnLite

RGB

Flow 495 439

the most common action classification datasets in order to compare its perfor-mance against the state-of-the-art approaches This includes the older but stillvery popular datasets UCF101 [32] and HMDB51 [21] but also the more recentdatasets Kinetics [20] and Something-Something [10] Moreover we applied thearchitecture to video captioning and tested it on the widely used Youtube2textdataset [11] For all of these datasets we use the standard evaluation protocolprovided by the authors

The comparison is restricted to approaches that take the raw RGB videos asinput without further pre-processing for instance by providing optical flow orhuman pose The term ECONF describes a network that gets N sampled framesas input The term ECOEn refers to average scores obtained from an ensembleof networks with 16 20 24 32 number of frames


51 Benchmark Comparison on Action Classification

The results obtained with ECO on the different datasets are shown in Tables 12 and 3 and compare them to the state of the art For UCF-101 HMDB-51 andKinetics ECO outperforms all existing methods except I3D which uses a muchheavier network On Something-Something it outperforms the other methodswith a large margin This shows the strong performance of the comparativelysimple and small ECO architecture

Table 4 Runtime comparison with state-of-the-art approaches using Tesla P100GPU on UCF101 and HMDB51 datasets (over all splits) For other approacheswe just consider one crop per sample to calculate the runtime We reportedruntime without considering IO

Method Inference speed (VPS) UCF101 () HMDB51 ()

Res3D [34] lt2 858 549TSN [41] 21 877 51EMV [44] 156 864 -I3D [4] 09 956 748ARTNet [39] 29 935 676

ECOLiteminus4F 2373 874 581ECO4F 1634 903 617ECO12F 526 924 683ECO20F 329 930 690ECO24F 282 936 684

Table 5 Accuracy and runtime of ECO and ECO Lite for different numbers ofsampled frames The reported runtime is without considering IO

ModelSampled Speed (VPS) Accuracy ()Frames Titan X Tesla P100 UCF101 HMDB51 Kinetics Someth

ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite

4 1429 2373 874 581 579 minus8 711 1159 902 633 minus 38716 353 610 916 682 644 42232 182 302 931 683 minus 413

52 Accuracy-Runtime Comparison

The advantages of the ECO architectures becomes even more prominent as welook at the accuracy-runtime trade-off shown in Table 4 and Fig 4 The ECOarchitectures yield the same accuracy as other approaches at much faster rates


101

100

101

102

Inference Speed (VPS)

84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16

ECOECO LITERes3DI3DEMVARTNet

Fig 4 Accuracy-runtime trade-off on UCF101 for various versions of ECO andother state-of-the-art approaches ECO is much closer to the top right corner

Moving something away from something Dropping something into something Squeezing something Hitting something with something

Fig 5 Examples from the Something-Something dataset In this dataset thetemporal context plays an even bigger role than on other datasets since thesame action is done with different objects ie the appearance of the object orbackground gives almost no cues about the action

Previous works typically measure the speed of an approach in frames persecond (fps) Our model with ECO runs at 675 fps (and at 970 fps with ECOLite) on a Tesla P100 GPU However this does not reflect the time needed toprocess a whole video This becomes relevant for methods like TSN and ourswhich do not look at every frame of the video and motivates us to report videosper second (vps) to compare the speed of video understanding methods

ECO can process videos at least an order of magnitude faster than the otherapproaches making it an excellent architecture for video retrieval applications

Number of sampled frames Table 5 compares the two architecture vari-ants ECO and ECO Lite and evaluates the influence on the number of sampledframes N As expected the accuracy drops when sampling fewer frames as thesubsections get longer and important parts of the action can be missed Thisis especially true for fast actions such as rdquothrow discusrdquo However even with


Top3 - Decreasing Top3 - Stable Top3 - Increasing

dive drink fencing kiss pick ride_horsedraw_s

wordeat flic_flac

Acc

ura

cy


YoYo Biking Lunges SurfingUneven

Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus

Dataset HMDB51 Dataset UCF101

Fig 6 Effect of the complexity of an action on the need for denser samplingWhile simple short-term actions (leftmost group) even suffer from more samplescomplex actions (rightmost group) clearly benefit from a denser sampling

just 4 samples the accuracy of ECO is still much better than most approachesin literature since ECO takes into account the relationship between these 4 in-stants in the video even if they are far apart Fig 6 even shows that for simpleshort-term actions the performance decreases when using more samples Thisis surprising on first glance but could be explained by the better use of thenetworkrsquos capacity for simple actions when there are fewer channels being fed tothe 3D network

ECO vs ECO Lite The full ECO architecture yields slightly better resultsthan the plain ECO Lite architecture but is also a little slower The differences inaccuracy and runtime between the two architectures can usually be compensatedby using more or fewer samples On the Something-Something dataset where thetemporal context plays a much bigger role than on other datasets (see Figure 5)ECO Lite performs equally well as the full ECO architecture even with the samenumber of input samples since the raw processing of single image cues has littlerelevance on this dataset

53 Early Action Recognition in Online Mode

Figure 7 evaluates our approach in online mode and shows how many framesthe method needs to achieve its full accuracy We ran this experiment on theJ-HMDB dataset due to the availability of results from other online methods onthis dataset Compared to these existing methods ECO reaches a good accuracyfaster and also saturates at a higher absolute accuracy

54 Video Captioning

To show the wide applicability of the ECO architecture we also combine it witha video captioning network To this end we use ECO pre-trained on Kinetics toanalyze the video content and train the state-of-the-art Semantic Compositional


10 20 30 40 50 60 70 80 90 100Video Observation

20

40

60

80

100

Accu

racy

ECO_4FECO_8FGSinghSoomro

Fig 7 Early action classification results of ECO in comparison to existing onlinemethods [3130] on the J-HMDB dataset The online version of ECO yields ahigh accuracy already after seeing a short part of the video Singh et al [30] usesboth RGB and optical flow

Network [9] for captioning We evaluated on the the Youtube2Text (MSVD)dataset [11] which consists of 1970 video clips with an average duration of 9seconds and covers various types of videos such as sports landscapes animalscooking and human activities The dataset contains 80839 sentences and eachvideo is annotated with around 40 sentences

Table 6 shows that ECO compares favorably to previous approaches across allpopular evaluation metrics (BLEU[27] METEOR[23] CIDEr[37]) Even ECOLite is already on-par with a ResNet architecture pre-trained on ImageNet Con-catenating the features from ECO with those of ResNet improves results furtherQualitative examples that correspond to the improved numbers are shown in Ta-ble 7

6 Conclusions

In this paper we have presented a simple and very efficient network architecturethat looks only at a small subset of frames from a video and learns to exploit thetemporal context between these frames This principle can be used in variousvideo understanding tasks We demonstrate excellent results on action classi-fication online action classification and video captioning The computationalload and the memory footprint makes an implementation on mobile devices aviable future option The approaches runs 10x to 80x faster than state-of-the-artmethods

Acknowledgements

We thank Facebook for providing us a GPU server with Tesla P100 processorsfor this research work


Table 6 Captioning results on Youtube2Text (MSVD) dataset

Metrics

Methods B-3 B-4 METEOR CIDEr

S2VT [38] - - 0292 -GRU-RCN [1] - 0479 0311 0678h-RNN [43] - 0499 0326 0658TDDF [45] - 0458 0333 0730AF [14] - 0524 0320 0688SCN-c3d [9] 0587 0482 0330 0692SCN-resnet [9] 0602 0506 0336 0755SCN-Ensemble of 5 [9] minus 0511 0335 0777

ECOLiteminus16F 0601 0504 0339 0833ECO32F 0616 0521 0345 0857ECO32F + resnet 0626 0535 0350 0858

Table 7 Qualitative Results on MSVD First row corresponds to the ex-amples where ECO improved over SCN and the second row shows the ex-amples where ECO decreased the quality compared to SCN ECOL refers toECOLiteminus16F ECO to ECO32F and ECOR to ECO32F+resnet

SCN a woman is cookingECOL the woman isseasoning the meatECO a woman is seasoningsome meatECOR a woman is seasoningsome meat

SCN a man is playing a fluteECO a man is playing a violinECOR a man is playing aviolinECOR a man is playing aviolin

SCN a man is cookingECOL a man is pouring waterinto a containerECO a man is putting alid on a plastic containerECOR a man is drainingpasta

SCN a man is riding a horseECOL a woman is riding amotorcycleECO a man is riding a horseECOR a man is riding a boat

SCN a girl is sitting on a couchECOL a baby is sittingon the bedECO a woman is playing witha toyECOR a woman is sleepingon a bed

SCN two elephants arewalkingECOL a rhino is walkingECO a group of elephantsare walkingECOR a penguin is walking


References

1 Ballas N Yao L Pal C Courville AC Delving deeper into convolutionalnetworks for learning video representations ICLR (2016) 14

2 Bilen H Fernando B Gavves E Vedaldi A Action recognition with dynamicimage networks IEEE Transactions on Pattern Analysis and Machine Intelligencepp 1ndash1 (2017) httpsdoiorg101109TPAMI20172769085 3

3 Bilen H Fernando B Gavves E Vedaldi A Gould S Dynamic im-age networks for action recognition In 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pp 3034ndash3042 (June 2016)httpsdoiorg101109CVPR2016331 3

4 Carreira J Zisserman A Quo vadis action recognition A new model and thekinetics dataset CoRR abs170507750 (2017) httparxivorgabs170507750 3 9 10

5 Diba A Fayyaz M Sharma V Karami AH Arzani MM YousefzadehR Gool LV Temporal 3d convnets New architecture and transfer learningfor video classification CoRR abs171108200 (2017) httparxivorgabs171108200 3 9

6 Donahue J Hendricks LA Guadarrama S Rohrbach M Venugopalan SSaenko K Darrell T Long-term recurrent convolutional networks for visualrecognition and description In CVPR (2015) 3

7 Feichtenhofer C Pinz A Wildes RP Spatiotemporal multiplier net-works for video action recognition In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pp 7445ndash7454 (July 2017)httpsdoiorg101109CVPR2017787 3

8 Feichtenhofer C Pinz A Zisserman A Convolutional two-stream network fu-sion for video action recognition CoRR abs160406573 (2016) httparxivorgabs160406573 2

9 Gan Z Gan C He X Pu Y Tran K Gao J Carin L Deng L Semanticcompositional networks for visual captioning In CVPR (2017) 4 13 14

10 Goyal R Kahou SE Michalski V Materzynska J Westphal S Kim HHaenel V Frund I Yianilos P Mueller-Freitag M Hoppe F Thurau CBax I Memisevic R The rdquosomething somethingrdquo video database for learningand evaluating visual common sense CoRR abs170604261 (2017) httparxivorgabs170604261 1 9

11 Guadarrama S Krishnamoorthy N Malkarnenkar G Venugopalan S MooneyR Darrell T Saenko K Youtube2text Recognizing and describing arbi-trary activities using semantic hierarchies and zero-shot recognition In 2013IEEE International Conference on Computer Vision pp 2712ndash2719 (Dec 2013)httpsdoiorg101109ICCV2013337 9 13

12 Hara K Kataoka H Satoh Y Can spatiotemporal 3d cnns retrace the historyof 2d cnns and imagenet CoRR abs171109577 (2017) httparxivorgabs171109577 9

13 Heilbron FC Escorcia V Ghanem B Niebles JC Activitynet A large-scalevideo benchmark for human activity understanding In CVPR pp 961ndash970 IEEEComputer Society (2015) httpdblpuni-trierdedbconfcvprcvpr2015htmlHeilbronEGN15 1

14 Hori C Hori T Lee TY Zhang Z Harsham B Hershey JR Marks TKSumi K Attention-based multimodal fusion for video description In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp 4203ndash4212 (Oct 2017)httpsdoiorg101109ICCV2017450 4 14


15 Hu B Yuan J Wu Y Discriminative action states discovery for online ac-tion recognition IEEE Signal Processing Letters 23(10) 1374ndash1378 (Oct 2016)httpsdoiorg101109LSP20162598878 4

16 Ilg E Mayer N Saikia T Keuper M Dosovitskiy A Brox T Flownet20 Evolution of optical flow estimation with deep networks In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) (Jul 2017) httplmbinformatikuni-freiburgdePublications2017IMKDB17 2

17 Ioffe S Szegedy C Batch normalization Accelerating deep network train-ing by reducing internal covariate shift In Proceedings of the 32Nd Interna-tional Conference on International Conference on Machine Learning - Volume 37pp 448ndash456 ICMLrsquo15 JMLRorg (2015) httpdlacmorgcitationcfmid=30451183045167 6 7

18 Kantorov V Laptev I Efficient feature extraction encoding and classificationfor action recognition In 2014 IEEE Conference on Computer Vision and PatternRecognition pp 2593ndash2600 (June 2014) httpsdoiorg101109CVPR20143324

19 Karpathy A Toderici G Shetty S Leung T Sukthankar R Fei-Fei LLarge-scale video classification with convolutional neural networks In Proceed-ings of the 2014 IEEE Conference on Computer Vision and Pattern Recog-nition pp 1725ndash1732 CVPR rsquo14 IEEE Computer Society Washington DCUSA (2014) httpsdoiorg101109CVPR2014223 httpdxdoiorg10

1109CVPR2014223 320 Kay W Carreira J Simonyan K Zhang B Hillier C Vijayanarasimhan

S Viola F Green T Back T Natsev P Suleyman M Zisserman A Thekinetics human action video dataset CoRR abs170506950 (2017) http

arxivorgabs170506950 1 921 Kuehne H Jhuang H Garrote E Poggio T Serre T HMDB a large video

database for human motion recognition In Proceedings of the International Con-ference on Computer Vision (ICCV) (2011) 9

22 Kviatkovsky I Rivlin E Shimshoni I Online action recognition using covari-ance of shape and motion Comput Vis Image Underst 129(C) 15ndash26 (Dec2014) httpsdoiorg101016jcviu201408001 httpdxdoiorg101016jcviu201408001 4

23 Lavie A Agarwal A Meteor An automatic metric for mt evaluation withhigh levels of correlation with human judgments In Proceedings of the Sec-ond Workshop on Statistical Machine Translation pp 228ndash231 StatMT rsquo07 As-sociation for Computational Linguistics Stroudsburg PA USA (2007) http

dlacmorgcitationcfmid=16263551626389 1324 Lev G Sadeh G Klein B Wolf L Rnn fisher vectors for action recognition and

image annotation In Leibe B Matas J Sebe N Welling M (eds) ComputerVision ndash ECCV 2016 pp 833ndash850 Springer International Publishing Cham (2016)3

25 Li Z Gavrilyuk K Gavves E Jain M Snoek CG Videolstm convolvesattends and flows for action recognition Comput Vis Image Underst 166(C)41ndash50 (Jan 2018) httpsdoiorg101016jcviu201710011 httpsdoiorg101016jcviu201710011 3

26 Ng JYH Hausknecht MJ Vijayanarasimhan S Vinyals O Monga RToderici G Beyond short snippets Deep networks for video classification InCVPR pp 4694ndash4702 IEEE Computer Society (2015) httpdblpuni-trierdedbconfcvprcvpr2015htmlNgHVVMT15 3


27 Papineni K Roukos S Ward T Zhu WJ Bleu A method for au-tomatic evaluation of machine translation In Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics pp 311ndash318 ACL rsquo02 Association for Computational Linguistics Stroudsburg PAUSA (2002) httpsdoiorg10311510730831073135 httpsdoiorg10

311510730831073135 1328 Qiu Z Yao T Mei T Deep quantization Encoding convolutional activations

with deep generative model In CVPR (2017) 329 Simonyan K Zisserman A Two-stream convolutional networks for action recog-

nition in videos In Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 1 pp 568ndash576 NIPSrsquo14 MIT PressCambridge MA USA (2014) httpdlacmorgcitationcfmid=2968826

2968890 330 Singh G Saha S Sapienza M Torr PHS Cuzzolin F Online real-time

multiple spatiotemporal action localisation and prediction In IEEE InternationalConference on Computer Vision ICCV 2017 Venice Italy October 22-29 2017pp 3657ndash3666 (2017) httpsdoiorg101109ICCV2017393 httpsdoiorg101109ICCV2017393 1 2 4 13

31 Soomro K Idrees H Shah M Predicting the where and what of actors andactions through online action localization In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (June 2016) 4 13

32 Soomro K Zamir AR Shah M UCF101 A dataset of 101 human actionsclasses from videos in the wild CoRR abs12120402 (2012) httparxivorgabs12120402 9

33 Tran D Bourdev LD Fergus R Torresani L Paluri M C3D generic featuresfor video analysis CoRR abs14120767 (2014) httparxivorgabs14120767 3

34 Tran D Ray J Shou Z Chang S Paluri M Convnet architecture search forspatiotemporal feature learning CoRR abs170805038 (2017) httparxivorgabs170805038 3 6 9 10

35 Tran D Wang H Torresani L Ray J LeCun Y Paluri M A closer lookat spatiotemporal convolutions for action recognition CoRR abs171111248(2017) httparxivorgabs171111248 3

36 Varol G Laptev I Schmid C Long-term temporal convolutions for actionrecognition CoRR abs160404494 (2016) httparxivorgabs1604044943

37 Vedantam R Zitnick CL Parikh D Cider Consensus-based image descriptionevaluation In CVPR pp 4566ndash4575 IEEE Computer Society (2015) httpdblpuni-trierdedbconfcvprcvpr2015htmlVedantamZP15 13

38 Venugopalan S Rohrbach M Donahue J Mooney R Darrell T Saenko KSequence to sequence ndash video to text In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) (2015) 14

39 Wang L Li W Li W Gool LV Appearance-and-relation networks forvideo classification CoRR abs171109125 (2017) httparxivorgabs

171109125 3 6 7 9 1040 Wang L Xiong Y Wang Z Qiao Y Lin D Tang X Gool LV Temporal

segment networks for action recognition in videos CoRR abs170502953 (2017)httparxivorgabs170502953 3

41 Wang L Xiong Y Wang Z Qiao Y Lin D Tang X Val Gool L Temporalsegment networks Towards good practices for deep action recognition In ECCV(2016) 6 7 9 10


42 Xu Z Yang Y Hauptmann AG A discriminative CNN video representationfor event detection CoRR abs14114006 (2014) httparxivorgabs14114006 3

43 Yu H Wang J Huang Z Yang Y Xu W Video paragraph captioning usinghierarchical recurrent neural networks 2016 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) pp 4584ndash4593 (2016) 4 14

44 Zhang B Wang L Wang Z Qiao Y Wang H Real-time action recogni-tion with enhanced motion vector cnns CoRR abs160407669 (2016) httparxivorgabs160407669 4 10

45 Zhang X Gao K Zhang Y Zhang D Li J Tian Q Task-driven dynamicfusion Reducing ambiguity in video description In 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 6250ndash6258 (July 2017)httpsdoiorg101109CVPR2017662 4 14

46 Zhou B Andonian A Torralba A Temporal relational reasoning in videosCoRR abs171108496 (2017) httparxivorgabs171108496 9

47 Zhu J Zou W Zhu Z Li L End-to-end video-level representation learningfor action recognition CoRR abs171104161 (2017) httparxivorgabs171104161 9

48 Zolfaghari M Oliveira GL Sedaghat N Brox T Chained multi-stream net-works exploiting pose motion and appearance for action classification and de-tection In IEEE International Conference on Computer Vision (ICCV) (2017)httplmbinformatikuni-freiburgdePublications2017ZOSB17a 3


between frames but they are computationally expensive and thus can coveronly comparatively small windows rather than the entire video Existing meth-ods typically use some post-hoc integration of window-based scores which issuboptimal for exploiting the temporal relationships between the windows

In this paper we introduce a straightforward end-to-end trainable archi-tecture that exploits two important principles to avoid the above-mentioneddilemma Firstly a good initial classification of an action can already be ob-tained from just a single frame The temporal neighborhood of this frame com-prises mostly redundant information and is almost useless for improving thebelief about the present action2 Therefore we process only a single frame of atemporal neighborhood efficiently with a 2D convolutional architecture in orderto capture appearance features of such frame Secondly to capture the con-textual relationships between distant frames a simple aggregation of scores isinsufficient Therefore we feed the feature representations of distant frames intoa 3D network that learns the temporal context between these frames and so canimprove significantly over the belief obtained from a single frame ndash especiallyfor complex long-term activities This principle is much related to the so-calledearly or late fusion used for combining the RGB stream and the optical flowstream in two-stream architectures [8] However this principle has been mostlyignored so far for aggregation over time and is not part of the state-of-the-artapproaches

Consequent implementation of these two principles together leads to a highclassification accuracy without bells and whistles The long temporal context ofcomplex actions can be fully captured whereas the fact that the method onlylooks at a very small fraction of all frames in the video leads to extremely fastprocessing of entire videos This is very beneficial especially in video retrievalapplications

Additionally this approach opens the possibility for online video understand-ing In this paper we also present a way to use our architecture in an onlinesetting where we provide a fast first guess on the action and refine it using thelonger term context as a more complex activity establishes In contrast to onlineaction detection which has been enabled recently [30] the approach providesnot only fast reaction times but also takes the longer term context into account

We conducted experiments on various video understanding problems includ-ing action recognition and video captioning Although we just use RGB imagesas input we obtain on-par or favorable performance compared to state-of-the-artapproaches on most datasets The runtime-accuracy trade-off is superior on alldatasets

2 An exception is the use of two frames for capturing motion which could be achievedby optionally feeding optical flow together with the RGB image In this paper weonly provide RGB images but an extension with optical flow eg a fast variant ofFlowNet [16] would be straightforward


2 Related Work















Input to 3D Net

K x [N x 28 x 28]

M2 M1

K x 28 x 28

Weight Sharing

Weight Sharing

M2 M1

Video (V)

K x 28 x 28

2D Net (H2d)

K x 28 x 28

3D NetAction

s1

s2

SN


MK

1K

(H3d)

111

MK

222

M2 M1MK

NNN

1 NN1


maps

Feature Maps


Action

Pooling

(N X 1024)

Action

Video

2D Net

2D-Net

3D Net3D Net

Video

2D Net 3D Net3D Net

1024

512

Concat

N frames N frames

S

(A) (B)










32 Network details




33 Training details





ECO Predictions

N2N2

N frames Update SN















end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References



























































2 Related Work















Input to 3D Net

K x [N x 28 x 28]

M2 M1

K x 28 x 28

Weight Sharing

Weight Sharing

M2 M1

Video (V)

K x 28 x 28

2D Net (H2d)

K x 28 x 28

3D NetAction

s1

s2

SN


MK

1K

(H3d)

111

MK

222

M2 M1MK

NNN

1 NN1


maps

Feature Maps


Action

Pooling

(N X 1024)

Action

Video

2D Net

2D-Net

3D Net3D Net

Video

2D Net 3D Net3D Net

1024

512

Concat

N frames N frames

S

(A) (B)










32 Network details




33 Training details





ECO Predictions

N2N2

N frames Update SN















end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References


































































Input to 3D Net

K x [N x 28 x 28]

M2 M1

K x 28 x 28

Weight Sharing

Weight Sharing

M2 M1

Video (V)

K x 28 x 28

2D Net (H2d)

K x 28 x 28

3D NetAction

s1

s2

SN


MK

1K

(H3d)

111

MK

222

M2 M1MK

NNN

1 NN1


maps

Feature Maps


Action

Pooling

(N X 1024)

Action

Video

2D Net

2D-Net

3D Net3D Net

Video

2D Net 3D Net3D Net

1024

512

Concat

N frames N frames

S

(A) (B)










32 Network details




33 Training details





ECO Predictions

N2N2

N frames Update SN















end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References





























































32 Network details




33 Training details





ECO Predictions

N2N2

N frames Update SN















end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References




























































ECO Predictions

N2N2

N frames Update SN















end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References
































































end

end




5 Experiments





UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References





























































UCF101 () HMDB51 ()


Res3D [34] Sports-1M 858 549





Val () Test ()



ECOEn 700 797 763



I3D by [10] - 2723M-TRN [46] 3444 3360


RGB

Flow 495 439












ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References



































































ECO

4 992 1634 903 617 662 minus8 495 815 917 656 678 39616 245 417 928 685 690 41432 123 208 933 687 678 minus

ECOLite





101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References



























































101

100

101

102


84

86

88

90

92

94

96

98

100

Accu

racy

16

4

812

202432

16

8

32

4

16

64

25

16











wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References





























































wordeat flic_flac

Acc

ura

cy



Bars

CuttingI

nKitchen

CricketB

owling

Hula

Hoop

ThrowD

iscus







54 Video Captioning




20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References




























































20

40

60

80

100

Accu

racy





6 Conclusions


Acknowledgements




Metrics












References




























































Metrics












References



























































References


























































ECO: Efficient Convolutional Network for Online...

Documents

Transcript of ECO: Efficient Convolutional Network for Online...