Supplementary Material for “Action Segmentation with Joint ... · attentive entropy (DAE) module,...

Supplementary Material for “Action Segmentation with Joint Self-SupervisedTemporal Domain Adaptation”

Min-Hung Chen1∗ Baopu Li2 Yingze Bao2 Ghassan AlRegib1 Zsolt Kira11Georgia Institute of Technology 2Baidu USA

In the supplementary material, we would like to showmore details about the technical approach, implementation,and experiments.

1. Technical Approach Details1.1. Domain Attentive Temporal Pooling (DATP)

Temporal pooling is one of the most common methods toaggregate frame-level features into video-level features foreach video. However, not all the frame-level features con-tribute the same to the overall domain discrepancy. There-fore, inspired by [2, 3], we assign larger attention weights tothe features which have larger domain discrepancy so thatwe can focus more on aligning those features, achievingmore effective domain adaptation.

More specifically, we utilize the entropy criterion to gen-erate the domain attention value for each frame-level featurefj as below:

wj = 1−H(dj) (1)

where dj is the output from the learned domain classifierGld used in local SSTDA. H(p) = −

∑k pk · log(pk) is

the entropy function to measure uncertainty. wj increaseswhen H(dj) decreases, which means the domains can bedistinguished well. We also add a residual connection formore stable optimization. Finally, we aggregate the at-tended frame-level features with temporal pooling to gen-erate the video-level feature v, which is noted as DomainAttentive Temporal Pooling (DATP), as illustrated in the leftpart of Figure 1 and can be expressed as:

v =1

T ′

T ′∑j=1

(wj + 1) · fj (2)

where +1 refers to the residual connection, and wj + 1 isequal to wj in the main paper. T ′ is the number of framesused to generate a video-level feature.

Local SSTDA is necessary to calculate the attentionweights for DATP. Without this mechanisms, frames willbe aggregated in the same way as temporal pooling without

∗Work done during an internship at Baidu USA

𝐻( መ𝑑)

Video-level feature

……𝑓1 𝑓2 𝑓𝑇′

𝐺𝑙𝑑Domain

Attentionෝ𝒘

𝑣

ℒ𝑎𝑒

𝐻( መ𝑑)

DATP

DAE

𝐻(ෝ𝒚)Class Entropy

ෝ𝒚Class Prediction𝑮𝒚

Figure 1: The details of DATP (left) and DAE (right). Bothmodules take the domain entropy H(d), which is calcu-lated from the domain prediction d, to calculate the atten-tion weights. With the residual connection, DATP attends tothe frame-level features for aggregating into the final video-level feature v (arrow thickness represents assigned atten-tion values), and DAE attends to the class entropy H(y) toobtain the attentive entropy loss Lae.

cross-domain consideration, which is already demonstratedsub-optimal for cross-domain video tasks [2, 3].

1.2. Domain Attentive Entropy (DAE)

Minimum entropy regularization is a common strategyto perform more refined classifier adaptation. However, weonly want to minimize class entropy for the frames that aresimilar across domains. Therefore, inspired by [14], weattend to the frames which have low domain discrepancy,corresponding to high domain entropyH(dj). More specif-ically, we adopt the Domain Attentive Entropy (DAE) mod-ule to calculate the attentive entropy loss Lae, which can beexpressed as follows:

Lae =1

T

T∑j=1

(H(dj) + 1) ·H(yj) (3)

where d and y is the output of Gld and Gy , respectively.T is the total frame number of a video. We also apply theresidual connection for stability, as shown in the right partof Figure 1.

1

GTEA 50Salads Breakfastsubject # 4 25 52class # 11 17 48video # 28 50 1712

avg. length (min.) 1 6.4 2.7avg. action #/video 20 20 6

cross-validation 4-fold 5-fold 4-foldleave-#-subject-out 1 5 13

Table 1: The statistics of action segmentation datasets.

1.3. Full Architecture

Our method is built upon the state-of-the-art action seg-mentation model, MS-TCN [4], which takes input frame-level feature representations and generates the correspond-ing output frame-level class predictions by four stages ofSS-TCN. In our implementation, we convert the second andthird stages into Domain Adaptive TCN (DA-TCN) by in-tegrating each SS-TCN with the following three parts: 1)Gld (for binary domain prediction), 2) DATP and Ggd (forsequential domain prediction), and 3) DAE, bringing threecorresponding loss functions, Lld, Lgd and Lae, respec-tively, as illustrated in Figure 2. The final loss function canbe formulated as below:

L =

4∑s=1

Ly(s) −3∑

s=2

(βlLld(s) + βgLgd(s) − µLae(s)) (4)

where βl, βg and µ are the weights for Lld, Lgd and Lae,respectively, obtained by the methods described in Sec-tion 2.2. s is the stage index in MS-TCN.

2. Experiments

2.1. Datasets and Evaluation Metrics

The detailed statistics and the evaluation protocols of thethree datasets are listed in Table 1. We follow [7] to use thefollowing three metrics for evaluation:

1. Frame-wise accuracy (Acc): Acc is one of the mosttypical evaluation metrics for action segmentation, butit does not consider the temporal dependencies of theprediction, causing the inconsistency between quali-tative assessment and frame-wise accuracy. Besides,long action classes have higher impact on this metricthan shorter action classes, making it not able to reflectover-segmentation errors.

2. Segmental edit score (Edit): The edit score penalizesover-segmentation errors by measuring the ordering ofpredicted action segments independent of slight tem-poral shifts.

3. Segmental F1 score at the IoU threshold k% (F1@k):F1@k also penalizes over-segmentation errors whileignoring minor temporal shifts between the predictionsand ground truth. The scores are determined by the to-tal number of actions but do not depend on the dura-tion of each action instance, which is similar to meanaverage precision (mAP) with intersection-over-union(IoU) overlap criteria. F1@k becomes popular recentlysince it better reflects the qualitative results.

2.2. Implementation and Optimization

Our implementation is based on the PyTorch [10, 13]framework. We extract I3D [1] features for the video framesand use these features as inputs to our model. The videoframe rates are the same as [4]. For GTEA and Breakfastdatasets we use a video temporal resolution of 15 framesper second (fps), while for 50Salads we downsampled thefeatures from 30 fps to 15 fps to be consistent with the otherdatasets. For fair comparison, we adopt the same architec-ture design choices of MS-TCN [4] as our baseline model.The whole model consists of four stages where each stagecontains ten dilated convolution layers. We set the numberof filters to 64 in all the layers of the model and the filter sizeis 3. For optimization, we utilize the Adam optimizer and abatch size equal to 1, following the official implementationof MS-TCN [4]. Since the target data size is smaller thanthe source data, each target data is loaded randomly multi-ple times in each epoch during training. For the weightingof loss functions, we follow the common strategy as [5, 6]to gradually increase βl and βg from 0 to 1. The weightingα for smoothness loss is 0.15 as in [4] and µ is chosen as1× 10−2 via the grid-search.

2.3. Less Training Labeled Data

To investigate the potential to train with a fewer numberof labeled frames using SSTDA, we drop labeled framesfrom source domains with uniform sampling for training,and evaluate on the same length of validation data. Our ex-periment on the 50Salads dataset shows that by integratingwith SSTDA, the performance does not drop significantlywith the decrease in labeled training data, indicating the al-leviation of reliance on labeled training data. Finally, only65% of labeled training data are required to achieve compa-rable performance with MS-TCN, as shown in Table 2. Wethen evaluate the proposed SSTDA on GTEA and Breakfastwith the same percentage of labeled training data, and alsoget comparable or better performance.

Table 2 also indicates the results without additional la-beled training data, which contain discriminative informa-tion that can directly boost the performance for action seg-mentation. The additional trained data are all unlabeled, sothey cannot be directly trained with standard prediction loss.Therefore, we propose SSTDA to exploit unlabeled data to:

2

I3D

I3D

I3D

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

SS-TCN

SS-TCN

SS-TCN

Input frame-level

features

Output frame-level predictions

Multi-layer Temporal

Convolution

𝑮𝒇

𝑮𝒚

ෝ𝒚𝒇

Input video frames

𝐺𝑙𝑑

ℒ𝑙𝑑𝐺𝑔𝑑ℒ𝑔𝑑

DAE

ℒ𝑎𝑒

ℒ𝑦

Domain Adaptive TCN (DA-TCN)

DATP𝐻( መ𝑑) 𝐻( መ𝑑)

DA

-TCN

Figure 2: The overall architecture of the proposed SSTDA. By equipping the network with a local adversarial domainclassifier Gld, a global adversarial domain classifier Ggd, a domain attentive temporal pooling (DATP) module, and a domainattentive entropy (DAE) module, we convert a SS-TCN into a DA-TCN, and stack multiple SS-TCNs and DA-TCNs to buildthe final architecture. Lld and Lgd is the local and global domain loss, respectively. Ly is the prediction loss and Lae is theattentive entropy loss. The domain entropyH(d) is used to calculate the attention weights for DATP and DAE. An adversarialdomain classifier G refers to a domain classifier G equipped with a gradient reversal layer (GRL).

50Salads m% F1@{10, 25, 50} Edit Acc

SSTDA

100% 83.0 81.5 73.8 75.8 83.295% 81.6 80.0 73.1 75.6 83.285% 81.0 78.9 70.9 73.8 82.175% 78.9 76.5 68.6 71.7 81.165% 77.7 75.0 66.2 69.3 80.7

MS-TCN 100% 75.4 73.4 65.2 68.9 82.1GTEA m% F1@{10, 25, 50} Edit Acc

SSTDA100% 90.0 89.1 78.0 86.2 79.865% 85.2 82.6 69.3 79.6 75.7

MS-TCN 100% 86.5 83.6 71.9 81.3 76.5Breakfast m% F1@{10, 25, 50} Edit Acc

SSTDA100% 75.0 69.1 55.2 73.7 70.265% 69.3 62.9 49.4 69.0 65.8

MS-TCN 100% 65.3 59.6 47.2 65.7 64.7

Table 2: The comparison of SSTDA trained with less la-beled training data. m in the first row indicates the percent-age of labeled training data used to train a model.

1) further improve the strong baseline, MS-TCN, withoutadditional training labels, and 2) achieve comparable per-formance with this strong baseline using only 65% of labels

for training.

2.4. Comparison with Other Approaches

We compare our proposed SSTDA with other ap-proaches by integrating the same baseline architecture withother popular DA methods [6, 9, 11, 15, 12, 8] and a state-of-the-art video-based self-supervised approach [16]. Forfair comparison, all the methods are integrated with the sec-ond and third stages, as our proposed SSTDA, where thesingle-stage integration methods are described as follows:

1. DANN [6]: We add one discriminator, which is thesame as Gld, equipped a gradient reversal layer (GRL)to the final frame-level features f .

2. JAN [9]: We integrate Joint Maximum Mean Discrep-ancy (JMMD) to the final frame-level features f andthe class prediction y.

3. MADA [11]: Instead of a single discriminator, we addmultiple discriminators according to the class numberto calculate the domain loss for each class. All theclass-based domain losses are weighted with predic-tion probabilities and then summed up to obtain thefinal domain loss.

3

50Salads F1@{10, 25, 50} EditSource only (MS-TCN) 75.4 73.4 65.2 68.9

VCOP [16] 75.8 73.8 65.9 68.4DANN [6] 79.2 77.8 70.3 72.0

JAN [9] 80.9 79.4 72.4 73.5MADA [11] 79.6 77.4 70.0 72.4MSTN [15] 79.3 77.6 71.5 72.1MCD [12] 78.2 75.5 67.1 70.8SWD [8] 78.2 76.2 67.4 71.6SSTDA 83.0 81.5 73.8 75.8

Breakfast F1@{10, 25, 50} EditSource only (MS-TCN) 65.3 59.6 47.2 65.7

VCOP [16] 68.5 62.9 50.1 67.9DANN [6] 72.8 67.8 55.1 71.7

JAN [9] 70.2 64.7 52.0 70.0MADA [11] 71.0 65.4 52.8 71.2MSTN [15] 69.6 63.6 51.5 69.2MCD [12] 70.4 65.1 52.4 69.7SWD [8] 68.6 63.2 50.6 69.1SSTDA 75.0 69.1 55.2 73.7

Table 3: The comparison of different methods that can learninformation from unlabeled target videos (on 50Salads andBreakfast). All the methods are integrated with the samebaseline model MS-TCN for fair comparison.

4. MSTN [15]: We utilize pseudo-labels to cluster thedata from the source and target domains, and calcu-late the class centroids for the source and target do-main separately. Then we compute the semantic lossby calculating mean squared error (MSE) between thesource and target centroids. The final loss contains theprediction loss, the semantic loss, and the domain lossas DANN [6].

5. MCD [12]: We apply another classifier G′y and followthe adversarial training procedure of Maximum Clas-sifier Discrepancy to iteratively optimize the genera-tor (Gf in our case) and the classifier (Gy). The L1-distance is used as the discrepancy loss.

6. SWD [8]: The framework is similar to MCD, but wereplace the L1-distance with the Wasserstein distanceas the discrepancy loss.

7. VCOP [16]: We divide f into three segments and com-pute the segment-level features with temporal pooling.After temporal shuffling the segment-level features,pairwise features are computed and concatenated intothe final feature representing the video clip order. Thefinal features are then fed into a shallow classifier topredict the order.

The experimental results on 50Salads and Breakfast bothindicate that our proposed SSTDA outperforms all thesemethods, as shown in Table 3.

The performance of the most recent video-based self-supervised learning method [16] on 50Salads and Break-fast also show that temporal shuffling within single domainwithout considering the relation across domains does noteffectively benefit cross-domain action segmentation, re-sulting in even worse performance than other DA methods.Instead, our proposed self-supervised auxiliary tasks makepredictions on cross-domain data, leading to cross-domaintemporal relation reasoning instead of predicting within-domain temporal orders, achieving significant improvementin the performance of our main task, action segmentation.

3. Segmentation Visualization

Here we show more qualitative segmentation resultsfrom all three datasets to compare our methods with thebaseline model, MS-TCN [4]. All the results (Figure 3for GTEA, Figure 4 for 50Salads, and Figure 5 for Break-fast) demonstrate that the improvement over the baselineby only local SSTDA is sometimes limited. For example,local SSTDA falsely detects the pour action in Figure 3b,falsely classifies cheese-related actions as cucumber-relatedactions in Figure 4b, and falsely detects the stir milk actionin Figure 5b. However, by jointly aligning local and globaltemporal dynamics with SSTDA, the model is effectivelyadapted to the target domain, reducing the above mentionedincorrect predictions and achieving better segmentation.

References[1] Joao Carreira and Andrew Zisserman. Quo vadis, action

recognition? a new model and the kinetics dataset. InIEEE conference on Computer Vision and Pattern Recogni-tion (CVPR), 2017. 2

[2] Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, JaekwonWoo, Ruxin Chen, and Jian Zheng. Temporal attentive align-ment for large-scale video domain adaptation. In IEEE In-ternational Conference on Computer Vision (ICCV), 2019.1

[3] Min-Hung Chen, Baopu Li, Yingze Bao, and Ghassan Al-Regib. Action segmentation with mixed temporal domainadaptation. In The IEEE Winter Conference on Applicationsof Computer Vision (WACV), 2020. 1

[4] Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stagetemporal convolutional network for action segmentation. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2019. 2, 4

[5] Yaroslav Ganin and Victor Lempitsky. Unsupervised domainadaptation by backpropagation. In International Conferenceon Machine Learning (ICML), 2015. 2

[6] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-cal Germain, Hugo Larochelle, Francois Laviolette, MarioMarchand, and Victor Lempitsky. Domain-adversarial train-ing of neural networks. The Journal of Machine LearningResearch (JMLR), 17(1):2096–2030, 2016. 2, 3, 4

4

Ground Truth

MS-TCN

Local SSTDA

SSTDA

stirtakebackground open pour close putscoop

(a) Make coffee

stirtakebackground open pour close putscoop

Ground Truth

MS-TCN

Local SSTDA

SSTDA

(b) Make honey coffee

Figure 3: The visualization of temporal action segmentation for our methods with color-coding on GTEA. The video snap-shots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are significantly different from the ground truth for clear comparison. “MS-TCN” represents the baselinemodel trained with only source data.

[7] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, andGregory D Hager. Temporal convolutional networks for ac-tion segmentation and detection. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017. 2

[8] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, andDaniel Ulbricht. Sliced wasserstein discrepancy for unsuper-vised domain adaptation. In IEEE conference on ComputerVision and Pattern Recognition (CVPR), 2019. 3, 4

[9] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael IJordan. Deep transfer learning with joint adaptation net-

works. In International Conference on Machine Learning(ICML), 2017. 3, 4

[10] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, AlbanDesmaison, Luca Antiga, and Adam Lerer. Automatic dif-ferentiation in pytorch. In Advances in Neural InformationProcessing Systems Workshop (NeurIPSW), 2017. 2

[11] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and JianminWang. Multi-adversarial domain adaptation. In AAAI Con-ference on Artificial Intelligence (AAAI), 2018. 3, 4

5

Ground Truth

MS-TCN

Local SSTDA

SSTDA

action start

add oiladd vinegar add pepperadd salt

mix dressing

peel cucumber

place cucumber into bowl

cut cucumbercut tomato place tomato into bowl

cut cheese place cheese into bowl cut lettuce place lettuce into bowl

mix ingredients

serve salad onto plate action end

(a) Subject 02

action start

add oil add vinegar add peppermix dressing

peel cucumber place cucumber into bowlcut cucumber cut tomato

place tomato into bowl cut cheese place cheese into bowlcut lettuce place lettuce into bowl

mix ingredients

serve salad onto plate add dressing action end

Ground Truth

MS-TCN

Local SSTDA

SSTDA

(b) Subject 04

Figure 4: The visualization of temporal action segmentation for our methods with color-coding on 50Salads. The videosnapshots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are different from the ground truth for clear comparison. Both examples correspond to the same activity Makesalad, but they are performed by different subjects, i.e., people.

[12] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-suya Harada. Maximum classifier discrepancy for unsuper-

vised domain adaptation. In IEEE conference on ComputerVision and Pattern Recognition (CVPR), 2018. 3, 4

6

background

Ground Truth

MS-TCN

Local SSTDA

SSTDA

take bowl pour cereals pour milk stir cereals

(a) Make Cereal

background

Ground Truth

MS-TCN

Local SSTDA

SSTDA

take cup spoon powder pour milk stir milk

(b) Make milk

Figure 5: The visualization of temporal action segmentation for our methods with color-coding on Breakfast. The videosnapshots and the segmentation visualization are in the same temporal order (from left to right). We only highlight the actionsegments that are different from the ground truth for clear comparison.

[13] Benoit Steiner, Zachary DeVito, Soumith Chintala, SamGross, Adam Paszke, Francisco Massa, Adam Lerer, Gre-gory Chanan, Zeming Lin, Edward Yang, et al. Pytorch:An imperative style, high-performance deep learning li-brary. Advances in Neural Information Processing Systems(NeurIPS), 2019. 2

[14] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, andJianmin Wang. Transferable attention for domain adaptation.In AAAI Conference on Artificial Intelligence (AAAI), 2019.1

[15] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen.Learning semantic representations for unsupervised domainadaptation. In International Conference on Machine Learn-ing (ICML), 2018. 3, 4

[16] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, andYueting Zhuang. Self-supervised spatiotemporal learning viavideo clip order prediction. In IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2019. 3, 4

7

Supplementary Material for “Action Segmentation with Joint ... · attentive entropy (DAE) module,...

Documents

Transcript of Supplementary Material for “Action Segmentation with Joint ... · attentive entropy (DAE) module,...