Joint Fine-Tuning in Deep Neural Networks for Facial Expression … · sampled frames using...

Joint Fine-Tuning in Deep Neural Networks for Facial Expression RecognitionHeechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, Junmo Kim (ICCV-2015)

Gundeep Arora, Kranti Kumar Parida, Vinay Kumar Verma{gundeep,kranti,vkverma}@iitk.ac.in

Department of Computer Science and Engineering,Indian Institute of Technology, Kanpur

1 Introduction and problem description:

Facial expression recognition (FER) has been a common problem in thearea of computer vision. This has applications in many different areasranging from advertising, augmented reality, human computer interactionand human response analysis to name a few. This problem has similaritiesto the action recognition problem, however the actions here are very subtleand fine-grained, hence a different approach has to be tried. We wish totry different approaches of transfer learning using different loss functions,fine-tuning techniques for a generalized performance on expression, ageand gender recognition. Another objective is to try the obtained deepembedding of the face for the task of image reconstruction/ inpainting.

2 Related Work:

Though facial analysis has been widely researched, the problem of in-painting is fairly new.

2.1 Facial Analysis

While there have been many approaches that extract facial features anduse ensemble of CNNs, we discuss the two papers where we took most ofour inspiration from.

Joint Fine Tuning in DNN(JFT)[4]:-The objective of this paper isfacial expression recognition. We have seen that big network are moreprone to over fitting for the small training data. There fore more thansmaller network that capture the variable feature are used that extract thegood feature without over fitting. Again to combine different feature isa big problem and most of the time concatenation or weighted featurecombination are used. This increases the feature dimension and does notshows the very discriminative result also due to high dim it has morechance to over-fitting. In this paper two smaller network deep tempo-ral appearance network (DTAN) and deep temporal geometry network(DTGN) are used. DTAN used 3D-CNN for the visual feature extractionand DTGN takes the facial key point that extract the geometric informa-tion. These two information are jointly fine tuned to a smaller dim featurespace. This paper shows the state of art result for the image sequencebased approach.Age and Gender Recognition[2] In this paper it is shown that by learn-ing representations through the use of deep-convolutional neural networks(CNN), a significant increase in performance can be obtained on the taskof age and gender classification. A simple convolutional net architec-ture is proposed, that can be used even when the amount of learning datais limited. It is evaluated our method on the recent Adience benchmarkfor age and gender estimation and it dramatically outperforms the currentstate-of-the-art methods.

2.2 Image Reconstruction and Inpainting

The problem of reconstructing partial images, being relatively new has notbeen studied extensively. However, generative models have been tried andproved to be effective. Generative Adversarial Networks (GAN) generatemeaningful image content through the competition between a generativemodel, which captures the data distribution, and a discriminative model,which penalizes flaws in the generative model. The generative and dis-criminative model iteratively improve each other during the training. Thecontext encoders[7] can be also viewed as a generative model. While theinput of GAN is often a random noise vector, the input of the context en-coders is the un-corrupted pixels. The network is trained to predict themissing content.

3 Dataset(s) used:

We have independently trained 3 networks for age estimation, gender es-timation and facial expression recognition. Expression Recognition :CK+[5] : The CK+ dataset consists of 527 videos(image frames) for 7expressions for 100+ subjects of which 324 are annotated. The expres-sions included are Disgust, Happy, Surprise, Fear, Angry, Contempt andSadness. The face and the landmark points are extracted from each of thesampled frames using Openface [1]. Challenged : The dataset is smallfor our training and also there is a certain bias towards a few classes in theannotated datasets. Hence the training dataset possibly over-representsa class and has very low samples for some classes. We augmented thedataset 15 times, by rotating it by 7 different angles, both clockwise andanti-clockwise and also flipped the frame images. This helped us get im-proved training as shown in the table in results.

Oulu-CASIA[3] : The Oulu-CASIA dataset is again a standard videodataset used for reporting performances in expression recognition. thisdataset consists of 480 videos of 160 subjects, each with six exressions.The expressions included are Anger, Disgust, Fear, Happiness, Sadnessand Surprise. It has three sections of Strong, weak and darkness for eachof the expressions. We use the Strong ones for training. 7 images sam-ples are taken in a temporal locality aware manner and face and landmarkpoints are extracted from each of the sampled frames using Openface.The dataset is fairly sized and we augmented it to make the training set10×. We rotated the images by 4 different angles both clockwise andanti-clockwise for each angle. Flipped images are also augmented to thetraining set.

Age & Gender :Adience benchmark dataset [2] : The main purpose behind this type

of collection of data is to capture the maximum variations possible inreal-life situation. In particular, it attempts to capture all the variationsin appearance, noise, pose, lighting and more, that can be expected ofimages taken without careful preparation or posing. The faces have beenextracted from it by running the Viola and Jones face detector [8], andthen detecting facial feature points using a modified version of the codeprovided by the authors of [9]. We have used the frontal set only forour purpose. The dataset contains a total no. of 26,580 images from2,284 subjects. It has 8(0−2,4−6,8−13,15−20,25−32,38−43,48−53,60−) age groups/labels and respective gender labels. Out of all theimages present there are some images, mostly in the age group of (0−2)and (4−6), whose gender label is hard to label and those are mentionedas undefined in the dataset. We have ignored these datapoints.

Context EncoderFaceScrub[6]: This dataset was made primarily for face and facial

property recognition. It consists of over 100,000 images with fair divi-sion across classes gender and ethnicity. It is one of the largest publiclyavailable faces database.

4 Existing codes and libraries used:

We are basically working on reproducing two papers, for one of whichthe code was not available while for the other it was available.

4.1 Facial Expression recognition:

In this problem we using the torch Deep learning library with CuDnn,nngraph and nn tool for our implementation. Here we are not using anyexisting code from any repository. Complete code has been written byus. For the architecture we are taking the idea from the paper[4] as theyhave not shared any details about the architecture, nor are they respond-ing to the mails requesting the same. Hence the architecture has beendesigned by us. Since architecture has been designed by us, we can‘t use

any trained model, so there is no fine tuning and whole training are donefrom starting the random initialization. In order to extract the face and thelandmark points from the frames of the video, we use OpenFace [1]. Itreturns a 68x1 vector as landmark points. These points were normalizedwith respect to the centroid. Challenge : While for face and landmark ex-traction we had two options of IntraFace and OpenFace, we chose Open-Face because it was easily available, open-source and had decent supportfrom online blogs and github for usage and installation. However Open-Face is very slow in terms of returning the results at scale and hence data-preprocessing became a bottle-neck for our training.

4.2 Age and Gender Recognition:

For gender and age estimation networks, we have used the deep learningframework, Caffe. The architecture used is described in [2] and we haveused the publicly available codes released by the authors of the mentionedpaper. We have used the code and retrained the whole model in CPU. Themethod described in [2] uses a five fold cross validation for measuring theaccuracty of both age and gender. But we have used only one-fold of the5-folds as it take a very large time in training the network in CPU. Weplan to extend this to 5-fold cross validation in future if we get acess toGPUs. Our final goal is to design a single network for both gender andage estimation and rewriting the code for whole network in Lua. Thsi isrequired for finally integrating with the expression network.

4.3 Context Encoder:

The code was publicly available for the paper[? ]. However there weresome bugs and needed re-factoring. It uses the cutorch and cunn librariesand is written in torch. The architecture is inspired from the one that hasbeen mentioned in that paper. An additional embedding, a representationof the face obtained from the age, gender and expression networks, hasbeen added to the one produced by the context encoder. The input is thenpassed into the decoder network which then competes with the discrimi-nator network of the architecture.

5 Our Approach

Our approach can broadly be divided based on the three subtasks we per-formed.

5.1 Expression Recognition

For this task, we have followed the approach mentioned in [4]. In thisarchitecture, two small network are combined. One extracts the CNNfeatures and the other uses the landmark point and loss function jointlyoptimizes both the networks. Since the exact architecture for the 3D-CNN was not available, we designed our own network for this. 3D-CNNrequires the fixed size input video. Every video is divided into 7 non-overlapping region and from each partition, one frame is selected andlandmarks points are extracted from these frames. Here instead of us-ing 27 landmarks point, we used 68 points and normalized them to therespective frame mean.

Figure 1: Joint architecture of the 3D-CNN & Landmarks point.

We implemented the code in torch and no pre-trained initializer isused. We experimented with different filter sizes, number of filters, drop-out rate at different layers, gradient update and chose the best one. Wetried this approach with both the non-augmented and augmented data.Our complete architecture are shown in figure-4. We optimized the modeland received a better accuracy than what was obtained by the authors of[4].

Figure 2: Architecture for the joint learning

5.2 Age and gender classification

For the task of age and gender classification, we used the network archi-tecture proposed in [2]. In this approach two separate CNN were trainedfor both the task separately. Initially we have used the existing code givenby the authors and reproduced the results. The code available was writtenin Caffe framework. In order to integrate the network along with the ex-pression network we had to re implement the model in Torch. In order togeneralize the technique mentioned in [4], we have implemented the jointmodel using Torch and have also tried to fine tune the network to improveresults.

5.3 Face Reconstruction with side information

We modified the network mentioned in [7] for this. The final f c layerfrom the each of the age, gender and expression networks is reduced toa 1000 unit embedding and is concatenated with the 4000 unit encodedrepresentation of the input image. We add another f c layer to reduce theconcatenated component to 4000 unit and then pass it through the decodercomponent of the network.The network is very difficult and slow to train, which made the hyper-parameter optimization difficult. The network could not converge andintermediate results have been reported.

Figure 3: The architecture for face reconstruction using side information

2

Method CK+ CASIAJoint Fine Tuning 97.25% 81.46%Our Network(Before Midterm) 93.61% 70.68%Our Network(After Midterm)[??] 98.38% 74.68%

Table 1: CK+ & CASIA Result

Network Reported Ours Our[JFT]Gender 86.8% 91.17% 56.3 %Age 56.7% 73.18% 36.2%

Table 2: Age Gender result

6 Results

We report the performance of our network for FER on two datasets, CK+and CASIA. The first result is the one reported by the paper. The otheris the result from our network of which multiple variants were tried. Wereport the accuracy of the best performing network. A few experimentswere done by changing the % of dropout at different layers, number offilters etc.

10

20

30

40

50

60

70

80

90

100

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

<Logger::/home1/badri/vinay/fexpression/orig/results/train.log>

% mean class accuracy (train set)

10

20

30

40

50

60

70

80

90

100

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

<Logger::/home1/badri/vinay/fexpression/orig/results/test.log>

% mean class accuracy (test set)

Figure 4: Convergence plot of train and test for CK+

For Age and Gender ClassificationThe performance of the Age and gender network for the Audience

benchmark dataset is given in the table below. The first column is theaccuracy reported in the paper and the next column is accuracy obtainedby us. The convergence plot for both the network is shown in figure 5.

The difference in ours to that of the reported one is because they haveused 5-fold cross-validation for measuring the accuracy where as we have

Figure 5: Convergence Plot for Age Network(up) and Gender Net-work(up)

used only one-fold for both training and testing. Face Reconstructionwith side information The network was trained for two different masks,one in the center of the image and the other that is shifted to cover thelower part of the face. The images below show the actual image, the inputimage and the reconstructed image. Due to lack of time, output fromthe non-converged network(central mask) have been reported. The outputfrom the shifted mask will reported as an addendum.

7 Discussions

• The use of augmented data did not improve the performance of thenetwork much. The rotations seemed to confuse the network withrespect to the location of landmark points

• We attempted to generalize the architecture(using landmark pointsfor the face) for age and gender, but could not achieve good results.The landmark is a location descriptor of a face. Another possiblereason could be the variance of the expression of the subject inevery image.

• The context encoders are seemingly slow and hard to train. How-ever, the addition of side information should result in better recon-struction.

• The side information can further be extended to natural languageinput.

8 Comparison:

8.1 Proposed work

Our project proposal was to reconstruct the facial expression using the ad-versarial network with the help of side information. Here our side infor-mation is facial expression, age and gender of the given image. Thereforeour complete objective is to find out the expression, age and gender of aimage and using this information our objective is to reconstruct the facialexpression.

8.2 Before Midterm

• We have implemented joint fine tuning[4] for the facial expressionand achieved the result comparable to the state of the art as reportedin table[1].

3

• We have fine-tuned the trained model of age and gender network[2] and achieved significant improvement in accuracy as reportedin table [2]

8.3 After Midterm

• We have optimized our joint fine tuning network by adding morelayers and further fine-tuning the hyper parameters. This helped usin boosting the result [1] of the network, which is even better thanthe state of the art reported in the paper [4].

• We have implemented the existing age and gender network in Torchas it was already available in Caffe and the Torch model was re-quired for final integration of all the networks.

• We have also attempted to generalize all the three networks(age,gender and expression) in an unified framework.

• The context encoder along with an embedding obtained from theage, gender and expression network has been added to the encodedlayer of the architecture. An experiment for this was tried and wehave the initial results as the model has not yet converged properly.

Qualitative Result

The top three images in below are the ones that are correctly classified,while the ones below are incorrectly classified.

Figure 6: Age and gender: Top three images are coorectly classified,while the others are misclassified.

The following images show the faces reconstructed by one of ournetworks by the end of 260 epochs.

Figure 7: Reconstructed faces when the mask is in the center of the image.This is the result at the end of 260 epochs.

9 Reference

[1] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan.Openface: A general-purpose face recognition library with mobileapplications. Technical report, CMU-CS-16-118, CMU School ofComputer Science, 2016.

[2] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estima-tion of unfiltered faces. IEEE Transactions on Information Forensicsand Security, 9(12):2170–2179, 2014.

[3] Zhao G., Huang X., Taini M., and Li SZ & PietikÃd’inenM. Facial expression recognition from near-infrared videos.Image and Vision Computing, 29(9):607–619, 2011. URLhttp://www.sciencedirect.com/science/article/pii/S0262885611000515.

[4] Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and JunmoKim. Joint fine-tuning in deep neural networks for facial expressionrecognition. In Proceedings of the IEEE International Conference onComputer Vision, pages 2983–2991, 2015.

[5] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, ZaraAmbadar, and Iain Matthews. The extended cohn-kanade dataset(ck+): A complete dataset for action unit and emotion-specified ex-pression. In 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition-Workshops, pages 94–101. IEEE,2010.

[6] Hong-Wei Ng and Stefan Winkler. A data-driven approach to clean-ing large face datasets. In 2014 IEEE International Conference onImage Processing (ICIP), pages 343–347. IEEE, 2014.

[7] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell,and Alexei A Efros. Context encoders: Feature learning by inpaint-ing. arXiv preprint arXiv:1604.07379, 2016.

[8] Paul Viola and Michael J Jones. Robust real-time face detection.International journal of computer vision, 57(2):137–154, 2004.

[9] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation,and landmark localization in the wild. In Computer Vision and Pat-tern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.

4

Joint Fine-Tuning in Deep Neural Networks for Facial Expression … · sampled frames using...

Documents

Transcript of Joint Fine-Tuning in Deep Neural Networks for Facial Expression … · sampled frames using...