MCG-ICT-CAS's Investigation of Model Sparsity and Category...

1
1. Sparse CNN model In the early April of this year, we propose to learn a compact CNN model named SPCNN for reducing computation and memory cost. MotivationSince the 4K*1K (4M) number of connections between the fc8 layer and fc7 layer is one of the densest connections between the CNN layers, we focus on how to remove most of small connections between them because small connections are often unstable and may cause noise. Actually, for a given category (a neuron in the fc8 layer), it is naturally only related with a very few number of category bases (neurons of the fc7 layer), i.e., the former can be modelled as sparse linear combinations of a very few number of the latter. We propose to select only a very few number of connections with large weights between a given category and category bases for retraining, and set other small unstable weights as constant zero without retraining. Fig.1 Illustration of Sparse CNN model Experiments on the validation dataset show that only keeping and retraining a very small percentage (about 9.12%) of all the initial 4M connections can still gain 0.32% improvement of top-5 accuracy (92.87%) compared with that (92.55%) of the initial VGG model after removing the 90.88% small unstable connections. 2. Sparse Ensemble Learning (SEL) To overcome the low efficiency and unscalability of the classical methods based on global classification, we proposed sparse ensemble learning for visual concept detection in [1]. As shown in Fig.2, it leverages sparse code of image features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification. The sparsity ensures that only a small number of classifiers in the ensemble will fire on a testing sample, which not only gives better AP than other fusion methods like average fusion and Adaboost, but also allows the high efficiency of online detection. Fig.2 SEL System Framework Experiments on the validation dataset show that compared with CaffeNet, SEL with CNN features can improve 3.6% and 1.0% in top-1 and top-5 accuracies respectively. 3. Ordered weighted averaging (OWA) Fusion Experiments on both the validation and test datasets show that OWA is better than average fusion. 4. Object localization (a) Training (b) Testing Fig.3 Illustration of our object localization: (a) Training; (b) Testing 5. Object detection An overwhelming majority of them have focused on how to reduce the number of region proposals while keeping high object recall without consideration of category information, which may lead to a lot of false positives due to the interferences between categories especially when the number of categories is very large. To eliminate such interferences, we propose a novel category aggregation approach based upon our careful observation that more frequently detected categories around an object have the higher probabilities to be present in the image. After further exploiting the co-occurrences relationship between categories, we can determine the most possible categories for an image in advance. As illustrated in Fig.4, our object detection strategy can be summarized as the following steps: Region proposal extraction: We employ region proposal methods such as selective search, edge boxes and RPN to generate category-independent candidate bounding boxes. Category aggregation and co-occurrence refinement: In order to filter out false positives of region proposals, we use our proposed category aggregation along with co-occurrence refinement among region proposals to determine the most possible categories present in an image. Region classification: We classify the proposals with classifiers trained with CNN features. Region fusion : We fuse the detection results of different region proposals to select bounding boxes of high quality. Fig.4 Illustration of our object localization framework 6. ILSVRC 2015 Results Fig.5 Our DET results (Rank 5 th /20 teams) Fig.6 Our CLS-LOC results (Rank 4 th /23 teams) References [1] Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua, “Sparse Ensemble Learning for Concept Detection”, IEEE Transactions on Multimedia, 14 (1): 43-54, February 2012 MCG-ICT-CAS's Investigation of Model Sparsity and Category Information on Object Classification, Localization and Detection at ILSVRC 2015 Tang Sheng(唐胜), Zhang Yong-Dong, Zhang Rui, Li Ling-Hui, Wang Bin, Li Yu, Deng Li-Xi, Xiao Jun-Bin, Cao Zhi, Li Jin-Tao Multimedia Computing Group (MCG), Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing 100190, China {ts, zhyd, zhangrui, lilinghui, wangbin1, liyu, denglixi, xiaojunbin, caozhi, jtli}@ict.ac.cn T ILSVRC 2015, in conjunction with ICCV 2015, December. 17 th , 2015 ImageNet and MS COCO Visual Recognition Challenges Joint Workshop [Abstract] At ILSVRC 2015, we participate in two tasks: object classification/localization (CLS-LOC) and object detection (DET) without using any outside images or annotations. In CLS-LOC task, our unique contributions are three-fold: (1) Sparse CNN model (SPCNN): We select only a very few number of connections with large weights between a given category (fc8 layer) and category bases (fc7 layer) for retraining, and set other small unstable weights as constant zero without retraining. (2) Sparse Ensemble Learning (SEL): It leverages sparse codes of CNN features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification. (3) Additionally, we try ordered weighted averaging (OWA) in terms of top-5 accuracy for fusion of all the models including GoogleNet, VGG, SPCNN and SEL. In DET task, we propose a novel category aggregation method along with exploiting the co-occurrences relationship between categories to filter out many false positives of region proposals. Rank Teams with Provided Data Error (%) 1 MSRA 9.01 2 Trimps-Soushen 12.29 3 Qualcomm Research 12.55 4 MCG-ICT-CAS 14.69 5 Lunit-KAIST 14.73 6 Tencent-Bestimage 15.55 7 ReCeption 19.58 8 Cuimage 21.21 23 teams in total Rank Teams with Provided Data MAP (%) 1 MSRA 62.07 2 Qualcomm Research 53.57 3 Cuimage 52.71 4 The University of Adeleide 51.44 5 MCG-ICT-CAS 45.36 6 DROPLET-CASIA 44.86 7 Trimps-Soushen 44.64 20 teams in total

Transcript of MCG-ICT-CAS's Investigation of Model Sparsity and Category...

Page 1: MCG-ICT-CAS's Investigation of Model Sparsity and Category …image-net.org/challenges/posters/MCG-ICT-CAS-poster.pdf · 2016-01-07 · MCG-ICT-CAS's Investigation of Model Sparsity

1. Sparse CNN model In the early April of this year, we propose to learn a compact CNN

model named SPCNN for reducing computation and memory cost.

Motivation:Since the 4K*1K (4M) number of connections between the fc8 layer and fc7 layer is one of the densest connections between the CNN layers, we focus on how to remove most of small connections between them because small connections are often unstable and may cause noise. Actually, for a given category (a neuron in the fc8 layer), it is naturally only related with a very few number of category bases (neurons of the fc7 layer), i.e., the former can be modelled as sparse linear combinations of a very few number of the latter.

We propose to select only a very few number of connections with large weights between a given category and category bases for retraining, and set other small unstable weights as constant zero without retraining.

Fig.1 Illustration of Sparse CNN model

Experiments on the validation dataset show that only keeping and retraining a very small percentage (about 9.12%) of all the initial 4M connections can still gain 0.32% improvement of top-5 accuracy (92.87%) compared with that (92.55%) of the initial VGG model after removing the 90.88% small unstable connections.

2. Sparse Ensemble Learning (SEL)

To overcome the low efficiency and unscalability of the classical methods based on global classification, we proposed sparse ensemble learning for visual concept detection in [1].

As shown in Fig.2, it leverages sparse code of image features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification.

The sparsity ensures that only a small number of classifiers in the ensemble will fire on a testing sample, which not only gives better AP than other fusion methods like average fusion and Adaboost, but also allows the high efficiency of online detection.

Fig.2 SEL System Framework

Experiments on the validation dataset show that compared with CaffeNet, SEL with CNN features can improve 3.6% and 1.0% in top-1 and top-5 accuracies respectively.

3. Ordered weighted averaging (OWA) Fusion

Experiments on both the validation and test datasets show that OWA is better than average fusion.

4. Object localization

(a) Training

(b) Testing

Fig.3 Illustration of our object localization: (a) Training; (b) Testing

5. Object detection An overwhelming majority of them have focused on how to reduce the

number of region proposals while keeping high object recall without consideration of category information, which may lead to a lot of false positives due to the interferences between categories especially when the number of categories is very large. To eliminate such interferences, we propose a novel category aggregation approach based upon our careful observation that more frequently detected categories around an object have the higher probabilities to be present in the image. After further exploiting the co-occurrences relationship between categories, we can determine the most possible categories for an image in advance.

As illustrated in Fig.4, our object detection strategy can be summarized as the following steps:

Region proposal extraction: We employ region proposal methods such as selective search, edge boxes and RPN to generate category-independent candidate bounding boxes.

Category aggregation and co-occurrence refinement: In order to filter out false positives of region proposals, we use our proposed category aggregation along with co-occurrence refinement among region proposals to determine the most possible categories present in an image.

Region classification: We classify the proposals with classifiers trained with CNN features.

Region fusion : We fuse the detection results of different region proposals to select bounding boxes of high quality.

Fig.4 Illustration of our object localization framework

6. ILSVRC 2015 Results

Fig.5 Our DET results (Rank 5th/20 teams) Fig.6 Our CLS-LOC results (Rank 4th/23 teams)

References [1] Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua, “Sparse Ensemble Learning for Concept Detection”, IEEE Transactions on Multimedia, 14 (1): 43-54, February 2012

MCG-ICT-CAS's Investigation of Model Sparsity and Category Information on Object Classification, Localization and Detection at ILSVRC 2015

Tang Sheng(唐胜), Zhang Yong-Dong, Zhang Rui, Li Ling-Hui, Wang Bin, Li Yu, Deng Li-Xi, Xiao Jun-Bin, Cao Zhi, Li Jin-Tao

Multimedia Computing Group (MCG), Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing 100190, China

{ts, zhyd, zhangrui, lilinghui, wangbin1, liyu, denglixi, xiaojunbin, caozhi, jtli}@ict.ac.cn

T

ILSVRC 2015, in conjunction with ICCV 2015, December. 17th, 2015 ImageNet and MS COCO Visual Recognition Challenges Joint Workshop

[Abstract] At ILSVRC 2015, we participate in two tasks: object classification/localization (CLS-LOC) and object detection (DET) without using any outside images or annotations. In CLS-LOC task, our unique contributions are three-fold: (1) Sparse CNN model (SPCNN): We select only a very few number of connections with large weights between a given category (fc8 layer) and category bases (fc7 layer) for retraining, and set other small unstable weights as constant zero without retraining. (2) Sparse Ensemble Learning (SEL): It leverages sparse codes of CNN features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification. (3) Additionally, we try ordered weighted averaging (OWA) in terms of top-5 accuracy for fusion of all the models including GoogleNet, VGG, SPCNN and SEL. In DET task, we propose a novel category aggregation method along with exploiting the co-occurrences relationship between categories to filter out many false positives of region proposals.

Rank Teams with Provided Data Error (%)

1 MSRA 9.01

2 Trimps-Soushen 12.29

3 Qualcomm Research 12.55

4 MCG-ICT-CAS 14.69

5 Lunit-KAIST 14.73

6 Tencent-Bestimage 15.55

7 ReCeption 19.58

8 Cuimage 21.21

… … …

23 teams in total

Rank Teams with Provided Data MAP (%)

1 MSRA 62.07

2 Qualcomm Research 53.57

3 Cuimage 52.71

4 The University of Adeleide 51.44

5 MCG-ICT-CAS 45.36

6 DROPLET-CASIA 44.86

7 Trimps-Soushen 44.64

… … …

20 teams in total