[IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) -...

5

Click here to load reader

Transcript of [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) -...

Page 1: [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) - Wuhan, China (2009.06.6-2009.06.7)] 2009 International Conference on Computational

On the Influence of Region Mismatch at Training and Testing in Region-Related Concept Detection

Zan Gao, Zhicheng Zhao, Tao Liu, Xiaoming Nan, Anni Cai School of Information and Communication Engineering

BUPT Beijing, P.R. China

E-mail: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract—A great deal of region-related concept detection algorithms have been proposed so far, but there are few of them concerning about the problem of mismatched regions at training and testing stages. In order to investigate the mismatch problem in region-related concept detection, we introduce three kinds of methods to annotate the datasets, and then conduct experiments on differently annotated training and testing datasets. We find from these experiments that the detection performance is the best when the regions of a region-related concept are well defined and matched during training and testing, or the detection performance will be decreased. Based on these observations, we propose a fusion scheme to combine the results of classifiers trained with datasets which are annotated by different methods. Experiments on Trecvid-2007 test corpus show that the proposed fusion scheme can obtain performance improvement up to 6~12%.

Keywords-Region-mismatch;Concept;Detection;Fusion; Annotation; SVM; Trecvid;

I. INTRODUCTION Semantic concept detection (high-level feature

extraction) of natural images and video sequences is a fundamental operation in a number of fields, such as machine vision and video retrieval. Semantic concepts generally fall into two categories: concrete semantic concept and abstract semantic concept. Generally, one abstract concept may include several concrete concepts, for example, meeting room, which is an abstract concept, is composed of several concrete concepts, such as desks and chairs. The abstract concept—cityscape is made up of sky and buildings, which are concrete concepts. In many works [1-7], concrete concepts are first detected, and abstract concepts are then inferred with semantic ontology. Therefore, the extraction of concrete concept plays an important role in semantic concept detection. If a concrete concept corresponds to a limited region of an image, such as dog, boat, car and telephone, we call it a region-related concept.

A great deal of region-related concept detection algorithms have been proposed so far, however, there are only few of them concerning about the problem of mismatched regions at the training and testing stages. In the training, one can annotate a region-related concept with region of a proper size, but it is difficult for people to automatically obtain the region of region-related concept in the test image. For example, in [8], the authors

segmented the key frame by Jseg[9], which is based on color and region growing, to get the region of a concept, but the algorithm often over-segments an object into several small regions. Therefore, in practice, features are often extracted from corresponding regions for region-related concepts in the training, but from whole frames in the testing. As a result, regions in the training and testing are mismatched, and performance degradation of the concept detection algorithm can be expected. Some researchers tried to solve this problem by finding scale invariant features or dividing the whole frame into several regions, but there are still some drawbacks. For example, in [10-12], the authors extracted key-points from the whole frame, which are scale and view invariant, but there may be too many noisy key-points from outside the concept region included in the feature set. Other researchers divided the key frame into a set of small blocks [13-16], but these blocks do not necessarily match the region of the concept.

In this paper, we will concentrate on the problem of mismatched regions between training and testing. In order to do so, we will introduce three kinds of methods to annotate the datasets, and then conduct experiments on differently annotated training datasets and testing datasets to investigate the influence of mismatch problem to region-related concept detection. In addition, a new fusion scheme is proposed according to different types of annotations.

The rest of the paper is organized as follows. Section II will introduce different types of annotations. The framework of semantic concept detection will be given in section III. The influence of different types of annotations to region-related concept detection is evaluated in section IV. A new fusion scheme will be proposed and its performance will be tested in section V. Finally, we conclude the paper in section VI.

II. ANNOTATIONS In order to study the problem of mismatched regions of

training and testing, we take TRECVID (Text Retrieval Conference Video Retrieval Evaluation) dataset as examples. In the task of semantic concept detection on TRECVID-2008, there are twenty concepts, and the detailed concept information and index are as follows.

01 Classroom, 02 Bridge, 03 Emergency_Vehicle, 04 Dog, 05 Kitchen, 06 Airplane_flying, 07 Two people, 08 Bus, 09 Driver, 10 Cityscape, 11 Harbor, 12 Telephone, 13 Street, 14 Demonstration_Or_Protest, 15 Hand, 16

2009 International Conference on Computational Intelligence and Natural Computing

978-0-7695-3645-3/09 $25.00 © 2009 IEEE

DOI 10.1109/CINC.2009.42

42

Page 2: [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) - Wuhan, China (2009.06.6-2009.06.7)] 2009 International Conference on Computational

Mountain, 17 Nighttime, 18 Boat_Ship, 19 Flower, 20 Singing.

In these concepts, there are some region-related concepts.We will use the following three types of methods to annotate the above concepts in key frames of a video sequence.

I) We first annotate the above concepts in TRECVID-2007 corpus by using IBM MPEG-7 Annotation tool v.1.5.1. [18]. The regions corresponding to this type of annotations are shown in Fig.1 (a), where the whole frame, which contains the concept to be detected, will be considered as a positive sample of that concept.

II) The annotation of TRECVID-2007 corpus is also released by MCG-ICT-CAS [17]. They annotate region-related concepts corresponding to tight regions. Fig.1 (b) shows the regions of this type of annotation.

III) The final way to annotate the concepts by using IBM MPEG-7 Annotation tool v.1.5.1.is that a large region, which contains one or more objects to be detected as well as some surroundings, will be segmented and considered as one positive sample of that concept. The regions corresponding to this type of annotations are shown in Fig.1 (c). For example, in the second row of Fig.1 (c), two boats are segmented together in a large region to give one positive sample.

Figure 1. Examples of different types of annotations

III. THE SYSTEM DESCRIPTION The system we used in our investigation is shown in

Fig.2. We first introduce features used in the system. These features are extracted at different granularities such as point feature, local feature, global feature and feature in group of frames. The detailed information is shown as follows.

A. Scale Invariant Feature Transform (SIFT) It has been shown in [10 -14] that SIFT processes good

performance in video analysis, so it is chosen as the feature at the finest granularity in the system. Firstly, we build a visual vocabulary of SIFT points detected from key frames of the training sequences based on Difference of Gaussian (DoG)[19] in Y channel. Then SIFT points are extracted from every key frame of a video sequence, and every point is distributed to its nearest visual word. Finally, the number of occurrences of each visual word is recorded in a histogram, which is used as the feature to represent that frame.

B. Gabor Wavelet We use the Gabor wavelets proposed in [20], but we

choose (0, 2, 4)ν ∈ , (0,1,...5)μ ∈ , 2.5σ π= ,

max 2k π= and 2f = . Let ( , )I x y be the gray-level distribution of an image,

and the convolution of image I and a Gabor kernel ,μ νΨ is calculated.

, ,( ) ( ) ( )z I z zμ ν μ νΟ = ⊗ Ψ (1)

where ⊗ denotes the convolution operator. , ( )zμ νΟ is then divided into 3*3 blocks, and the mean and standard deviation of each block are cascaded as one Gabor feature vector.

C. Edge Directional Histogram In MPEG-7, the edge histogram descriptors represent

the spatial distributions of five types of edges, namely four directional edges and one non-directional edge. The edge histogram feature is extracted according to MPEG-7.

D. Color Feature Color descriptors are suitable for representing local

regions where a small number of colors are enough to characterize color information in regions of interest. Six color features are considered in our system. (1) RGB Color Moment (9 dim). (2) HSV Color Auto-Correlogram (512 dim); 3) HSV Color Histogram (256 dim). (4) HSV Group of Frame (256 dim). (5) RGB Histogram with Block (576 dim). 6) Average Brightness (1 dim). Since these features are all related to color and brightness, we fuse them in the feature space, i.e., we concatenate these features to form one feature vector.

Figure 2. the framework of semantic concept detection

For each set of annotations, we adopt four basic SVM (support vector machine) classifiers based on SIFT, Gabor, Edge and Color features respectively, and then fuse the results of four basic classifiers to give the decision. In Fig.2, two fusion steps are shown. The first is the one we mentioned above, which fuses the output probabilities of four basic classifiers of each concept by average precision (AP) in one type of annotations, but the second one is proposed to fuse the output probabilities of different types of annotations, which will be discussed in section V.

In the first fusion step, the output probabilities of four basic classifiers --- SIFT ,Gabor, Edge and Color, are fused according to the following scheme.

43

Page 3: [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) - Wuhan, China (2009.06.6-2009.06.7)] 2009 International Conference on Computational

{ ( )* ( ) ( )* ( )

( )* ( ) ( )* ( )}( , )

( ) ( ) ( ) ( )

ann j ann jsift sift gabor gabor

ann j ann jedge edge color color

apweightsift gabor edge color

prob i apw i prob i apw i

prob i apw i prob i apw iprob i j

apw i apw i apw i apw i

+

+ +=

+ + +

(2) where (1,2,.....20)i∈ is the concept index, ( , , )j I II III∈ is

the annotation index. ( )ann jsiftprob i , ( )ann j

gaborprob i , ( )ann jedgeprob i

and ( )ann jcolorprob i are the output probability of four

classifiers of concept i with annotation j . Equation (2) represents the linear weighted fusion, where weights in (2) are average precision weight (Apweight). The average precision is obtained by testing classifiers with a validation corpus.

IV. INFLUENCE OF REGION MISMATCH TO REGION-RELATED CONCEPT DETECTION

We will show the influence of region mismatch to region-related concept detection by experiments on TRECVID dataset.

A. Experiment Set Up For TRECVID dataset, we partition the video

collection into three corpuses: I) the training corpus. It includes TRECVID-2007 development dataset. II) The validation corpus. It includes one half of TRECVID-2007 test dataset. III) The testing corpus. It contains another half of TRECVID-2007 test corpus.

We name the training/validation/test dataset as “Training/Validation/test Dataset j ”, j =I, II and III, if the training/validation/test dataset is annotated by method of annotation j (see section II). For concepts, which number of positive samples is small, we add more pictures collected from the Internet for them. Table I gives the number of positive and negative samples in each dataset. In addition, in order to save computational time and to make fair comparison, all samples in all datasets are down-sampled to the size of 88*72.

TABLE I. THE NUMBER OF POSITIVE AND NEGATIVE SAMPLES IN EACH DATASET

We choose 50 positive samples from TRECVID-2007

development corpus for every concept to generate the SIFT vocabulary. Although there are only 1,000 key frames, there are about 270,000 SIFT points. With K-means, these points are quantized into 1000 clusters, and each cluster represents a visual keyword. Then SIFT points are extracted from every key frame of a video sequence, and every point is distributed to its nearest cluster. The number of occurrences of each visual word is recorded in a histogram.

It has been shown in [1, 2, 13, 14] that SVM (support vector machines) processes good performance. So we use LibSVM[21] to train the four SVM classifiers with RBF kernel for each concept.

The evaluation criterion in the paper is average precision (AP) and mean average precision (MAP). The AP for each concept is given in the following tables, and the MAP of each scheme is given in the last row of the following tables.

B. Experiment 1 In this experiment, we train all classifiers with “Training

Dataset I”, and evaluate the performance of these classifiers on “Test Dataset I” and “Test Dataset II” respectively. The performances of these classifiers in the two cases are shown in Table II. Table II (a) and (b) is the performance of classifiers, which is trained on “training Dataset I”, evaluated on “Test Dataset I” and “Test Dataset II” respectively.

TABLE II. THE PERFORMANCE OF CLASSIFIERS, TRAINED ON “TRAINING DATASET I” AND EVALUATED ON “TEST DATASET I AND II”.

As SIFT are extracted from whole frames in all video

sequences, the APs of SIFT for all concepts in Table II (a) and Table II (b) are the same. In Table II (a), the Gabor classifier, whose MAP is 0.143719, is the best among four basic classifiers, but the MAP of Apweight, which reaches 0.217377405, outperforms it by 51.25%. The similar thing can be observed in Table II (b), and the MAP of Apweight outperforms the MAP of Gabor by 24.599%. These facts demonstrate that simple fusion scheme (the first fusion in Fig.2) is effective, and it can improve the performance of basic classifiers significantly.

Furthermore, it should be noticed that features in Table II (a) are extracted from whole frames both in the training stage and in the testing stage, so their regions are matched for region-related concepts. However, regions where the features are extracted from at the training and testing stages are not matched in Table II (b). From Tables II, we can observe that the APs of scene-related concepts, such as “classroom”, “kitchen” and “cityscape”, are the same in both tables, since there are no region mismatched problems for this kind of concepts, but the APs of 9 out of 14 region-related concepts have higher values in Table II (a) than that in Table II (b). Consequently, the MAP of Apweight in Table II (a) is 21.39% higher than the MAP of Apweight in Table II (b).

C. Experiment 2 In this experiment, we train all classifiers with

“Training Dataset III”. In the test, we evaluate the performance of these classifiers by “Test Dataset I” and

44

Page 4: [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) - Wuhan, China (2009.06.6-2009.06.7)] 2009 International Conference on Computational

“Test Dataset II” respectively. Table III gives the corresponding results. Table III (a) and (b) is the performance of classifiers, which is trained on “training Dataset III”, evaluated on “Test Dataset I” and “Test Dataset II” respectively.

TABLE III. THE PERFORMANCE OF CLASSIFIERS, TRAINED ON “TRAINING DATASET III” AND EVALUATED ON “TEST DATASET I AND

II”.

In this experiment, since samples with relatively large regions are used at the training stage, we can expect that the performances in Table III (a) are better than that in Table III (b) for concepts with large sizes, such as “bridge” and “street”. On the contrary, the APs of “telephone” and “flower” in Table III (a) are lower than that in Table III (b), because these concepts have small sizes which are significantly different from the size of a whole frame used in Test Dataset I. However, in total, the MAP of Apweight in Table III (a) outperforms the MAP of Apweight in Table III (b) by 5.8434%.

D. Experiment 3 In this experiment, we will evaluate classifiers, trained

by “Training Dataset II”, and tested on “Test Dataset I” and “Test Dataset II” respectively, and the detailed performances are shown in Table IV. Table IV (a) and (b) is the performance of classifiers, which is trained on “training Dataset II”, evaluated on “Test Dataset I” and “Test Dataset II” respectively.

TABLE IV. THE PERFORMANCE OF CLASSIFIERS, TRAINED ON “TRAINING DATASET II” AND EVALUATED ON “TEST DATASET I AND II”.

Since regions in “Training Dataset II” are more matched

with that in “Test Dataset II” than in “Test Dataset I”, the MAPs of Apweight in Table IV (b) are better than that in Table IV (a), and the improvement reaches 27.94%. 10 out of 14 region-related concepts in Table IV (b) have higher APs than that in Table IV (a), but there are still 4 out of 14 region-related concepts with no improvement in Table IV (b). The reason for that may be that quite a number of pictures collected from the internet are added into the training dataset for these region-related concepts, such as

“dog” and “bus”, and the size of the pictures are more matched to the whole frame.

From above analysis, we can see that if the regions of region-related concepts can be accurately defined and matched with each other at testing and training, the performance of classifiers will be good. However, as mentioned before, it is difficult for people to automatically obtain an accurate region of a region-related concept in the test image, and the whole frame is commonly used at the testing instead. At such circumstance, we may consider to fuse the output probabilities of classifiers trained by different kinds of annotations, since each kinds of annotations may have its own advantage, such as with matched region, with well defined region, or may containing some useful surrounding information.

V. FUSION WITH DIFFERENT TYPES OF ANNOTATIONS Based on the above experiments and analysis, we

propose to fuse the output probabilities of classifiers trained by different kinds of annotations, for example Annotations I and III, as shown in the second fusion in Fig.2. The detailed operations are expressed by (3)~(8).

where i ∈(1, 2,.....20) is the concept

index, ( )annIgaborprob i , ( )annI

edgeprob i and ( )annIcolorprob i are the output

probabilities of basic classifiers of concept i in annotation I, ( )ann III

siftprob i , ( )annIIIgaborprob i , ( )annIII

edgeprob i and ( )annIIIcolorprob i

are the output probabilities of basic classifiers of concept i in annotation III, ( )new

siftprob i , ( )newgaborprob i ,

( )newedgeprob i and ( )new

colorprob i are new probabilities of basic classifiers of concept i , which are fused by (3)~ (5), ( )new

siftapw i , ( )newgaborapw i , ( )new

edgeapw i and ( )newcolorapw i are

calculated by validation dataset, and ( )newapweightprob i is

obtained by fusion of (6). At the same time, ann IIIapweightmapw and new

apweightmapw are also calculated with

the validation dataset. Finally, ( )lastmapprob i is computed

by (7). In order to evaluate the validity of the APMAP fusion

scheme shown by the above equations, the output probabilities of classifiers, which are trained on “Training Dataset I” and “Training Dataset III” and tested on “Test

45

Page 5: [IEEE 2009 International Conference on Computational Intelligence and Natural Computing (CINC) - Wuhan, China (2009.06.6-2009.06.7)] 2009 International Conference on Computational

Dataset I”, are fused in this experiment on TRECVID dataset. The performance is shown in Table V.

TABLE V. THE PERFORMANCE OF FUSION SCHEME

From above Table, we find that the MAP of basic

classifiers in Table V is better than that in Table II (a) and Table III (a). In addition, the MAP of APMAP in Table V is 0.2307618, but the MAP of Apweight in Table II (a) is 0.217377405, and the improvement reaches 6.1572%. Furthermore, except for 7 scene-related concepts, 10 out of 13 region-related concepts in Table V show performance improvements. The MAP of APMAP in Table V outperforms the MAP of Apweight in Table III (a), whose MAP is 0.20456308, by 12.807%, and 11 out of 13 region-related concepts have performance improvements. These results show that the APMAP fusion scheme we proposed is effective.

VI. CONCLUSION A great deal of region-related concept detection

algorithms have been proposed so far, but there are few of them concerning about the problem of mismatched regions at training and testing. In this paper, we concentrate on the problem of mismatched regions of training and testing, and show by experiments that the annotation methods will affect the performance for region-related concept detection. We also proposed a fusion scheme based on different types of annotations. Experiments show that the proposed scheme is effective and stable.

ACKNOWLEDGMENT This work was supported by National Natural Science

Foundation of China (60772114).

REFERENCES [1] Xiangyang Xue, Hong Lu, Hui Yu et al, “Fudan University at

TRECVID 2006.,” In: Proceedings of TRECVID 2006 Workshop. [2] Xiangyang Xue, Hong Lu, Hui Yu et al, “Fudan University at

TRECVID 2007,” In: Proceedings of TRECVID 2007 Workshop. [3] Zheng-Jun Zha, Tao Mei, PZengfu Wang, Xian-Sheng Hua,

“Building a comprehensive ontology to refine video concept detection,” Sep. 2007 Proceedings of the international workshop on Workshop on multimedia information retrieval, pp.227-236

[4] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. “Large-scale concept ontology for multimedia.” IEEE MultiMedia, 13(3):86–91, 2006.

[5] C. G. M. Snoek, M. Worring, J. C. van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” In MULTIMEDIA’ 06: Proceedings of the 14th annual ACM international conference on Multimedia, pages 421–430, New York, NY, USA, 2006. ACM Press.

[6] M. Bertini, A. D. Bimbo, and C. Torniai, “Automatic video annotation using ontologies extended with visual information," In MULTIMEDIA’ 05: Proceedings of the 13th annual ACM

international conference on Multimedia, pages 395–398, New York, NY, USA, 2005. ACM Press.

[7] H. Luo and J. Fan, “Building concept ontology for medical video annotation,” In MULTIMEDIA ‘ 06: Proceedings of the 14th annual ACM international conference on Multimedia, pages 57–60, New York, NY, USA, 2006. ACM Press.

[8] Shile Zhang, Jianping Fan, Hong Lu, Xiangyang Xue, “Salient Object Detection on Large-Scale Video Data,” Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on 17-22 June 2007 Page(s):1 – 6.

[9] Y. Deng, B. S. Manjunath, “Unsupervised segmentation of color-texture regions in images and video,” IEEE Trans. PAMI, vol. 23, no. 8, pp. 800-810, 2001.

[10] Koen E.A. van de Sande, Theo Gevers, Cees G.M. Snoek, “A comparison of color features for visual concept classification,” Jul. 2008 Proceedings of the 2008 international conference on Content-based image and video retrieval.

[11] Jun Yang, Yu-Gang Jiang, Alexander G. Hauptmann, Chong-Wah Ngo, “Evaluating bag-of-visual-words representations in scene classification,” Sep. 2007 Proceedings of the international workshop on Workshop on multimedia information retrieval.

[12] Yu-Gang Jiang, Chong-Wah Ngo, Jun Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” Jul. 2007 Proceedings of the 6th ACM international conference on Image and video retrieval.

[13] J. Cao et al, “Tsinghua University at TRECVID 2006,” In: Proceedings of TRECVID 2006 Workshop.

[14] Jinhui Yuan, Zhishan Guo, Li Lv et al, “THU and ICRC at TRECVID 2007,” In: Proceedings of TRECVID 2007 Workshop.

[15] D Duy-Dinh Le,Shin’ichi Satoh,Tomoko Matsui, “NII-ISM, Japan at TRECVID 2007:High Level Feature Extraction,” In: Proceedings of TRECVID 2007 Workshop.

[16] Sheng Tang, Yong-Dong Zhang, Jin-Tao Li et al, “TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS,” In: Proceedings of TRECVID 2007 Workshop.

[17] MCG-ICT-CAS, “Annotation of TRECVID 2008 Development Key frames”

[18] IBM Video AnnEx MPEG-7 Video Annotation Tool, http://www.research.ibm.com/VideoAnnEx/

[19] D. Lowe, “Distinctive image features from scale-invariant key points,” Int. Journal on Computer Vision, 60(2):91-110, 2004.

[20] Liu, C., Wechsler, H., “A Gabor Feature Classifier for face recognition,” Proceedings of the IEEE International Conference on Computer Vision, v 2, 2001, p 270-275.

[21] Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001) Software .available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

46