[IEEE 2009 4th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Xian, China...

978-1-4244-2800-7/09/$25.00 ©2009 IEEE ICIEA 2009

A New Framework for High-level Feature ExtractionZan Gao

School of Information and Telecommunication Engineering

BUPT Beijing, P.R. China

[email protected]

Xiaoming Nan, Tao Liu, Zhicheng Zhao

School of Information and Telecommunication Engineering

BUPT Beijing, P.R. China

Anni Cai School of Information and

Telecommunication Engineering BUPT

Beijing, P.R. China [email protected]

Abstract—A new framework for high-level feature extraction (or semantic concept detection) is proposed. In this system, features at different granularities are extracted, and four classifiers with complementary features for each concept are employed, and then the results are fused. We have evaluated 18 fusion schemes, and choose the best one for each concept to form the final results. The experiments on the auto-test corpus and TRECVID-2008 corpus show that the proposed system is effective and stable.

Index Terms—high-level feature extraction, semantic concept

detection, TRECVID, video analysis

I. INTRODUCTION High-level feature extraction (or semantic concept

detection) from natural images and video sequences is a fundamental step in machine vision, video understanding and content-based video retrieval. A great deal of high-level feature extraction algorithms have been developed until now. In [1, 2], the authors trained 110 classifiers and chose the 50 top classifiers for each concept. MediaMill[3] extracted image features on block, key point, and segmentation level, and then trained a set of 572 concept detectors for 39 concepts. These methods are time-consuming at the training process, since a lot of classifiers have to be trained for each concept. On the other hand, various low-level features are employed to detect high-level concepts in the present works. Xue et al [4] focused on the texture features, and captured both of the regional and the global characteristics of the key frames. Le et al [5] compared and thought that color features are more effective than edge features and Gabor features. Tang et al [6] considered the visual features, audio features and motion features. Philbin et al [7] just considered the bag-of-words, face detection and pedestrian detection.

Among these algorithms, there are three main problems in common. Firstly, how to select suitable and stable low-level features? Secondly, how many classifiers should be employed? Thirdly, how to fuse the results of different classifiers?

We consider solving these problems in this paper as follows.

(1) Features at different granularities; we consider visual features only, but we select the features at different granularities, such as point feature, block feature, global feature and group of features. These features reflect different characteristics of the same frame, and they are complementary.

(2) The framework of high-level feature extraction; in order to balance the performance, complexity of the framework and

computational time, only four classifiers for each concept are employed.

(3) Fusion scheme; as these features are complementary, we fuse results of different classifiers for each concept. We evaluate 18 fusion schemes, and choose the best one from all fusion schemes for each concept.

The rest of the paper is organized as follows. Section II introduces various features with different granularities. Section III describes the system framework of high-level feature extraction. Section IV will present fusion methods we used. And then we will give the experiment results in section V, and conclude the paper in the section VI.

II. FEATURE SELECTION Since no unique feature can capture all information

contained in an image, and no unique feature is suitable for all concepts, we extract a number of features at different granularities. For example, global features can well represent scene-related concepts, but local and point features are good to describe region-related concepts. In addition, to increase the stability of features, we also consider extracting features from group of frames and from different color spaces. The features used in this paper are described as follows:

A. Scale Invariant Feature Transform (SIFT) It has been shown in [5, 6, 7, 8, 9] that SIFT possesses good

performance in video analysis, and it is chosen as the feature at the finest granularity in our system. At the training stage, we build a visual vocabulary of SIFT points detected from key frames of the training sequences based on Difference of Gaussian (DoG)[10] in Y channel. Then SIFT points are extracted from every key frame of the test sequences, and every point in the frame is distributed to its nearest visual word. The number of occurrences of each visual word in a frame is recorded in a histogram to represent the frame. This feature does not change with the scale and orientation of the picture.

B. Gabor Wavelet The Gabor wavelets kernel can be defined as follows [11]:

2 2 2,2 ,

2

, 2 2, 2( ) [ ]

k zik zk

z e e eμ ν

μ ν

σμ ν σ

μ νψσ

•− −

= − (1)

where μ andν define the orientation and scale of the Gabor

kernels respectively, ( , )z x y= , • denotes the norm

operator, and the wave vector ,kμ ν is defined as follows:

2118

,ik k e μ

μ ν νΦ= (2)

where maxk k f νν = , 8μ πμΦ = , maxk is the maximum

frequency, and f is the spacing factor between kernels in the frequency domain. In our system, we choo (0, 2, 4)ν ∈ ,

(0,1,...5)μ ∈ , 2.5σ π= , max 2k π= and 2f = .

Let ( , )I x y be the gray-level distribution of an image, and

the convolution of image I and a Gabor kernel ,μ νΨ is calculated.

, ,( ) ( ) ( )z I z zμ ν μ νΟ = ⊗ Ψ (3)

where ( , )z x y= , and ⊗ denotes the convolution operator.

, ( )zμ νΟ is divided into 3*3 blocks, and the mean and standard deviation of each block are then cascaded as one Gabor feature vector.

C. Edge Directional Histogram In MPEG-7, the edge histogram descriptors are

recommended to represent the spatial distribution of five types of edges, namely four directional edges and one non-directional edge. Since edges play an important role for image perception, it may offer some semantic clues for images, especially for natural images with non-uniform edge distribution.

D. Color Feature It has been shown in the literature that in many cases,

regions can be characterized by a small number of colors, for example, regions of national flag or color trademark etc. In addition, different color spaces embody complementary information. For example, RGB color space directly reflects the light strengths of the primary colors, and HSV space can most match the human feeling to color. Therefore, we choose a number of color features from different color spaces.

• RGB Color Moment (9 dim) --- in the RGB color space, we compute the first, the secondary and the third moments for each color channel, and get a feature vector with 9 dimensions.

• HSV Color Auto-Correlogram (512 dim) [12, 13, 14] --- H, S and V component are first quantized to 16, 4 and 4 bits respectively. Then we compute the auto-correlogram only at the distances of 1 and 3 to save computational time.

• HSV Color Histogram (256 dim) --- H, S and V components are first quantized to 16, 4 and 4 bits respectively, and then a 256-dimension histogram is formed to represent the color distribution of a frame.

• HSV histogram for Group of Picture (256 dim) --- H, S, and V components are first quantized to 16, 4 and 4 bits respectively, and then we calculate the HSV color histogram for every frame in a group of key frames. Finally, average histogram, which refers to averaging the values of each bin across all histograms in the

group of pictures, is obtained. In MPEG-7, this descriptor is also recommended.

• Block RGB Histogram (576 dim) --- We divide a key frame into 3*3 blocks. For each block, RGB histograms are calculated and only bins with the two MSB (Most Significant Bit) values in RGB are remained to form the feature vector of the block. Color quantization will reduce the computational time, whereas block histogram can take some spatial information into account.

• Average Brightness (1 dim) --- the average brightness of a frame is important for some concepts such as nighttime, indoor and outdoor.

Since the above features are all related to color and brightness, we fuse them in the feature space, i.e, we concatenate these features to form one feature vector.

III. THE FRAMEWORK OF THE HIGH-LEVEL FEATURE EXTRACTION

Classifiers are commonly used in high-level feature extraction. Some of the present works employ one general classifier [15, 16, 17] to identify all concepts. However, this method requires fusion of various low-level features with different granularities, different characteristics and different units in the feature space, and this is a difficult task. Another category of present works [1, 2, 6] use a large amount of classifiers, for example, 100, each for a low-level feature, to identify one concept. This method is time-consuming at the training stage and complicated fusion strategies are required at the decision level.

Figure.1 the framework of high-level feature extraction

In order to balance the performance and computational time, we adopt four SVM (support vector machine) classifiers for each concept based on SIFT, Gabor, Edge and Color features respectively, and then fuse the results of the four classifiers to give the decision. The proposed framework is shown in Fig.1. In the figure, the dashed-line block depicts the training process, and the solid-line block shows the framework of the evaluation process, where the auto-test dataset represents that the inputs of the classifiers are key frames and the last test dataset inputs are shots. At the evaluation stage, if the inputs are shots, we select three key frames from each shot, and in the

2119

second fusion model, we choose the decision as the one with the maximal probability among the three key frames.

IV. FUSION Fusion can be performed with non-heuristic methods and

heuristic methods. Non-heuristic methods including Max, Vote and Average, do not need training, and heuristic methods including Adaboost and Linear weighted fusion, need training. However, no unique strategy can generate performance improvements for all concepts. Some methods are good for some concepts, but may be bad for others. For example, in [18], performance improvements are obtained by fusion at 15

concepts, while worse performances are obtained at other 11 concepts. The fusion strategy used in [19] could achieve performance improvements at 21 concepts, while get worse results at other 5 concepts. In [20], its improvements are on 29 out of 39 concepts. In this work, we try a number of fusion measures, and choose the best one for each concept.

In the framework, the outputs of the four basic classifiers for each concept --- SIFT, Gabor, Edge and Color, are fused according to (4), (5), (6), (7), (8) and (9).

where (1,2,.....20)i ∈ is the concept index, ( )siftprob i , ( )gaborprob i , ( )edgeprob i and ( )colorprob i are the output probabilities of the four classifiers in concept i respectively. Equation (8) and (9) represent the linear weighted fusion, where pws in (8) are precision weights (Pweight), and (9) apws are Average precision weights (Apweight). The precision and average precision are obtained by testing the classifiers with a validation dataset. In order to take into account the performance difference of the classifiers, we zoom in and zoom out Pweight and Apweight with different parameters. The zoom function [21] is as follows.

where C is the scale of zooming, (2,3, 4,6,8,10)C ∈ . We magnify Pweight and Apweight by (10) and (11) to get the new weights, and perform fusion by (12) for each concept again. In total, we have 18 fusion schemes, and choose the best result from all fusion schemes for each concept.

V. EXPERIMENT RESULT In order to evaluate the performance of the proposed

system, we make the experiments on TRECVID (Text Retrieval Conference Video Retrieval Evaluation) datasets. The NIST (National Institute of Standards and Technology) holds TRECVID each year to prompt progress in the field of content-based video retrieval. We partition TRECVID datasets into

four corpuses: I) The development corpus. It includes the TRECVID-2007 development data. II) The validation data corpus. It includes one half of the TRECVID-2007 test data. III) Auto-test corpus. It contains another half of the TRECVID-2007 test data. V) The test corpus. It covers all the TRECVID-2008 test data, which are given in shots. The development corpus is used to train the classifiers for each concept. The validation data corpus is used to get the fusion weights, and find the best fusion scheme for each concept. Finally, the auto-test and test corpus evaluates the performance of the system. As for the annotation, we annotate the TRECVID-2007 development corpus and TRECVID-2007 test corpus by IBM MPEG-7 Annotation tool v.1.5.1 [22]. Table I gives the numbers of positive and negative samples in each dataset. In

2120

the test corpus, there are 42461 shots, but the number of each concept is not released. In addition, in order to save computational time, all samples are down-sampled to the size of 88*72, and all classifiers are trained by LibSVM[23] with RBF kernel.

TABLE I. THE NUMBER OF SAMPLES IN EACH DATASET

TABLE II. THE PERFORMANCE OF THE BASIC CLASSIFIERS AND DIFFERENT FUSION SCHEMES

We choose 50 positive samples from TRECVID-2007

development corpus for every concept to generate the SIFT vocabulary. Although there are only 1,000 key frames, there are about 270,000 SIFT points. With K-means, these points are quantized into 1000 clusters, and each cluster represents a visual keyword. In the test, SIFT points are extracted from every key frame, and every point will be distributed to its nearest cluster. Then, the number of occurrences of each visual word in the frame is recorded in a histogram.

Firstly, we make the evaluation on the auto-test data corpus. Table II gives the mean average precisions (MAP) of the Sift, Gabor, Edge and Color classifiers and the fusion schemes of Average, Max, Vote, Pweight, Apweight, NewPw, NewApw expressed by (4), (5), (7), (8), (9) and (12) respectively. Fig.2 shows the average precisions of all concepts, where the concept index is corresponding to the first column of the Table I.

From Table II, we can find that the MAP of Gabor classifier, which is the best among the four basic classifiers, can reach 14.3719%. The performances of all fusion methods other than Max are much better than that of the Gabor classifier. In

addition, the MAP of Apweight outperforms the MAP of Gabor by 51.2514%. When each concept takes its own best fusion scheme, the set of fusion schemes is called as the Mixed fusion. The MAP of Mixed fusion outperforms the Gabor by 54.9690%.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Concept Index

Ave

rge

Pric

ision Sift

Gabor

Edge

Color

Mixed

Figure.2 the Average Pricision (AP) of each concept

From Fig.2 we can see that all APs in the mixed fusion scheme are better than those in the four basic classifiers. The AP of Mixed for street can outperforms the AP of Edge by 45.95%, and the AP of Mixed for mountain can outperforms the AP of Edge by 45.69%. The AP of Mixed for cityscape can improve 39.12% over the AP of Gabor, and all concepts can get some improvements by fusion.

Figure.3. the comparison of the BUPT_Sys1_1 with others

We have participated TRECVID-2008 with the proposed system. Since every team is required to submit no more than 6 runs, we just submit the mixed fusion schemes. In addition, as we extract three key frames from each shot, in the second fusion model in Fig.1 we choose the maximum or the average probabilities from the three key frames. Two runs we submitted are as follows:

BUPT_Sys1_1: The classifiers were trained, and the second fusion model took the maximal probability among three key frames of a shot.

BUPT_Sys2_2: The classifiers were trained, and the second fusion model took the average among three key frames of a shot.

The performance of BUPT_Sys1_1 is the best, and its MAP is 0.047. With respect to BUPT_Sys2_2, its MAP is 0.043.

Fig.3 shows the comparison on performance of the proposed system with others. In Fig.3, the ordinates is Inferred average precision [24] which evaluates the average precision (AP) with incomplete and imperfect judgments. In this way, TRECVID can save time from computing all data. The abscissa is the

2121

concept index. Although our system is simple, its performance is above the average level.

Finally, we will analyze the reason why the performance of our system in TRECVID-2008 is quite different from the TRECVID-2007 auto-test corpus. We think that there are three reasons:

I) In TRECVID-2007 dataset, the positions of key frames are given, and the concepts to be detected, are actually in the given key frames. However, in TRECVID-2008 dataset, only shots are given. We simply choose three key frames from each shot by a fixed interval, which may not be the same as that chosen by the TRECVID, and may not include the concept to be detected.

II) The scale of evaluation. There are about 6609 samples in TRECVID-2007 auto-test corpus, but in TRECVID-2008 there are 42,461 samples. The greater the number of samples, the worse the performance might be.

III) The dataset domain is different. The training corpus is identical with TRECVID-2007 auto-test, but it is different from TRECVID-2008. However, the trends in results are coherent in Fig.2 and Fig.3.

VI. CONCLUSION In this paper, we proposed a new framework for high-level

feature extraction. In this system, features at different granularities are extracted, and four classifiers with complementary features for each concept are employed, and then the results are fused. We have evaluated 18 fusion schemes, and choose the best one for each concept to form the final results. The experiments on the auto-test corpus and TRECVID-2008 corpus show that the proposed system is effective and stable.

ACKNOWLEDGMENT This work was supported by China National Natural

Science Foundation under Project 60772114.

REFERENCES [1] Jie Cao, Yanxiang Lan, Jianmin Li et al, “Tsinghua University at

TRECVID 2006,” In: Proceedings of TRECVID 2006 Workshop. [2] Jinhui Yuan, Zhishan Guo, Li Lv et al., “THU and ICRC at TRECVID

2007, ” In: Proceedings of TRECVID 2007 Workshop. [3] C.G.M. Snoek, I. Everts, J.C. van Gemert et al, “The MediaMill

TRECVID 2007 Semantic Video Search Engine,” In: Proceedings of TRECVID 2007 Workshop.

[4] Xiangyang Xue, Hui Yu, Hong Lu et al, “Fudan University at TRECVID 2007,” In: Proceedings of TRECVID 2007 Workshop.

[5] D Duy-Dinh Le, Shin’ichi Satoh,Tomoko Matsui, “NII-ISM, Japan at TRECVID 2007:High Level Feature Extraction,” In: Proceedings of TRECVID 2007 Workshop.

[6] Sheng Tang, Yong-Dong Zhang, Jin-Tao Li et al, “TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS,” In: Proceedings of TRECVID 2007 Workshop.

[7] James Philbin, Ondˇrej Chum, Josef Sivic et al, “Oxford TRECVID 2007-Notebook paper,” In: Proceedings of TRECVID 2007 Workshop.

[8] Yu-Gang Jiang, Chong-Wah Ngo, Jun Yang, “Towards Optimal Bag-of-Features for Object”, Proceedings of the 6th ACM International Conference on Image and Video Retrieval, CIVR 2007, pp 494-501, 2007.

[9] Van De Sande, Koen E. A., Gevers, Theo, Snoek Cees G. M., “Evaluation of color descriptors for object and scene recognition,” 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2008, pp 4587658, Jun 23-28 2008

[10] D. Lowe, “Distinctive image features from scale-invariant key points,” Int. Journal on Computer Vision, vol.60(2), pp:91-110, 2004.

[11] Liu, C., Wechsler, H., “A Gabor Feature Classifier for face recognition, “ Proceedings of the IEEE International Conference on Computer Vision, vol.2, pp 270-275., Jul 9-12 2001.

[12] Huang J, Kumar SR, Mitra M & Zhu WJ, “Image indexing using color correlograms,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp762-768. Jun 17-19 1997

[13] Rautiainen, Mika, Doermann, David, “Temporal color correlograms for video retrieval Source,” International Conference on Pattern Recognition, v 16, n1, pp267-270, 2002..

[14] Leauhatong, Thurdsak Atsuta, Kiyoaki; Kondo, Shozo, “A new content-based image retrieval using color correlogram and inner product metric,” 8th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2007, pp 4279141, 2007.

[15] Naphade M R, Smit hJR, et al , “IBM Research TRECVID-2004,” Video Retrieval System. In: Proceedings of TRECVID 2004 Workshop.

[16] Naphade M R, Smit hJR,et al, “IBM Research TRECVID-2005 Video Retrieval System,” In: Proceedings of TRECVID 2005 Workshop.

[17] Chang SF, Hsu W, et al, “Columbia University TRECVID-2005 Video Search and High-level Feature Extraction,” In: Proceedings of TRECVID 2005 Workshop.

[18] J. Smith et al., “Multimedia semantic indexing using model vectors,” IEEE Proc. ICME, vol.2, pp445–448, 2003.

[19] Wei Jiang, Shih-Fu Chang, Loui, A.C, “Context-Based Concept Fusion with Boosted Conditional Random Fields,” Acoustics, Speech and Signal Processing, 2007. ICASSP. IEEE International Conference on, Vol.1, pp. I-949 - I-952, 15-20 April, 2007.

[20] Aytar, Y.,Orhan, O.B, Shah, M., “Improving Semantic Concept Detection and Retrieval using Contextual Estimates,” Multimedia and Expo, 2007 IEEE International Conference on pp. 536 – 539. 2-5 July 2007

[21] Le Chen, Dayong Ding, Dong Wang, Lin Fuzong, Zhang Bo, “AP-based Borda Voting Method for Feature Extraction in TRECVID-2004,” Advances in Information Retrieval-27th European Conference on IR Research, ECIR 2005, pp.568-570, 2005.

[22] IBM VideoAnnEx MPEG-7 Video Annotation Tool, http://www.research.ibm.com/VideoAnnEx/

[23] Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[24] Emine Yilmaz, Javed A. Aslam, “Estimating average precision with incomplete and imperfect judgments,” Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management (CIKM), November, 2006.

2122

[IEEE 2009 4th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Xian, China...

Documents

Transcript of [IEEE 2009 4th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Xian, China...