Multidimensional approach in cbmmirs full paper v4.0

A Multidimensional Approach in Content-based Multimedia Information Retrieval System

Indra Budi, Zainal A. Hasibuan, Gema P. Mindara Faculty of Computer Science

University of Indonesia Depok, Indonesia

[email protected], [email protected], [email protected]

Albaar Rubhasy Department of Computer System

STMIK Indonesia Jakarta, Indonesia

[email protected]

Abstract— In this digital era, the use of digital multimedia information is highly utilized and growing very rapidly due to the development of the Internet. Thus, users demand for more effective content-based multimedia information retrieval system (CBMMIRS). The major challenge in this research area is that a multimedia document comprises more than one type of contents (i.e. text, image, audio). In order to address this challenge, many works have been focusing on the indexing techniques development which can accommodate multiple multimedia object representation or known as object features. However, most of the experiments use only one certain kind of collection, for example a collection of WWW pages, video collections, image collections, and so forth. In this paper, we propose a multidimensional approach which could accommodates semantic indexing of various multimedia contents in different multimedia collections, since the fact is that different multimedia documents may share similar information. The architecture comprises three components: (1) collection manager (which manages multimedia documents repository); (2) indexer (which handles multimedia concept detection and indexing); and (3) query processor (which deals with query and search results). Our hypothesis is that the more complete the document (which indexed in many different feature spaces), the more relevant the document and should be ranked higher in the search results.

Keywords- CBMMIRS, multimedia information retrieval, multidimensional approach

I. INTRODUCTION With the development of the Internet, the use of digital

multimedia information (including audio, video, images and graphics) is growing rapidly and has plays an important role in modern life. Most of the multimedia files were published and distributed in various formats via the social media within the Internet for instance Facebook1, Flickr2, Youtube3, and so forth. As a result, there is an explosion of digital multimedia objects and users demand for more efficient yet accurate content-based multimedia information retrieval system (CBMMIRS).

Due to the large and varied digital multimedia collection, a text-based retrieval system is considered to be inefficient

1 http:// www.facebook.com

2 http:// www.flickr.com 3 http://www.youtube.com

considering the level of human labor and the precision level. Therefore, in the early 1980s, content-based information retrieval (CBIR) was introduced to overcome the disadvantages. However, by nature a multimedia document may consist of more than one type of content, for example text, images, video and audio. Thus, in late 1990s, emerged a novel approach which combines the text-based and content-based retrieval method in order to boost CBMMIRS performance. Many authors describe such technique as a multimodal information retrieval whilst the system indexes and retrieves using various object representation/modalities, such as text, color, texture, etc. Nevertheless, in many papers, authors used only one type of multimedia collection, such as TRECVID for video collection [1], MIRFLICKR for image collection [2], WIKIPEDIA-MM for world wide web pages (WWW) collection [3], and so forth.

In this paper, we propose a multidimensional approach which accommodates the heterogeneous kind of the multimedia collections and the variety of multimedia contents (i.e. textual, visual, and audio). The goal of this approach is to achieve the completeness of information, means that the most relevant information must be available in many type of contents. Even though this approach might be fruitful, but there exist a constraint in context of applying a number of objects features. In this case, excessive use of object features in indexing may lead into a poor performance, due to the famous ‘curse of dimensionality’ problem [4]. As the dimensionality of feature space increases, the performance of indexing algorithms will degrades. Research showed that when the dimensionality is above 10, the performance is no better than a simple sequential scan [5].

This paper explores a multidimensional approach in CBMMIRS. The rest of the paper is organized as follows. In Section 2, we show some works related to this paper. Section 3 focuses on the multidimensional approach in CBMMIRS using high dimension of feature spaces with various type of collections. Section 4 concludes this paper and in this section we also discuss the future works that will be conduct.

II. RELATED WORKS The building block of a CBMMIRS comprises three

essential processes: (1) multimedia feature extraction; (2) concept detection; and (3) indexing process. Each of these processes will be discussed in the following parts.

A. Multimedia Feature Extraction Feature extraction is one of the major tasks that

determine the performance of a CBMMIRS [6]. Thus far many techniques are available to generate representation of multimedia content which may comprises the combination of text, visual (i.e. image), and audio. Next we break down few state-of-the-art feature extraction techniques in three different types of multimedia contents.

1) Textual Feature Extraction The fundamental of text indexing scheme was proposed

by Salton and McGill with the popular tf-idf scheme [7]. This technique chooses a basic vocabulary of “terms” or “words” and counts the number of occurrences of each term. After that, this term frequency count is compared with an inverse document frequency count. As a result, the tf-idf scheme reduces documents length to fixed-length lists of numbers. However, the dimension reduction of this scheme is considered to be insignificant. The most distinguished approach to tackle this issue is the latent semantic indexing (LSI) approach. LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection [8]. Later, a major breakthrough was introduced by Hofman with the probabilistic LSI (pLSI) model. This approach models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics” [9]. But, all these two models (LSI and pLSI) are based on the “bags of words” assumption that the order of words in a document could be ignored. In order to mix the models that capture the exchangeability of both words and documents, the latent Dirichlet allocation (LDA) model was proposed [10]. Up till now, this model is widely used by many authors in their IR researches. Next, we discussed the image feature extraction.

2) Image Feature Extraction There are many ways to generate image representation

into feature vectors. The traditional method is using image histogram. This method was successfully implemented in a large scale gallery and museum in Europe [11]. However, this method discards all information regarding spatial distribution of color and reduces the signature efficiency which has been a major flaw [12]. Then, other techniques were being studied, such as using color, texture, shape, and many other features. Nevertheless, most of them could not overcome the challenging fact in image extraction which is the extraction of an image regardless if it were obstructed, rotated, and so forth. As a result, using image features in

CBMMIRS are no longer an ideal method. Currently, most recent works uses the scale-invariant feature transform (SIFT) which based on common grounds and successfully applied in many projects [11, 13]. SIFT could detects and provides descriptions of some points from image which produces more information than the other feature-based methods. There are also few criteria of the detection: local contrast, local maxima/minima of certain functions (e.g. laplacian, gradient, etc.) and threshold over a curvature function (e.g. harris, hessian, etc.). Next, we briefly explain concerning audio feature extraction.

3) Audio Feature Extraction Many works have been focusing on structured audio

analysis such as speech or music. Only few system have been proposed to analyze on unstructured audio. One of the popular models is the mel-frequency cepstral coefficient (MFCC). MFCC features are modeled based on the shape of the overall spectrum, making it more favorable for modeling single sound sources. On the other hand, an environmental sound comprises more than one source of sounds. In order to tackle this issue, the matching-pursuit (MP) technique was proposed. MP provides an efficient way of selecting a small basis set that would produce meaningful features as well as a flexible representation [14]. It is potentially invariant to background noise and could capture characteristics in the signal where MFCC fails. This ends our discussion regarding multimedia feature extraction in three different types of multimedia contents. In the next part, we focus on the audio visual concept detection techniques.

B. Audio Visual Concept Detection Multimedia concept detection is considered as one ways

in reducing semantic gap. Reference [15] provides an example of a detection model which links each topics with one or more visual concepts, known as the Visual Concept Detections (VCDT). However, works have been focusing only on the visual concept and few on the audio visual concept detection. One of the examples that used both visual and audio content could be found in [16]. In this work, the authors provided an approach to semantically detect concept(s) from a video collection. However, the audio detection is only classified into speech and instrumental, rather than to detect the environmental sounds. This issue needs to be more explored more thoroughly in order to improve the CBMMIRS understandings of concepts existing in a multimedia document.

C. Multimedia Concept-based Indexing The Multimedia concept-based or semantic-based

indexing approach is depends on the fusion of the concepts, which many works uses kernel-based classifier (e.g. support vector machine or SVM). Basically, there are two fusion strategies available: early fusion and late fusion. Early fusion method integrates the different modalities, previously feature from different modalities have been fused then

search algorithm execute on the representation of the new fusion. On the other hand, late fusion will characterize multimedia content which employs multiple features. Using this scheme, different rankings referred to data fusion or rank aggregation could be combined. Nonetheless, it is possible and promising to merge these two schemes, whereas the early fusion is based on low or intermediate-level features and the late fusion merges unimodal classification scores of high-level features [17].

III. THE PROPOSED MULTIDIMENSIONAL APPROACH We discover that many works in CBMMIR research area

are involving with just one type of multimedia collection, for example video or image collection for sequentially content-based video or image retrieval system. Here we propose a different approach whereas involving with different kind of features from various type of multimedia collection (e.g. video, image, WWW, and other type of collections) in order to achieve the completeness of information. Inspired by [18] which uses three different components of documents in order to elevate retrieval performance; we propose a similar multidimensional strategy which is applied in multimedia documents which also have several types of components. The proposed multidimensional approach is depicted in Figure 1.

Digital Object RepositoryOther

CollectionVideo

Collection

Metadata Repository

Indexer

TextFeature

Extraction

Multimedia Concept Detection Process Training Dataset

Multimedia Concept-based Matching Process

User Interface

Image Feature

Extraction

AudioFeature

Extraction

Multimedia Search Results

MultimediaConcept-

Based Index

Image Collection

WWW Collection

Collection Manager

Multimedia Concept-based Indexing

Query Processor

Multimedia Concept-based Query

Figure 1. Proposed multidimensional approach in CBMMIRS

Fig. 1 shows the proposed multidimensional CBMMIRS architecture which adapted from [19]. The system comprises three components as follows: • Collection Manager (CM): this component is in charge

with collecting and managing multimedia documents from various types of multimedia document collections that we aim to index, searches, and retrieve by the Indexer and Query Processor. The documents from different types of collections such as video, image, WWW, and other multimedia collections are stored in a repository along with their metadata which provide information about the documents. CM also includes the administrator user interface with the intention that he/she is capable in administering the document collections.

• Indexer (IX): This component is responsible on generating and maintaining data structures that represents one type of multimedia document feature (i.e. text, image, and audio) so called index in order to provide searching capabilities. IX exploits the documents collected by CM for indexing processes. The indexing process involves feature extraction methods for each and every type of feature as follows: (1) in text feature extraction, we suggest using LDA; (2) in image feature extraction, we use SIFT; (3) in audio feature extraction, we intended using MP technique. In our system design, the indexing process involving a multimedia concept-based indexing which depends on the robustness of multimedia concept detection method.

• Query Processor (QP): this component is responsible for handling query and search results. QP provides user interface for multimedia concept-based query. The concept-based query interface differs from a search tools such as Google 4 since it allows users to resolve the naming heterogeneity that occurs when the identical concept is described using different terms.

The research issues that may occur in our works are

stated below: • Feature extraction techniques. The extraction

techniques that we mentioned earlier, such as LDA, SIFT, and MP, are the state-of-the-art feature extraction methods. Nevertheless, finding the ‘right combination’ is one of the main problems. What feature of a multimedia object should we choose and what extraction technique we prefer for each feature in order to increase the CBMMIRS performance is still remains a challenging research area.

• Multimedia concept detection method. Many works have been done to automatically detect multimedia concepts. However, the generic concept of a multimedia object, including audio visual collections, has not been explored comprehensively. The standardized visual concepts are available, such as Wiki concepts and Visual Concept Detection topics. In contrast, the standardized concept for audio is not in place. Yet, we ought to explore more in multimedia concept detection method in order to

4 http://www.google.com

accommodate audio visual features of a multimedia object.

• Multimedia concept-based matching process. In the matching process, we propose a different way in ranking retrieved documents. Our hypothesis is that the more complete documents which available in many different feature spaces, the more relevant the document. Therefore, such documents should be weighed more in order to raise the rank. This hypothesis has to be proven in experiment that will be performed in the next phase of this work.

• Multimedia concept-based query interface. As stated earlier, concept-based query interface differs from a general search tools. The issue of this research area is to minimize the ambiguity of different terms with similar concept.

IV. CONCLUSION AND FUTURE WORKS This paper proposes a multidimensional approach in

CBMMIRS which can accommodate various types of multimedia object features (i.e. text, image, and audio) in numerous multimedia document collections. In our design, the system comprises three components: (1) collection manager (which responsible in storing multimedia document collections); (2) indexer (which responsible in extracting and indexing document features in order to be searched by user); and (3) query processor (which responsible in managing queries and search results). We also identify few research issues in these three CBMMIRS components. Nevertheless, further experiment needs to be conducted not only to test the retrieval performance, but also to prove our hypothesis, which is that the more complete the document (which indexed in several different feature spaces), the more relevant the document compare to the others which only indexed in only one feature space. Thus, such documents should be place in the top list of the search results.

ACKNOWLEDGMENT This paper was fully supported by DRPM UI Research

Grant under contract Number 1198/SK/R/UI/2010 (research project on Indonesian e-Cultural Heritage and Natural History Framework).

REFERENCES [1] M. J. Huskes and M. S. Lew, “The MIR Flickr Retrieval

Evaluation”, MIR ’08 Proceeding of the 1st ACM International Conference on Multimedia Information Retrieval, ACM New York, USA, 2008, ISBN: 978-1-60558-312-9, DOI:10.1145/1460096.1460104.

[2] A. F. Smeaton, P. Over, and W. Kraaij, “Evaluation Campaigns and TRECVid”, MIR ’06 Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, ACM New York, USA, 2006, ISBN: 1-59593-495-2, DOI: 10.1145/1178677.1178722.

[3] A. Popescu, T. Tsikrika, and J. Kludas, “Overview of the Wikipedia Retrieval Task at ImageCLEF 2010”, Working Notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010.

[4] N. Rasiwasia, J. C. Pereira, E. Coviello, and G. Doyle, “A New Approach to Cross-Modal Multimedia Retireval”, Proceedings of the International Conference on Multimedia, October 25-29,2010, ACM New York, USA, ISBN: 978-1-60558-933-6, DOI: 10.1145/1873951.18739870.

[5] R. Weber, H.-J. Schek, S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces”, Proceedings of the 24th VLDB Conference, New York, USA, 1998, pp. 194–205.

[6] M. M. Rahman, B. C. Desai, and P. Bhattacharya, “A Feature Level Fusion in Similarity Matching to Content-based Image Retrieval”, Information Fusion, 2006.

[7] G. Salton and M. McGill, “Introduction to Modern Information Retrieval”, McGraw-Hill, 1983.

[8] S. Deerwester, S. Dumais, T. T. Landauer, G. Furnas, and R. Harshman, “Indexing by Latent Semantic Analysis”, Journal of the American Society of Information Science, 41(6):391-407, 1990.

[9] T. Hofman, “Probablistic Latent Semantic Indexing”, Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.

[10] D. M. Brei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation”, Journal of Machine Learning Research 3, 2003, pp. 993-1022.

[11] P. H. Lewis, K. Martinez, F. S. Abas, M. Faizal, A. Fauzi, S. C. Y. Chan, M. J. Addis, M. J. Boniface, P. Grimwood, A. Stevenson, C. Lahanier, J. Stevenson, “An Integrated Content and Metadata Based Retrieval System for Art”, Journal IEEE Transactions on Image Processing, vol.13, March 2004, pp.302-313.

[12] E. Valle, M. Cord, and S. Philipp-Foliguet, “Content-based Retrieval of Images for Cultural Institutions using Local Descriptors”, Proceedings of Geometric Modelling and Imaging — New Trends — GMAI 2006, London England, July 05–06, 2006, DOI: 10.1109/GMAI.2006.16..

[13] M. Kampel, R. Huber-Mörk, M. Zaharieva, “Image-Based Retrieval and Identification of Ancient Coins”, Journal IEEE Intelligent Systems, Vol. 24 Issue 2, March 2009 IEEE Educational Activities Department Piscataway, NJ, USA, pp.26-34, DOI: 10.11109/MIS.2009.29.

[14] S. Chu, S. Narayan, and C.-C. J. Kuo, “Environmental Sound Recognition Using MP-based Features”, Proceedings of International Conference on Accoustics, Speech, and Signal Processing, 2008.

[15] Z. Zhao and H. Glotin, “Concept Content Based Wikipedia WEB Image Retrieval using CLEF VCDT 2008”.

[16] M. Rautiainen, T. Seppänen, J. Penttilä, and J. Peltola, “Detecting Semantic Concepts from Video Using Temporal Gradients and Audio Classification”.

[17] S. Ayache, G. Qu´enot, and J. Gensel, “Classifier Fusion for SVM-Based Multimedia Semantic Indexing”.

[18] Z. A. Hasibuan, “Multi Dimensions Concept-based Information Retrieval System”, Proceedings of ALL/ACH 2000 Conference, Glasgow, UK, 2000.

[19] Z. A. Hasibuan, A. Kurniawan, and R. Budiarto, “Multi-Format Concept-Based Information Retrieval using Data Grid”, Journal of Advanced Computing and Applications Vol. 1 No. 1, 2009, pp. 1-11.

Multidimensional approach in cbmmirs full paper v4.0

Education

Transcript of Multidimensional approach in cbmmirs full paper v4.0