Using MPEG standards for multimedia...

20
Signal Processing: Image Communication 19 (2004) 437–456 Using MPEG standards for multimedia customization Jo * ao Magalh * aes a, *, Fernando Pereira b a Instituto Superior de Engenharia de Lisboa, Portugal b Instituto Superior T ! ecnico, Instituto de Telecomunica@ * oes, Lisboa, Portugal Received 12 August 2003; received in revised form 24 December 2003; accepted 20 February 2004 Abstract The multimedia content delivery chain poses today many challenges. The increasing terminal diversity, network heterogeneity and the pressure to satisfy the user preferences are raising the need for content to be customized in order to provide the user the best possible experience. This paper addresses the problem of multimedia customization by (1) presenting the MPEG-7 multimedia content description standard and the MPEG-21 multimedia framework; (2) classifying multimedia customization processing algorithms; (3) discussing multimedia customization systems; and (4) presenting some customization experiments. r 2004 Elsevier B.V. All rights reserved. Keywords: Universal Multimedia Access; Multimedia customization; Multimedia adaptation; Content description; Usage environ- ment description; MPEG-7; MPEG-21 1. Introduction The exploding variety of multimedia informa- tion is nowadays a reality since everyone has a camera, a scanner or another device that almost instantly generates multimedia content. Most often the content author wishes to share its masterpiece with everyone, but the variety of terminals and networks may be a problem if he/ she wants everyone to see his/her work with the best possible quality. Typically, when a terminal accesses content to which it was not designed for, the user experience is rather poor. The scenario described misses a bridging element between all the components involved, which should take into account their characteristics and assure an efficient and consistent inter-working: in other words, an efficient way to ‘‘access any information from any terminal’’, eventually after some content adapta- tion. The access to multimedia information by any terminal through any network is a new concept referred in the literature as Universal Multimedia Access (UMA) [30,31]. The objective of UMA technology is to make available different presenta- tions of the same information, more or less complex, e.g., in terms of media types or band- width, suiting different terminals, networks and user preferences. In UMA scenarios, and in order to more easily and efficiently customize the desired content, it is essential to have available descrip- tions of the parts that have to be matched/ bridged—the content and the usage environment: * Content description: Information on the content features—e.g., resolution, bit-rate, motion, ARTICLE IN PRESS *Corresponding author. E-mail address: [email protected] (J. Magalh * aes). 0923-5965/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2004.02.004

Transcript of Using MPEG standards for multimedia...

Page 1: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Signal Processing: Image Communication 19 (2004) 437–456

*Correspondin

E-mail addre

0923-5965/$ - see

doi:10.1016/j.ima

Using MPEG standards for multimedia customization

Jo*ao Magalh*aesa,*, Fernando Pereirab

a Instituto Superior de Engenharia de Lisboa, Portugalb Instituto Superior T!ecnico, Instituto de Telecomunica@ *oes, Lisboa, Portugal

Received 12 August 2003; received in revised form 24 December 2003; accepted 20 February 2004

Abstract

The multimedia content delivery chain poses today many challenges. The increasing terminal diversity, network

heterogeneity and the pressure to satisfy the user preferences are raising the need for content to be customized in order

to provide the user the best possible experience. This paper addresses the problem of multimedia customization by (1)

presenting the MPEG-7 multimedia content description standard and the MPEG-21 multimedia framework; (2)

classifying multimedia customization processing algorithms; (3) discussing multimedia customization systems; and (4)

presenting some customization experiments.

r 2004 Elsevier B.V. All rights reserved.

Keywords: Universal Multimedia Access; Multimedia customization; Multimedia adaptation; Content description; Usage environ-

ment description; MPEG-7; MPEG-21

1. Introduction

The exploding variety of multimedia informa-tion is nowadays a reality since everyone has acamera, a scanner or another device that almostinstantly generates multimedia content. Mostoften the content author wishes to share itsmasterpiece with everyone, but the variety ofterminals and networks may be a problem if he/she wants everyone to see his/her work with thebest possible quality. Typically, when a terminalaccesses content to which it was not designed for,the user experience is rather poor. The scenariodescribed misses a bridging element between all thecomponents involved, which should take intoaccount their characteristics and assure an efficient

g author.

ss: [email protected] (J. Magalh*aes).

front matter r 2004 Elsevier B.V. All rights reserve

ge.2004.02.004

and consistent inter-working: in other words, anefficient way to ‘‘access any information from anyterminal’’, eventually after some content adapta-tion.

The access to multimedia information by anyterminal through any network is a new conceptreferred in the literature as Universal MultimediaAccess (UMA) [30,31]. The objective of UMAtechnology is to make available different presenta-tions of the same information, more or lesscomplex, e.g., in terms of media types or band-width, suiting different terminals, networks anduser preferences. In UMA scenarios, and in orderto more easily and efficiently customize the desiredcontent, it is essential to have available descrip-tions of the parts that have to be matched/bridged—the content and the usage environment:

* Content description: Information on the contentfeatures—e.g., resolution, bit-rate, motion,

d.

Page 2: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Fig. 1. A representation of the MPEG-21 multimedia frame-

work.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456438

color, pitch, temporal structure, genre—whichmay be instrumental to perform an efficientcustomization of that content; if an adequatecontent description is available, there is no needto extract on-the-fly content features to performan adequate adaptation.

* Usage environment description: Information onthe usage conditions—e.g., terminal, network,user preferences, natural environment—whichdetermine the quality of the experience to beprovided through an adequate customizedvariation of the pretended content. If no usageenvironment description is available, it isdifficult to provide adapted content adequatelyfitting the consumption conditions.

These descriptions have to be matched by somemodule in the content delivery chain that willproduce and implement a decision to perform a setof content customization operations to provide theuser with the content for the best possibleexperience.

MPEG has dedicated large efforts to thestandardization of tools for such applicationscenarios in terms of three major dimensions:content coding (MPEG-1, MPEG-2 and MPEG-4), content description (MPEG-7 [17]) and usageenvironment description (MPEG-21 [20]). Themajor objective of this paper is to discuss the roleof the various MPEG standards in the context ofmultimedia customization scenarios and to con-tribute for a better organization and understand-ing of the multimedia customization problem.

This paper is organized as follows: Section 2presents the MPEG-21 vision of a multimediaframework aiming to enable the transparent andaugmented use of multimedia resources across awide range of networks, devices, and communities;a brief description of MPEG-7 capabilities whichmay fit in the MPEG-21 framework is also made inthis section. Section 3 organizes the several typesof multimedia processing algorithms which may beused for the content matching process to the usageenvironment characteristics. Section 4 overviewsmultimedia customization systems available in theliterature and proposes a rather generic systemarchitecture [14] fitting in the MPEG-21 multi-media framework. Finally, Sections 5 and 6

present some experiments and the final remarksregarding the presented work.

2. The MPEG-21 multimedia framework

Considering all the issues inherent to a contentdelivery chain (and not only in the UMAperspective), the MPEG-21 standard [17] proposesto define a multimedia framework for the trans-parent multimedia delivery and consumption byall players in the delivery and consumption chain.In this framework, many standard technologiesare needed to provide the various functionalitiesrequired such as coding, multiplexing, synchroni-zation, description, rights expression, rights man-agement, etc. The MPEG-21 framework will makeuse of the relevant available standards providingefficient solutions for some of these functionalities,e.g. MPEG-4 for coding and MPEG-7 fordescription, and will develop new standardswhenever required.

In the context of the MPEG-21 multimediaframework, the main entities are Users and DigitalItems, see Fig. 1:

* An MPEG-21 User is any entity that interactswithin the multimedia framework: it can be acontent creator, a content distributor, or acontent consumer (end user). Users includeindividuals, consumers, communities, organisa-tions, corporations, consortia, and govern-ments. Users are identified specifically by theirrelationships to others Users for a certain

Page 3: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 439

interaction. From a purely technical perspec-tive, MPEG-21 makes no distinction between a‘‘content provider’’ and a ‘‘consumer’’—bothare Users in the MPEG-21 multimedia frame-work.

* The Digital Item is the fundamental unit ofdistribution and transaction among Users in theMPEG-21 multimedia framework. It is astructured digital object with resources (thecontent), unique identification and correspond-ing metadata (e.g. MPEG-7 description). Thestructure relates to the relationships among theparts of the Digital Item, both resources andmetadata.

Once the content (in the form of Digital Itemsresources) will be exchanged in the defined frame-work, there will be entities that will offer contentcustomization functionalities to achieve an opti-mal end user experience. Therefore, MPEG-21 setsthe trail to create a complete UMA system, wheresuch entities will play the role of the ‘‘bridgingelement between the parts that have to bematched/bridged’’—the multimedia content andthe usage environment.

2.1. Multimedia content description

To facilitate the development of powerfulapplications for multimedia information retrieval,customization, distribution, and manipulation,some knowledge about the multimedia contentcharacteristics is essential in any content deliverychain [4]. The best the content is known, the moreefficient it may be processed, whatever the type ofprocessing to be applied. Content descriptiontypically considers two types of features: featuresabout the content but that cannot be extracteddirectly from the content, such as titles and names,and features conveying information that is presentin the content, such as colors, melodies, or events.The second type of features may be low-level orhigh-level depending on their abstraction orsemantic level [21]. Both low-level and high-level(semantic) features may be useful to decide on thebest customizations to be performed, e.g. contentsegment semantics is useful to create a summarybased on related user preferences while motion

activity may be important for the filtering ofviolent action segments.

MPEG-7 addresses the multimedia contentdescription problem at different levels: it offers awide range of description tools that consider bothlow-level features such as color and pitch as well ashigh-level features such as the name of thecharacters in a scene or the title of a movie.MPEG-7 provides a set of description toolsintended to characterize the audiovisual contentin terms of the type of features listed above. Thestandard separates the descriptions from thecontent but provides linking mechanisms betweenthe content and the descriptions. Also more thanone description may exist for the same contentdepending on the needs the descriptions intend toaddress. The types of description tools specified bythe MPEG-7 standard are [17]:

* Descriptors (D): Represent a feature, anddefine the syntax and semantics of the featurerepresentation.

* Description Schemes (DS): Specify the struc-ture and semantics of the relationships betweentheir components, which may be both Descrip-tors, and Description Schemes.

* Description Definition Language (DDL): Al-lows the creation of new Description Schemes,as well as the extension of existing DescriptionSchemes.

* Systems Tools: Support the multiplexing ofdescriptions, synchronization of descriptionswith the associated content, binary representa-tion for efficient storage and transmission,management and protection of intellectualproperty, etc.

MPEG has invested a great effort in thestandardization of MPEG-7 description tools forUMA applications [33]. This makes MPEG-7 themost powerful content description solution avail-able for UMA environments and thus has beenchosen for describing content in the context of theUMA system later described in this paper. TheMPEG-7 UMA related description tools aregrouped into three categories:

* Media description tools: Describe the media;typical features include the storage support, the

Page 4: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456440

coding format, coding parameters, and theidentification of the media. One of the mostpowerful tools regards transcoding hints, whichimproves the quality and reduces the complex-ity of transcoding applications (e.g. regions ofinterest, motion vectors).

* Content structure and semantic description tools:Describe the audiovisual content from the view-point of its structure and semantics; the contentis organized in segments that represent itsspatial, temporal or spatial–temporal structure.

* Content navigation and access: Facilitate thebrowsing and retrieval of audiovisual contentand the management of different versions of thesame content, notably summaries, views andvariations. With MPEG-7 summarization tools,multiple summaries of the same content can bespecified at different levels of detail, without theneed for generating or storing multiple varia-tions of the content.

These description tools are built upon theMPEG-7 Multimedia Description Schemes basicelements (schema tools, basic data types, links andmedia localization) which provide the foundationfor the development of MPEG-7 descriptiontools [17].

To create MPEG-7 descriptions, some amountof analysis is typically required to extract thefeatures values. Fig. 2 illustrates a rather genericdiagram of the multimedia features extractionprocess, where knowledge is learned from previousanalysis, to improve future feature extraction andclassification (this analysis ranges from low-levelto high-level). There are many papers in the

Low-levelfeatures

Multimediacontent

Humandecision

ColorShapesTexturesMotion

Low-levelanalysis

Feature extractor

Knowl

Features

Fig. 2. Low-level and high-le

literature describing the extraction of both low-level features and high-level features. Wang et al.[36], present an overview of multimedia contentlow-level analysis techniques: while the extractionof low-level characteristics typically poses nosignificant problem but the choice of what toextract to achieve the desired goals, the same is nottrue when semantic characteristics have to beextracted by means of automatic analysis. Tocreate semantic descriptions, semi-automatic ana-lysis tools allowing humans to complement and/ortrain the analysis algorithms may be essential.Snoek and Worring present in [29] a review ofmultimodal indexing techniques for high-levelinformation extraction.

Several developments have been presented interms of high-level analysis, the most notable usingmachine learning/pattern recognition algorithmsto identify objects, and detect concepts andconceptual relations from multimedia content.Vasconcelos et al. [34] and Naphade et al. [24]used Bayesian networks to characterize multi-media semantic features, e.g. indoor/outdoor,forest, sky, water, explosions, rocket launches;similar work has been done by Adams et al. [1]where features from audio, video and text mod-alities are individually classified in an earlier stage,to be later combined for the detection of semanticconcepts; Benitez et al. [3] proposed a way todiscover and measure statistical relationshipsamong concepts, from images and correspondingtext annotations; finally, Tansley et al. [32]proposed a four layer semantic representationframework, where each layer encodes informationat an increasingly symbolic level.

Semanticfeatures

Multimediadescriptions

High-levelanalysis

Recognition andInterpretation Objects recognition

RelationsConceptsActivity

...

edge base

Features

vel features extraction.

Page 5: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 441

All these works achieve valuable results byemploying learning-based approaches to extracthigh-level information from multimedia content.Semantic content analysis is nowadays a veryactive research field, with promising advances inupcoming years—the reader is referred to[36,29,28], where the content analysis problem isfurther discussed.

2.2. Multimedia usage environment description

The previous section presented MPEG-7 as thestandard solution to describe the multimediacontent to be accessed. The other side of theUMA problem, the usage environment, may alsobe very heterogeneous due to the differentterminals, networks and so forth that may bepresent. With a standard solution providinginformation on the usage environment key dimen-sions, it would be much easier for an application tocustomize its content and services to the usageenvironment conditions. Moreover it would im-prove applications portability since an applicationcould be developed so that it checks the (standard)conditions in which it would be running andbehave accordingly.

Part 7 of the MPEG-21 standard—called DigitalItem Adaptation (DIA) [19]—has been createdmainly to address the usage environment descrip-tion problem. A diversity of dimensions character-izing the usage environment may enter in theequation that rules the content adaptation/deliverystrategy. MPEG-21 DIA considers four majordimensions for usage environment characteriza-tion, which have been proposed to MPEG by theauthors of this paper [15]:

* Terminal characteristics: The most commonlyused software with which a user accessesmultimedia content is a Web browser and itsplug-ins. The browser is dependent on thehardware and software characteristics on thetop of which it is running: multimedia decodingsoftware, display resolution, display size, num-ber of colors, audio capabilities, input capabil-ities (e.g. keyboard type), etc.

* Network characteristics: The access network canbe the cause of very annoying effects for the

user: delay, bandwidth shortage, channel errors,etc. The access network should be described ascompletely as possible to prevent as much aspossible delays or pauses in the content render-ing.

* Natural environment characteristics: This usageenvironment dimension includes the naturalfeatures regarding the surrounding usage en-vironment that may influence the contentadaptation: location, illumination, altitude,temperature, etc.

* User preferences: The last element in the contentchain: the human user. This dimension holdsinformation regarding his/her preferences, suchas genre, and advertisement tastes, but alsoabout disabilities. More general information onpreferences can also be used such as food andaccommodation preferences for advertising orretrieval.

In a broader sense, the term ‘‘usage environ-ment’’ concerns all user related information thatcan be described and can be used also for otherpurposes than multimedia content customization.For example, credit card numbers are part of theuser environment but this information is not thatimportant from the point of view of contentcustomization (although it is essential for billingpurposes).

2.3. The MPEG wrapping

Starting with the more traditional coding toolssuch as MPEG-1, and MPEG-2, and the recentscalable video coding tools such as MPEG-4 Fine-Grain-Scalability (FGS), and passing throughMPEG-7 content description, MPEG standardiza-tion culminates with the MPEG-21 multimediaframework which offers a wrapper to allow all thepieces in a multimedia customization chain tointegrate and interact with each other. Fig. 3depicts a possible configuration of a multimediacustomization chain using all MPEG standards.The Digital Item (DI) and the DIA UsageEnvironment Description are concrete representa-tions of the two sides of this chain.

At the server side, there is the DI with its resources(the content variations) and the corresponding

Page 6: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Fig. 3. MPEG wrapping for multimedia customization (in the form of retrieval, summarization, adaptation, or personalization).

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456442

descriptions. The content resources may exist inseveral resolutions and formats; for example, thesame Digital Item may have a non-scalableresource variation (e.g. MPEG-1) and a scalableresource variation (e.g. MPEG-4 FGS).

At the end user side, the MPEG-21 DIA UEDdescribes the environment (terminal, network,natural environment, and user preferences) wherethe content is to be consumed. When the userperforms a query or defines a choice, his request isaccompanied by the DIA UED, thus enabling acustomizing application to explore this informa-tion to create the right content variation to providethe best possible user experience.

Finally, the application at the center of Fig. 3 isresponsible for matching the user query (and theassociated DIA UED) and the Digital Item, eitherby selecting the most adequate available variation,or by performing some adaptation. When proces-sing a user query, the customizing applicationcreates an adapted variation of the Digital Item tobe sent to the user—the new variation and itscorresponding description may also be added tothe DI resources available at the server.

The user query response may be deliveredthrough a network, eventually using a real-timeconnection. In this case, the streaming module willstream the scalable or non-scalable content to theuser; in the case real-time transcoding is beenperformed, it may happen that real time adjust-ments to the transcoding process are implemented

using measures which characterize, for example,the network fluctuations.

3. Multimedia customization processing

Several content customization techniques mayhave to be combined to achieve the optimal resultin terms of final user experience. Fig. 4 presents apossible categorization and a non-exhaustive listof adaptation operations that a content customi-zation engine may perform to the basic mediatypes: text, image, audio, speech, video andsynthetic content. The content customizationoperations are divided in two major categories:

* Selection: Supposing that several variations ofthe same content or even several alternativecontent pieces addressing different kinds of userconstraints are available, content selectioncorresponds to the identification of the mostadequate content asset from those available tobe sent to the user. The selected variation maybe already adequate enough or may needfurther adaptation as explained below. Existingcontent variations may include the same ordifferent data types (e.g. video replaced by animage or text converted to speech using a cross-modal transformation). Content selection mayinvolve sending different information to differ-ent users not only based on their technical

Page 7: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Customization

Adaptation Selection

Transcoding Transmoding

Spatial resolutionreduction

Temporal resolutionreduction

Format conversion

Video to image

Text to speech

Video to audio

....

Scalable coding

Spatial resolutionreduction

Temporal resolutionreduction

Terminal andNetwork driven

....

....

Semantic driven

Spatialsummarization

AV Temporalsummarization

Scene/Layoutsummarization

Textsummarization

....

Perception driven

Colortransformations

Text font sizeincrease/reduction

Sensationstransformation

....

Speech genderadaptation

Fig. 4. Organization of multimedia content customization solutions.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 443

capabilities, e.g. bit-rate and video decoder, butalso on their location or preferences: forexample, a different advertising banner depend-ing on their local temperature for a higherimpact (e.g. ice cream shop versus tea house).The several content variations may be orga-nized as an MPEG-21 Digital Item, categorizedas choices according to some descriptionfeatures, e.g., bit-rate or genre.

* Adaptation: Involves the transformation of thecontent asset above selected (one of theresources in the context of an MPEG-21 DigitalItem) according to some criteria, if the availablevariation is not already adequate enough. Thisprocess typically requires a lot more of compu-tational effort compared with the simple custo-mization by selection, since for some contenttransformations heavy signal processing algo-rithms may have to be performed, many timeswith a low delay. It is here proposed to clusterthe various customization by adaptation solu-tions by using three major classes: terminal andnetwork driven, perception driven, and seman-tic driven.

It must be pointed out that the adoptedadaptation solution may be a combination ofseveral adaptation techniques, e.g. a summarywith the goals in a football match at a lower bit-

rate which combines semantic abstraction withtranscoding.

3.1. Terminal and network driven adaptation

The consumption device and the networkcapabilities may pose a serious barrier to thecontent distribution and consumption. This cate-gory of adaptations aims to reduce the contentconsumption requirements by adjusting the sourcecontent to the device and network capabilities.Major examples of terminal and network drivenadaptation solutions are transcoding, transmodingand scalable coding.

This type of adaptation may be achievedthrough signal processing operations (e.g. in time,space or frequency) applied in the compresseddomain (e.g. DCT domain) or, at least, decodingas less as possible the bitstream to decrease thecomplexity of the process—typically called trans-coding. When the terminal does not have thecapability to consume certain media types, trans-moding or modality conversion may be thesolution; for example, it may imply the conversionof video to text or video to images to match theterminal decoding capabilities. Scalable codingtechniques organize the content bitstream intoconsumption layers, which are truncated depend-ing on the terminal/network resources, thus

Page 8: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456444

avoiding any expensive real-time signal processingoperations.

3.1.1. Transcoding

Transcoding is intended to decrease the requiredcontent resources and thus matching the availablenetwork/terminal consumption capabilities, keep-ing the same content modality. Examples areformat conversion, temporal/spatial resolutionreduction, higher frequency DCT coefficientsremoval, or quantization step reduction. Therelevant signal-processing algorithms will requiredifferent resources depending on how the trans-coding technique is implemented:

* Uncompressed domain adaptation: This type ofadaptation techniques typically requires largeresources in terms of memory and CPU(assuming the content available is coded). Eventhough this approach entails a lower implemen-tation cost, it becomes quite inefficient since thecontent must be passed to the uncompresseddomain to be adapted and then recompressed.Besides the high computational costs (notablyfor real-time implementations), this solutionalso implies a quality reduction, e.g. due to theaccumulation of the coding (quantization)errors. Format conversion using this techniqueentails a certain loss of quality, due to theencode–decode–encode processing chain.

* Compressed domain adaptation: Compresseddomain adaptation techniques offer fasterprocessing, consume fewer resources, and inprinciple provide better quality; however, thistype of processing may be more complex interms of implementation since adequate trans-coding algorithms have to be developed[35,11,7]. Format conversions in the com-

Sourcevideo track

Sourceaudio track

Segm

ent

Segm

ent

Segm

ent

Segm

ent

Fig. 5. Modality conversion: video

pressed domain can also be more easilyachieved when using similar coding formats,e.g. H.263 to MPEG-4 Simple profile.

Transcoding adaptation requires signal-proces-sing techniques that may consume large resources,e.g. in terms of computational power and memory.Thus it may become quite expensive to achieve alarge-scale adaptation deployment when many on-the-fly adaptations are required.

3.1.2. Transmoding

When the usage environment conditions do notallow to consume the content with its originalmedia types, a modality transformation can beused to fit the consumption capabilities in terms ofmedia types able to be consumed; for example, totransform text into audio, to extract key-framesfrom a video, or to use only the audio part of amovie. Fig. 5 shows a case where only one key-frame would be used for each segment, or thespeech track would be converted into text. Whilethe conversion of video to images may be rathersimple, e.g., using a key-frame selection algorithm,other conversions such as voice to text may bemuch more complex to provide acceptable experi-ences.

3.1.3. Scalable coding

Scalable coding intrinsically assumes that thecontent shall be distributed through heterogeneousnetworks and consumed by heterogeneous term-inals. Thus, the content format already providesseveral consumption layers, more or less depend-ing on the granularity adopted, to satisfy differentconsumption resources, decoupling the coding andadaptation processes. Content customization can

Segm

ent

Key-frameImage

Speech totext "Good morning!"

Video toimage

to images and speech to text.

Page 9: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 445

use two major techniques to process scalablecontent:

* Scalable content truncation: A scalable codingtechnique compresses the data in question intomultiple consumption layers with more or lessgranularity. Typically, one of the compressedlayers is the base layer, which can be indepen-dently decoded and provide a basic quality. Theother layers are enhancement layers, which canonly be decoded together with the base layer,therefore successively providing better quality.The complete bitstream (i.e., the combinationof all the layers) provides the highest quality.Therefore scalable content does not requiremuch adaptation processing when the accessconditions vary, it is just a question of truncat-ing the total bitstream depending on theconstraints to be imposed, e.g. a bit-ratelimitation. In Fig. 6(a), the various blue layerson the right can be dynamically used accordingto the quality that the user resources allow for.The types of scalability (e.g., spatial, temporal,SNR), its granularity as well as the efficiency ofscalable coding techniques are today a hotresearch topic [37,2]. Considering the impor-tance of the scalability concept in the context ofthe MPEG-21 framework, MPEG is currentlyinvolved in the development of a new scalablevideo coding standard which should provide

Format

Bitstreamsyntax

description

Content qualityper layer

Source contentbitstream

Source contentbitstream

(a)

(b)

Fig. 6. Scalable content adaptation: (a) scalable content truncation

description.

fine grain scalability at almost no cost in termsof coding efficiency regarding the most ad-vanced non-scalable solutions, e.g. H.264/AVC.

* Scalable content with a bitstream syntax descrip-

tion: This type of technique is based on the useof an XML based description of the (scalable)content bitstream syntax. The bitstream XMLdescription may follow a generic bitstreamsyntax description language [26], which suppliesstructures that can describe the bitstreamsyntax of the content to be processed, see Fig.6(b). Whenever an adaptation has to beperformed, the XML bitstream description isadapted instead of adapting directly the bit-stream. Next, the transformed XML descrip-tion is used to generate the adapted version ofthe bitstream by simply parsing the bitstream. Ifthe bitstream syntax description language isgeneric enough, the adaptation engine canprocess the content without knowing its codingformat. The MPEG-21 DIA specification [19]already defines the so-called Bitstream SyntaxDescription Language (BSDL) and the genericBitstream Syntax Schema (gBS Schema). BSDLis a normative language based on XML Schemamaking possible to design specific BitstreamSyntax Schemas (BSs) describing the syntax ofparticular scalable media resource formats.While BSDL makes it possible to design specificBitstream Syntax Schemas describing the

independent adaptation engine

Description transformation

Constrains

Customized contentbitstream

Scalable content truncation

Bitstreamsyntax

description

Customized contentbitstream

; (b) scalable content adaptation based on its binary syntax

Page 10: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456446

syntax of particular media resource formats,gBS Schema enables resource format indepen-dent bitstream syntax descriptions (gBSDs) tobe constructed. As a consequence, it becomespossible for resource format agnostic adapta-tion engines to transform bitstreams andassociated syntax descriptions.

Nowadays, video distribution in heterogeneousnetworks mostly uses multiple stream transmission(simulcast) or scalable content techniques, eithertemporal or spatial (e.g., the PacketVideo stream-ing server [25]), since real-time processing is reallyminimal in these cases.

3.2. Perception driven adaptation

Perception driven adaptation regards transfor-mations according to some user preferred sensa-tion, or assisting a user with a certain perceptionlimitation or condition. The user perception of thecontent will be different from its original version,for example, to address the needs of users withvisual handicaps, e.g. color blind deficiencies, orspecific preferences, e.g. in terms of visual tem-perature sensations. This class of customizationoperations considers all types of adaptationsrelated to some human perceptual preference,characteristic or handicap:

* Sensation based: Audio and visual sensationsconveyed by the content can be modifiedaccording to user preferences. For example:consider a color temperature adaptation pro-viding different sensations in terms of warm orcold color temperatures [16] or a spoken trackfor which the gender or emotion can bemodified like in story telling applications.

* Handicap assistance: Certain user handicapsmay be compensated or minimized in terms ofcontent consumption by content transforma-tions such as image color transformations, textcolor changes, and text size increasing. Forexample, a color blind person may need atransformation of the colors in the content intodifferent levels of luminance.

* Natural environment: The natural (physical)environment may limit the user perception

capabilities and thus the user may need specificadaptations to maximize his/her access condi-tions. For instance, if a user is too far from thescreen, then the text font size may have to beincreased; if a user is driving, his/her attention(and his/her eyes) is focused on the road, andthus an adaptation to produce audio contentonly may have to be provided.

Most the human perception related adaptationsmust be performed in the compressed or uncom-pressed domain using adequate signal-processingtechniques. As such, this type of adaptations israther difficult to implement in large scale (in aserver) in real-time due to the significant amountof computational resources that would be needed.

3.3. Semantic driven adaptation

Semantic adaptations involve the temporal and/or spatial reduction of a certain multimedia asset,e.g. temporal duration or number of regions ofinterest, implying a certain degree of semanticknowledge. For example, the adaptation maycreate a smaller duration variation of the struc-tured content, usually called summary, by selectingthe temporal (and may be also spatial) segmentsthat are more relevant according to some criteria(e.g. user preferences).

To accomplish such customizations, the (spatialand temporal) content structure and the contentsemantics is essential data (available in a descrip-tion) to perform any type of semantic adaptation.Semantic customization may be achieved in severalways, notably:

* Temporal Summarization: The content is re-duced to a smaller duration variation of thesource by selecting only some of the contentsegments according to some (semantic) criteria[24]; sometimes also explicit duration con-straints are imposed, e.g. a summary with themost violent moments of a movie but shorterthan 2min. Considering segmented content asshown in Fig. 7, where the temporal segmenta-tion may be described using MPEG-7 tools, thecustomization engine will select the relevantsegments, matching the content description

Page 11: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Source videotrack

Source audiotrack

Summary A

Segment Segment Segment

Summary B

Segment

Fig. 7. Summarization of temporally segmented content.

Fig. 8. Spatial/scene summarization example.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 447

with the user preferences, to create the mostadequate audiovisual summary of the sourcematerial.

* Spatial/scene summarization: Semantic adapta-tion may also be applied to a spatiallysegmented content (a scene or layout) [8]. Ascene is composed by several individual (spatial)content segments, each of them with a semanticdescription [23]. The summarization of a scenemay involve the evaluation of each individualspatial segment in terms of semantic relevancefor the concerned user, and the consequentcustomization by removing some of the spatialsegments or by attributing them a lower qualityor resolution if spatial summarization is com-bined with some kind of transcoding. Fig. 8illustrates a scene (spatial) summarizationexample where several spatial segments wereremoved from the scene, thus summarizing it toa lower complexity scene more adjusted to theuser interests.

Both Figs. 7 and 8 examples illustrates summar-ization by segment selection, but some adaptation

of each segment individually may also be con-sidered; for example, it is rather common tocombine summarization with adaptation techni-ques such as segment spatial resolution reduction,segment quality reduction, or segment temporalresolution reduction.

The creation of summaries depends very muchon the richness and precision of the MPEG-7descriptions in terms of content structure andcontent semantics. This is not such a majorproblem for other types of adaptations such astranscoding because they typically depend onfeatures that are very objective and almost alwaysavailable in the content description such as thecoding format or the video spatial resolution.

4. MPEG-based Universal Multimedia Access

system

In principle, content customization can beperformed in three different places along themultimedia chain: (1) at the content server, (2) ata proxy server, and (3) at the end user terminal.

Page 12: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456448

The customization of multimedia data at theterminal is typically not feasible due to the limited(terminal and network) resources normally avail-able on most terminals, notably mobile terminals.However there are cases where some degree ofcustomization is performed at the end userterminal, for example, due to the associatedadvantages in terms of user preferences privacy.

Multimedia customization systems proliferate inthe literature in the form of retrieval, summariza-tion, adaptation or personalization systems; theyall use one or a combination of the processingtechniques described in the previous section. Also,there are systems which use previously extractedfeatures (descriptions or annotations), while othersextract them on real-time. Several systems forUMA have already been described in the litera-ture: [1,10,5,6,8,23,22,18,13,12] provide examplesof such systems, many of them also using standardtools for some of their modules.

The systems presented in [1,10,5,6] target seman-tic video retrieval/temporal summarization appli-cations. For that purpose, Adams et al. [1] view themultimedia semantic analysis as a pattern recogni-tion/machine learning problem, where the semanticconcept to be detected is learned by a probabilisticnetwork which detects the concept in certain shots.Features from individual modalities (audio, videoand text) are individually classified in an earlierstage, to be later combined for the detection ofsemantic concepts defined in a restricted lexicon.The system uses MPEG-7 descriptions created bytools that the authors made publicly available. Aquery by keyword retrieval application is used toaccess the video database. Joyce et al. [10] presentan architecture for content retrieval and navigationusing a four layers data representation, wheretrainable agents create links between the layers,linking the content to semantic concepts. A queryallows finding similar objects, the concepts relatedto that object, and the objects matching a certainconcept. Chang et al. [5] describe a real-timesemantic summarization system where the pro-duced bit-rate is content-based and dynamicallyvarying according to the event importance. Thesystem discriminates the video segments by stream-ing video for the important segments, and justaudio and still pictures for the less important

segments. Finally, the approach adopted by Ekinet al. [6] is based on a deterministic approach wherea semantic concept is detected when a set ofpredetermined cinematic features meet together.The detected concepts include goals (finding asequence of shots with certain characteristics), thereferee (based on the referee corresponding sizeinvariant shape on a close-up shot) and thepenalty-box (finding some box lines). The usercan access three types of summaries: slow-motionshots, goals, and an extension of the two withobject-based features (using the penalty-box andreferee detection results).

The spatial/scene summarization or contentstructure customization is the objective of thesystems described in [8,23,22,18]. Hwang et al. [8]present a content structure/scene aware systemwhich employs heuristics to adapt the content tomobile terminals (terminal display driven adapta-tions). This system does not use previouslyextracted or annotated descriptions; user prefer-ences are also ignored. The authors propose twonew heuristics: the ‘‘generalized outlining trans-form’’ to detect repeated scene patterns in amultimedia document and the ‘‘selective elision’’transform to selectively eliminate parts of therepeated scene patterns. In practice, these detectedand eliminated scene patterns correspond tomenus, tables, lists, etc., in a Web page. Nagaoet al. [23,22] present a system based on anadaptation server and a description server whichstores the descriptions by URL. The authors havethree types of descriptions: linguistic (to describetext, e.g. noun, noun phrase, verb, adnoun oradverb), annotations (to comment non-textualelements) and multimedia (to describe videocontent). HTML gives the scene structure to thedocument, where each element (text, voice, image,and video) is identified and described by anexternal XML description. Mohan et al. [18]present a system with a single content server whichconcentrates all functions (adaptation and descrip-tion server). This work proposes a contentorganization scheme, the InfoPyramid, whichstructures content variations in terms of itsmodality and quality. MPEG-7 provides a similardescription tool to manage content variations.Both Nagao and Mohan works perform text

Page 13: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 449

summarization, image transcoding, and videosummarization (based on manual annotations)and perform scene summarization, by convertingtables to lists, evaluating the importance of eachcontent segment in the overall document, andconsequently removing or relatively adapting eachcontent segment.

Some multimedia customization systems [13,12],dedicate a strong emphasis to the usage environ-ment description. Ma et al. [13] present a systemwhere there are no descriptions available andconsequently the server (origin or proxy) mustextract both content and usage environmentcharacteristics on-the-fly (e.g. bandwidth measure-ment, terminal characteristics discovery throughthe HTTP query). The system is composed by anadaptation module, an adaptation decision engine(also responsible for the content analysis), and auser/client/network characteristics discovery mod-ule, which uses several heuristics, e.g., looking forpatterns in the HTTP headers, and monitoringuser behavior (e.g. if the time between pagerequests is too small, it may be an indication of alink bottleneck, which can be compensated byadapting the content to reduce the data size). Lumet al. [12] propose a content adaptation system,which is quite centered on the user context. Aproxy-based architecture similar to the one in [13]is used to implement a decision engine. Theauthors devise a method to quantitatively measurethe QoS of a content piece as an n-dimensionalvalue. In the preprocessing stage, the decisionengine creates a search space consisting of all thepossible adaptation decisions (each decision cor-responds to a node in the search space). Then eachsearch space node is scored according to the user’s

MPEG-7description server

P

MPEG content server

Content sidenetwork

MPEG-7description

tool

MPEG-7 contentdescription with

analysis result

Content to beanalyzed

Multimcustom

proce

Fig. 9. UMA system sim

preferred QoS. During real-time operation, theoptimal node is located by a negotiation algorithmthat examines the search space nodes based on theuser terminal capabilities, network parameters,and content metadata.

Inspired in the systems presented above, Fig. 9shows a rather generic architecture of a UMAsystem. This system, implemented in the context ofthis paper, deploys the customization engine at thecontent server and at a proxy server; also itincludes a multimedia customization server andother elements more associated to the access tocontent. The overall system follows the MPEG-21vision, where multiple Users interact with thepurpose to provide the end user the best possiblemultimedia experience. Each system element per-forms precise functions; no function overlap ispresent, meaning that other architectures can bederived from this one by accumulating functionsinto a single element or by further splitting any ofthe modules.

The function of each module in the genericUMA system shown in Fig. 9 is:

* Content server: Acts as the content source.* MPEG-7 description tool: Analyzes the multi-

media content available at the content serverand generates MPEG-7 descriptions that will bestored into an MPEG-7 description server (localor remote to this tool).

* MPEG-7 description server: Stores the MPEG-7descriptions received from the MPEG-7 de-scription tool; in this database, at least oneMPEG-7 description exists for each piece ofcontent. The MPEG-7 description tool consistsin a content analyzer that scans the multimedia

roxy server

UMAbrowser

MPEG-21 DIA UEDserver

User sidenetwork

ediaizationssing

plified architecture.

Page 14: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456450

content to build MPEG-7 descriptions that willbe used later by the customization server. Thegenerated descriptions can be saved locally orposted (through an HTTP command POST) toan MPEG-7 description server. The MPEG-7description tool implements the media descrip-tion tools presented in Section 2.1.

* UMA browser: Browser used to access contentand allowing the user to create and manage his/her usage environment description. The imple-mented UMA browser is a Web browser testapplication, which allows creating, changingand managing several usage environment de-scriptions (all of its four dimensions), thussimulating several content consumption condi-tions. The UMA browser sends the MPEG-21DIA usage environment description to theUED server, which keeps a database of UEDsto be used by the content customization enginewhen it needs information regarding a certainusage environment.

* UED server: Stores the MPEG-21 DIA usageenvironment description (UED) received fromthe UMA browser; provides the Customizationserver with the UED for the usage environmentin question.

* Customization server: Includes the applicationimplementing the content customization engine(and the needed interfaces) required to providethe best experience to the user for the contenthe/she asked.

In the application developed, the content server,the MPEG-7 description server and the UEDserver are Apache Web servers, which implementthe HTTP POST command allowing other appli-cations to store descriptions in the server. TheMPEG-7 description tool, the customizationserver, and the UMA browser were implementedspecifically for this UMA system.

The customization server implements the con-tent customization engine plus the required mod-ules to interface with a network. Several modulescompose the implemented customization server ascan be seen in the architecture presented in Fig. 10.The major modules in the customization server are

* Network interface manager: Responsible for thenetwork communications between the customi-

zation engine, and the other UMA systemmodules. It is also responsible for retrievingthe descriptions, both for content and usageenvironment. When the processing of a requestis complete, this module updates the cachingtables with the customized new content and thecorresponding descriptions. The caching tablesare useful to enable the repurposing of previousadaptations in future similar (or almost similar)requests. It is this module that retrieves thecontent, either from the local disk (in a serverconfiguration) or from an URL (in a proxyconfiguration).

* User request processor: This module translatesinto the MPEG-21 DIA usage environmentdescriptions other description formats such asWAP UAProf (Wireless Application Proto-col—User Agent Protocol) descriptions, andthe PocketPC description (PocketPC usesHTTP headers to carry terminal related infor-mation, e.g. display resolution, decoding cap-abilities).

* Multimedia content description analyzer: Parsesand validates the MPEG-7 descriptions into theplatform internal data memory structures.

* User environment description analyzer: Parsesand validates the MPEG-21 DIA UED descrip-tions into the platform internal data memorystructures.

* Customization action decision: Matches thecontent description with the usage environmentdescription and decides which customizationsolution must be adopted for the content inquestion, taking into account the availablecontent processing operations.

* Content customization: Performs the contentcustomization operations decided by the pre-vious module. It has several Content customi-zation modules to be used depending onthe type of adaptation to be performed (seeSection 3).

The Content action decision and the Content

customization modules are interdependent sincethe first one must be aware of the adaptationmethods available in the second in order to knowwhich decisions can be executed. The contentcustomization modules implemented perform

Page 15: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Customization server

Multimediacontent

NetworkMPEG-7descriptions

Customization engine

Network Interface Manager

Multimedia content

Userrequest

processor

Multimediacontent

HTTP headers / MPEG-7 description

HTTP headers /UE descriptions

Contentaction

decision

Contentcustomization

source content

customizedcontent

actionsto executeUsage

environment description

Image

Video

Audio

XHTML

Usageenvironment description

analizer

Multimediacontent

descriptionanalizer

Content description

ContentCustomizationModules

Image

Video

Audio

XHTML

Fig. 10. Customization server architecture.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 451

transcoding as well as perceptual transformationsfor images and video content; these adaptationtools are not presented here in detail because theyjust reproduce state-of-the-art technology. How-ever, the customization server is prepared toreceive other customization modules for otherdata types or with other adaptation capabilities.The perceptual transformation implemented re-gards a color temperature transformation in theuncompressed domain [38]; this customizationcapability is based on the corresponding MPEG-7 low-level visual descriptor, which expresses ahuman perceptual preference in terms of colortemperature [16]. This is one of the first MPEG-based adaptation systems using MPEG-7 low leveldescriptions since most available systems are onlybased on textual descriptions, e.g. the format forthe transcoding, or the preferred genre for thesummarization.

5. Experiments and conclusions

To test the MPEG-based multimedia customiza-tion system, the server can be configured as a proxyserver or as a content server. The experiments herereported consider transcoding and color temperatureperceptual transformation of images and videos.

5.1. Test-bed architecture

Fig. 11 presents the test-bed used, in a contentserver configuration, operated with UniversalMobile Telecommunications System (UMTS)and General Packet Radio Service (GPRS) term-inals. The two used configurations present thefollowing characteristics:

* Content server configuration: Has the advantageto ease the dynamic (on-the-fly) content

Page 16: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Content MPEG-7descriptions

by URL

Usageenvironmentdescriptions

UED server

UMTS network

MPEG-7description tool

contentdescriptions

content

Contentcreator

Customizationserver

Fig. 11. Experimental test-bed (left) and PDA connected to the customization server through a UMTS terminal (right).

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456452

adaptation because content is locally available.If the author knows that the content willbe adapted, and he/she has access to thecontent server, he/she can have control on theadaptation results. Consequently, the authorcan tune the adaptation process either byproviding transcoding hints in the contentdescription (MPEG-7 allows this), or by pro-viding appropriate variations. It is also neces-sary to consider content rights management andits implications: since the content has to bechanged, the right owner must allow for itand may be even approve the ‘‘re-authored’’content.

* Proxy-based configuration: Since the contentis not locally available, this configurationposes more technical difficulties. From thecontent server to the usage environment,several content customization proxies, withdistinct behaviors, may exist in the network.Therefore the tuning that an author couldperform with the previous configurationmay become impossible to perform here, sincethe author may have little control over theproxies. Content rights management maybecome even more critical with a proxyconfiguration.

5.2. Test material and usage environment scenarios

In order to evaluate the influence of largecontent (high spatial resolution and data size)in the content customization processing, the

selected image content ranged from 320� 240with 8 bit/pixel to 1600� 1200 with 24 bit/pixeland the video content ranged from 176� 144 pix-pixels/frame at 64 kbit/s to 352� 288 pixels/frameat 256 kbit/s. On the user side, several UEDswere used to simulate (using the UMA browser)WAP terminals, and Pocket PCs which shouldrepresent terminals and connections with differentcharacteristics.

5.3. Lessons from the experiments

The terminal and network driven adaptationtests performed with the Pocket PC, WAPterminal and UMA browser showed that thesame content could be delivered after adaptationto different terminals, thus reducing the main-tenance and storage costs of having onecontent variation for each terminal type. Fig. 12illustrates the same high-resolution imagedelivered to a PocketPC and a GPRS WAPterminal. In the PocketPC case, the source imageis too big to fit its display (see Fig. 12(a)), resultingin a poor user experience and a large data transfertime. The customized content clearly increased theuser experience (see Fig. 12(b)), and only therequired data was transferred. For the WAPterminal in Fig. 12(c), the source content couldnot even be consumed by the terminal since it isonly able to consume black and white images, thusthe customization was the only way to access tothat piece of content. Similar processing isperformed with video: screen size and network

Page 17: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Fig. 12. Examples: (a) image not customized for a Pocket PC display resolution (only part of the image is seen); (b) image customized

for a Pocket PC display resolution; (c) image customized for a black & white WAP terminal.

Fig. 13. Content variations, from left to right: RGB color space, 24 bit per pixel; gray color space, 8 bit per pixel; gray color space, 2

bit per pixel.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 453

bandwidth are the input parameters for contenttranscoding.

When a big gap exists between a high resolutionsource content and the usage environment con-sumption capabilities, the CPU was taken up for along period for content processing. In thesesituations, customization efforts may be reducedby making available strategic variations withcharacteristics closer to the most popular term-inal/network capabilities. Also, the reuse of pre-vious adaptations inserted in the contentvariations list allows the customization server totake more benefit of previous adaptation efforts;this policy may decrease the CPU occupation atthe cost of a larger storage space. For example, acustomization server may have several contentvariations at its disposal (e.g. Fig. 13 illustrates asource image and its variations in terms of bit

depth) and, depending on the terminal, the bestvariation is selected to perform the remainingadaptation (if needed).

As referred in Section 3.2, customization canalso be employed to offer different sensations tothe user, notably in terms of hearing and sight.Fig. 14 presents some color temperature adapta-tions, which provide images able to stimulatedifferent visual sensations, e.g. a warmer or coldersensation. The user specifies his preferred colortemperature (e.g. by selecting an example image),which will be used in later adaptations. Otherperception-based adaptations can provide differ-ent sensations, or even help handicap people tobetter access the content.

Looking at the relative costs of a bit, the (low)bit storage cost, the (relatively low) bit processingcost and the (relatively high) bit distribution cost,

Page 18: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

Fig. 14. Adaptation of an image to different color temperatures, from left to right: T ¼ 4135K, T ¼ 7924K, 11711K.

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456454

the above results assume a different shape. Theserelative costs show that the existence of severalcontent variations as well as off-line contentprocessing is not critical, since storage andcomputational power (notably off-line) are notthat expensive. However, the distribution cost isrelatively high, which means that any typicalUMA system should explore intensively thestorage and processing capabilities to compensatethe distribution costs and on-the-fly contentprocessing, thus just distributing the essentialcontent and not any content that cannot beadequately consumed, ideally not generated inreal-time.

6. Final remarks

This paper navigated through standards, state-of-the-art tools, algorithms and systems for multi-media customization, concluding with a rathergeneric system architecture fitting in the MPEG-21multimedia framework. Now, under the MPEG-21umbrella, a short discussion on the evolution ofthe multimedia customization most importantparts, content coding tools, content descriptionand usage environment description, will be made.

Content delivery in heterogeneous environ-ments, notably through terminal and networkdriven customization, may take great benefit ofscalable audiovisual coding techniques. There aretoday a variety of scalable coding standardsavailable, notably JPEG and JPEG 2000 forimages, and MPEG-4 scalable profiles for audioand video. MPEG-4 fine granularity scalability(FGS) solution for video is especially relevant dueto the very fine adaptation capabilities it provides.The recently initiated MPEG-21 Part 13, called

Scalable Video Coding, targeting the developmentof an advanced scalable video coding solutionwhere unlike in the past solutions coding efficiencyis not affected, confirms the importance that theindustry and MPEG keeps giving to this function-ality.

Content description shows a different situation,since the MPEG-7 standard is available andproviding a vast variety of description tools andno significant changes are foreseen. The greattechnical challenge is now associated to theautomatic low-level to high-level mapping; multi-modal processing and semantic networks are alsovery promising topics. On other fronts, thestandard is still missing licensing conditions andmay be because of that the industry did not yettake this powerful and still very much unexploredstandard with full heart.

And finally, the last link in the content chain:the usage environment and the human user.Technical consumption restrictions in the usageenvironment (network, hardware or software) maybe overcome with content customization althoughthe author’s intended usability and subliminalmessage associated to his/her original creationmay be different. It is also true that physicallimitations such as bandwidth and processingpower may become less relevant as times goesby. The usage environment and its descriptiondefine multimedia customization limitations andboundaries—the richest they are, the betteradaptation results can be achieved.

Usage environments also affect user’s predis-position, patience, and most important the timefor consuming multimedia. So, according to theuser’s surrounding environment, different usabil-ity/interactivity/information designing rules areused in the content/services creation. These

Page 19: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456 455

designing rules take into account that the userlooks for different experiences in different places—the author creates content with a certain quality ofexperience (QoE) in mind, content to be consumedin a certain experiential environment [9]. Withtime, the elements through the content chain willgive more relevance to human features (semanticand sensory) than before. The content chain willevolve to a higher level of abstraction: the authorintended experience (provided through the con-tent) and the experiential environment will be thetwo ends of the future multimedia chain.

In the content delivery chain, UMA will evolveto a new arena where human factors (expressedthrough a QoE) will be the most conditioningaspect. UMA will then evolve to Universal Multi-media Experiences (UME) [27], where customiza-tion has to create the best experience for eachexperiential environment. The final goal in multi-media customization will then be the maximumpreservation of the same QoE, going beyondphysical access and entering the world of humanexperiences.

References

[1] W.H. Adams, G. Iyengart, C.-Y. Lin, M.R. Naphade, C.

Neti, H.J. Nock, J.R. Smith, Semantic indexing of multi-

media content using visual, audio and text cues, EURASIP

J. Appl. Signal Process. 2003(2) (February 2003) 170–185.

[2] Y. Andreopoulos, A. Munteanu, G. Van der Auwera, P.

Schelkens, J. Cornelis, Scalable wavelet video-coding with

in-band prediction implementation and experimental

results, Proceedings of the IEEE International Conference

on Image Processing, Vol. 3, Rochester, New York,

September 2002, pp. 729–732.

[3] A.B. Benitez, S.F. Chang, Multimedia knowledge integra-

tion, summarization and evaluation, Proceedings of the

2002 International Workshop on Multimedia Data Mining

in conjunction with the International Conference on

Knowledge Discovery & Data Mining, Edmonton, Alber-

ta, Canada, July 2002.

[4] S.F. Chang, The holy grail of content-based media

analysis, IEEE Multimedia Mag. 9(2) (April–June 2002)

6–10.

[5] S.-F. Chang, D. Zhong, R. Kumar, Real-time content-

based adaptive streaming of sports videos, IEEE Work-

shop on Content-Based Access of Image/Video Libraries,

Hawaii, USA, 2001.

[6] A. Ekin, A.M. Tekalp, R. Mehrotra, Automatic soccer

video analysis and summarization, IEEE Trans. Image

Process. 12(7) (July 2003) 796–807.

[7] K. Homayounfar, Rate adaptive speech coding for

Universal Multimedia Access, IEEE Signal Process. Mag.

20(2) (March 2003) 30–39.

[8] Y. Hwang, J. Kim, E. Seo, Structure-aware Web

transcoding for mobile devices, IEEE Internet Comput.

7(5) (September–October 2003) 14–21.

[9] R. Jain, Quality of experience, IEEE Multimedia 11(1)

(January-March 2004) 95–96.

[10] D.W. Joyce, P.H. Lewis, R.H. Tansley, M.R. Dobie, W.

Hall, Semiotics and agents for integrating and navigating

through multimedia representations, Proc. Storage Retrie-

val Media Databases 3972 (January 2000) 120–131.

[11] Z. Lei, N.D. Georganas, A rate adaptation transcoding

scheme for real-time video transmission over wireless

channels, Signal Process.: Image Commun. 18(8) (Septem-

ber 2003) 641–658.

[12] W.Y. Lum, F.C.M. Lau, A context-aware decision engine

for content adaptation, IEEE Pervasive Comput. 1(3)

(July–September 2002) 41–49.

[13] W.Y. Ma, I. Bedner, G. Chang, A. Kuchinsky, H.J.

Zhang, A framework for adaptive content delivery in

heterogeneous network environments, Proceedings of

the SPIE/ACM Conference on Multimedia Computing

and Networking 2000, San Jose, USA, February 2000,

pp. 86–100.

[14] J. Magalh*aes, Universal access to multimedia content

based on the MPEG-7 standard, M.Sc. Thesis, Instituto

Superior T!ecnico, Lisbon, Portugal, June 2002.

[15] J. Magalh*aes, F. Pereira, Enhancing user interaction in

UMA applications, Doc. ISO/MPEG, M7312, MPEG

Sydney Meeting, Australia, July 2001.

[16] J. Magalh*aes, F. Pereira, MPEG-7 based color tempera-

ture customization, ConfTele 2003, Aveiro, Portugal, June

2003, pp. 481–484.

[17] B.S. Manjunath, P. Salembier, T. Sikora, Introduction to

MPEG 7: Multimedia Content Description Language,

Wiley, New York, 2002.

[18] R. Mohan, J. Smith, C.-S. Li, Adapting multimedia

internet content for universal access, IEEE Trans. Multi-

media 1(1) (March 1999) 104–114.

[19] MPEG MDS Group, Multimedia framework Part 7:

Digital item adaptation, Final Draft International Stan-

dard, Doc. ISO/MPEG, N6168, MPEG Waikaloa Meet-

ing, USA, December 2003.

[20] MPEG Requirements Group, MPEG-21 Multimedia

framework, Part 1: Vision, technologies and strategy,

Proposed Draft Technical Report, 2nd Edition, Doc.

ISO/MPEG N6269, MPEG Waikaloa Meeting, USA,

December 2003.

[21] F. Nack, M. Windhouwer, L. Hardman, E. Pauwels, M.

Huijberts, The role of high–level and low–level features in

style-based retrieval and generation of multimedia pre-

sentations, New Rev. Hypermedia Multimedia 7 (2001)

7–37.

Page 20: Using MPEG standards for multimedia customizationctp.di.fct.unl.pt/~jmag/publications/img04magalhaes.pdf · 2012-10-18 · agement, etc. The MPEG-21 framework will make use ofthe

ARTICLE IN PRESS

J. Magalh *aes, F. Pereira / Signal Processing: Image Communication 19 (2004) 437–456456

[22] K. Nagao, Digital content annotation and transcoding,

Artech House, 2003.

[23] K. Nagao, Y. Shirai, K. Squire, Semantic annotation and

transcoding: making web content more accessible, IEEE

Multimedia 8(22) (April–June 2001) 69–81.

[24] Naphade, M.R., T.S. Huang, A probabilistic framework

for semantic video indexing filtering and retrieval, IEEE

Trans. Multimedia 3(1) (March 2001) 141–151.

[25] PacketVideo streaming server, http://www.packetvideo.com.

[26] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch,

C. Timmerer, S. Devillers, M. Amielh, Binary multimedia

resource adaptation using XML bitstream syntax descrip-

tion, Signal Processing: Image Communication, Special

Issue Multimedia Adaptation 18(8) (September 2003)

721–747.

[27] F. Pereira, I. Burnett, Universal multimedia experiences

for tomorrow, IEEE Signal Process. Mag. 20(2) (March

2003) 63–73.

[28] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta,R.

Jain, Content-based image retrievals at the end of the early

years, IEEE Trans. Pattern Anal. Machine Intell. 22(12)

(December 2000) 1349–1380.

[29] C.G.M. Snoek, M. Worring, Multimodal video indexing:

A review of the state-of-the-art, to appear in Multimedia

Tools and Applications, 2004 (http://carol.wins.uva.nl/

Bworring/pub/papers/snoek-review-mmta.pdf).

[30] Special Issue on Universal Multimedia Access, IEEE

Signal Processing Magazine 20(2) (March 2003).

[31] Special Issue on Multimedia Adaptation, Signal

Processing: Image Communication 18(8) (September

2003).

[32] R. Tansley, C. Bird, W. Hall, P. Lewis, M. Weal,

Automating the linking of content and concept, Proceed-

ings of the ACM Multimedia, Los Angeles, USA,

November 2000.

[33] P. van Beek, J.R. Smith, T. Ebrahimi, T. Suzuki, J.

Askelof, Metadata-driven multimedia access, IEEE Signal

Process. Mag. 20(2) (March 2003) 40–52.

[34] N. Vasconcelos, A. Lippman, A Bayesian framework for

semantic content characterization, Proceedings of

the IEEE Conference on Computer Vision and

Pattern Recognition, Santa Barbara, USA, June 1998,

pp. 566–571.

[35] A. Vectro, C. Christopoulos, H. Sun, Video transcoding

architectures and techniques: an overview, IEEE Signal

Process. Mag. 20(2) (March 2003) 18–29.

[36] Y. Wang, Z. Liu, J.C. Huang, Multimedia content analysis

using both audio and visual clues, IEEE Signal Process.

Mag. 17(11) (November 2000) 12–36.

[37] D. Wu, Y.T. Hou, W. Zhu, Y.-Q. Zhang, J.M. Peha,

Streaming video over the Internet: approaches and

directions, IEEE Trans. Circuits Syst. Video Technol.

11(3) (March 2001) 1–20.

[38] G. Wyszecki, W.S. Stiles, Color science: concepts and

methods, quantitative data and formulae, Wiley Series in

Pure and Applied Optics, 1982.