Saying What it Means: Semi-Automated (News) Media Annotation

Multimedia Tools and Applications, 22, 263–302, 2004c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Saying What it Means: Semi-Automated (News)Media Annotation

FRANK NACK [email protected], Amsterdam, The Netherlands

WOLFGANG PUTZ [email protected]∗, Darmstadt, Germany

Abstract. This paper considers the automated and semi-automated annotation of audiovisual media in a new typeof production framework, A4SM (Authoring System for Syntactic, Semantic and Semiotic Modelling). We presentthe architecture of the framework, describe a prototypical camera, a handheld device for basic semantic annotation,and an editing suite to demonstrate how video material can be annotated in real time and how this information cannot only be used for retrieval but also can be used during the different phases of the production process itself. Wethen outline the underlying XML Schema based content description structures of A4SM and discuss the pros andcons of our approach of evolving semantic networks as the basis for audio-visual content description.

Keywords: media production, media annotation, news production, MPEG-7, XML-Schema, news productiontools

1. Introduction

Since the beginning of the 1980’s the digitisation of the media domain has been proceedingrapidly with respect to storage, reproduction, and transportation of information. At the sametime the notion of the ‘digital’ as the capability to combine atomic information fragmentsbecame apparent. The idea of ‘semantic and semiotic productivity’ allowing an endlessmontage of signs inspired a great deal of research in computer environments that embodymechanisms to interpret, manipulate or generate visual media [1, 5, 10, 12, 30, 34, 38, 40,45, 51, 57, 58]. Similar developments were acquired for audible information [6, 7, 20, 42,44, 50, 53].

However, the provided technology still follows the strains of traditional written communi-cation by supporting the linear representation of an argument resulting in a final multimediaproduct of context-restricted content. Conversely, the deeper impact of digital media is toredefine the forms of media, blurring the boundaries between traditional categories likepre-production, production, and post-production, and radically altering the structure of in-formation flow from producers to consumers.

A media infrastructure supporting such aims requires that a given component of mediaexists independently of its use in any given production. Making use of such an independent

∗Before June 2001 the FHG-IPSI was named GMD-IPSI. The renaming is part of the merge between FHG andGMD.

264 NACK AND PUTZ

media item necessitate the extraction of the relationship between the signs of the audio-visual information unit and the idea they represent [3, 8, 14, 18].

Systems are therefore needed to manage independent media objects and representationsfor use in many different productions with a potentially wide range of applications such assearch, filtering of information, media understanding (surveillance, intelligent vision, smartcameras, etc.), or media conversion (speech to text, picture to speech, visual transcoding,etc.). Systems are also required for authoring representations that enable the creation ofparticular productions, such as narrative films, documentaries, interactive games, and newsitems. In other words, we not only need tools which allow people to use their creativity inways they are already accustomed to, but in addition, tools that utilize human actions toextract the significant syntactic, semantic and semiotic aspects of its content [9], so thatdescriptions based on a formal language can be constructed.

The emphasis in this article is laid on the provision of tools and technologies for thesemi-automated authorship of linear and interactive media production, based on the exam-ple domain of news. We chose this domain because a great variety of media is constantlygenerated, manipulated, analysed, and commented on. Moreover, the news production pro-cess itself represents a typical example of cross-organizational, multidisciplinary, and cross-site teamwork, which typically exemplifies the description of audio-visual material as anongoing task specific process.

The major claim of this article advocates the idea that the growing amount of variousmedia and their combinatorial use requires the annotation of media during its produc-tion. This approach differs from the traditional, retrospective-oriented research pursued bynews indexing, where automatic transcriptions of news stories, speech recognition, imageprocessing, detection of scene boundaries, face recognition, etc., is performed on alreadyexisting material [4, 16, 23, 28, 52]. Thus, we do not argue against the use of these tech-nologies for applications, such as news on demand, but we hope to show that the suggestedprogressive-oriented approach utilizing the A4SM framework (pronounced APHORISM—Authoring System for Syntactic, Semantic and Semiotic Modelling) is better suited to copewith the future expansion of mixed media information. In fact, we believe that the suggestedknowledge structures and related technologies for the creation, manipulation, and archiv-ing/retrieval of media material extend the currently discussed ideas of the Semantic Web[47] to better recognize the dynamic nature of audio-visual media as well as the variety ofdata representations and their mixes.

In the following article we first outline our workflow model of the news production processand present requirements for semi-automated digital news production, which are based ona three month investigation at WDR (WDR: Westdeutscher Rundfunk—Germany’s largestpublic broadcasting station) and HR (Hessischer Rundfunk—the broadcasting station ofthe federal state of Hesse) where we talked with and observed the work of reporters, cam-eramen, and editors during actual productions. We then describe the A4SM architecture, itssemantic network based approach for data storage and management, and address its XMLSchema-based representational structures. Then we illustrate the use of the frameworkthrough a number of tools we implemented for the news environment. Finally, we assessour achievements thus far, and provide and conclude with an overview of further work.

SAYING WHAT IT MEANS 265

2. News production—One type of media production

Media production, such as for news, documentaries, feature films, interactive games, edu-tainment applications or virtual environments is a complex, resource demanding, and dis-tributed process with the aim to provide interesting and relevant information by composinga multi-dimensional network of relationships between different kinds of audio-visual in-formation units. Though the produced item, e.g., a news clip or a film, itself is representinglinearity, the process of making is rather of a circular and organic nature, e.g., improvisatorymethods are important. Hence, it is more likely that a holistic approach is applied in mediaproduction than the unidirectional movement of the manufacturing line. Nevertheless, forconvenience reasons media production is traditionally arranged in three parts, i.e., prepro-duction, production, and postproduction. The activities associated with these phases mayvary according to the type of production. For example, the methodology involved in thecreation of dramatic film, documentary, and news programs is very different. Commercialdramatic film production is typically a highly planned and linear process, while documen-tary is much more iterative with story structure often being very vague until well into theediting process. News production is structured for very rapid assembly of material fromextremely diverse sources. Media information systems must be able to accommodate all ofthese styles of production, providing a common framework for the storage of media contentand assembly of presentations.

Within the news production process the different phases represent the following aspects.The preproduction phase covers the identification of events and the schedule planning. Theproduction part includes shooting and transmission of the news feed to the studio. Thepostproduction is directed towards editorial decisions based on reviewing the material,editing, sound mixing, broadcasting, and archiving.

It must be emphasised that the interrelationships between different stages within theprocess (e.g., the influence of personal attributes on decisions or the comparing of differentsolutions) is complex and extremely collaborative. The nature of these decisions is importantbecause as feedback is collected from many people, it is the progression through the varioustypes of reviewers that affects the nature of the work.

Consequently, each of the different phases of news production provides important infor-mation on a technical, structural, and a descriptional level. However, today’s news produc-tion is mainly a one-time application for design and production. This means that primarilyoral discussions on the choice of events or paper-based descriptions of activities on theset, production schedules, editing lists with decision descriptions, and organisational infor-mation will be lost after the production is finished. Ironically, it is this kind of cognitivecontent and context-based information that today’s researchers and archivists try to analyseand re-construct out of the final product—usually with limited success.

Current professional IT tools even exacerbate information loss during production. Theseapplications assist in the processes of transforming ideas into scripts (e.g., a text editor,news ticker application, etc.), allow digital recording (e.g., Sony’s 24p Camera, Hdreelfrom director’s friend), sustain in digital/analogue editing (Media 1oo, Media Composer,FAST silver, etc.), or support production management (Media-PPS, SAP R/2, SESAM,etc.). Most of the available tools are often based on incompatible and closed proprietary

266 NACK AND PUTZ

architectures. It is not easy to establish an automatic information flow between differenttools, or to support the information flow between distinct production phases. The net result islittle or no intrinsic compatibility across systems from different providers, and poor supportfor broader reuse of media content.1

Due to the most critical constraint within news production, i.e., time, it is in particularthe flow of information that is of paramount interest. As a result of a knowledge elicitationamong tv-reporters, cameramen, and editors for news over three month at different broad-casting sites (WDR and HR), we identified that the optimisation of the workflow within anews department requires that incoming material, such as news feeds or transmissions fromfield reporters, as well as already processed and archived material, should immediately beretrievable by all members of the editorial staff.

To facilitate direct access to incoming feed, it has to incorporate a certain set of descriptivemetadata, such as location, title, topic, content description, camera work or duration. Currentnews feeds provide this information (at some point in time) via file transfer. An automaticlink of this information to the video material itself is in most cases not possible because thedelivery time of this information is usually unspecified.

Direct access on archived material requires additional information. Here the emphasislies on the provision of dynamic structures and implicit connections to establish statements,context and discourse. For example, a news item showing Bill Clinton and Monica Levinskyonly becomes of interest after their relationship changed into the context of ‘scandal’. Asa result we are forced to reconstruct the relations between existing materials on two levels.Firstly, on a physical basis, by integrating an existing piece of video into a newly createdpiece of newscast, using it perhaps as a temporal reference, and secondly on a representa-tional basis, by modifying an existing descriptional unit with an additional relation. For theunderlying ontological representations this means that they must facilitate the descriptionof the physical world and abstract mental and cultural concepts, though only a shallowdescriptional level will be sufficient for the ‘micro world’ of news. At the same time, thecontent representations must be related to the intrinsic structures of the media unit to bedescribed, so that the translation between one representation and the other does not resultin the loss of salient features of either representation. The challenge is to provide an envi-ronment that integrates the instantiation and maintenance of these dynamic structures intothe actual working process. However, no such environment does exist to the current day.

To improve this situation we propose a news environment that enriches the content froman early production state with relevant metadata and carries this enriched material to thenewsroom. The proposed requirements are:

1. A device that automatically stores the acquired video stream together with an associatedstream of relevant metadata like co-ordinates, camera work or lens movement.

2. An annotation tool for the reporter to provide real-time semantic annotation duringacquisition, such as the contextual quality of images and the soundtrack and includesthis information time-coded into the stream of metadata.

3. An on-site editing tool for the reporter that provides means for stratified annotation onsegment level and that incorporates the edit decision list and the annotations into themetadata stream.


4. An architecture for fast and secure transmission of the metadata stream to a news provideror TV broadcaster.

5. An import and logging tool for the recipient that provides automatic recording, extractionof metadata, key frame extraction and generation of frame-accurate low-res proxies forbrowsing and rough cut.

6. A content management system that accepts all this information and allows immediateaccess to incoming material with only a minimum delay time.

7. A news ticker application that notifies the news room editors when new material iscoming in and that allows to query the metadata, to preview material, to rough cut onthe low-res proxy and to download the results of the rough cut process into professionalediting suites as well as directly into the air play system.

The rest of the article focuses in particular on the first three and the sixth requirement.We will now introduce A4SM and show how it can support the news production model.

3. News production and A4SM

Research in industry and academia has opened inroads to characterise audio-visual infor-mation not only on a conceptual level by using keywords, but also on a perceptual levelby using objective measurements based on image or sound processing, pattern recogni-tion, etc. [2, 13, 17, 27, 29, 31, 32, 46]. The same research paradigm influenced the workon automated indexing of news material, as described in the introduction. However, theproblem with these approaches is that they merely use low-level perceptual descriptorswith limited semantics for content representation (approach over several semantic lev-els are described among others by [11, 21, 39]). The most striking difficulty with suchretrospective-exclusive approaches is, that they use the final media product, which doesnot provide important cognitive, content and context based information, such as editinglists with decision descriptions. This type of semantic information, which is required toenrich the automatically extracted information, would necessitate manual annotation—an expensive endeavour normally not covered by the production or archivalbudget.

The aim of A4SM, as described in figure 1, is to overcome these limitations by providinga distributed digital environment supporting the creation, manipulation, and archiving/retrieval of audio-visual material during and after its production. The aim is to apply theframework to all sorts of media production, such as for news, documentaries, interactivegames, edutainment applications or virtual environments, even though we relate it at themoment only to the domain of news.

The environment is based on a digital library of consistent data structures for mediaitems and related content descriptions (described in Section 4), around which associatedtools are grouped to support the distinct phases of the media production process (describedin Section 5). Each of the available tools forms an independent object within the frameworkand can be designed to assist the particular needs of a specialised user. It is essential forour developments, that the tools should not put extra workload on the user—she shouldconcentrate on the creative aspects of the work in exactly the same way as she is used

268 NACK AND PUTZ

Figure 1. The A4SM architecture.

to. Nevertheless, each tool relies on the consistent data structures, which guarantee thecomposition of a multi-dimensional network of relationships between different kinds ofinformation units.

The flexibility of the A4SM framework becomes clearer by applying the news domainto it. In news, for example, time constraints and the unpredictability of events do not allowscripting on a level, as it is known from the production of documentaries or feature films.Thus, complex scripting tools as suggested by [35] and [37] for commercial dramatic filmsor interactive stories, are not applicable here. However, there are mental concepts beingdeveloped on the way to and on the set and most of them are sort of reflected in thegathered material. The established syntactic and semantic structures still need to be madeaccessible. If there is no material available from the preproduction, then we have to gatherrelevant meta-date during the production phase, which requires particular tailored newstools providing the appropriate working interface by utilizing available schemata in therepository.

Based on the discussion with reporters, cameramen, and editors, we not only iden-tified a set of required data structures and associated relations (see Section 4) but alsodeveloped tools for efficient IT support during production and post-production (seeSection 5).


Before we introduce the particular tools, the interoperability between them and each ofthem and the database, we would first like to describe the general concepts and assumptionsof the repository and our representation schemata for news.

4. A4SM repository

Looking at the phases of news production it becomes apparent that the audio-visual materialundergoes constant changes, e.g., from the shooting to editing, where parts of the materialusually will become reshaped on a temporal and spatial basis. Annotating is a dynamicand iterative process that maps the not strictly pre-structured process in the perception ofa concept. Every degree of cognition might be illuminative for other interests and shouldremain accessible.

This dynamic use of the material has a strong influence on the descriptions and annotationsof the media data created during the production process. There will be temporal gaps inannotations, we will see overlaps, double- or triple annotations, etc., or in other words, theannotations will be incomplete and change over time.

As a result it is important to provide semantic, episodic, and technical representationstructures with the capability to change and grow. This also requires relations between thedifferent type of structures with a flexible and dynamic ability for combination. To achievethis, media annotations cannot form a monolithic document but must rather be organisedas a network of specialised content description documents.

4.1. General concepts in A4SM’s representation structures

To facilitate the dynamic use of audio-visual material, A4SM’s general attempt at con-tent description applies the strata oriented approach [1] in combination with the settingconcept [41]. The usefulness of combining these two approaches results in gaining thetemporality of the multi-layered approach without the disadvantage of using keywords, askeywords have been replaced by a structured content representation, which comprise notonly textual descriptions but also descriptions of image decomposition based on featureextraction.

The general structure of our content representation understands each description as alayer. The connection between the different layers and the data to be described (e.g.,the actual audio, video or audio-visual stream) is realised by applying a triple identi-fier, which indicates the media identifier, the start time and the end time. For example,an actor may perform a number of actions in the same time span. The temporal relationbetween them can be identified using the start and end point with which those actions areassociated. In this way, complex structured human behaviour or spatial concepts can berepresented and hence the audio-visual material retrieved on this basis. Figure 2 shows alayered description of a shot consisting of hundred frames, featuring the actions of a singlecharacter.

The horizontal lines in figure 2 denote actions, whereas the vertical lines delimit thevarious action-based content descriptions that can be extracted from this shot. Applying thisschema to all descriptive units enables the retrieval of particular material with no restrictions

270 NACK AND PUTZ

Figure 2. Actions annotated in layers in a 100 frame shot.

Figure 3. Relevant shot segment for a query for all three actions.

on the complexity level of a query. Take the simple example described in figure 2. If thereis a need for a character, who eats, sits and talks simultaneously, we are now in the positionto isolate the essential part of a shot, as shown by the doted area in figure 3.

The described flexible organisation of media content description and the related mediadata, i.e., the video or audio stream, requires not an enclosed hierarchical structure, as weknow it from document structures, but rather the adaptable construction in the form of asemantic network.

This concept emerged from research in knowledge representation [9, 19, 49] and featuresthree significant functions, which make it suitable as a platform for supporting the needs ofmedia production:

• It provides semantic, episodic, and technical memory structures (i.e., information nodes)with the capability to change and grow, allowing an ongoing task specific process ofinspection and interpretation of source material.

• It facilitates the dynamic use of audio-visual material using links, enabling the connectionfrom multi-layered information nodes to data on a temporal, spatial and spatio-temporallevel. Moreover, since the description of media content holds constant for the associatedtime interval, we are now in the position to handle multiple content descriptions for thesame media unit and also to handle gaps.

• It enables the semantic connection between information nodes using typed relations, thusstructuring the information space on a semantic as well as syntactic level.

The following sections explain in more detail how A4SM for News supports these dif-ferent functions. We begin with nodes and outline then our handling of links and relations.


4.2. Nodes in A4SM for news

We distinguish in A4SM between three types of nodes, i.e., data nodes (D-nodes), con-tent description nodes (CD-nodes), and conceptual annotation nodes (CA-nodes). In thefollowing discussion we mainly focus on D- and CD-nodes.

A D-node represents physical audio-visual material of any media type, such as text, audio,video, 3D animation, 2D image, 3D image, and graphic. The size, duration, and technicalformat of the material is not restricted, nor are any limitations present with respect to thecontent, i.e., number of persons, actions and objects. A data node might contain a completefilm or merely a scene. The identification of the node, required for the linking betweendescriptional data and media material, is realised via a URI.

A CD-node provides information about the physical level of the data, such as shape,colour, light, focus, distance (spatial interpretation) or movement. The content of such anode can be either based on natural language or features or measurements. For example, anode might describe the lens movement, lens state, camera distance, movement, position,angle, and production date for one shot only.

A CA-node provides high-level descriptions, which are usualy difficult to retrieve usingautomatic extraction methods. This type of node is usually created manually.

Besides the described functionality, each node is best understood as an instantiatedschema. The available number of node schemata is restricted, thus indexing and classi-fication can be performed in a controlled way, whereas the number of provided nodes in thedescriptional information space might consist of just one node or up to n nodes.

The obvious choice for representing CD- and CA-nodes, each of them describing audio-visual content, would have been using the DDL of MPEG-7 [25] or suggested schemata byMPEG-7 [26].

The objective of the MPEG-7 group [24] is to standardise ways to describe differenttypes of multimedia information. The emphasis is audio-visual content with the goal toextend the limited capabilities of proprietary solutions in identifying content by providinga set of description schemata (DS) and descriptors (D) for the purpose to make varioustypes of multimedia content accessible. In this context a DS specifies the structure andsemantics of the relationships between its components, which may be both descriptors anddescription schemata. A descriptor defines the syntax and the semantics of a distinctivecharacteristic of the media unit to be described, e.g., the colour of an image, pitch of aspeech segment, rhythm of an audio segment, camera motion in a video, style of a video,the actors in a movie, etc. Descriptors and description schemata are represented in theMPEG-7 Description Definition Language (DDL). The current version of the DDL is basedon XML Schema [56], providing means to describe temporal and spatial features of audio-visual media as well as to connect these descriptions on a temporal spatial basis within themedia. For more details on MPEG-7 see [36].

At the end we decided not to use any of the MPEG-7 description schemata. There aremainly four reasons for that decision. First, MPEG-7 is hierarchy centred. This means thata description of data in MPEG-7 is understood as one document that itself applies a treestructure.2 The schemata for this document type are fixed and cannot be altered. This linearapproach is not astonishing since efficient access and retrieval was and still is the driving

272 NACK AND PUTZ

development force. However, this approach is far too restrictive, as any form of annotation isnecessarily imperfect, incomplete and preliminary, because they accompany and documentthe progress of interpretation and understanding of a concept. A far better support for thedescribed production and use of meta-data for audio-visual data in news is provided bygraphs, which do form the basis of semantic networks.

The second reason, why we chose not to use provided MPEG-7 schemata, is the concep-tual idea in MPEG-7 of two general description types: on the one hand there are completedescription (using MPEG-7Main as root element) and on the other hand there are partialdescription units (using MPEG-7 Unit as root element). The distinction of a complete andfragmental description is purely academic and adds an unnecessary extra level of descrip-tional complexity.

Thirdly, we opted against the use of suggested MPEG-7 schemata due to the great numberof MPEG-7 schemata, not so much because they are too many, this is unavoidable, butbecause of their interlocked nature, which makes the use schemata in isolation difficult.

Finally, it has also become increasingly clear that there is a need for a machine-understandable representation of the semantics associated with MPEG-7 DSs and Ds toenable the interoperability and integration of MPEG-7 with metadata descriptions fromother domains. We are aware of the fact that RDF Schema [43] would actually be aneven better approach for describing relation-based semantics than XML Schema. How-ever, at the time of writing this alternative is merely under discussion in MPEG-7 [22].The authors would highly welcome the inclusion of RDFS but will wait with adoptingthe implementation, due to conformity reasons, until decisions are made in the MPEG-7standard.

However, as we intended to be MPEG-7 compliant we decided to use the MPEG-7DDL, which allows the creation of new description schemata. The only problem we havebeen facing is that up to the time of writing the DDL has not been put into an agreedformat and also still lacks parser support. Thus, we chose to build our implementationusing XML Schema and perform the necessary changes, if any, ones an MPEG-7 parser isavailable.

For our news environment we developed a set of 18 schemata, which provide informationon a denotative and technical level. The design of the technical oriented schemata fulfilstwo aspects: these schemata can be automatically or semi-automatically instantiated andthey describe technical information that in itself offers semantic value. For example, azoom-in usually focuses the viewers attention on something important. The descriptionschemata we developed and use in the ‘A4SM for News’ implementation are describedbelow:

Newscast high level organisation scheme of a new cast, containing references to allrelated news clips and moderations

Newsclip high level organisation scheme of a new clip, containing all references suchas links to relevant annotations and relations to other clips

Link link structure describing the connection between description scheme andthe av-material to be described (data)

TSR relative (to a given link) temporal or spatial reference to the data


Relation structure describing the relation between descriptionsFormaldes formal information about the news clip, such as broadcaster, origin, lan-

guage, etc.Bpinfo production and broadcasting information: when was the clip broadcasted

(produced), on which channel, etc.Subjective c subjective description of an event, such as comments of the audienceMediadevice media specific technical information of the data, e.g. lens state, camera

movement, etc.Person persons participating in the production of the clip, such as reporter, cam-

eraman, technicians, producerEvent the event covered by the descriptionObject object, existing or acting in the eventCharacter the relevant characterAction action of an object or characterDialogue spoken dialogues and comments of the eventSetting setting information of an event, such as country, city, place etc.Archive archiving value of the news clip according its content and compositionAccess access right info, IPR, rights management of the clip

The XML Schema representations of the above 18 schemata can be found in Annex Aat the end of this paper.

With these 18 structures it is possible to access material on the content level (e.g., searchfor a person, situation, place in a newsclip) and on the organisational level (show clips froma particular reporter or show clips related to the clip based on a relation type), as will bedemonstrated in Section 5, when we describe tools that apply our schemata.

4.3. Links and relations in A4SM for news

As pointed out in Section 4.1, we organize all meta information about the actual audio andvideo stream in form of the semantic network. Figure 4 describes a possible network of

Figure 4. A4SM description network.

274 NACK AND PUTZ

Figure 5. Structure of a link.

A4SM descriptions for news clips, which are represented by the rectangular boxes (i.e.,D-nodes). Figure 4 also shows the two ways of annotating the data, either as part of a com-plete newscast (the upper media stream represented in form of a rectangle), or as single clipsas portrayed for the media stream on the right side. It is important to mention that differentannotation networks can be related towards each other (e.g., a newsclip about ‘Clinton at apress conference’ refers to another clip from an older newscast showing ‘Mr. Clinton andMs. Levinsky’).

Hence, we require two sorts of connections attached to nodes:

• Connection between data and descriptional nodes (link).• Connection between the three different types of nodes themselves (relation).

These two elements form the syntax of the semantic network and will be explained below.A Link enables the connection from a description to data on a temporal, spatial and spatio-

temporal level. The structure of a link in A4SM looks like the one displayed in figure 5 (forthe XML Schema definition see appendix A).

A DS providing links we call a ‘hub’. The hub is actually the best potential entry pointinto the network. Moreover, it is the hub where ‘absolute addresses’ and ‘absolute time’are determined. For our news environment we provide two types of description schematathat can hold links, i.e., the newscast-DS and the newsclip-DS. A newscast-DS alwaysbehaves as a hub. This means that all its temporal and spatial references are absolute,whereas the references in the associated clips, organised in newsclip-DS, are referential. Ifa newsclip-DS acts as a hub (see the right clips within figure 4), then its temporal and spatialreferences toward the media are absolute (note, that the annotation algorithm is aware whento use which temporal or spatial representation). As we will see later, the instantiation oflinks in A4SM for News is always performed automatically (see the Section 5 on toolsbelow).

A Relation enables the connection of descriptors within descriptions as well as connec-tions between distinct descriptions. The structure of a relation in A4SM looks like the onedisplayed in figure 6 (for the XML Schema definition see Appendix A).


Figure 6. Structure of a relation.

There can be up to n relations between two nodes. For providing a coherent semantics,it is necessary to define the domain relevant relation types. As a result of our knowledgeelicitation we defined the following for the news environment:

Events: follows, precedes, must include, supports, opposes, introduces, motivation, conflict-resolution, evidence, justification, interpretation, summary, opposition, emotional.

Character, Setting, Object: part-of, synonym, association, before, equal, meets, overlaps,during, starts, finishes.

Relations will be instantiated in a semi-automated way during the production process (seeSection 5: Tools).

Our current implementation organises the annotation network in form of a file system.This means that we have to control the data management (i.e., storage and retrieval ofmetadata as well as connection between data and annotations) as well as the validation ofdocuments by ourselves.

The different XML schemata, which represent the structures of potential nodes in thenetwork, are defined using XMLSpy Version 3.5. This tool supports the XML Schemaversion from October 2000 [55] and also converts older versions of XML Schema into thecurrent version. Moreover, this parser allows through a scripting interface its adaptation touser specific applications.

We would prefer having the structures stored in an XML based object-oriented database(see, among others, Tamino from Software AG and dbXML from the dbXML Group, and theXML enabled databases from IBM–XML Extender, Informix Internet Foundation, 2000,Oracle–XML Developer’s Kit, etc.) At the moment, however, none of them performs storageand retrieval in an acceptable access time.

Based on the above discussion on the general structures of our semantic network basedrepository, we describe now how the structures are instantiated in A4SM. For that we willintroduce the tools we have been developing over the last couple of years for the newsdomain.

276 NACK AND PUTZ

5. The tools

Over the period of three month we observed and discussed with reporters, cameramen, andeditors of two German broadcasting stations their way of producing audio-visual news,aiming to identify when and what sort of annotational data they produce and what kind ofIT support could help them perform their work more efficiently.

As mentioned above the most critical constraint within news production is time andthus the aim was to support the actual production process in such a way that no extraburden is put on the involved specialists by adding the annotations during production. Asoutlined in Section 3, we don’t find physical scripting in news due to time constraints andthe unpredictability of events. Thus, we had to come up with solutions for merely twoproduction phases, i.e., the production itself and the postproduction, which had to coverimportant pre-production information as well.

Within the production phase it is the acquisition of material, which can be improved bysupporting the collaboration between reporter and cameraman. The common procedure forthis process is that the general concept for the news-clip is designed on the way to the locationof the event to be portrayed. Refinements of the concept might be performed at the location.Thus, there is a need for a set environment within which the reporter can annotate structural(e.g., scene id, etc.) and content information (e.g., importance of shot with respect to audio orvisual elements) while the cameraman is shooting. As a result we designed and developed ahard disk camera that automatically stores the acquired video stream together with relevantinformation, such as co-ordinates, camera work or lens movement, in form of an associatednetwork of instantiated XML-Schema structures (see Section 5.1). Additionally we provideda mobile handheld annotation tool for the reporter to allow for real-time annotation duringshooting on a basic semantic level, i.e., capturing in and out points for sound and images (seeSection 5.2). The decision in favour for a handheld device instead of a voice-driven gadgetis derived from our observation that voice is usually interrupting the flow of a reportage orinterview. For example, if the reporter interviews a politician, and assumes that the providedinformation is mediocre it would be not helpful to mumble that into a microphone. Withthe handheld the semantic note can be placed without hinting the personal judgement to theconversation partner.

Post-production, in which the recorded material is made ready for telecasting, calls forthe support of the collaboration between reporter and editor. During the interviews at WDRand HR it was mentioned by a great number of reporters, that it would be excellent tohave a simple editing suite in form of a lab-top based application so that they do nothave to rely on an editor and can increase the topicality of their work. However, the toolshould facilitate the provision of additional information so that an editor at the broadcastingstation should be able to adjust the material in an appropriate way, if required. Thus, wedesigned and prototyped an on-site editing suite for a reporter that allows editing of thematerial, provides means for stratified annotation on segment level, and incorporates the editdecision list as well as typed in annotations into the XML-schema description structure (seeSection 5.3).

We now describe the different tools we developed in our lab in more detail. We start withthe camera and the handheld device and then show how the generated annotations of these


Figure 7. Design study of the camera and MPEG-7 Handheld device.

tools are used during the post-production phase for finalising the news unit. The sectionwill conclude with an evaluation of the gained insights.

5.1. The camera

The aim of the Digital Camera is to provide, besides recorded MPEG-1 or MPEG-2 video,metadata associated with the video that is helpful for machine-assisted postproduction (seeSection 5.3). Figure 7 portrays a real size design study of the camera.

The metadata describes image capture parameters, such as

• lens movement: a zoom-in usually portrays increasing importance, whereas a zoom-outusually manifests a decrease of importance

• camera distance: the image frame (e.g., extreme close-up, close-up, medium, mediumlong, long, extreme long) is required for supporting automated editing, as described inSection 5.3

• camera position: the point of view (e.g., left, middle, or right) is required for the automatedediting as described in Section 5.3

• camera angle: the angle is usually and indicator for the importance of a scene• shot colour: the colour in combination with a high-level concept (such as joy or sorrow)

can help to set the emotional tone of a sequence.

It is this technology that manifests itself in the medium’s unique expressiveness [8, 14].Figure 8 demonstrates the interface in our lab settings for determining, which camerasettings are traced, i.e., data about camera movement (pan and tilt), lens action (zoom andfocus), shutter (light condition), gain, and iris position. Moreover, we collect informationabout the spatial position of the camera. For the latter we use a magnetic tracker. The imagein figure 8 portrays the current video being shot and thus represents what the cameramanwould see in the viewfinder.

278 NACK AND PUTZ

Figure 8. Interface for the camera handling.

The limited information units were chosen to demonstrate the general access and archivalmechanisms. A complete set of camera descriptors might be closer to the parameters sug-gested by the Dynamic Data Dictionary [48] and by the GMD virtual studio [15].

The actual hardware components used in our demonstration environment are:

• Polihemus FastScan Tracker,• Sony EVID-30 Camera,• a Videodisc Photon MPEG-1 Encoder.

The software analysing the data provided by the camera setting, instantiating the relevantXML-schema, and finally establishing the position of the schema in the network (i.e.,updating the reference list in the relevant newsclip-DS) is programmed in C++ (MS VisualC++ 5.0) and runs under Windows 98.

As mentioned in Section 4, it is the aim of the A4SM framework to provide as muchautomatic annotation as possible. The most potential areas are those of technical descrip-tions. We use the camera to show how the instantiation processes of those annotations or thespatio-temporal identification marks for audio-visual data are performed, based on a linkingmechanism using time-codes and region-codes. The advantage of the generic approach isthat the expert, here the cameraman, can concentrate on his tasks, i.e., image composition,framing, etc., without being concerned about storage organisation or general presentation.


Before the cameraman starts shooting, he establishes the set of features that should beannotated during filming. Due to the particular requirements of productions it is importantto provide a larger (e.g., for feature films) or smaller (e.g., for news) set of automatedannotations. Moreover, he also has to specify his personal data that will be added to everyproduced shot in form of an instantiated access-DS. Finally, it is also required to providethe dates and location info, which will be exploited in the instantiations of bpinfo-DS,formal-DS, and setting-DS when required (for details on the description schemata seeAppendix A).

For this information gathering task the cameraman can either use an interface providedby the camera, as suggested in figure 8, or, as being recommended in the design study,via a programmable code-card, which is installed into the camera before shooting starts.The code-card is the preferred option by cameramen, though we do not provide it in ourlaboratory setting.

Once the camera is activated, it first retrieves the scene-id from the handheld (seeSection 5.2). This information is required for establishing the link between the actual ma-terial produced (the MPEG-1 or MPEG-2 video) and the related hub structure (newsclip-DS). The actual link, adopting the X-Link structure [54], is represented in the link-DS,of which a reference is kept in the newsclip-DS. The camera also instantiates a formal-Ds and a setting-DS. When the camera is recording the digital video in MPEG format,the annotation algorithm polls every 20 ms for changes on the relevant image captureparameters. In case a change is detected a mediadevice-DS structure will be instantiatedwith the start and end time of the ‘setting change’ in SMPTE notation (hh:mm:ss:msms),the parameter type, e.g., zoom, and its description value. If the camera requires a longertime span than 20ms to capture a setting change, the end time will be entered after thefirst unsuccessful poll (the algorithm corrects the temporal delay automatically). Oncea node is completely instantiated, a reference to the schema will be made in thenewsclip-DS.

The instantiation of a mediadevice-DS for focus might look as follows:

<?xml version=“1.0” encoding=“UTF-8”?><!– edited with XML Spy v3.5 (http://www.xmlspy.com) by Wolfgang Putz (GMD-IPSI) –><mediadevice xmlns:xsi=“http://www.w3.org/2000/10/XMLSchema-instance”xsi:noNamespaceSchemaLocation=“C:\EigeneDateien\mpeg7\newsitem v2\

xmlschemaDSs\Media device.xsd”name=“m02 ts200001202000”>

<ContextRelation><Resource>

<address>http://www.darmstadt.gmd.de/MPEG7/Tagesschau200001202000</address>

<arcrole>part of</arcrole></Resource>

</ContextRelation><ContextRelation>

<Resource>

280 NACK AND PUTZ

<address>http://www.darmstadt.gmd.de/MPEG7/event01 inst</address><arcrole>part of</arcrole>

</Resource></ContextRelation><timerange>

<in>00/00/02/200</in><out>00/00/03/100</out>

</timerange><effects type>closeup</effects type><camera data>

<Focus>734</Focus></camera data>

</mediadevice>

In such a way the camera establishes a network, where each change will be representedin a single instantiated scheme (temporal actions longer than 20 ms, such as zooms, arecollected in one instantiation only). Once the recording has finished, a comparison check isperformed to ensure, that all established notes are referenced and that links are establishedbetween nodes and the audio-visual data.

The approach taken in this implementation of the camera goes beyond what currentcameras offer. First of all, there are only a small number of professional digital camerasthat are able to capture additional audio/visual meta-data, and not surprisingly all of themare developed for digital newsgathering (DNG). As suggested by our approach these pro-fessional cameras uses an external memory source (e.g., a 4Gbyte harddisc recorder withAVID technology), where markers are set while the video is recorded (Ikegami). However,the particular markers are provided in a proprietary format, instead of the open XML basedformat provided by A4SM.

5.2. The handheld

The handheld device allows the reporter the making of real-time annotations during shootingon a basic semantic level, i.e., capturing in and out points for sound and images. Figure 9demonstrates the interface simulation within our demonstration environment.

The device provides a monitor for screening the recorded material of the camera in real-time (the transition rate is 13 images per second, since the reporter is only interested ingaining an idea of the framing and content). Furthermore, the handheld supplies a set ofbuttons.

• to determine the conceptual shot dependency via a scene id, such as 1-1, 1-2, 2-1, etc.,where the first digit represents the scene (provided by the top left button in figure 9) andthe second digit stands for the shot number (provided by the top right button in figure 9)and

• to mark the importance of sound and images (distinguished by the two middle buttons infigure 9.) with in- and out points (instantiated via the two lowest buttons in figure 9).


Figure 9. Simulation of a handheld device for the annotation of simple semantic information.

The reporter can adjust the scene id at any time if the camera is not recording. Once thecamera is active, it first identifies the current scene id of the handheld (a handshake protocolon infrared basis is used), which will provide the id for all description schemata createdfor the current recording. If the scene id already exists the camera automatically adjusts theversion.

The synchronisation of the in and out points, which can be set by the reporter at anytime during recording, with the audio or video is achieved via time codes (provided bythe camera) and the scene/shot id. In fact, every use of an in or out button generatesan event-DS where the required information is stored. We decided to use the event-DSrather than the mediadevice-DS, because in and out points operate on a higher semanticlevel.

All the reporter has to care about is to perform exactly the same information gatheringas she is used to in the traditional pencil-oriented environment. However, the differenceis, that in our digital environment the synchronisation between camera and important con-ceptual information, such as what is relevant information and at what time did it happen,is not anymore based on the manual adjustment of camera time code and reporter watchbut is carried out automatically. Consequently, the reporter can concentrate on the actionbefore the camera, given that only a few buttons need to be pressed, instead of scrib-bling time codes and related notes on paper while listening to speeches or answers ofinterviewees.

282 NACK AND PUTZ

The examples of the camera and the handheld demonstrate how extra semantics canbe added to audiovisual material during production without increasing the workload orintroducing drastic changes in the work procedures. The next section provides ideas on howinformation-enhanced media can improve the work process and information flow duringpostproduction.

5.3. The editing suite

This tool uses the metadata-augmented audio-visual media to automatically group and editthe material based on conceptual dependencies within news.

Based on the discussion with editors on how they group incoming material we distilledthe following rules:

• Groups of video material are based on scene ids and their versions• short clips are more important than long ones• annotations of in and out points increase the value of a shot, thus it should be presented

more prominently• a shot with a zoom-in is important, thus it should be presented more prominently• a shot with a zoom-in might be relevant for the main message, thus it should marked as

message relevant• a shot with a zoom-out might be relevant for a conclusion, thus it should marked as

conclusion relevant• a shot with a close-up is relevant for the main message, thus it should marked as conclusion

relevant

Applying the rules on each shot of a scene cluster results in the presentation value for theshot in comparison with the other shots of the scene. The final presentation uses this valueand presentation templates to facilitate the user with a visual representation of al scenesand an approximation which shots are potential targets for the final clip (see figure 11(a)).If too much material is available to be presented in the working environment, the reportercan also search for a shot she remembers. Figure 10 displays the search engine of ourtest environment, which allows searching for annotation types and directly locating of therelated sequence in the video.

Once the semi-automated editing suite provides the reporter with an instant overviewof the available material and its intrinsic relations represented through the spatial orderof its presentation (see figure 11(a) as an example), the reporter is now able to mark therelevant video clips for the newscast by pointing at the preferred clips. The order of pointingindicates the sequential appearance of the clips (see figure 11(b)) and the length of pointingtheir importance. The different shapes around the selected clips resemble the role of theclip within the news unit, i.e., the square stands for the introduction, whereas the circlerepresents the important part. Finally the reporter provides the overall length of the clip inseconds (see figure 11(c)). Based on a simple planner the editing suite is then performingan automated composition of the news clip.


Figure 10. Search engine of the A4SM test environment.

At the current stage of development our prototypical editing environment uses the ac-quired meta-information from the camera annotation process to support the editing process.

The simple planner for news composition is based on 22 rules for the automated clip-ping of a shot and juxtaposition of shots. To provide an idea on the functionality of therules we describe below those rules that are concerned with the rhythmical shape of asequence.

Strategy A and Strategy B reflect the fact that a viewer perceives the image in a close-upshot in a relatively short time (≈2–3 seconds), whereas the full perception of a long shotrequires more time.3 Moreover, the composition of shots may vary in number of subjects,number and speed of actions, and so on, which also influences the time taken to perceive theimage in its entirety. Finally, the stage of a sequence in which a shot features also influencesthe time taken to perceive the entire image. For example, a long shot used in the motivationphase takes longer to appreciate, since the location and subjects need to be recognised,whereas in the resolution phase the same shot type may be shorter in duration, since in thiscase the viewer can orient himself or herself much more quickly.

Strategy A If camera distance of a shot = close-upthen

clip it to a length = 60 Frames.

284 NACK AND PUTZ

Figure 11. (a)–(d) The handling of a semi-automated editing suite.

Strategy B If camera distance of a shot > medium-longsequence. kind = realisation or resolution

thenclip it to a length = 136 Frames.

Both strategies indicate the need to trim a shot. However, the above strategies may causeproblems, as the actions portrayed may simply require more time than suggested by therules. The necessary steps to be performed to achieve a necessary chain of actions whilestill allowing to trim are to identify the start frame of the first relevant action and to detectany overlap with the second relevant action. If such an overlap exists, then it is possible tocut away the section of the shot in which the first action is performed in isolation. The samemechanism applies to the end of the shot. It is, of course, important that the establishedspatial and temporal continuity between the shot and its predecessor and successor are stillvalid. Figure 12 describes the application of edge trimming for a shot of 140 frames. Theshot should portray an actor walking and then sitting, but should not be longer than 108frames.

Strategies C represent an example of temporal clipping as discussed immediately above,focussing on frame elimination from a single shot that sequentially portrays a number ofactions.


Figure 12. Trimming of a shot from 140 to 108 frames.

Strategy C If close-up < camera distance < long andsequence.action.tempform = equivalence andnumber of frames > as calculated in Strategy A or B andnumber of performed actions in the shot > 3

thenverify the frame overlap of first and second action as Xverify the overlap of last and last - 1 action as Ycut away the pure frame numbers for the first action if X > 36cut away the pure frame numbers for the last action if Y > 36

For a more detailed theoretical discussion of automated film editing see [33].Due to the information provided by the point identification of the reporter, the system

can also identify basic relations between different scenes (follows, precedes, supports,introduces, evidence, justification), which results in the generation of relation-DSs.

Even though the system performs the editing, the reporter still can attune the suggestedvideo sequence, i.e., to adjust it according to the voice over (see the time line at the topof the interface, figure 11d). Once the reporter is satisfied with the news clip, the materialcan be transferred to the broadcasting station, where it is integrated into the larger newsprogramme (newscast-DS). It must be made clear that ‘material’ not only means the finalclip and the associated semantic network, but also the unused material that will be storedunder the same newscast-DS, only that it is marked as unused.

It also needs mentioning that high-level annotations covering information about an object,character or person, still require manual labour. We discussed with reporters various methodsof adding this information during production. For example, we suggested an extendedediting-suite, where the reporter can point at an object or person and then type its namein an associated text field. The idea was that the system can identify the region in theframe and based on object tracking technology can also classify the time period in which itappears. Both temporal and spatial information could then be utilised for the instantiationof a spatial-temporal descriptor’. Later, the system could also establish relations betweenthe identified objects in the video with matching concepts in the audio stream The basicidentification mechanism would have been an ontology-based analysis of the transscript. Asimilar technique could be used if the text for the voice over was composed with a text editorinstead of free improvisation. However, it turned out that none of the methods satisfied theneeds of the reporters. Thus, at the current stage it seems to be the easiest to let reporter

286 NACK AND PUTZ

write the voice over first, as they are used to, synchronise the text with the voice over andthen the voice over with the video, and finally relate time-codes between the two mediastreams. In other words, use established video and audio indexing technology during theproduction. Yet, our test environment does not yet provide this technology.

5.4. Evaluation of repository and tools

Broadcasters are extremely conservative organisation. It turned out that most people atbroadcasting stations feel quite reserved towards new ways of production. Moreover, broad-casters rely on predictable, reliable and fast production environments, especially in news.Thus, they would not run the risk of producing media material for actual public transmissionwith a technology that is still in an experimental stage. We had, therefore, to depend onpeople to come to our lab in their free time, use the tools and discuss with us their usefulnessand agreeableness for the news production process.

The responses reporters, cameramen, and editors made regarding the tools either pre-sented in our lab, at the TV fare in Berlin or on the occasion of 10th Anniversary of ERCIM(European Research Consortium for Informatics and Mathematics) in Amsterdam wereencouraging. The general statement was that the described technology would fit into theproduction process and would provide essential advantages and revenues. Still, the view ofthe three main target user groups, reporter, cameramen and editors, varied.

Reporters said that they would use tools as being demonstrated—though it must be men-tioned that merely the younger ones (around beginning 30ies or younger) were exceptuallyopen to new technology. The overall idea of annotating material during the production pro-cess was accepted once they understood that additional meta-data will not only help findingmaterial on a simple keyword level but support search for high-level concepts, which isusually vague in its description. This is in particular of importance for them once they dowork on a larger piece of work, such as a 30-minute documentary. From this prospectivethey would also annotate news material as suggested because it will help them reusingmaterial. In particular the handheld grabbed their attention, since it allowed them to get anidea what video material they can expect for the composition of the news clip. Moreover,they enjoyed being freed from the burden of time keeping. The easiness of the editing suiteappealed to them though they wished to see facilities for easy special effects, e.g. addingof captions, as well as a better synchronisation with a scripting tool.

A similar experience we had with cameramen. Nearly all of them understood the extravalue of annotating material during production and all of them immediately saw the advan-tage of documenting their work—might that be in annotating a new camera effect as theirinvention or by simply identifying those parts of a film done by them. They liked in partic-ular the idea that no extra burden was placed on them, such as pressing extra buttons whileshooting. However, they detested the idea of the visual screen on the handheld—whichevery one of them described as a control mechanism for the reporter and thus as a way ofdegrading their artistic freedom.

The most reserved reactions we had from editors. They criticized that the automaticediting is far too crude and would not cover the intuitive and emotional aspects of editing.Even in news they would like to see the principles of the film language being implementedbut they feared that the automation would encourage the current trend in news that images


merely serve as a visual carpet for the audio. Visual news is not radio was the usual statement.On the other hand, they would be generally pleased to be not tied up with the stressfulbut still monotonous work of news editing. Thus, they appreciate the editing tool for thenews reporter, though they would like to see more stylistic aid. Any similar help in theirown working environment they would, however, reject, especially for productions as adocumentary or feature film.

From a technical, implementation point of view we are partially satisfied. It could beshown that the described mechanism of generating description schemata semi-automaticallyin real time support the idea of iterative, ongoing description of audio-visual material. Adescription in form of a semantic network (i.e., instantiated description schemata connectedvia relations) allows easily creating new annotations and relating them to existing material.If new information structures are required, novel templates can be designed using XMLSchema, or once finalised, MPEG-7’s DDL. Moreover, the decomposition of the annotationin small temporal or spatial units supports the streaming aspect of the media units. In casewe wish to provide meta-information with the streamed data we can now just use thoseannotation units that are relevant for the temporal period and which are of interesting forthe application using the stream.

However, the approach of relating small unit description schemes, each forming a docu-ment and thus a node of the network, also generates problems, mainly with respect to searchand content validation.

Search: The complex structure of the semantic network does not allow easy detection ofrequired information units. It is not difficult to detect the right entry point for the search(usually a hub, but other nodes might also be applicable) but the traversal of the networkis, compared to a simple hierarchical tree structure, more complex. Due to its flexibilityit is rather problematic to generate orientation structures such as a table of content fora newscast. A potential solution to this problem might be the introduction of a schemaheader (containing general information about the DS type, the links and relations, and otherorganisational info) and the schema body with the particular descriptive information.

Validation: For cases in which new nodes are added, established nodes are changed, ornew relations between existing nodes are introduced we have to validate that these operationsand the created documents are valid. In our opinion this can only be achieved via partialparsing (see also figure 13). This means the parser validates only a particular part of thenetwork (e.g., a number of hubs). In such a way we avoid that a complete network needs tobe parsed if only a tiny section is affected.

We understand that the flexibility of our approach is extending the complexity of main-taining the descriptional structure. Further research has to prove if this is acceptable.

In Appendix B of this article we include instantiated description schemata to show howreal results look like.

6. Conclusion and future work

In this paper we presented A4SM, a framework for the creation, manipulation, and archiv-ing/retrieval of media documents, applied for the domain of news. We demonstrated witha basic set of 18 syntactic and semantic description schemata how video material can be

288 NACK AND PUTZ

Figure 13. A4SM description network and partial parsing approach.

annotated in real time and how this information can not only be used for retrieval but alsohow it can be exploited during the different phases of the production process itself.

The emphasis in this work is on the provision of tools and technologies for the manualauthorship of linear and interactive media production. In fact, there is lots of evidence thata great deal of useful annotation can just be provided by manual labour but also that thereis not such a thing as a single and all inclusive content description. We see the need forcollective sets of descriptions growing over time (i.e., no annotation will be overwritten butextensions or new descriptions will appear in the form of new documents). Thus, there is notonly the requirement for flexible formal annotation mechanisms and structures but also fortools which firstly support human creativity for creating the best material for the requiredtask but secondly also use the creative act to extract the significant syntactic, semantic andsemiotic aspects of the content description.

We are aware that the approach described in this article is but a small step towards theintelligent use and reuse of media production material. Nevertheless, we believe that thework undertaken will inform research into the generation of interactive media documents,in particular, and research into media representation, in general. At present, we are engagedin research on tools and techniques for semi-automated, interactive narrative generation,


mainly for the domain of documentary. This research will in particular address the problemsregarding the annotation of high-level concepts during script writing and their incorporationin the production process. We also include the missing 4 sections of the requirements fordigital news production.

Appendix A: A4SM’s 18 schemata for describing content of news

XML Schema documentation generated with XML Spy Schema Editor www.xmlspy.com

complexType NewscastType

complexType NewsclipType

290 NACK AND PUTZ

complexType LinkType

complexType TSRType

complexType RelationType

complexType ResourceType


complexType FormaldesType

complexType BpinfoType

complexType Subjective cType

292 NACK AND PUTZ

complexType MediadeviceType

complexType PersonType

complexType EventType


complexType ObjectType

complexType CharacterType

complexType ActionType

294 NACK AND PUTZ

complexType DialogueType

complexType SettingType

complexType LighteningType

complexType ArchiveType


complexType AccessType

Appendix B: Example of a news annotation

The following example of the description of a newsclip is based on our 18 different de-scription schemata. What is shown below is not the complete description, since its volumewould not fit the scope of this article. However, the example is detailed enough to show theapproach and the possibilities of the underlying model.

The clip we describe is a real newsclip, broadcasted by ARD-Tagesschau at 20th January2000. The clip consists of 7 events, of which we show below the first four scenes representingthe corresponding key-frames. Note, that the newsclipDS in this example is used as s a hub.

event 1 event 2 event 3 event 4

Instantiated newsclip DS<?xml version=“1.0” encoding=“UTF-8”?><!– edited with XML Spy v3.5 (http://www.xmlspy.com) by Wolfgang Putz (GMD-IPSI) –><newsclip xmlns:xsi=“http://www.w3.org/2000/10/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=“C:\EigeneDateien\mpeg7\newsitem v2\xmlschemaDSs\Newsclip.xsd”name=“Tagesschau200001202000”><ContextRelation name=“formaldescription”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/formaldes inst</address><arcrole>contains</arcrole>

</Resource></ContextRelation><ContextRelation name=“toev1”>

296 NACK AND PUTZ

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/event01 ins</address><arcrole>contains</arcrole>






</Resource><address>http://www.darmstadt.gmd.de/MPEG7/event04 ins</address><arcrole>contains</arcrole>






</Resource><address>http://www.darmstadt.gmd.de/MPEG7/event07 inst</address><arcrole>contains</arcrole>

</Resource></ContextRelation><media format>MPEG-1</media format><media type>video</media type><Ratio>4 3</Ratio><video data name=“video data”>

<target>http:/www.darmstadt.gmd.de/mpeg7/ts200001202000 v</target><timeandspace>

<in>00/13/22/840</in><out>00/13/49/560</out>

</timeandspace></video data>


<audio data name=“audio data”><target>http:/www.darmstadt.gmd.de/mpeg7/ts200001202000 a</target><timeandspace>

<in>00/13/22/840</in><out>00/13/48/960</out>

</timeandspace></audio data>

</newsclip>

Instantiated formaldes DS<?xml version=“1.0” encoding=“UTF-8”?><formaldes xmlns:xsi=“http://www.w3.org/2000/10/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=“C:\EigeneDateien\mpeg7\newsitem v2\xmlschemaDSs\Formaldes.xsd” name=’fts200001202000’>

<ContextRelation name=“firstbroadcastedat”><Resource>

<address>http://www.darmstadt.gmd.de/MPEG7/broadcast01</address><arcrole>contains</arcrole>

<Resource></ContextRelation><ContextRelation name=“archive”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/archive01</address><arcrole>contains</arcrole>

<Resource></ContextRelation><ContextRelation name=“speaker”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/person01</address><arcrole>contains</arcrole>

<Resource></ContextRelation><title>children’s bus accident in Bavaria</title><broadcaster>ARD</broadcaster><category>news</category><keyword>accident</keyword><keyword>children</keyword><keyword>Bavaria</keyword><origin>own</origin><language>DE</language>

</formaldes>

Instantiated event DS for event 2<?xml version=“1.0” encoding=“UTF-8”?><!– edited with XML Spy v3.5 (http://www.xmlspy.com) by Wolfgang Putz (GMD-IPSI) –><event xmlns:xsi=“http://www.w3.org/2000/10/XMLSchema-instance”xsi:noNamespaceSchemaLocation=“C:\EigeneDateien\mpeg7\newsitem v2\xmlschemaDSs\Event.xsd”

name=“e02 ts200001202000”><timeandspace>

<in>00/00/04/300</in><out>00/00/07/167</out>

</timeandspace><ContextRelation name=“eventchain”>

298 NACK AND PUTZ

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/event03 inst</address><arcrole>precedes</arcrole>

<Resource></ContextRelation><ContextRelation name=“bus”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/bus02 inst</address><arcrole>contains</arcrole>

<Resource></ContextRelation><ContextRelation name=“setting”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/setting02 inst</address><arcrole>contains</arcrole>

<Resource></ContextRelation><ContextRelation name=“dialogue”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/dial02 inst</address><arcrole>contains</arcrole>

<Resource></ContextRelation><ContextRelation name=“media1”>

<Resource><address>http://www.darmstadt.gmd.de/MPEG7/media02 inst</address><arcrole>contains</arcrole>

<Resource></ContextRelation><event type>action</event type><abstract>the broken front window of the schoolbus</abstract>

</event>

Instantiated mediadevice DS used in the instantiation for event 2<?xml version=“1.0” encoding=“UTF-8”?><!– edited with XML Spy v3.5 (http://www.xmlspy.com) by Wolfgang Putz (GMD-IPSI) –><mediadevice xmlns:xsi=“http://www.w3.org/2000/10/XMLSchema-instance”xsi:noNamespaceSchemaLocation=“C:\EigeneDateien\mpeg7\newsitem v2\xmlschemaDSs\

Media device.xsd” name=“m02 ts200001202000”><timerange>

<in>00/00/04/360</in><out>00/00/07/200</out>

</timerange><effects type>closeup</effects type><!– in this example no technical camera information is defined –>

</mediadevice>

Acknowledgments

We thank Michael Weigand and Angelo Barreiros for programming the MPEG-7 camera.We’d also like to acknowledge the MPEG-7 experts, in particular Jane Hunter, Ernest


Wan, Olivier Avaro and Arnd Steinmetz, who provided support in the development of theDescription Schemata and for useful discussion during the development of this work. Wealso thank the WDR (Westdeutscher Rundfunk - Koln) and the HR (Hessischer Rundfunk- Frankfurt) for supporting this work by offering access to their practical sessions.

This work was mainly funded by GMD-IPSI. The first author would like to thank CWIfor ongoing support to finish off the latest stages of the research.

Notes

1. The technology provided by Avstar Systems LLC, the market leader in newsroom computing systems, showsthat the combination of closed proprietary architectures is possible. This technology allows automatic index-ing that enables news journalists, editors, producers and directors to search, browse and pre-edit their incomingvideos from the desktop. However, the system does not support progressive-oriented or individualised anno-tations, facilities a fairly limited search spectrum, and does not support later use of the descriptions for newson demand.

2. Please note, that Part 5 of the Standard, which is about Multimedia Description Schemata, provides a GraphDS [26]. This Graph DS is merely aimed to establish semantic relations between parts of a description, thusrelation within one, most likely complex, document.

3. The time values used in all the following examples of editing strategies are based on estimates provided by theeditors at the WDR.

References

1. T.G. Aguierre Smith and G. Davenport, “The stratification system. A design environment for random accessvideo,” in ACM Workshop on Networking and Operating System Support for Digital Audio and Video, SanDiego, California, 1992.

2. P. Aigrain, P. Joly, and V. Longueville, “Medium knowledge-based macro-segmentation of video into se-quences,” in IJCAI 95—Workshop on Intelligent Multimedia Information Retrieval, M. Maybury (Ed.),Montreal, 1995, pp. 5–16.

3. R. Arnheim, “Art and visual perception: A psychology of the creative eye,” Faber & Faber: London, 1956.4. M. Bertini, A. Del Bimbo, and P. Pala, “Content based annotation and retrieval of news videos,” in Proceedings

of IEEE International Conference on Multimedia and Expo (ICME 2000), New York City, NY, USA, 2000,pp. 483–488.

5. G.R. Bloch, “Elements d’une machine de montage pour l’audio-visuel,” Ph.D., Ecole Nationale SuperieureDes Telecommunications, 1986.

6. P.J. Bloom, “High-quality digital audio in the entertainment industry: An overview of achievements andchallenges,” IEEE Acoust. Speech Signal Process. Mag., Vol. 2, 1985, pp. 2–25.

7. J. Borchers and M. Muhlhauser, “Design patterns for interactive musical systems,” IEEE Multimedia Maga-zine, Vol. 5, No. 3, 1998, pp. 36–46.

8. D. Bordwell, “Making meaning—inference and rhetoric in the interpretation of cinema,” Harvard UniversityPress: Cambridge, MA.

9. R.J. Brachman and H.J. Levesque, “Readings in knowledge representation,” Morgan Kaufmann Publishers:San Mateo, CA, 1983.

10. KM Brooks, “Metalinear cinematic narrative: Theory, process, and tool,” Ph.D. Thesis, MIT, 1999.11. C. Colombo, A. Del Bimbo, and P. Pala, “Semantics in visual information retrieval,” IEEE Multimedia, Vol. 6,

No. 3, 1999, pp. 38–53..12. M. Davis, “Media streams: Representing video for retrieval and repurposing,” Ph.D. MIT, 1995.13. A. Del Bimbo, “Visual information retrieval,” Morgan Kaufmann: San Francisco, USA, 1999.14. U. Eco, “A theory of semiotics,” The Macmillan Press: London, 1997.15. H. Fehlis, Hybrides Trackingsystem fur virtuelle Studios, Fernseh + Kinotechnik, Bd. 53, Nr. 5, 1999.

300 NACK AND PUTZ

16. J-L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Commu-nications of the ACM, Vol. 43, No. 2, 2000, pp. 64–70.

17. A. Gupta and R. Jain, “Visual information retrieval,” Communications of the ACM, Vol. 40, 1997,pp. 71–79.

18. J. Greimas, “Structural semantics: An attempt at a method,” University of Nebraska Press: Lincoln, 1983.19. F.G. Halasz, “Reflection on notecards: Seven issues for the next generation of hypermedia systems,”

Communications of the ACM, Vol. 31, No. 7, 1988.20. K. Hirata, “Towards formalizing Jazz Piano knowledge with a deductive object-oriented approach,”

in Proceedings of Artificial intelligence and Music, IJCAI, Montreal, 1995, pp. 77–80.21. J. Hunter and L. Armstrong, “A comparison of schemas for video metadata representation,” in Proceedings

of the WWW8, Toronto, 1999.22. J. Hunter and C. Lagoze, “Combining RDF and XML schemas to enhance interoperability between metadata

application profiles,” in The Tenth International World Wide Web Conference, Hong Kong, 2001, pp. 457—466.

23. I. Ide, R. Hamada, S. Sakai, and H. Tanaka, “An attribute based news video indexing,” in Workshop Proceedingsof the 9th ACM International Conference on Multimedia, Marina del Rey, California, 2000, pp. 195–200.

24. ISO MPEG-7, “Overview of the MPEG-7 standard (version4.0),” Doc. ISO/MPEG N3752, MPEG La BauleMeeting, 2000.

25. ISO MPEG-7, Text of ISO/IEC FCD 15938-2 Information Technology—Multimedia Content DescriptionInterface—Part 2: Description Definition Language, ISO/IEC JTC 1/SC 29/WG 11 N4288, 19/09/2001a.

26. ISO MPEG-7, Text of ISO/IEC 15938-5/FCD Information Technology—Multimedia Content DescriptionInterface—Part 5: Multimedia Description Schemes, ISO/IEC JTC 1/SC 29/WG 11 N4242, 23/10/2001b.

27. S.E. Johnson, P. Jourlin, K. Spark Jones, and P.C. Woodland, “Audio indexing and retrieval of completebroadcast news shows,” in RIAO’ 2000 Conference Proceedings, College de France, Paris, France, Vol. 43,No. 2, 2000, pp. 1163–1177.

28. F. Kubala, S. Colbath, D. Liu, A. Srivastava, and J. Makhoul, “Integrated technologies for indexing spokenlanguage,” Communications of the ACM, Vol. 43, No. 2, 2000, pp. 57–63.

29. K. Lemstrom and J. Tarhio, “Searching monophonic patterns within polyphonic sources,” in RIAO’ 2000Conference Proceedings, College de France, Paris, France, Vol. 2, 2000, pp. 1163–1177.

30. C. Lindley, “A video annotation methodology for interactive video sequence generation,” BCS ComputerGraphics & Displays Group Conference on Digital Content Creation, Bradford, UK, 2000.

31. M. Melucci and N. Orio, “SMILE: A system for content-based musical information retrieval environ-ments,” in RIAO’ 2000 Conference Proceedings, College de France, Paris, France, Vol. 2, 2000, pp. 1261–1279.

32. T.J. Mills, D. Pye, N.J. Hollinghurst, and K.R. Wood, “At&TV: Broadcast television and radio retrieval,” inRIAO’ 2000 Conference Proceedings, College de France, Paris, France, Vol. 2, 2000, pp. 1135–1144.

33. F. Nack, “AUTEUR: The application of video semantics and theme representation in automated video editing,”Ph.D. Lancaster University, 1996.

34. F. Nack and A. Parkes, “The application of video semantics and theme representation in automated videoediting,” Multimedia Tools and Applications, H. Zhang (Ed.), Vol. 4, No. 1, 1997, pp 57–83.

35. F. Nack and A. Steinmetz, “Approaches on intelligent video production,” in Proceedings of ECAI-98 Workshopon AI/Alife and Entertainment, Brighton, 1998.

36. F. Nack and A. Lindsay, “Everything you wanted to know about MPEG-7: Part I & II IEEE MultiMedia,”pp. 65–77, IEEE Computer Society, 1999, pp 64–73.

37. F. Nack and C. Lindley, “Environments for the production and maintenance of interactive stories,” Workshopon Digital Storytelling, Darmstadt, Germany, 15–16/6/2000.

38. A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-search for video appearance,” in VisualDatabase Systems, E. Knuth and I.M. Wegener (Eds.), Elsevier Science Publishers: Amsterdam, 1992,pp. 113–127.

39. F. Pachet and D. Cazzly, “A taxonomy of musical genres,” in RIAO’ 2000 Conference Proceedings, Collegede France, Paris, France, Vol. 2, 2000, pp. 1238–1245.

40. A.P. Parkes, “An artificial intelligence approach to the conceptual description of videodisc images,” Ph.D.Thesis, Lancaster University, 1989.


41. A.P. Parkes, “Settings and the settings structure: The description and automated propagation of networks forperusing videodisk image states,” in SIGIR ’89, N.J. Belkin and C.J. van Rijsbergen (Eds.), Cambridge, MA,1989, pp. 229–238.

42. S. Pfeiffer, S. Fischer, and Effelsberg, “Automatic audio content analysis,” in Proceedings of the ACMMultimedia 96, New York, 1996, pp. 21–30.

43. RDF Schema, 2001. http://www.w3.org/TR/rdf-schema/.44. J. Robertson, A. De Quincey, T. Stapleford, and G. Wiggins, “Real-time music generation for a virtual

environment,” in Proceedings of ECAI-98 Workshop on AI/Alife and Entertainment, 1998, Brighton.45. W. Sack, “Coding news and popular culture,” in The International Joint Conference on Artificial Intelligence

(IJCA93) Workshop on Models of Teaching and Models of Learning. Chambery, Savoie, France, 1993.46. S. Santini and R. Jaim, “Integrated browsing and querying for image databases,” IEEE MultiMedia, IEEE

Computer Society, 2000, pp. 26–39.47. Semantic Web, 2001. http://www.w3.org/2001/sw/.48. SMPTE Dynamic Data Dictonary Structure, 6. Draft, 1999.49. J.F. Sowa, “Conceptual structures: Information processing in mind and machine,” Addison-Wesley Publishing

Company: Reading, MA, 1984.50. TALC, 1999. http://www.de.ibm.com/ide/solutions/dmsc/.51. Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured video computing,” IEEE MultiMedia,

Vol. 1, No. 3, 1994, pp. 34–43.52. H.D. Wactler, A.G. Hauptmann, M.G. Christel, R.G. Houghton, and A.M. Olligschlaeger, “Complementary

video and audio analysis for broadcast news archives,” Communications of the ACM, Vol. 43, No. 2, 2000,pp. 42–47.

53. E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search, and retrieval of audio,”IEEE Multimedia Magazine, Vol. 3, No. 3, 1996, pp. 27–36.

54. X-Link, 2001. http://www.w3.org/TR/xlink/.55. XML Schema, 2000. XML Schema Part 0: Primer, W3C Candidate Recommendation, 24 October 2000,

http://www.w3.org/TR/2000/CR-xmlschema-1-20001024/; XML Schema Part 1: Structures, W3C Recom-mendation, 24 October 2000, http://www.w3.org/TR/2000/CR-xmlschema-1-20001024/; XML Schema Part2: Datatypes, W3C Recommendation, 24 October 2000, http://www.w3.org/TR/2000/CR-xmlschema-1-20001024/.

56. XML Schema, 2001. XML Schema Part 0: Primer, W3C Recommendation, 2 May 2001http://www.w3.org/TR/xmlschema-0/; XML Schema Part 1: Structures, W3C Recommendation, 2 May 2001http://www.w3.org/TR/xmlschema-1/; XML Schema Part 2: Datatypes, W3C Recommendation, 2 May 2001http://www.w3.org/TR/xmlschema-2/.

57. M.M. Yeung, B. Yeo, W. Wolf, and B. Liu, “Video browsing using clustering and scene transitions oncompressed sequences,” in Proceedings IS&T/SPIE ’95 Multimedia Computing and Networking, San Jose,SPIE (2417), 1995, pp. 399–413.

58. H. Zhang, Y. Gong, and S.W. Smoliar, “Automated parsing of news video,” in IEEE International Conferenceon Multimedia Computing and Systems, Boston: IEEE Computer Society Press, 1994, pp. 45–54.

Frank Nack is a senior researcher at CWI, currently working within the Multimedia and Human-Computer In-teraction group. He obtained his Ph.D. in ‘The Application of Video Semantics and Theme Representation for

302 NACK AND PUTZ

Automated Film Editing,’ at Lancaster University, UK. The main thrust of his research is on video representation,digital video production, multimedia systems that enhance human communication and creativity, interactive sto-rytelling and media-networked oriented agent technology. He is an active member of the MPEG-7 standardisationgroup where he served as editor of the Context and Objectives Document and the Requirements Document, andchaired the MPEG-7 DDL development group. He is on the editorial board of IEEE Multimedia, where he editsthe Media Impact column.

Wolfgang Putz received his diploma of computer science in 1975 from University Erlangen-Nuermberg, Germany.From 1975 to 1978 he was member of the Institute for Information Systems of the GMD, now FraunhoferGesellschaft, involved in implementation of a database system. From 1978 to 1983 he was member of the Institutefor Information Systems and Graphical Information Processing of the GMD working in database research. From1983 to 1989 he joined the Institute for Technology Transfer working in research and implementation of reportgeneration tools based on structured documents. Since 1989 he is working in the Integrated Publication andInformation Systems Institute (IPSI) at Darmstadt. His main research topic is content and knowledge descriptionof multimedia information. In this context he is coworking in the MPEG7 standardisation.

Saying What it Means: Semi-Automated (News) Media Annotation

Documents

Transcript of Saying What it Means: Semi-Automated (News) Media Annotation