Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked...

70
Networked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video fusion VISNET/WP4.3/D40/V1.0 Page 1/70

Transcript of Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked...

Page 1: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

Networked Audiovisual Media Technologies

VISNET

IST-2003-506946

D40

Review of the work done in Audio-Video fusion

VISNET/WP4.3/D40/V1.0 Page 1/70

Page 2: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

Document description Document’s name

Review of the work done in Audio-Video fusion

Abstract : This deliverable presents a “state of the art” in multimodal analysis. The main objective of this document is to review the work already done in Audio-Video fusion in person location and identification and video indexing so that research areas can be identified.

Document Identifier : D40 Document Class : Deliverable Version : V 1.0 Authors: EPFL: Yousri Abdeljaoued

INESC: Luis Gustavo PdM: Marco Marcon, Augusto Sarti TUB: Markus Schwab UPC: Toni Rama, Francesc Tarrés Edited by UPC

Creation date: 15/04/2004 Last modification date: 31/05/2004 Status: Final Destination: Consortium WP n°: 4.3

VISNET/WP4.3/D40/V1.0 Page 2/70

Page 3: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

TABLE OF CONTENTS

1. INTRODUCTION ................................................................................................................................................... 6

1.1 OVERVIEW OF MULTIMODAL ANALYSIS: PROBLEM STATEMENT....................................................................... 6 1.2 OVERVIEW OF THE DELIVERABLE......................................................................................................................... 7 1.3 BIBLIOGRAPHY ....................................................................................................................................................... 8

2. MEASURING AUDIO FEATURES ...................................................................................................................... 9

2.1 INTRODUCTION ....................................................................................................................................................... 9 2.2 FRAME-LEVEL FEATURES ................................................................................................................................... 10 2.2.1 VOLUME – SHORT TIME ENERGY (STE) - LOUDNESS ........................................................................................ 10 2.2.2 ZERO CROSS RATE (ZCR)................................................................................................................................... 10 2.2.3 BAND ENERGY (BE) AND BAND ENERGY RATIO (BER OR ERSB) .................................................................... 11 2.2.4 FREQUENCY CENTROID (FC) .............................................................................................................................. 11 2.2.5 BANDWIDTH (BW).............................................................................................................................................. 12 2.2.6 SPECTRAL ROLLOFF POINT ................................................................................................................................. 12 2.2.7 SPECTRAL FLATNESS MEASURES ....................................................................................................................... 12 2.2.8 CEPSTRAL COEFFICIENTS (CC)........................................................................................................................... 13 2.2.9 MEL FREQUENCY CEPSTRUM COEFFICIENTS (MFCC) ....................................................................................... 13 2.2.10 PITCH OR FUNDAMENTAL FREQUENCY ............................................................................................................. 14 2.3 CLIP-LEVEL FEATURES ....................................................................................................................................... 15 2.3.1 VOLUME-BASED.................................................................................................................................................. 15 2.3.2 ENERGY BASED................................................................................................................................................... 16 2.3.3 ZCR-BASED......................................................................................................................................................... 17 2.3.4 NON-SILENCE RATIO (NSR) ............................................................................................................................... 17 2.3.5 NOISE FRAME RATIO (NFR) ............................................................................................................................... 17 2.3.6 PITCH-BASED ...................................................................................................................................................... 17 2.3.7 SPECTRUM FLUX (SF) ......................................................................................................................................... 18 2.3.8 BAND PERIODICITY (BP)..................................................................................................................................... 18 2.3.9 LSP DISTANCE MEASURE ................................................................................................................................... 19 2.3.10 COMPRESSED DOMAIN AUDIO FEATURES ........................................................................................................ 19 2.4 BIBLIOGRAPHY ..................................................................................................................................................... 20

3. MEASURING VIDEO FEATURES .................................................................................................................... 22

3.1 INTRODUCTION ..................................................................................................................................................... 22 3.2 COLOR................................................................................................................................................................... 22 3.3 SHAPE .................................................................................................................................................................... 22 3.4 TEXTURE ............................................................................................................................................................... 23 3.5 MOTION................................................................................................................................................................. 23 3.6 BIBLIOGRAPHY ..................................................................................................................................................... 23

4. STATISTICAL PATTERN RECOGNITION: A REVIEW ............................................................................. 25

4.1 INTRODUCTION ..................................................................................................................................................... 25 4.2 CLASSIFIERS ......................................................................................................................................................... 25 4.2.1 BAYESIAN APPROACH......................................................................................................................................... 25 4.2.2 DISCRIMINANT FUNCTIONS ................................................................................................................................. 26 4.2.3 LINEAR DISCRIMINANT FUNCTIONS .................................................................................................................... 26

VISNET/WP4.3/D40/V1.0 Page 3/70

Page 4: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

4.2.4 PIECEWISE LINEAR DISCRIMINANT FUNCTIONS .................................................................................................. 28 4.2.5 GENERALIZED LINEAR DISCRIMINANT FUNCTIONS............................................................................................. 28 4.3 CLASSIFIER COMBINATION .................................................................................................................................. 29 4.3.1 COMBINATION SCHEMES..................................................................................................................................... 29 4.3.2 TRAINING METHODS OF INDIVIDUAL CLASSIFIERS TO ASSURE INDEPENDENCY ................................................. 31 4.4 BIBLIOGRAPHY ..................................................................................................................................................... 31

5. FUNDAMENTALS OF INFORMATION FUSION .......................................................................................... 33

5.1 INTRODUCTION ..................................................................................................................................................... 33 5.2 PRE-MAPPING FUSION .......................................................................................................................................... 34 5.2.1 SENSOR DATA LEVEL FUSION.............................................................................................................................. 34 5.2.2 FEATURE LEVEL FUSION...................................................................................................................................... 34 5.3 POST-MAPPING FUSION........................................................................................................................................ 35 5.3.1 DECISION FUSION ................................................................................................................................................ 35 5.3.2 OPINION FUSION.................................................................................................................................................. 36 5.4 ADAPTIVE FUSION................................................................................................................................................. 37 5.5 BIBLIOGRAPHY ..................................................................................................................................................... 39

6. PEOPLE LOCATION USING AUDIO VISUAL INFORMATION ................................................................ 41

6.1 INTRODUCTION ..................................................................................................................................................... 41 6.2 PEOPLE LOCATION ............................................................................................................................................... 41 6.2.1 REVIEW OF MICROPHONE ARRAY SPEAKER LOCALIZATION ............................................................................... 41 6.2.2 VIDEO PERSON LOCALIZATION ............................................................................................................................ 43 6.2.3 RECENT WORKS IN VIDEO AND AUDIO FUSION FOR PEOPLE LOCALIZATION....................................................... 45 6.2.4 EXAMPLES AND APPLICATIONS FOR AUDIO-VISUAL LOCALIZATION.................................................................. 46 6.3 BIBLIOGRAPHY ..................................................................................................................................................... 47

7. AUDIOVISUAL PERSON RECOGNITION AND VERIFICATION ............................................................. 50

7.1 INTRODUCTION ..................................................................................................................................................... 50 7.2 SHOT SELECTION IN VIDEO SEQUENCES.............................................................................................................. 51 7.2.1 ELEMENTS OF THE VIDEO SEQUENCE (NEWS SEQUENCE) ................................................................................... 52 7.2.2 AUDIO AND VIDEO SEGMENTATION .................................................................................................................... 53 7.2.3 MODALITIES CORRESPONDENCE......................................................................................................................... 53 7.3 STATE-OF-THE-ART OF MULTIMODAL PERSON RECOGNITION ......................................................................... 55 7.3.1 INTRODUCTION ................................................................................................................................................... 55 7.3.2 NON-ADAPTIVE APPROACHES ............................................................................................................................. 55 7.3.3 ADAPTIVE APPROACHES ..................................................................................................................................... 58 7.4 BIBLIOGRAPHY ..................................................................................................................................................... 60

8. MULTIMODAL VIDEO INDEXING ................................................................................................................. 62

8.1 INTRODUCTION ..................................................................................................................................................... 62 8.2 VIDEO DOCUMENT SEGMENTATION ................................................................................................................... 62 8.2.1 LAYOUT RECONSTRUCTION ................................................................................................................................ 62 8.2.2 CONTENT SEGMENTATION .................................................................................................................................. 62 8.3 AUDIO DOCUMENT SEGMENTATION ................................................................................................................... 63 8.4 MULTIMODAL INTEGRATION .............................................................................................................................. 64 8.5 BIBLIOGRAPHY ..................................................................................................................................................... 66

VISNET/WP4.3/D40/V1.0 Page 4/70

Page 5: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

9. FUTURE RESEARCH AND RESEARCH ACTIVITIES IN VISNET ........................................................... 69

9.1 RESEARCH ACTIVITIES ........................................................................................................................................ 69

VISNET/WP4.3/D40/V1.0 Page 5/70

Page 6: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

1. INTRODUCTION

1.1 Overview of multimodal analysis: Problem statement

Audiovisual scenes can be viewed as a composition of objects of different nature. These objects can be broadly classified as natural or synthetic audio and video signals, graphics, text, etc. In general, for a complete scene understanding, it seems essential to analyze all the available information and identify not only the objects that compose the scene but also its nature and their relationship. In many practical situations, the objects of the scene are closely related and provide the same kind of information. Consider for instance, the problem of identifying a person in a news program. We can use a video-based face recognition system to identify the person in the scene or a speaker recognizer which analyzes the audio stream. We could also try to find text areas in the video signal and apply an OCR just to verify if the speaker’s name is on the screen. It is quite obvious that all of the above approaches can provide the expected result but there is an inherent risk of failure for each of them. This risk not only depends of the goodness of the recognizers but also of the characteristics of the scene such as the illumination, the pose of the person, more than one person speaking, audio noise, etc. It seems natural that the best approach to obtain reliable results is to design a system that efficiently combines the available information in the audio and video domains. In fact, human beings are a good evidence of a system that combines multiple information sources. Examples of information combined by humans include: the use of both eyes, seeing and touching the same object, seeing and listening a person talking (which greatly increases the intelligibility [Silsbee96]). Multimodal approaches deal with strategies for efficiently fuse the information provided by different sources in a single system. Although the use of multimodal approaches for audiovisual scene analysis seems obvious, it is a relatively new field with growing interest and activities in the research community. There are several significant contributions to the field covering from the principles and theoretical basis for information fusion to a broad class of practical developments directed to different applications such as security, speech recognition, surveillance, people location, scene classification, person recognition, indexing and retrieval, segmentation, etc. Nevertheless, the field is still immature and proposals are very focussed to specific applications and scenarios which make the performance comparison of different strategies very difficult. In general, multimodal analysis tries to combine information of several sources into one single system. For example, in biometric analysis, a single system could combine the 3D face of a person, his speech and the images obtained from fingerprints and iris sensors. In the context of audiovisual scene analysis, the nature of the sources of information is usually restricted to the audio and the video signals. The key idea behind this combined approach is that using the audio and video information can complement each other because the limitations that reduce the system performance in both modalities are supposed to be uncorrelated. While multimodal approaches seem to be the most natural path towards scene analysis and understanding, an obvious question arises about why this field is relatively new and has only focussed major research efforts during the last decade. Probably, at least two factors have to be considered as possible explanations. The first one is related with the computational burden associated with the simultaneous analysis of audio and video signals. Nevertheless, the continuous improvement of digital systems capabilities have enabled the introduction of multimodal analysis solutions for audiovisual signals. The second reason, less obvious, is behind the expertise of the research community. Cooperation between audio and video research communities have been quite limited through decades and, generally, every engineer who is an expert in one of the areas has at most, a general state-of-the-art knowledge in the other one. It is true that image and video researchers have peeked solutions or approaches from the audio world for solving their problems in the video domain and viceversa. Nevertheless, the unified multimodal approach, taking the expertise from both worlds has been, until recently, reduced to a few numbers of examples. In this context, the VISNET WP4.3 in Multimodal Analysis is considered as an excellent

VISNET/WP4.3/D40/V1.0 Page 6/70

Page 7: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

opportunity for the cooperation of both communities in the research and development of new approaches that effectively combines the potentiality of using all the available information. Techniques used in multi-modal systems are included into a broader research field called information fusion. In general, information fusion includes any area that combines different information sources, either to generate one representational format, or to reach a decision [Kittler98]. Examples of areas that use information fusion are: consensus building, team decision theory, committee machines, integration of multiple sensors, multi-modal data fusion, combination of multiple-experts/classifiers, distributed detection and distributed decision making. There are many open questions related to multimodal analysis which have to be taken into account when designing or developing a multimodal system:

1. Which classifiers are the most efficient ones for a certain application? 2. Which features are the most appropriated for the classifiers? 3. How many classifiers are necessary? 4. Given a set of classifiers which combination scheme may be the one that improves the overall

performance of the system? 5. How is the information or the outputs of each individual classifier fused?

From all the above questions, the first two are inherited from classical pattern recognition, whereas the last three are specific for multimodal analysis. The 3rd question refers to the number of classifiers involved in the multimodal system. Normally, it depends on the information available and in some cases it can occur that the insertion of a new classifier may not improve the performance of the system. This means that this classifier doesn’t contribute with complementary information, or in other words, its opinion is correlated with the opinions of the other classifiers. The 4th question remarks the problem on how the different classifiers should work. Different possibilities of combining the experts are: in a parallel, serial or hierarchical scheme. Moreover it is possible to “activate” one classifier depending on the output of another one, or simply activate only some of them depending on the system conditions. Although the last question should seem very similar to the previous one, there is a big difference between the two. While the 4th question is related to the architecture to combine individual classifiers, question 5 refers to the combination of each individual output in order to make the final decision. A correct answer to all these questions is the key of succeeding in the multimodal analysis field. 1.2 Overview of the deliverable

The present deliverable, produced within the scope of workpackage 4.3 “Multimodal Analysis”, has the objective of presenting the state-of-the-art in multimodal analysis signals. The document is organized as follows. Sections 2 and 3 cover the topic of feature extraction from audio and video signals respectively. Measuring features from audio and video streams is the first stage in any scene analysis system. Both sections review the main feature extraction methods from both worlds. Section 4 is a short review of statistical pattern classification methodologies. The first part is dedicated to a general overview of the problem statement and main solutions of pattern classification. The second part is devoted to the presentation of existing strategies for combining multiple classifiers into a single solution. This part is directly related to the multimodal approach where one has to devise an architecture to combine results from different partial classifiers into a single decision. The approach is however very general and

VISNET/WP4.3/D40/V1.0 Page 7/70

Page 8: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

presents the different combinations and strategies from the perspective of pattern classification, with no special emphasis in the audiovisual problem. Audio and video fusion is considered in detail in section 5 where a broad classification of the existing approaches are discussed in detail in the context of audiovisual scene analysis. Sections 6, 7 and 8 review the works done in audio and video fusion in the three different applications where the major efforts of this WP are devoted. The applications are respectively, People Location, Person Recognition and Video Indexing using multimodal approaches. Finally, section 9 reviews the joint research efforts that partners will develop in the context of the VISNET project. 1.3 Bibliography

[Kittler98] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226-239, March 1998 [Silsbee96] P. Silsbee and A. Bovik, “Computer lip-reading for improved accuracy in automatic speech recognition”, in IEEE Transactions on Speech and Audio Processing, 4(5):337-351, May 1996

VISNET/WP4.3/D40/V1.0 Page 8/70

Page 9: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

2. MEASURING AUDIO FEATURES

2.1 Introduction

Until recently, research in the field of audiovisual content analysis has mainly focused on using visual features for segmentation, classification and summarization. Researchers are now beginning to understand that audio characteristics are equally, or even more, important when it comes to understanding the semantic content of a video1 [Wang00]. This applies not only to speech contents, but also to the more general acoustic properties, which in fact can contain sufficient information to allow the semantic classification of a video signal without even looking to its image content. Furthermore, audio-based processing has the advantage of presenting less complex computational needs than visual analysis techniques. On the other hand, audio analysis results can be used to guide additional visual processing, or be combined with the visual cues in order to try to resolve ambiguities in individual modalities and thereby help to obtain more accurate answers. This Section addresses the extraction of some of the most well-known features that can be used to characterize audio signals, and, by consequence, their corresponding video signals. Additional information about audio features can be found in VISNET deliverable D29 – “Audio and Speech Analysis Overview”[VISNET.D29]. In fact, some of the information in this section is based on the text of that document. Audio features used for scene classification include typical measures of the audio waveform already used in other well-known problems (e.g. speech recognition) but also specific features that try to improve discrimination among different scene classes. The audio features can be extracted in two levels: short-term frame level and long-term clip level. The frame level is defined as a group of neighbouring samples with duration between 10 and 40 ms, so that a stationary signal can be assumed. For a feature to reveal the semantic meaning of an audio signal, analysis over a much longer period is necessary, usually from one second to several tens of second. This interval is called an audio clip and it consists of a sequence of audio frames. The clip boundaries may be the result of audio segmentation such that the frame features within each clip are similar. In fact, some authors proposed a direct segmentation of audio signals into regions, based on temporal changes of selected features, without trying to classify the content [Tzanetakis99a]. These two main divisions can be further divided according to their processing domain into time-domain features and frequency-domain features. Time domain features are computed directly from the audio waveform, and reflect temporal properties of the audio signal. However, some differences between distinct classes of audio signals become easier to identify in the frequency-domain than in the time-domain. The spectrum of an audio frame is a representation of its frequency content, but using the spectrum itself as a frame-level feature is unpractical due to its high dimensionality. Consequently, more succinct descriptors may be computed from the spectrum, resulting in highly discriminative frequency domain features. The following sections present a short review of the most common audio features used in the context of multimedia scene classification according to the above described categorization. 1 In this chapter, the word “video” is used to refer to both image frames and audio waveform contained in an audiovisual signal.

VISNET/WP4.3/D40/V1.0 Page 9/70

Page 10: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

2.2 Frame-Level Features

Audio features extracted at the frame level are those that capture the short-term characteristics of an audio signal. 2.2.1 VOLUME – SHORT TIME ENERGY (STE) - LOUDNESS

The easiest to compute frame feature is volume. Volume is a reliable indicator for silence detection, which may help to segment an audio sequence and to determine clip boundaries. Volume is also known as loudness. Volume is approximated by the rms (root mean square) of the mean energy of the signal within a frame:

∑−

=

=1

0

2 )(1)(N

in is

Nnp

where N denotes the frame length, and sn(i) denotes the amplitude of the ith sample in the nth audio frame. The volume on an audio signal depends on the gain value of the recording and digitizing devices. In order to eliminate the influence of such gain settings, the volume value in a frame may be normalized by the maximum volume of some previous frames. The Volume feature, in its squared version, is also known as Short Time Energy (STE):

21

0

2 )()(1)( npisN

nSTEN

in == ∑

=

Similarly to the Volume, STE provides for speech signals a basis for distinguishing voiced speech components from unvoiced speech components. The values of STE for unvoiced components are in general significantly smaller than those of the voiced components. STE can be used to discriminate audible sounds from silence when the SNR is high. The duration of the analysis window is usually 20ms with 10 ms overlap. The volume feature can also be computed from the spectrum of the signal:

∑−

=

=1

0

2 )(1)(N

kn kS

NnVol

where St is the continuous spectrum at a given time t, w the continuous analysis window, Sn is the discrete spectrum at a given frame n, and N is the size of the discrete analysis window and the order of the DFT. 2.2.2 ZERO CROSS RATE (ZCR)

Another time-domain feature is Zero Cross Rate (ZCR). The ZCR of a frame is computed by counting the number of times the audio waveform crosses the zero axis per time unit:

( ) ( )Nf

issignissignnZ sN

inn ⎟

⎞⎜⎝

⎛−−= ∑

=

1

1)1()(

21)(

where N denotes the frame length, sn(i) denotes the ith sample in the nth audio frame, fs represents the sampling rate and sign(.) is a sign function.

VISNET/WP4.3/D40/V1.0 Page 10/70

Page 11: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

ZCR is a very useful measure to discern between voiced/unvoiced speech, because typically unvoiced speech has a low volume, but a high ZCR. By combining ZCR and volume together, it is possible to classify low volume and unvoiced frames as silence. The ZCR is a correlate of the Frequency Centroid feature (described bellow in this text). 2.2.3 BAND ENERGY (BE) AND BAND ENERGY RATIO (BER OR ERSB)

The band energy is the energy content of a signal, at a given time, in a band of frequencies. It can be computed from the spectrogram as:

[ ]∫

∫=

ττ dw

dffStBE

f

ft

ff )(

)()(

1

0

10

2

,

where f0 and f1 are the bounds of the frequency band, St is the spectrum at a given time t, and w the analysis window. Considering that the energy distribution in different frequency bands varies quite significantly among different audio signals, it is also possible to use as frequency domain features the ratios between the energies at the different subbands and the signal’s total energy. This is known as the Band energy Ratio(BER) or as Energy Ratio Subband (ERSB) [Wang00], and can be defined as the proportion of energy in each frequency band and the total volume of the signal:

[ ] [ ][ ] [ ]

)()(

)(

)(

)()()( 1010

1010

,

2

,,, tVol

tBE

dw

dffS

tBEtBERtERSB ff

t

ffffff =

⎟⎟

⎜⎜

⎛==

∫∫

ττ

Having in mind the perceptual property of the human ear, [Wang00] proposes that the entire frequency band is divided into four sub-bands, each consisting of the same number of critical bands (critical bands correspond to cochlear filters in the human auditory model): 0-630 Hz, 630-1720 Hz, 1720-4400 Hz and 4400-11025 Hz (considering a sampling rate of 22050 Hz). Because the summation of the four ERSBs is always one, only the first three are normally used as audio features, and are referred by [Wang00] as ERSB1, ERSB2, and ERSB3, respectively. 2.2.4 FREQUENCY CENTROID (FC)

This feature is the “balancing point” of the spectral power distribution. It is related to the human sensation of the brightness of a sound and gives discriminating results for music and speech, as well as for voiced and unvoiced speech. Its mathematical definition can be represented as

∫∞

=

0

2

0

2

)(

)()(

ωω

ωωω

dS

dSnFC

n

n

VISNET/WP4.3/D40/V1.0 Page 11/70

Page 12: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

where Sn(ω) denote the spectrum of the nth frame (obtained from a Fourier Transform calculation) and ω represents the angular frequency. This feature has a high correlation with the Zero Crossing Rate (ZCR) feature. 2.2.5 BANDWIDTH (BW)

Directly related to the frequency centroid and using the previously computed power spectral density it is also possible to calculate its standard deviation that represents a measure of the signal effective bandwidth:

∫∞

−=

0

2

0

22

2

)(

)())(()(

ωω

ωωω

dS

dSnFCnBW

n

n

where Sn(ω) denotes the spectrum of the nth frame (obtained from a Fourier Transform calculation), ω represents the angular frequency and FC(n) represents the frequency centroid value. 2.2.6 SPECTRAL ROLLOFF POINT

The spectral roll-off point can be defined as the 95th percentile of the power spectrum, and presents a measurement of the “skew ness” of the spectral shape [Scheirer97]. A signal with a right-skewed spectral distribution presents a higher value for the Spectral Rolloff Point feature, and consequently provides good capability to distinguish between voiced and unvoiced speech. 2.2.7 SPECTRAL FLATNESS MEASURES

Flatness oriented spectral features describe the flatness properties of the short-term power spectrum of an audio signal. This family of features expresses the deviation of the signal’s power spectrum over frequency from a flat shape (corresponding to a noise-like or impulse-like signal). A high deviation from a flat shape may indicate the presence of tonal components. Since the desired characteristics (tone vs noise-likeness) are attributed to specific frequency bands rather than the entire spectrum, these features will be applied on a frequency band basis.

Spectral Flatness Measure (SFM) The Spectral Flatness Feature (SFM), a feature inherited from the coding theory [Jayant84], can be computed as the ratio between the geometric and the arithmetic mean of the power spectrum from a specified frequency band. The SFM feature is also defined as a Low-Level Descriptor (LLD) in the audio sub-part of the MPEG7 standard: AudioSpectrumFlatness LLD [MPEG7.4], and its computation is defined as:

=

+−

=

+−

= )(

)(

2

1)()()(

)(

2

)(1)()(

1

)(),( bih

bilin

bilbihbih

bilin

iSbilbih

iSnbSFM

VISNET/WP4.3/D40/V1.0 Page 12/70

Page 13: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

where Sn is the spectrum at a given frame n, and il(b) and ih(b) are the lower and upper limits of a specified frequency band b. The MPEG7 standard also defines that this expression should return a value of 1 whenever there is no audio signal present (i.e. the mean power is zero) [MPEG7.4]. The MPEG7 standard also proposes a frequency band division rule, which follows a quarter octave relation to 1KHz, as described in the following equation [MPEG7.4]:

KHzedge m 12 25.0 ×= where m is an integer number.

Spectral Crest Factor (SCF) Another very similar feature is the Spectral Crest Factor (SCF), which can be calculated as the ratio of the largest power spectrum density (PSD) coefficient and the mean PSD value in a frequency band (this can also be interpreted as the square of the spectral magnitude’s Crest Factor, i.e. maximum-to-RMS ratio):

∑=+−

= )(

)(

2

2

)(1)()(

1))((max),( bih

bilin

ni

iSbilbih

iSnbSCF

Alternative versions of these basic features can be conceived, as well as non-linear variations (e.g. inverse values), which add up to the family of flatness oriented spectral features. 2.2.8 CEPSTRAL COEFFICIENTS (CC)

In Speech Processing, the cepstral coefficients are used to obtain the formants from voiced phonemes. The information relevant to the formants is contained in the first coefficients of the cepstrum (i.e. for τ < τmax, where τmax is the period of the source). The cepstrum can be computed as the inverse Fourier transform of the logarithm of the spectrum:

∫= dfefStCep fjt

τπτ 22 )(log21),(

where St is the spectrum of the signal in a frame located at time t.

Cepstrum Resynthesis Residual Magnitude [Scheirer97] proposes a feature computed by the 2-norm of the vector residual after cepstral analysis, smoothing and resynthesis, which is known as Cepstrum Resynthesis Residual Magnitude. Computing a real cepstral analysis and performing a smoothing of the spectrum and afterwards resynthesizing and comparing the smoothed to the unsmoothed spectrum, it is possible to have a better fit for unvoiced speech than for voiced speech or music. 2.2.9 MEL FREQUENCY CEPSTRUM COEFFICIENTS (MFCC)

One of the most popular set of features used to parameterize the speech is the Mel-Frequency Cepstrum Coefficients (MFCC). They are based on the human auditive system model of critical bands. Linearly spaced filters at low frequencies (below 1000 Hz) and logarithmically at high frequencies (above 1000 Hz) have been used to capture the phonetically important characteristics of speech (mel-frequency scale). A block diagram of the structure of an MFCC processor is given in the figure 2.1

VISNET/WP4.3/D40/V1.0 Page 13/70

Page 14: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Fig 2.1 Mel-Cepstral feature extraction

The speech signal is divided into frames of N samples, with adjacent frames being separated by M (M< N) samples. A Hamming window is used to minimize the signal discontinuities at the borders of each frame. After this step the Fast Fourier Transformation is applied and the absolute value is taken to obtain the spectrum magnitude. The signal is then processed by the Mel-filterbank, as depicted in figure 2.2

Fig 2.2 Mel-spaced filter-bank of 12 Mel spectrum coefficients

Cepstrum is the final step where the log mel spectrum is converted back to time using the DCT. 2.2.10 PITCH OR FUNDAMENTAL FREQUENCY

An important parameter in the analysis of speech and music is the pitch. This feature is the fundamental frequency of an audio waveform, but normally only voiced speech and harmonic music have a well-defined pitch. The main strategies to obtain the pitch information of an audio signal are based on temporal and frequency estimation techniques.

Temporal estimation - Autocorrelation Function and AMDF Temporal estimation is based on the Autocorrelation Function Rn(l) or AMDF (Average Magnitude Difference Function) An(l) computed as:

VISNET/WP4.3/D40/V1.0 Page 14/70

Page 15: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

)()()(1

0

lisislR n

lN

inn += ∑

−−

=∑

−−

=

−+=1

0

)()()(lN

innn islislA

where N denotes the frame length and sn(i) denotes the ith sample. The distance of the first peak in the autocorrelation function to the time origin is a good estimate of the pitch frequency. In a similar way, the distance to the time origin of the first valley of the ADMF function is a good indication of the pitch frequency.

Frequency estimation Frequency estimation based on the Fourier Transform or Cepstral analysis of a frame is based on the examination of the periodic structure of the magnitude of the coefficients. Pitch may be determined accurately by analysing the spectrum and finding the maximum common divider for all the local magnitude peaks [Pfeiffer96] or by finding harmonic relations between the most prominent magnitude peaks [Ferreira2001]. 2.3 Clip-Level Features

As stated earlier, frame level features are designed to capture the short-term characteristics of an audio signal. However, if a higher semantic content analysis is to be performed, it is necessary to observe the temporal variation of frame features during longer time intervals. This leads to the development of various clip-level features, which characterize how frame-level features change over a clip. 2.3.1 VOLUME-BASED

Standard Deviation normalized by the maximum volume in a clip (VSTD) It is possible to compute statistical measures from the volume feature over the duration of a clip, such as the standard deviation normalized by the maximum volume in a clip (VSTD) or the mean value of the volume within a clip. Although the mean value of the volume does not seem to have a high discrimination power, its temporal variance may reflect the scene content [Wang00].

Volume Dynamic Range (VDR) The Volume Dynamic Range (VDR) is another feature proposed in [Wang00], and is defined as (max(v) – min(v))/max(v), where min(v) and max(v) are the minimum and maximum volume within an audio clip. Obviously these two features are correlated, but they do carry some independent information about the scene content.

Volume Undulation (VU) Another feature is Volume Undulation (VU), which is the accumulation of the difference of neighbouring peaks and valleys of the volume contour within a clip.

4-Hz modulation energy (4ME or FCVC4) The volume contour of a speech waveform typically peaks at 4 Hz, so a new parameter called 4-Hz modulation energy (4ME) can be calculated based on the energy distribution in 40 subbands (this feature is also known as Frequency Component of the Volume Contour around 4Hz – FCVC4) . A way to compute this parameter can be directly derived from the volume envelope and it is defined as:

VISNET/WP4.3/D40/V1.0 Page 15/70

Page 16: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

∫∞

==

0

2

0

2

)(

()(44

ωω

ωωω

dC

dCWFCVCME

where C(w) is the Fourier transform of the volume contour of a given clip and W(w) is a triangular window function centred at 4 Hz. Speech clips usually have higher values of 4 ME than music or noise clips.

Pulse Metric Pulse Metric is a feature proposed by [Scheirer97], which uses long-time band-passed autocorrelations to determine the amount of “rythmicness” in a 5-second window. This feature is able to identify a strong, driving beat (i.e. techno, salsa, straight ahead rock-and-roll) in the signal, but can’t detect rhythmic pulse in signals with rubato or other tempo changes. The computation of this feature is based on the observation that a strong beat leads to broadband rhythmic modulation in the signal as a whole (i.e. no matter what band of the signal is considered, it is always possible to detect the same rhythmic regularities). Consequently, the algorithm divides the signal into six bands and finds the peaks in the envelopes of each band. These peaks correspond roughly to perceptual on-sets, and the algorithm looks for rhythmic modulation in each on-set track using autocorrelations It then selects the autocorrelation peaks as a description of all the frequencies at which were found rhythmic modulations in that band. It is then necessary to compare band by band to understand how often it is possible to find the same pattern of autocorrelation peaks in each. If many peaks are present at similar modulation frequencies across all bands, a high value for the pulse metric is given [Scheirer97]. 2.3.2 ENERGY BASED

Low Short Time Energy Ratio (LSTER) Low Short Time Energy Ratio (LSTER) is defined as the ratio of the number of frames with an STE are less than 50% of the average short time energy in a 1s- window.

∑−

=

+−=1

0

]1))(*5.0[sgn(21 N

n

nSTEavSTEN

LSTER .

where N is the total number of frames, STE(n) is the short-time energy at the nth frame, and avSTE is the average STE in a 1 second window. LSTER is an effective measure to distinguish between speech and music. Since there are more silence frames in speech than in music, the LSTER measure of speech will be much higher than that of music.

Energy Entropy The Energy Entropy is another energy based feature, computed by dividing each audio frame into segments of K samples each. The signal energy is computed over each of these segments and normalized by the overall frame energy. Then, the energy entropy I, of the clip is defined as follows:

∑=

−=J

iiiI

1

2

2

2 log σσ

VISNET/WP4.3/D40/V1.0 Page 16/70

Page 17: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

where J is the total number of segments in the clip and is the normalized energy of the ith shortest segment in the clip. This parameter is useful to detect audio clips containing burst sounds (violent scene detection).

2

2.3.3 ZCR-BASED

Standard Deviation of the ZCR (ZSTD) Some researchers have reported the usefulness of the standard deviation of the ZCR (ZSTD) to differentiate between TV program categories. According to [Saunders96], statistics of the ZCR can be used to discriminate between speech and music audio segments with high accuracy classification rate.

High Zero Crossing Rate Ratio (HZCRR) [Lu02] proposes a more discriminative feature based on the ZCR, known as High Zero Crossing Rate Ratio (HZCRR). This feature is defined as the ratio of the number of frames whose ZCR are above 1.5- fold average Zero Crossing Rate in an 1second window, as:

∑−

=

+−=1

0

]1)*5.1)([sgn(21 N

n

avZCRnZCRN

HZCRR

HZCRR will be used in speech/music discrimination. Hence, for speech signals its variation of zero crossing rates will be in general greater than that of music. 2.3.4 NON-SILENCE RATIO (NSR)

Non-Silence Ratio (NSR) in another feature, proposed in [Liu98], which reflects the ration of the number of non-silent frames to the total number of frames in a clip, where normally silence detection is based on both Volume and ZCR. 2.3.5 NOISE FRAME RATIO (NFR)

NFR describes the ratio of noise frames in a given audio segment. A frame is considered as a noise frame if the maximum local peak of its normalized correlation function is lower than a preset threshold. The NFR values of environmental sound are higher than those of music. 2.3.6 PITCH-BASED

It is not easy to derive the scene content directly from the pitch level of isolated frames, but the dynamics of the pitch envelope of successive frames appear to reveal the scene content more. So it is possible to compute the following clip-level features to capture the variation of pitch:

Pitch Standard Deviation (PSTD) Instead of using the absolute value of the pitch information in each frame, better discriminative power can be achieved if using the standard deviation of the pitch values of adjacent frames.

VISNET/WP4.3/D40/V1.0 Page 17/70

Page 18: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Smooth Pitch Ratio (SPR) SPR is the percentage of frames in a clip that have similar pitch as the previous frames. This feature is used to measure the percentage of voiced or music frames within a clip, since only voiced and music have smooth pitch.

Non Pitch Ratio (NPR) NPR is the percentage of frames without pitch and it is used to measure the percentage of unvoiced speech or noise within a clip. 2.3.7 SPECTRUM FLUX (SF)

Spectrum flux (SF) is defined as the average variation value of spectrum between the adjacent two frames in a window.

∑∑−

=

=

+−−+−−

=1

1

1

1

2)]),1(log()),([log()1)(1(

1 N

n

K

k

knAknAKN

SF δδ

where A(n,k) is the discrete Fourier transform of the nth frame of input signal, N the total number of frames, K is the order of the DFT and δ is a very small value to avoid calculation overflow. According to [Lu02], SF values of speech are higher than those of music. In addition, environmental sounds present the highest values with more changes. Therefore, SF is a good feature to discriminate speech, environmental sound and music and it will be used in both, speech and non-speech classification and music/environmental sound classification. 2.3.8 BAND PERIODICITY (BP)

Band periodicity (BP) describes the periodicity of chosen sub-bands. It can be derived by sub-band correlation analysis. The normalized correlation function is calculated from the current and previous frame:

∑∑

∑−

=

=

=

−=

1

0

21

0

2

1

0,

)(*)(

)(*)()(

M

m

M

m

M

mji

mskms

mskmskr

where ri,j(k) is the normalized correlation function, i is the band index and j is the frame index. s(n) is the sub-band digital signal of current frame and previous frame. When n>=0, the data is from the current frame; otherwise, the data is obtained from the previous frame. M is the total length of a frame. The band periodicity of each sub-band i can be represented by the maximum local peak of the normalized correlation function, and is calculated as:

4...,2,1 )(11

, == ∑=

ikrN

bpN

jpjii

where ri,j(kp) is the maximum local peak, kp is the index of the maximum local peak, i is the band index and j is the frame index. In other words, ri,j(kp) is band periodicity of the ith sub-band of the jth frame

VISNET/WP4.3/D40/V1.0 Page 18/70

Page 19: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

BP’s of music are in general much higher than those of environmental sound, due to the fact that music is more harmonic while environmental sound is more random. Band periodicity is an effective feature to discriminate music and environmental sound. 2.3.9 LSP DISTANCE MEASURE

Linear spectral pairs (LSPs) are derived from linear predictive coefficients (LPC), and have shown to have good discriminative power and to be highly robust in noisy conditions [Lu02]. To measure the LSP dissimilarity between two 1-second audio clips, a abbreviated K-L distance measure may be used:

( )( )[ ] ( )( )( )[ ]TSPLSPSPLSPLSPSPLSPSPSPLSP uuuuCCtrCCCCtrD −−−+−−= −−−− 1111

21

21

where CLSP and CSP are the estimated LSP covariance matrices and uLSP and uSP are the estimated mean vectors, from each of the two audio clips, respectively. The first part of this distance measure is determined by the covariance of the two segments, and the second part is determined by the covariance and the mean. Because the mean is easily biased by distinct environmental conditions, the second part is normally discarded, and only the first part of the above expression is used to represent the distance [Lu02]:

( )( )[ ]11

21 −− −−= LSPSPSPLSP CCCCtrD

This abbreviated distance measure is normally called divergence shape distance, and is similar to the cepstral mean subtraction (CMS) method use in speaker recognition to compensate the effect of environment conditions and transmission channels. This dissimilarity measure provides good results when used to discriminate speech and noisy speech from music. Furthermore, LSP divergence shape also proved to be a good feature for the discrimination of different speakers [Lu02]. 2.3.10 COMPRESSED DOMAIN AUDIO FEATURES

The previous paragraphs was presented a description of the most well-known audio features used for content analysis of digital audio signals represented in the PCM (Pulse Code Modulation) format. However, most of the audio contents available today are in a compressed format, such as MPEG1-Layer3 (i.e. MP3), or embedded in a compressed Video format which includes compressed audio and image signals. There are basically two approaches to analyze audio signals in the compressed domain. The first one consists in decompressing the audio signal to a PCM format and directly use the techniques listed previously in this text. This approach, although straightforward, suffers with the consequent inefficiency of always having to completely decode the compressed audio to the PCM format. Furthermore, such an approach does not take advantage of some of the complex analysis processing that are already performed by the most recent and advanced audio coders during audio compression (e.g. MPEG1-Layer3 or AAC). Many of the stages required for the extraction of some of the features presented in this section (e.g. spectral and multi-band analysis steps), are already performed during the compression stages of audio signals, and their results are in fact what constitutes the compressed representation of the original audio signals. Consequently, a much more efficient approach is to try to extract relevant audio features (the same or very similar to the ones presented previously in this chapter) directly from the compressed domain, avoiding the computational cost of completely decoding the audio signal. However, certain constraints and trade-offs should be observed and taken into consideration, since the analysis performed by such audio coders are

VISNET/WP4.3/D40/V1.0 Page 19/70

Page 20: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

normally fine-tuned to the compression task, and consequently may not be the most adequate to the extraction of discriminating features. Since most of the audio features to be extracted are in fact the same or very similar to the ones presented in this section, the description of the particularities and specific problems encountered when trying to extract relevant audio features from the compressed domain is considered to be out of the scope of this document. The interested reader can obtain detailed information about this topic in [Pfeiffer00, Pfeiffer01, Tzanetakis00]. 2.4 Bibliography

[Ferreira2001] Aníbal J. S. Ferreira. “Accurate Estimation in the ODFT Domain of the Frequency, Phase and Magnitude of Stationary Sinusoids”, To be presented at the 2001 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 21-24 October 2001, New Paltz, New York. [Herre01] J. Herre, E. Allamanche, and O. Helmuth, "Robust Matching of Audio Signals Using Spectral Flatness Features," in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2001). [Jayant84] N. Jayant and P. Noll, “Digital Coding of Waveforms”, Prentice-Hall, Englwood Cliffs, NJ, 1984. [Kimber96] D. Kimber, and L. Wilcox, “Acoustic Segmentation for Audio Browsers”, in Proceedings Interface Conference (Sydney, Australia, July 1996). , July 1, 1996 [Lambrou98] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler and Al Linney, “Classification of audio signals using statistical features on time and wavelet transform domains”, in Proc IEEE ICASSP 98, May 1998, Seattle, USA [Lienhart99] R. Lienhart, S. Pfeiffer and W. Effelsberg, “Scene determination based on video and audio features”, Multimedia Tools and Applications, 15(1):59--81, 2001 [Liu98] Z. Liu, Y. Wang, T. Chen, “Audio feature extraction and analysis for scene segmentation and classification”, J. VLSI Signal Processing Syst. Signal, Image, Video Technol., vol 20, pp.61-79, Oct. 1998. [Lu02] L. Lu, H-J. Zhang, and H. Jiang “Content analysis for audio classification and segmentation” IEEE Transactions on speech and audio processing, October 2002 [MPEG7.4] “Information Technology – Multimedia Content Description Interface – Part 4: Audio”, ISO/IEC CD 15938-4 [Peltonen02] V. Peltonen, J. Tuomi and A. Klapuri, “Computational auditory scene recognition”, In Proc. International Conference on Acoustic, Speech, and Signal Processing, Orlando, Florida, May 2002 [Pfeiffer96] S. Pfeiffer, S. Fischer and W.Effelsberg, “Automatic audio content analysis”, in Proc. ACM Multimedia, 1996, pp. 21—30 [Pfeiffer00] Pfeiffer, S.; Robert-Ribes, J.; Kim, D., “Audio Content Extraction from MPEG-encoded sequences”, First International Workshop on Intelligent Multimedia Computing and Networking, Proc. Fifth Joint Conference on Information Sciences, JCIS 2000, Atlantic City, New Jersey, February 27 to March 3, 2000, pp. 513-516. [Pfeiffer01] S. Pfeiffer, T. Vincent, “Formalisation of MPEG-1 compressed domain audio features”, CSIRO Mathematical and Information Sciences Technical Report nr. 01/196, December 2001.

VISNET/WP4.3/D40/V1.0 Page 20/70

Page 21: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Saraceno98] C. Saraceno, R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing”, in Proc. Int. Conf. Image Processing (ICIP-98), vol. 1, pp. 363-367, Chicago, IL, Oct. 4-7, 1998 [Saunders96] J. Saunders, “Real-time discrimination of broadcast speech/music,” Proc. IEEE Int. Conf. on Acoustics,Speech, Signal Processing (Atlanta, GA), pp. 993–996,May 1996 [Scheirer97] E. Scheirer and M. Slaney, "Construction and Evaluation of a robust multifeature speech/music discriminator", in Proc. of International Conference on Acoustic, Speech and Signal Processing (ICASSP), 1997 [Tzanetakis99a] G. Tzanetakis and P. Cook "A Framework for Audio Analysis based on Classification and Temporal Segmentation" In Proc. Euromicro, Workshop on Music Technology and Audio processing, Milan, Italy, 1999 [Tzanetakis99b] G. Tzanetakis and P.Cook, “Multi-feature audio segmentation for browsing and annotation”, in Proc. IEEE Workshop Applications Signal Processing to Audio and Acoustics, pp. 103-106, New Paltz, NY, U.S.A., October 1999. [Tzanetakis00] G. Tzanetakis, P.Cook, "Sound analysis using MPEG compressed audio", In Proc. Int. Conf. Audio, Speech and Signal Processing (ICASSP), Istanbul, Turkey, 2000 [VISNET.D29] VISNET Deliverable D29 – “Audio and Speech Analysis System Overview”, VISNET FP6 NoE - WP4.1, 2004 [Wang97] Y. Wang, J. Huang and Z. Liu, “Multimedia content classification using motion and audio information” IEEE International Symposium on Circuits and Systems, Hong Kong June 9-12, 1997 [Wang00] Y. Wang, Z. Liu and J.C Huang, “Multimedia content analysis using both audio and visual clues” IEEE Signal Processing Magazine, 2000 [Wold96] E. Wold, T. Blum, D. Keislar and J. Wheaton, “Content-based classification, search and retrieval of audio”, IEEE Multimedia 3(3), 27-36, fall 1996 [Zhang01] T. Zhang and C.-C. J. Kuo, “Hierarchical classification of audio data for archiving and retrieving”, in Proc. Int. Conf. Acoustic, Speech and Signal Processing, Vol. 6, Phoenix 1999

VISNET/WP4.3/D40/V1.0 Page 21/70

Page 22: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

3. MEASURING VIDEO FEATURES

3.1 Introduction

The goal of measuring video features is to transform the data in a way that is more appropriate for indexing. These transformations include reducing the number of variables required to capture the structure within the data and removing redundant or irrelevant information. The transformed data may be processed further to extract features.

Many researchers proposed to represent a video by a set of key-frames. In this way, the techniques developed for image database indexing can be used for video indexing [Zhang97], [Zhong95], [Hampapur98], [Bolle98] . However, motion-based indexing has been shown to significantly improve the performance of a still image-based indexing [Chang98]. In this section we present and briefly describe some of the most representative low-level video features that are currently used to index video and the techniques available to extract and measure those features.

3.2 Color

Color is one of the most discriminative visual features. Due to the variations in the recording or the perception of color, an efficient color representation is required. RGB representations are widely used. However they are not suited to capture the invariant features. For image indexing, HSV representation is often selected because it conforms more to human perceptual similarity of colors (i.e. two colors that are perceived similar by humans are closer in the HSV space than two dissimilar colors). In addition to that, the hue is invariant under the orientation of the object with respect to the illumination and camera direction.

Color constancy is a feature of the human color-perception system which ensures that the perceived color of objects remains (almost) constant under varying light conditions. In [Funt95] color constancy is used to make the color representation invariant to illumination.

Once the color is represented in an efficient way, features are extracted. A simple but very effective approach to extract features is to use the histogram. The original idea to use histograms for indexing comes from Swain and Ballard [Swain91], who proved that color is more discriminative than gray-valued image. Swain and Ballard also argue that color histograms are robust to change in viewpoint and scale and to occlusion.

Color moments [Stricker96] and color correlograms [Huang97] can also be used to characterize the color information. To overcome the problem of high-dimensionality, SVD [Hafner95], dominant color regions [Zhang95] [Ravishankar99] and color clustering [Wan98] have been proposed.

3.3 Shape

Shape is also an important feature for perceptual object recognition and classification. Many techniques, including chain code, polygonal approximations, curvature, Fourier descriptors and moment descriptors have been proposed for measuring shape for image indexing.

In [Flickner95], [Mohamad95] features such as moment invariants and area of region have been used, but do not give perceptual shape similarity. Cortelazzo [Cortelazo94] used chain codes for trademark image shape description and string matching technique. The chain codes are not normalized and string matching is not invariant to shape scale. Jain and Vailaya [Jain95] proposed a shape representation based on the use of a histogram of edge directions. But these are not scale normalized and computationally expensive in similarity measures. Curvature Scale-Space representation [Mokhtarian96] has been proposed to characterize shape features of an object or region based on its contour. This representation has a number

VISNET/WP4.3/D40/V1.0 Page 22/70

Page 23: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

of advantages such as robustness to non-rigid motion, partial occlusion, and perspective transformations due to camera motion.

3.4 Texture

Many algorithms have been proposed for image indexing using texture. Filtering approaches are the most commonly used techniques for the extraction of the texture features. Filtering approaches include: Laws masks [Laws80], ring/wedge filters [Coggins85], dyadic Gabor filter-banks [Jain91], discrete cosine transforms [Tan92], and optimized Gabor filters [Bovik90]. An extensive evaluation of the filtering approaches has been reported in [Randen99]. One of the conclusions of this evaluation is that the wavelet-based techniques achieve good performance. 3.5 Motion

The main goal of motion-based indexing is to capture essential motion characteristics from the motion field. Since the extraction of the motion vectors from the compressed bitstream is easy, motion-feature extraction in the MPEG-1/2 compressed domain has been proposed [Meng96]. The most important motion descriptors are: motion activity, camera motion, and motion trajectory. Motion activity gives an idea about the intensity of actions. For example, the motion activity of a “goal scoring” scene is perceived high, whereas the motion activity of a news anchor scene is considered low. In [Divakaran00] a technique for measuring motion activity is presented. The magnitude of the motion vectors as a measure of the intensity of motion activity in a macro-block is used. The camera motion expresses the intention of the viewer’s focus of attention, and can thus be used to discriminate events [Jeannin00]. The motion trajectory of an object is defined as the localization, in time and space, of one representative point of this object. It is a simple high-level description, which enable video indexing.

3.6 Bibliography

[Zhang97] Zhang HJ, Wu J, Zhong D, Smoliar SW (1997) An integrated system for content-based video retrieval and browsing. Pattern Recognition 30: 643–653. [Zhong95] Zhong D, Zhang HJ, Chang SF (1995) Clustering methods for video browsing and annotation. In: Proc. SPIE Conference on Storage and Retrieval for Image and Video Databases, (San Jose, CA) [Hampapur98] Hampapur A, Jain R, Weymouth T (1994) Digital video segmentation. ACM Multimedia 94: 357–364 [Bolle98] Bolle RM, Yeo BL, Yeung M (1998) Video query: Beyond the keywords. IBM J Res Dev (see also IBM Res. Rep. RC20586, 1996) [Chang98] S. F. Chang and W. Chen et al., “A fully automated content-based video search engine supporting multi-objects spatio-temporal queries,” IEEE Trans. Circuit Syst. Video Technol., vol. 8, pp. 602–615, Sept. 1998. [Funt95] B.V. Funt and G.D. Finlayson, Color Constant Color Indexing, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 5, pp. 522-529, May 1995. [Swain91] M.J. Swain and B.H. Ballard, Color Indexing, Int'l J. Computer Vision, vol. 7, no. 1, pp. 11-32, 1991. [Stricker96 ]M.Stricker and A.Dimai, Color indexing with weak spatial constraints. Proceedings of SPIE Storage and Retrieval of Still Image and Video Databases IV. Vol. 2670 (1996) 29-40 [Huang97] J.Huang et al., Image indexing using color correlograms. Proceedings of CVPR. (1997) 762-768 [[Hafner95] J.Hafner et al., Efficient color histogram indexing for quadratic form distance functions. PAMI. Vol. 17 No. 7 July. (1995) 729-736

VISNET/WP4.3/D40/V1.0 Page 23/70

Page 24: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Zhang95] H.Zhang et al, Image retrieval based on color features: an evaluation study. Proceedings of SPIE. Vol. 2606 (1995) 212-220 [Ravishankar99] K.C.Ravishankar, B.G.Prasad, S.K.Gupta and K.K.Biswas, Dominant Color Region Based Indexing Technique for CBIR. In proceedings of the International Conference on Image Analysis and Processing (ICIAP'99). Venice. Italy. Sept. (1999) 887-892 [Wan98] X.Wan and C.J.Kuo, A multiresolution color clustering approach to image indexing and retrieval. Proceedings of ICASSP. (1998). [Flickner95] M.Flickner et al., Query by image and video content: the qbic system. IEEE Computer. 28(9) (1995) 23-32. [Mohamad95] D.Mohamad, G.Sulong and S.S.Ipson, Trademark Matching using Invariant Moments. Second Asian Conference on Computer Vision. 5-8 Dec, Singapore. (1995). [Cortelazo94] G.Cortelazzo et al., Trademark Shapes Description by String-Matching Techniques. Pattern Recognition. 27(8) (1994) 1005-1018. [Jain95] A.K.Jain and A.Vailaya, Image Retrieval using Color and Shape. Second Asian Conference on Computer Vision. 5-8 Dec. Singapore. (1995) 529-533. [Mokhtarian96] Mokhtarian, F., S. Abbasi and J. Kittler, Robust and Efficient Shape Indexing through Curvature Scale Space, Proc. British Machine Vision Conference, pp. 53-62, Edinburgh, UK, 1996. [Laws80] K.I. Laws, Rapid Texture Identification, Proc. SPIE Conf. Image Processing for Missile Guidance, pp. 376–380, 1980. [Coggins85] J.M. Coggins and A.K. Jain, A Spatial Filtering Approach to Texture Analysis, Pattern Recognition Letters, vol. 3, no. 3, pp. 195–203, 1985. [Jain91] A.K. Jain and F. Farrokhnia, Unsupervised Texture Segmentation Using Gabor Filters, Pattern Recognition, vol. 24, no. 12, pp. 1,167–1,186, 1991. [Tan92] I. Ng, T. Tan, and J. Kittler, On Local Linear Transform and Gabor Filter Representation of Texture, Proc. Int’l Conf. Pattern Recognition, pp. 627–631. Int’l Assoc. for Pattern Recognition, 1992. [Bovik90] A.C. Bovik, M. Clark, and W.S. Geisler, Multichannel Texture Analysis Using Localized Spatial Filters, IEEE Trans. Pattern Analysis and Machine Intelligence., vol. 12, pp. 55–73, Jan. 1990. [Randen99] T. Randen and J. H. Husoy. Filtering for texture classification: A comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(4):291–310, 1999. [Meng96] J. Meng and S. –F Chang, CVEPS – A Compressed Video Editing and Parsing System, Proc. Of ACM Multimedia 96, Boston, MA, Nov. 1996. [Divakaran00] Divakaran, A.; Sun, H., "Descriptor for Spatial Distribution of Motion Activity for Compressed Video", SPIE Conference on Storage and Retrieval for Multimedia Database, Vol 3972, pps 392-898, January 2000 [Jeannin00] S. Jeannin and R. Jasinschi et al., Motion descriptors for content-based video representation, Signal Processing: Image Commun. J., vol. 16, no. 1–2, pp. 59–85, Sept. 2000.

VISNET/WP4.3/D40/V1.0 Page 24/70

Page 25: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

4. STATISTICAL PATTERN RECOGNITION: A REVIEW

4.1 Introduction

The problem of classification can be formulated as follows: A given pattern has to be assigned to one of the C classes Cωω ,,1 based on a p-dimensional data vector , whose components are measurements of the features of an object. A decision rule partitions the measurement space into C regions

. If a measurement vector x lies in

Tpxxx ),,( 1= ix

Cii ,,1, =Ω iΩ then it is assumed to belong to class iω . The boundaries between the regions are the decisions boundaries. Examples of patterns are measurements on a patient to identify a disease, measurements of an acoustic waveform for speech recognition, and a digital image for face recognition.

A typical pattern recognition system (see Fig 4.1) involves the following three modules: pre-processing, feature extraction/selection, and classifier. The goal of the pre-processing module is to represent the pattern of interest in a compact way. It includes the segmentation, the normalization and the noise removal operation. The feature extraction/selection module finds the appropriate features for representing the input patterns. The classifier yields an estimate of the class to which the pattern belongs. In order to design a classifier (i.e. specifying the parameters of the classifier), a set of patterns of known class (the training set or design set) is used.

Pre-processing

Feature Extraction/ Selection

Classifier

Feature pattern

Compact pattern representation

Decision Pattern

Fig 4.1 A typical pattern recognition system.

This section is divided in two parts, namely: Classifiers and Classifier Combination. The first part introduces two approaches to statistical pattern recognition. The first approach is based on the Bayes rule. It assumes the knowledge of the class-conditional probability density functions. The second approach uses the data to estimate the decision boundary directly, without explicit calculation of the probability density functions. The second part addresses the problem of combining more than one classifier to achieve the final solution to a complex problem. The approach of combining more than one classifier is generally used in applications where more than one information source is available. The main principles for classifier combination are discussed in this section. The next section is devoted to information fusion and extends the concepts presented here to the problem of combining classifiers working with audio and video signals. 4.2 Classifiers

4.2.1 BAYESIAN APPROACH

We assume that the a priori distributions and the class-conditional distributions are known. Parametric (e.g. the estimation of the parameters of a normal distribution model from the data) or non-parametric (e.g.

VISNET/WP4.3/D40/V1.0 Page 25/70

Page 26: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

kernel density estimation) methods are used to estimate the class-conditional distributions. The Bayes rule for minimum error can be formulated as follows: assign the pattern x to class iω if

jkCkpxppxp kkjj ≠=> ;,,1)()|()()|( ωωωω

Another variant of the Bayesian approach is based on minimizing the expected loss or risk. The loss jiλ is

defined as the cost of assigning a pattern x to iω when jx ω∈ . The Bayes rule for minimum risk can be

described as follows: assign the pattern x to class iω if

CkxpxpxpxpC

jjjk

C

jjji ,,1)()|()()|(

11=> ∑∑

==

ωλωλ

4.2.2 DISCRIMINANT FUNCTIONS

A discriminant function is defined as a function of the pattern x which result consist in a classification rule. For example, in a two-class problem, the classification rule is given by:

2

1

)()(

ωω

∈⇒<∈⇒>

xkxhxkxh

where h(x) is a discriminant function and k is a constant. The main difference between the discriminant function and the Bayesian approach is that the form of the discriminant function is specified and is not estimated from the class-conditional distribution. Many different forms of discriminant function have been proposed in the literature varying in complexity from the linear discriminant to multi-parameter nonlinear functions such as the multilayer perceptron. The following subsections give a brief description of some of these approaches. 4.2.3 LINEAR DISCRIMINANT FUNCTIONS

A linear discriminant function is defined as a linear combination of the components of the measurement vector x:

0)( wxwxg T +=

where w is the weight vector and w0 is the threshold weight. Linear discriminant function can be interpreted as the equation of a hyperplane with unit normal in the direction of w and a perpendicular distance |w0|/|w| from the origin. The value of the discriminant function for a pattern x corresponds to the perpendicular distance from the hyperplane (see Fig 4.2).

VISNET/WP4.3/D40/V1.0 Page 26/70

Page 27: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

w

x

g(x)/|w|

Hyperplane, g=0

Fig 4.2 Geometric interpretation of linear discriminant function.

A special case of the linear discriminant function is the minimum-distance classifier or nearest neighbor classifier. Assume that each class iω is represented by a prototype pi. The minimum-distance classifier assigns the pattern x to the class iω associated with the nearest point pi. The linear discriminant function for the minimum-distance classifier is

2

21)( i

Ti pxpxg −=

Decision regions for a minimum-distance classifier are shown in Fig 4.3. Each boundary corresponds to the perpendicular bisector of the lines joining the prototype points of regions that are contiguous. One of the main properties of linear discriminant functions is that the decision regions are convex (i.e. two arbitrary points lying inside the region can be joined by a straight line that lies entirely within the region). In order to allow non-convex and disjoint decision regions (see Fig 4.4), piecewise linear discriminant functions and generalized linear discriminant functions have been proposed.

p1

p4

p3

p2

prototype

Convex Regions: Any two elements of a class can be joined by a line that completely lies on the classregion

p1

p4

p3

p2

prototype

Convex Regions: Any two elements of a class can be joined by a line that completely lies on the classregion

Fig 4.3 Decision regions of a minimum-distance classifier.

VISNET/WP4.3/D40/V1.0 Page 27/70

Page 28: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Non-convex Region: The line joiningtwo elements of one class goesthrough the partition of the other class

Non-convex Region: The line joiningtwo elements of one class goesthrough the partition of the other class

Fig 4.4 Example of non-convex regions.

4.2.4 PIECEWISE LINEAR DISCRIMINANT FUNCTIONS

Piecewise linear discriminant functions correspond to a generalization of the minimum-distance classifier where each class is represented by more than one prototype per class. That is, each class iω is

represented by ni prototypes . The discriminant function is defined as follows inii pp ,,1

j

iTj

ij

iTj

ij

inji pppxxgwherexgxgi 2

1)()(max)(,,1

−===

A pattern x is assigned to the class with the largest gi(x). In this way the space is portioned into ∑=

C

i in1

regions. The prototypes can be estimated from the training set by using clustering schemes. 4.2.5 GENERALIZED LINEAR DISCRIMINANT FUNCTIONS

The generalized linear discriminant function is defined as T

DT xxwherewwxg ))(,),(()( 10 ΦΦ=Φ+Φ= is a vector function of x. The basic principle of

generalized linear discriminant functions is to transform disjoint classes into a Φ -space in which a linear discriminant function could separate the classes. Typical functions )(xiΦ are:

Quadratic: zerobothnotllpkkorll

ppixxx lk

lki

212121 ,,,1,;10,

12/)2)(1(,,1,)( 2

2

1

1

==

−++==Φ

Radial basis function: )()( iii vxx −Φ=Φ for centre vi and function Φ

Multilayer perceptron: )()( 0ii

Ti vvxfx +=Φ for direction vi and offset vi0. f is the logistic function,

)).exp(1/(1)( zzf −+=

The parameters of the functions are estimated using the training set. iΦ

VISNET/WP4.3/D40/V1.0 Page 28/70

Page 29: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

4.3 Classifier combination

In the previous section a review of the most important statistical pattern recognition classifiers has been presented. Recently, the combination of multiple classifiers has been viewed as a new direction for the development of highly reliable pattern recognition systems. Preliminary results indicate that the combination of several complementary classifiers leads to classifiers with better performance (see section 7.3). The main reasons for combining different classifiers are:

1. For a specific classification or recognition problem, there are often numerous types of features which could be used to represent and recognize patterns. These features are also represented in diversified forms and it is rather hard to group them together for one single classifier to make a decision. An example is the identification of persons by their voice, face, as well as handwriting.

2. Sometimes more than a single training set is available, each collected at a different time or in a different environment and even with different features.

3. Some classifiers working with the same data may present uncorrelated results and even different global performance. Thus, it will be possible to estimate local regions where each classifier has the best performance results.

Summarizing, if different feature sets, different training sets, different classification methods or different training sessions are available, it will be possible to combine different classifier outputs in order to improve the overall classification accuracy. The main problem is focused on the selection of the combination function. The next subsections (4.3.1 and 4.3.2) are dedicated to present briefly the fundamentals and requirements of combination schemes from a point of view of classification theory, whereas section 5 intends to extend this generic perspective by describing the classification combination techniques a little bit more in detail. 4.3.1 COMBINATION SCHEMES

A large number of combination schemes have been proposed in the literature [Xu92, Chen97, Jain00]. A typical combination scheme consists of a set of individual classifiers and a combiner which is the responsible for making a final decision by combining the different classification results. Different categories have been proposed to differentiate between the combination schemes. [Xu92] proposed a combination categorization depending on the output of the classifiers; i.e. using the info level produced by the different classifiers. The three categories reported in this paper [Xu92] are: abstract level, rank level and measurement level. In the abstract level the decision yield by a classifier i consists in a unique label j; whereas in the rank level a list or ranking of different candidate classes is given at the output with the class at the top being the first choice. Finally, in the measurement level, the classifier i assigns each possible class j a confidence value depending on the similarity degree of the class j for a given data test set. Among the three levels, the measurement level contains the highest amount of information and the abstract level contains the lowest. [Chen97] classified the combination of multiple classifiers into three frameworks: linear opinion pools, winner-take-all and evidential reasoning. In the framework of linear pools framework, the combination schemes make the decision using a linear combination of multiple classifiers results. The winner-take-all combination scheme chooses only one classifier among several others which is responsible for taking the final decision for a specific input pattern. In the framework of evidential reasoning framework, for an input pattern, the output of each individual classifier is regarded as an evidence or event and the combination scheme makes the final decision based upon a voting principle. More recently, Jain et al. [Jain00] have proposed a more generic classification according to the basic architecture of the combination scheme without analyzing how the combiner invokes the different classifiers. The authors grouped the combination schemes in: parallel, cascading or serial and hierarchical (tree-like) architectures. In the parallel architecture, all individual classifiers are invoked independently, and

VISNET/WP4.3/D40/V1.0 Page 29/70

Page 30: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

their results are then combined by a combiner (see Fig 4.5). Most combination schemes in the literature belong to this category. In the cascading architecture, individual classifiers are invoked in a linear sequence. The number of possible classes for a given pattern is gradually reduced as more classifiers in the sequence become involved. For the sake of efficiency, inaccurate but cheap classifiers (low computational and measurement demands) are considered first, followed by more accurate but expensive classifiers. In the hierarchical architecture, individual classifiers are combined into a structure, which is similar to that of a decision tree classifier. The tree nodes, however, may now be associated with complex classifiers demanding a large number of features. The advantage of this architecture is the high efficiency and flexibility in exploiting the discriminant power of different types of features. It will also be also possible to build more sophisticated and complex architectures by using these three basic architectures. As already mentioned the most of the multi-expert systems use the parallel combination architecture. In this case, once individual classifiers have been selected, they are combined together by a module, called the combiner (see Fig 4.5). This module is the responsible for pooling the different classifier outputs in order to make a final decision. A combiner can present different characteristics such as static vs. trainable, or adaptive vs. non-adaptive. The trainable combiners may lead to a better improvement over static combiners at the cost of additional training and of requiring additional training data. Furthermore, whereas non-adaptive combiners treat all the input patterns in the same way, adaptive ones can vary the weight (=importance) of the individual classifiers depending on the input pattern and the channel conditions (e.g. in face recognition some classifiers can deal with pose variations whereas others are illumination independent).

CLASSIFIER 1

C O M B I N E R

CLASSIFIER 2

CLASSIFIER N

FINAL DECISION

Fig 4.5 Parallel Combination Scheme

The different possibilities for combining individual classifiers are further analyzed and reviewed in section 5 that provides and overview information fusion in the context of audio and video signals. A large number of experimental studies and results have shown that classifier combination can improve the recognition accuracy. However, only few approaches give a theoretical explanation for this better performance in multi-classifier systems and in these cases, only the simplest combination schemes under rather restrictive assumptions have been analyzed [Kleinberg90, Perrone93, Tumer96]. The most popular evaluation or analysis of a combination scheme is based on the bias-variance dilemma. The expected classification error can be separated into a bias and a variance term, where the bias term refers to the persistent training error of the learning algorithm and the variance term to a generalization error. The main problem is that these two errors are not independent and one term is anti-proportional to the other. Procedures with increased flexibility to adapt to the training data (more free parameters) tend to have lower bias but higher variance. According to several approaches [Kleinberg90, Perrone93, Tumer96], a multi-classifier system is supposed to improve the classification error by reducing the variance term.

VISNET/WP4.3/D40/V1.0 Page 30/70

Page 31: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

4.3.2 TRAINING METHODS OF INDIVIDUAL CLASSIFIERS TO ASSURE INDEPENDENCY

A classifier combination is especially useful if the individual classifiers are largely independent. If this is not already guaranteed by the use of different training sets, various resampling training techniques like rotation and bootstrapping may be used to artificially create such differences. These techniques are related to methods for estimating and comparing classifier models. The most well known training techniques for multiple classifiers are stacking [Wolpert92], bagging [Breiman96] and boosting [Schapire90]. In stacking the main purpose is to minimize the generalization error rate of one or more classifiers in other words, stacked generalization works by deducing the biases of the classifiers with respect to a provided learning set. The final decision is made based on the outputs of the stacked classifier in conjunction with the outputs of individual classifiers. Bagging is a method for generating multiple versions of a classifier and combining these to get an averaged classifier. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. According to [Breiman96] tests on real and simulated data sets show that bagging can give substantial gains in accuracy. Furthermore, bagging is supposed to improve recognition for unstable 2classifiers because it effectively averages over such instabilities. Finally, boosting is a general method for improving the performance of any learning algorithm. In theory, boosting can be used to significantly reduce the error of any “weak” learning algorithm that consistently generates classifiers which are expected to be just slightly better than random guessing. Boosting works by repeatedly running a given weak learning algorithm on various distributions over the training data, and then combining the classifiers produced by the weak learner into a single composite classifier. The first effective boosting algorithms were presented by Schapire [Schapire90] and Freund [Freund96] showing that in principle, it is possible for a combination of weak classifiers (whose performances are only slightly better than random guessing) to achieve an arbitrary small error rate on the training data. There are several variations on basic boosting but the most popular is Adaboost (“Adaptive Boosting”) [Viola01], which allows the designer to continue adding weak learners until some desired low training error has been achieved. Another way of guaranteeing independency among individual classifiers that are combined to yield improved results is to use different feature sets Instead of building different classifiers on different sets of training patterns. This even more explicitly forces the individual classifiers to contain independent information. 4.4 Bibliography

[Breiman96] L. Breiman, “Bagging Predictors”, Machine Learning, vol. 24, nº 2 pp. 123-140, 1996 [Chen97] K. Chen, L. Wang, and H. Chi, “Methods of combining multiple classifiers with different features and their applications to text-independent speaker identification”, in Int. Journal of Pattern Recognition and Artificial Intelligence, 11(3) pp. 417-445, 1997 [Freund96] Y. Freund, and R.Schapire, “Experiments with a New Boosting Algorithm”, Proc. 13th Int. Conference Machine Learning, pp.148-156, 1996 [Jain00] A. k. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: a review” in IEEE Trans. On Pattern Analysis and Machine Intelligence, Vl 22, No.1, January 2000 [Kleinberg90] R. M. Kleinberg, “Stochastic discrimination”, Annals of Math. and Artificial Intelligence, Vol. 1, pp. 226-239, 1990 2 A classifier or learning algorithm combination is informally called unstable, if “small changes” in the training data lead to significantly different classifications and relatively “large” changes in accuracy.

VISNET/WP4.3/D40/V1.0 Page 31/70

Page 32: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Perrone93] M. P. Perrone, and L. N. Cooper, “When networks disagree: Ensemble methods for hybrid Neural Networks”, Neural Networks for Speech and Image Processing. R. J. Mammone, ed. Chapman-Hall 1993 [Schapire90] R. E. Schapire, “The strength of weak learnability”, Machine Learning, vol.5, pp.197-227, 1990 [Tumer96] K. Tumer, and J. Ghosh, “Analysis of decision boundaries in linearly combined neural classifiers”, Pattern Recognition, vol. 29, pp. 341-348, 1996 [Xu92] L. Xu, A. Krzyzak, and C. Suen, “Methods of combining multiple classifiers and their applications to handwriting recognition”, IEEE Trans. Sys. Man. Cyb., 22:418--435, 1992 [Wolpert92] D. Wolpert, “Stacked Generalization”, in Neural Networks, Vol. 5, pp 241-259, 1992 [Viola01] P. Viola, and M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”, Computer Vision and Pattern Recognition, 2001.

VISNET/WP4.3/D40/V1.0 Page 32/70

Page 33: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

5. FUNDAMENTALS OF INFORMATION FUSION

5.1 Introduction

Broadly speaking, the term information fusion encompasses any area which deals with utilizing a combination of different sources of information, either to generate one representation format, or to reach a decision. It is a relatively new research area, with pioneering publications tracing back to early 1980s [Barniv81, Pau82, Tenney81a, Tenney81b]. Examples of areas that use information fusion are: consensus building, team decision theory, committee machines, integration of multiple sensors, multi-modal data fusion, combination of multiple-experts/classifiers, distributed detection and distributed decision making. In a general case, the use of information fusion can be justified by some of the following advantages [Sanderson02]:

• Utilizing complementary information (e.g. audio and video) can reduce error rates. • The system design and implementation complexity can be reduced by using several and

complementary systems rather than one very sophisticated. • Use of multiple information sources can increase reliability. • Sensors can be physically separated, allowing the acquisition of information from different points of

view. This section reviews the most important and common approaches to Information fusion techniques which can be broadly classified into two main categories:

1. Pre-mapping fusion (input level fusion): Information is combined before any use of classifier (hard decision) or expert 3(opinion or confidence value).

2. Post-mapping fusion (classifier level fusion): information is combined after mapping from sensor-

data/feature space into the opinion/decision space using either an expert or classifier. Other researches [Hall01, Neti99] use a slightly different grouping by considering two different pre-mapping and post-mapping categories. In this case, information fusion techniques are grouped into four different groups namely:

(a) Sensor data level fusion (b) Feature level fusion (c) Decision fusion (d) Opinion fusion

Fig 5.1 illustrates both categories and their subcategories and techniques. 3 Note that while a classifier provides a hard decision, an expert provides an opinion (soft-decision or confidence value) on each possible decision.

VISNET/WP4.3/D40/V1.0 Page 33/70

Page 34: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

FUSION TYPE

POST-MAPPING PRE-MAPPING

FEATURE LEVEL

SENSOR DATA LEVEL

CONCA-TENATION

WEIGHTED SUMMATION

WEIGHTED SUMMATION

OPINION FUSION

DECISION FUSION

RANKED LISTS

MAJORITY VOTING

AND OR

WEIGHTED SUMMATION

WEIGHTED PRODUCT

POST-CLASSIFIER

Fig 5.1 Information fusion techniques

Next subsections will describe these categories in more detail. Finally, fusion strategies can further be classified as non-adaptive or adaptive. In non-adaptive approaches the contribution of each expert is fixed a priori. In the adaptive approach the contribution of at least one expert is varied according to its reliability and discrimination capability in the presence of some environmental condition. Adaptive strategies are briefly reviewed in section 5.4 . 5.2 Pre-mapping fusion

As already mentioned pre-mapping techniques may be further divided in two subcategories: sensor data and feature level fusion. 5.2.1 SENSOR DATA LEVEL FUSION

In sensor data level fusion [Hall01], the raw data extracted from sensors is directly combined. Depending on the specific application, there are two main methods to do this: weighted summation and mosaic construction. For example, weighted summation can be employed to combine visual and infrared images into one image, or, to combine the data from two microphones (to reduce the effects of noise) in the form of an average operation. It must be emphasized that the data must first be commensurate4, which can be accomplished by pre-mapping to a common interval. Mosaic construction can be employed to create one image out of images provided by several cameras, where each camera is observing a different part of the same object [Iyengar95]. 5.2.2 FEATURE LEVEL FUSION

In feature level fusion, features extracted from data provided by several sensors (or from one sensor but using different feature extraction techniques) are combined. If the features are commensurate, the combination can be accomplished by a weighted summation (e.g., features extracted from data provided by two microphones). If the features are not commensurate, feature vector concatenation can be used [Hall01, Adjoudani95, Luettin97]. In this case, a new feature vector is constructed by concatenating the feature vectors obtained from each information source. Feature vector concatenation has three main drawbacks:

a. There is no explicit control over how much each vector contributes to the final decision.

4 Commensurate = to have a common measure

VISNET/WP4.3/D40/V1.0 Page 34/70

Page 35: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

b. Separate feature vectors must be available at the same frame rate (i.e., the feature extraction must be synchronous). This might be a problem when combining speech and visual feature vectors.

c. The dimensionality of the resulting feature vector can be too high leading to the “curse of

dimensionality” problem [Duda01]. Due to the above problems, the post-mapping fusion approach is preferred in many situations. 5.3 Post-Mapping fusion

Post-mapping techniques can be divided into decision and opinion fusion. In the first group, hard decisions from a set of classifiers are used, while in the second group, opinions from a set of experts are utilized. 5.3.1 DECISION FUSION

In decision fusion the hard decisions made by a set of classifiers are used [Hall01, Iyengar95]. These classifiers can be of the same type but working with different features (audio and video),or it is possible to have several different classifiers working with the same features, or hybrid combinations of the previous two types. The idea behind the use of different classifiers with the same features stems from the belief that each classifier (due to internal representation) may be good at recognizing a particular set of classes while being bad at recognizing a different set of classes. Thus, a combination of classifiers may overcome the weaknesses of each classifier [Ho94, Kittler98]. In decision fusion, hard decisions can be combined by majority voting, combinations of ranked lists or using AND & OR operations.

Majority voting In majority voting [Genoud96, Iyengar95], a consensus is reached on the decision for which a majority of classifiers agree. There are two downsides to the voting approach: First, an odd number of classifiers is required to prevent ties. Second, the number of classifiers must be greater than the number of classes (possible decisions) to ensure that a decision is reached (this last condition is not usually a limitation for verification problems).

Ranked list combination In ranked list combination [Achermann96, Ho94], each classifier provides a ranked list of class labels. In this list, labels are sorted according to the degree of preference for each class. Usually the top entry corresponds to the most preferred. The ranked lists can then be combined via various means [Ho94], possibly taking into account the reliability and discrimination ability of each classifier. The final decision is reached by selecting the top entry in the combined ranked list.

AND Fusion In AND fusion [Luo95, Vapnik95], a particular class is only accepted when all classifiers agree. This type of fusion is quite restrictive and it is used mainly for applications where a low false alarm is required. For multi-class identification problems no decision may be reached, thus it is mainly used in verification scenarios.

OR Fusion In OR fusion [Luo95, Vapnik95], a particular class is accepted as soon as one of the classifiers makes a positive decision. Opposite to the AND fusion, this type of fusion is very relaxed and it can even provide multiple possible decisions in multi-class problems. OR fusion is used mainly in verification applications with a low false rejection rate.

VISNET/WP4.3/D40/V1.0 Page 35/70

Page 36: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

5.3.2 OPINION FUSION

In opinion fusion [Hall01, Iyengar95, Verlinde99], a set of experts provides opinions on each possible decision. Since different types of experts can be used, the opinions are usually required to be commensurate before further processing (e.g., one expert gives an opinion in terms of distances while another in terms of a likelihood measure). This can be accomplished by mapping the output of each expert to the [0, 1] interval. Often in the literature [Hall01, Iyengar95, Verlinde99] the terms decision fusion and opinion fusion are used interchangeably. However, since each expert provides an opinion and not a hard decision, the term opinion is more appropriate when an ensemble of experts is used. The main advantage of opinion fusion in respect to decision fusion is that in this latter approach some information regarding de goodness of each possible decision is lost. Opinions can be combined using weighted summation or weighted product approaches before using any classification criterion, such as the max operator (which selects the class with the highest opinion, to reach a decision.

Weighted summation In weighted summation, the opinions regarding class j from NE experts are combined using:

∑=

⋅=EN

ijiij owf

1,

where oi,j is the opinion from the i-th expert and wi is the corresponding weight in the [0,1] interval with

. ∑=

=EN

iiw

11

When all the weights are equal, the previous equation reduces to an arithmetic mean operation. The weighted summation approach is also known as linear opinion pool [Altiçay00] and sum rule [Alexandre01].

Weighted product If the opinions provided by the experts are considered to be independent a posteriori probabilities in a Bayesian framework [Brunelli95], then the opinion regarding class j from NE experts can be combined using a product rule:

∏=

=EN

ijij of

1,

Moreover, to account for varying discrimination ability and reliability of each expert, weighted can be introduced:

∏=

=E

i

N

i

wjij of

1, )(

When all the weights are equal, the previous equation reduces to a geometric mean operation. The weighted product approach is also known as logarithmic opinion pool [Altiçay00] and product rule [Alexandre01].

VISNET/WP4.3/D40/V1.0 Page 36/70

Page 37: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

There are two drawbacks to weighted product fusion. The first one is that one expert can have a large influence over the global fused opinion. For example, an opinion close to zero from one expert sets the fused opinion also close to zero. The second disadvantage is that independence assumption is only strictly valid when each expert uses independent features. An advantage of the weighted summation and product fusion over feature vector concatenation, used in decision fusion, is that the opinions from each expert can be weighted. Moreover, these weights can be chosen to reflect the reliability and discrimination ability of each expert. Thus, when fusing opinions from speech and face experts, it would be possible to modify the contribution of the speech expert for variable SNR channel conditions. This type of fusion is known as adaptive fusion (see section 5.4).

Post-classifier Alternatively to weighted summation and product fusion techniques, a post-classifier can be used to reach a decision using the opinions provided by the experts. In this case opinions can be considered as features in the likelihood space. The opinions from NE experts regarding NC classes form a NE x(NC–1) dimensional opinion vector, which is used by a post-classifier to make a final decision. An important advantage of using a post-classifier approach is that the opinions do not necessarily need to be commensurate as in the two previous approaches. In this case, the post-classifier makes adequate mapping from the likelihood space to the class label space. Notice that in a verification scenario, the dimensionality of the opinion vector is only dependent on the number of experts [Ben99] since only two classes (accept / reject) exist. Therefore, the post-classifier provides a decision surface in a NE dimensional space, separating the impostor and the true claimant classes. The post-classifier can be any of the classifiers reviewed in section 4, but the most common ones are Bayesian classifier or linear discriminant functions. 5.4 Adaptive fusion

Pioneering approaches that have addressed the problem of adaptive fusion were published in the earlier 90s [Hampshire89, Jacobs90, Jacobs91, Tresp95]. Hampshire [Hampshire89] and Jacobs [Jacobs90] were the first who tried to solve the problem of using several different expert networks plus a gating network that decides which of the experts should be use for each specific case. Jacobs extended the system of Hampshire so that the system can learn how to allocate new cases to experts by checking the output; if the output is incorrect, the weights for the selected experts change in the gating network. So there is no interference with the weights of other experts that specialize to quite different cases. The experts are therefore local in the sense that the weights in one expert are decoupled from the weights in other experts. Both approaches [Hampshire89, Jacobs90] use the following error function to adjust the weights:

2

∑−=i

j

ij

i

jj opdE

where j

io is the output vector of expert i on case j, is the proportional contribution of expert i to the

combined output vector, and

jip

jd is the desired output vector in case j. The error measure compares the desired output with a combination of local experts and adjust the weights (contribution of the local experts) to cancel this error. When the weights in one expert change, the same happens to the residual error, and the error derivatives for all the other local experts. This strong coupling between experts causes them to cooperate but tends to lead to solutions where too many experts are necessary. This technique is called associative learning by its authors.

VISNET/WP4.3/D40/V1.0 Page 37/70

Page 38: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

In [Jacobs91] this associative learning is extended to a competitive learning by using only one expert for each case instead of combining the outputs of all of them. The error is then the expected value of the squarest difference between the desired and the actual output of one expert:

∑ −=−=i

j

i

jji

j

i

jj odpodE22

,

or a logarithmical version, which presents a better and faster performance

∑−−

−=i

odji

jji

j

epE2

21

log

Notice that in this new error function each expert is required to produce the whole of the output vector rather than a residual. As a result, the goal of a local expert on a given training case is not directly affected by the weights within other local experts; however if the weights of one experts are adjusted, this may change the choice of the gating network in favour of another different expert. In [Tresp95] different ways of designing the error or weighting functions are described. Intuitively, the weighting functions should represent the competence or the certainty of a module (expert), given the available information x. The authors modelled the global dependency between the input x and the output y as a weighting combination of individual experts : )(xNNi

∑∑==

==M

iii

M

iii xNNxgxNNxh

xny

11)()()()(

)(1ˆ

where

∑=

=M

ii xhxn

1)()(

with the weighting functions and 0)( ≥xhi )()()(

xnxhxg i

i =

According to the authors the weights can be adjusted based on the variance, the density or the error.

Variance-based weighting Here, Tresp [Tresp95] assumed that different experts were trained with different data sets but with identical (commesurate) input-output relationships. According to the authors, under the restriction that the individual classifiers are uncorrelated and unbiased

)(xNNi

5, the combined estimator is also unbiased and has the smallest variance if the weighting function is inversely proportional to the variance of the experts:

[ ])(var1)(

xNNxh

i

i =

Density-based weighting Another possibility is that the experts were trained with data sets from different regions so that the input-output relationships are not identical. In this case, it is necessary to know the state of the input during the 5 According to the author, as far as different training data have been used, it seems to be correct to assume that the experts are uncorrelated and unbiased.

VISNET/WP4.3/D40/V1.0 Page 38/70

Page 39: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

test stage in order to estimate the weighting function as the probability of having the state i given an input vector x:

)(),()(),()|()( iPixP

xPxiPxiPxhi ∝==

The authors used a mixture of Gaussian model to estimate P(i,x) but it might be estimated in different ways. In this adaptive fusion technique the density probability function can be seen as the gating network (=combiner) which decides how the individuals classifiers are fused. The authors also proposed a hybrid weighting function using the variance and the density-based weighting techniques. 5.5 Bibliography

[Achermann96] B. Achermann and H. Bunke, “Combination of Classifiers on the Decision Level for Face Recognition”, Technical Report IAM-96-002, Institut für Informatik und angewandte Mathematik, Universität Bern, 1996 [Adjoudani95] A. Adjoudani and C. Benoit, “Audio-Visual Speech Recognition Compared Across Two Architectures”, Proc. 4th European Conf. Speech Communication and Technology, Vol. 2, pp. 1563-1567, Madrid, Spain, 1995 [Alexandre01] L. A. Alexandre, A. C. Campillo and M. Kamel, “On combining classifiers using sum and product rules”, Pattern Recognition Letters, Vol. 22 pp. 1283-1289, 2001 [Altiçay00] H. Altiçay and M. Demirekler, “An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification”, Speech Communication, Vol. 30 pp. 255-272, 2000 [Barniv81] Y. Barniv and D. Casasent, “Multisensor image registration: Experimental verification”, Proceedings of the SPIE, Vol. 292, 1981, pp. 160-171. [Ben99] S. Ben-Yacoub, Y. Abdeljaoued and E. Mayoraz, “Fusion of Face and Speech Data for Person Identity Verification”, IEEE Trans. on Neural Networks, Vol. 10, No. 5 pp. 1065-1074, 1999 [Brunelli95] R. Brunelli and D. Falavigna, “Person Identification Using Multiple Cues”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 10, No. 17 pp. 955-965, 1995 [Duda01] R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification”, John Wiley & Sons, USA, 2001 [Genoud96] D. Genoud, F. Bimbot, G. Gravier and G. Chollet, “Combining methods to improve speaker verification”, Proc. 4th International Conf. Spoken Language Processing, Vol. 3, pp. 1756-1759, Philadelphia, 1996 [Hall01] D. L. Hall and J. Llinas. “Handbook of multisensor data fusion” CRC Press, USA, 2001 [Hampshire89] J. B. Hampshire, and A. Waibel, “The meta-pi network: building distributed knowledge representations for robust pattern recogni tion”. Carnegie Mellon Technical Report, August 1989 [Ho94] T. K. Ho, J. J. Hull and S. N. Srihari, “Decision Combination in Multiple Classifier Systems”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 16, No. 1, pp. 66-75, 1994

VISNET/WP4.3/D40/V1.0 Page 39/70

Page 40: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Jacobs90] R. A. Jacobs, M. I. Jordan, and A. G. Barto, “Task decomposition through competition in a modular connectionist architecture. The what and where vision tasks” Cog. Sci. 1990 [Jacobs91] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts”, Neural Computation Vol. 3, pp. 79-87, 1991 [Luettin97] J. Luettin, “Visual Speech and Speaker Recognition”, PhD Thesis, Department of Computer Science, University of Sheffield, 1997. [Luo95] R. C. Luo and M. G. Kay, “Introduction” in: Multisensor Integration and Fusion for Intelligent Machines and Systems (editors: R. C. Luo and M. G. Kay), Ablex Publishing Corporation, Norwood, NJ, 1995 [Neti99] C. Neti and Andrew Senior, “Audio-visual speaker recognition for video broadcast news”, in DARPA HUB4 Workshop, Washington D.C, March 99 [Pau82] L. F. Pau, “Fusion of multisensor data in pattern recognition” in: Pattern recognition theory and applications (editors: J. Kittler, K. S. Fu and L. F. Pau), D Reidel Publ., Dordrecht, Holland, 1982. [Sanderson02] C. Sanderson, “Automatic person verification using speech and face information”, PhD thesis, School of microelectronic engineering, Griffith University, 2002 [Iyengar95] S. S. Iyengar, L. Prasad and H. Min, “Advances in Distributed Sensor Technology”, Prentice Hall PTR, New Jersey, 1995 [Tenney81a] R. R. Tenney and N. R. Sandell Jr., “Detection with Distributed Sensors”, IEEE Trans. on Aerospace and Electronic Syst., Vol. 17, pp. 98-101, 1981 [Tenney81b] R. R. Tenney and N. R. Sandell Jr., “Strategies for Distributed Decisionmaking”, IEEE Trans. on Systems, Man and Cybernetics, Vol. 11 pp. 527-537, 1981 [Tresp95] V. Tresp, and M. Taniguchi, “Combining estimators using non-constant weighting functions”, Advances in Neural Information Processing Systems, 1995 [Vapnik95] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York,1995. [Verlinde99] P. Verlinde, A Contribution to Multi-Modal Identity Verification Using Decisión Fusion, PhD Thesis, Department of Signal and Image Processing, Telecom Paris, France, 1999

VISNET/WP4.3/D40/V1.0 Page 40/70

Page 41: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

6. PEOPLE LOCATION USING AUDIO VISUAL INFORMATION

6.1 Introduction

Distributed sensor networks have been proposed for a wide range of applications. The main purpose of sensor networks is to monitor an area, including detecting, identifying, localizing, and tracking one or more objects of interest. With the increase of multimedia applications person localization and tracking gained in importance. In the context of multimedia applications, e.g. video conference systems, audio-visual information are usually used to perform the localization. In multi channel signal processing source localization has been of great research interest for more than two decades [Krim96]. The location of the sources is of particular importance for multi channel audio signal processing. With beamforming techniques (spatial filtering) sources can be separated or a noise reduction can be achieved which is very important for many applications. The use of visual information to perform person tracking is relatively new compared to audio because of their computational costs. But with the increasing power of computers visual signal analysis has assumed a greater importance in recent research. Video indexing and retrieval is one potential application. However, sound and visual information are jointly generated when people speak, and provide complementary advantages for speaker tracking if their dependencies are jointly modelled [Vermaag01]. On the one hand, initialization and recovery from failures - bottlenecks in visual tracking - can be robustly addressed with audio. In contrast, precise object localization is better suited to visual processing. 6.2 People location

6.2.1 REVIEW OF MICROPHONE ARRAY SPEAKER LOCALIZATION

Source localization based on sensor arrays has been a mainstream research topic in the signal processing community for over two decades. In fact, it offers a wide variety of potential applications as it deals with signals that range from electromagnetic (e.g. radar, cellular and wireless communications, satellite) to acoustic (sonar, audio, etc.). All such fields of application share a wide common knowledge base [Brandstein01, Stoica97, Haykin84], as they all deal with problems of phase array, but specialize quite differently from each other to the specific scenarios, as the physical phenomena that come into play in the various situations differ a great deal from each other. Recently, significant attention has been given to the localization of acoustic sources using microphone arrays [Brandstein01]. In particular, the localization of a speech source is of great interest, as it enables the tracking of individuals in a variety of environments. When dealing with acoustic sources, physics of propagation are strongly affected by the long wavelengths involved. This introduces severe problems of diffraction and make the environment reverberation one major issue of concern. Acoustic source localization has been shown to suffer from significant problems of accuracy due to a number of factors such as the number, the quality and the relative location of the microphones; the number, the location and the spectral content of the sound sources; the noise and the reverberation level of the ambient enclosure. As a general rule and with the techniques available today, in order to achieve optimal performance, it is usually necessary to select the array geometry according to criteria that strongly depends on the geometry and the acoustic conditions of the environment. In addition to accuracy, in order to devise solutions of practical use, we need to take into account issues of computational complexity. It is in fact important to be able to frequently update the source location and track it along time. The solutions that are available in the literature can be coarsely classified into three broad categories: those based on the concept of Steered Beamforming (SB); those based on High-Resolution Spectral Estimation (HRSE) methods; and those based on Time-Difference Of Arrival (TDOA). Steered Beamforming is a well-known method for deriving information on the source locations directly from a filtered linear combination of

VISNET/WP4.3/D40/V1.0 Page 41/70

Page 42: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

the acquired signals. Methods based on HRSE imply the analysis of the correlations between the acquired signals, while methods based on TDOA extract information on the source location through the analysis of a set of delay estimates. Roughly speaking, the optimal maximum likelihood (ML) SB-based localization methods rely on a focused beamformer, which steers the array to various locations in space, and look for peaks in the detected output power [Hahn73]. In its simplest implementation, the steered response can be obtained through a delay-and-sum process performed on the signals acquired by two microphones. One of the two signals is delayed in order to compensate for the propagation delays due to the incidence direction of sounds. When using microphone arrays, more complex configurations of time shifts must be devised and applied for focusing purposes [Brandstein01, Hahn73]. The various SB-based methods that are available in the literature tend to differ from each other in the way filters are combined with time shifts on the array signals in order to perform the focalization task. Indeed, the ML estimation is performed through a nonlinear optimization problem. This raises questions on the effectiveness of the approach, as the goal function does not have a strong global peak, and often exhibits a number of local minima. In addition to the above optimization problems, it is worth mentioning that the steered response of beamformers is strongly dependent on the spectral content of the signal and they usually require accurate a-priori knowledge of the background noise. As a consequence, strongly reverberant environments tend to significantly affect the performance of such solutions, as they introduce a strong correlation between signal and noise (reverberation components are interpreted as noise). Source localization methods of the second category are all based on the analysis of the data covariance matrix of the array sensors. This matrix is usually unknown, therefore it needs to be estimated from the acquired data. In order to do so, such solutions rely on high resolution spectral estimation techniques [Stoica97]. For example Auto-Regressive (AR) or Minimum Variance (MV) estimation techniques. Also belonging to this category of solutions are those techniques that perform a partition of the spatial data covariance matrix (SCM) into signal and noise subspaces. Such subspaces are defined as the eigenvectors of the SCM [Schmidt86]. All implications of this idea were then studied in [Bienvenu83, Moor93] and generated a variety of algorithms for narrowband beamforming, such as MUSIC [Schmidt86], Estimation of Signal Parameter via Rotational Invariance Techniques (ESPRIT) [Roy89], MIN-NORM [Kumaresan83], and Weighted Subspace Fitting (WSF) [Viberg91]. All such techniques are statistically consistent under the standard assumptions of Gaussian signals and noise and perfect array calibration. One remarkable feature of subspace algorithms, as opposed to other HRSE-based method that strictly depend on the exact knowledge of the data distribution, is that they only require the prior knowledge of the array signal model and the availability of a consistent estimate of the SCM. In addition, there exist specific instances of subspace algorithms (e.g. ESPRIT [Roy89] and ROOT-MUSIC [Rao89]) that guarantee a fast and global numerical convergence without the need of any prior guesses about source locations. Although the above-mentioned subspace solutions are designed to work with narrowband signals, generalizations to more difficult wideband signals (including speech) have been proposed in the literature. The complexity of such generalizations ranges from a simple serial application of the narrowband methods (where the array geometry changes with frequency), to cases that account for the fact that point sources lose their rank-one signature in the SCM [Krolik89] or other sophisticated solutions such as [Wang85, Buckley88]. All such solutions, however, imply heavier computational costs. As a general rule, HRSE-based methods are significantly less robust to source and sensor modelling errors than SB-based methods. TDOA-based locators are all based on a two-step procedure applied on a set of spatially separated microphones. Time delay estimation is first performed on pairs of distant sensors. This information is then used for constructing hyperbolic curves that describe the location of all points whose distance from the sensor corresponds to the estimated delay. The curves drawn for the various pairs of sensors are then intersected in some clever way in order to identify the source locations. The solutions available in the literature vary a great deal depending on the problem conditions and on the derivation of the

VISNET/WP4.3/D40/V1.0 Page 42/70

Page 43: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

approach (e.g. see [Brandstein97, Schmidt72, Smith87, Parisi02]). Such solutions are generally characterized by a modest computational complexity and usually perform quite well in a reasonable range of favourable conditions. For these reasons, most of the acoustic localization techniques that are currently in use belong to this category. The performance of TDOA-based solutions depends very critically on the accuracy and the robustness of the time delay estimation (TDE). As a consequence, we can identify two major sources of performance degradation for TDOA methods: background noise and room reverberation (multiple sound propagation paths). The problem of how to deal with background noise has been thoroughly addressed in the literature and well understood. In fact, a variety of clever TDOA-based solutions have been proposed for stationary Gaussian signal and noise sources with known statistics and no multi-path. Examples are those that perform the ML estimation of the time delay from the Generalized Cross-Correlation (GCC) function [Knapp76]. Solutions for non-stationary speech sources have also been proposed [Brandstein95] and generalized for modestly reverberating environments [Brandstein97a]. Unfortunately, such methods turn out to be extremely sensitive to room reverberations, and as soon as multiple paths exceed a minimum level, their performance drops dramatically. In order to deal with this problem, several solutions have been proposed for specific classes of sounds (typically speech). Most of them attempt to remove the reverberation impact through pre-processing or try to deal with it in the estimation phase. An example of the first sort is Cepstral Pre-filtering [Stéphenne95] performed before GCC in the case of small rooms. One problem of this solution, however, is that even limiting its applicability to small environments, it still requires long chunks of data. In addition, like most TDOA methods, it is inherently unable to deal with multiple sound sources. In conclusion, TDOA-based methods appear to be the most practical ones, but they appear to be suitable only for rather unrealistic acoustic environments and are limited to single-source scenarios. Conversely, steered beamformers exhibit an improved robustness and have fewer memory requirements, but are characterized by a heavier computational cost. Finally, signal space methods such as Root-MUSIC are characterized by the lowest computational complexity, but, as of now, it appears to be comparatively poor in terms of robustness to reverberation and SNR fluctuations. One way to improve the performance of source localization techniques in unfavourable acoustic conditions is to integrate the information acquired with the microphone arrays using sensors of other nature, such as video-cameras. 6.2.2 VIDEO PERSON LOCALIZATION

There are many available techniques for person localization and for automatic person counting that have been developed during these years. A complete review and some of the most reliable approaches for people localization, tracking and counting may be found in [Rossi94, Ramanan03, Hashimoto97, Yang03]. A complete survey on face tracking is also available in VISNET D32 on Video Analysis. This section reviews a widely used technique for people tracking based on template matching. People tracking can be performed efficiently with a template-based tracking approach [Beymer99]: This approach is based on correlation between the current image and a template image T(x,y) that is recursively updated to handle changing object appearance:

( , )T x y =

The recursively updating algorithm is:

T(x,y)= αT(x,y)+(1-α) I(x+xp,y+yp)

Where I is the actual image, xp and yp is the object position and α is a coefficient indicating the adaptability of the template to the actual image. The main drawbacks from this approach concern the object initialization/detection and the template drift.

VISNET/WP4.3/D40/V1.0 Page 43/70

Page 44: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

The background can be efficiently subtracted using stereo disparity maps: this allows an efficient foreground detection and segmentation as shown in the figure below:

Left Background Disparities Foreground

Fig 6.1 Foreground detection and segmentation

The person detection is based on templates encoding head and torso shape that are used for the correlation function.

Fig 6.2 Templates for head and torso shape

The contemporary use of template matching and background subtraction offers an affordable person segmentation tool. The full scheme for person detection and tracking is shown below:

backgroundinit

backgroundsubtraction

foreground

stereo

leftintensity

detection tracking

persontemplates

Fig 6.3 Full scheme for person detection and tracking

The person tracking is based on two classical algorithms: the template updating and a Kalman filtering on person location in 3D. Using the disparity maps we are able to track the path of each person in the scene as shown below.

VISNET/WP4.3/D40/V1.0 Page 44/70

Page 45: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Fig 6.4 Person path tracking

The first person appearing in the scene is detected by cycling through the following flow-chart: The presence of new object (like a person) in the scene is detected looking at the variations in the image histogram. If something is changing inside the image we look at each disparity map correlating each of them with the person template. If the correlation exceeds a threshold then the approach will decide that a person entered the scene and it starts its tracking.

foregroundf(x,y)

histogram anotherpeak?

thresholddisparities

Fig 6.5 Flow-chart or the person tracking algorithm

6.2.3 RECENT WORKS IN VIDEO AND AUDIO FUSION FOR PEOPLE LOCALIZATION

In the last few years there has been a great deal of interest in the research towards the development of robust techniques and methodologies for finding the location of the mobile objects. This new interest is motivated due to the increasing popularity of new applications and services, strongly related to the ability of

correlate withperson

template

foundperson?

remove personfrom layer(x,y)

exit

no

yes

no

yes

layer(x,y)

disparity

count

VISNET/WP4.3/D40/V1.0 Page 45/70

Page 46: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

determining the user position. Positioning is a major service nowadays in military, navigation and civil applications and is especially important for multimedia applications. Audio and video signals originating from the same source tend to be related. To achieve optimal performance, a tracking system must exploit not just the statistics of each modality alone, but also relationships between the two. Consider a system that tracks moving objects in an audio-visual scene. Such a system may use video data to track the spatial location of an object and if an object emits sound, such a system may use audio data captured by a microphone array to track its location using the time delay of arrival of the audio signals at different microphones. A tracker that exploits both these modalities may be more robust and achieve better performance than one which uses either one alone. Each modality may compensate for weaknesses of the other one. Previous approaches to tracking multiple people have mostly been limited to using solely vision or audio. Vision-based tracking typically use a combination of foreground/background classification, clustering of novel points, and trajectory estimation in one or more camera views [Beymer99]. Many of these systems are ad hoc and thus successful in only limited scenarios. Since they are not cast in a probabilistic framework, any uncertainty that might arise is not modelled. For example, blobs corresponding to separate people are merged when they come close together and split again when they separate, thus making it difficult to provide unique labels for different people. Non-probabilistic trackers have difficulty in resolving the ambiguity in this situation. There are several proposals on how to fuse the acoustic localization approach with the visual localization. These have been described in the previous chapter. In [Spors01], two multi-state decentralized Kalman filter are used, one for the audio data and another for the video localization. The movement of the persons are modelled with a first order motion model where the acceleration is modelled by white Gaussian noise. Finally, a fusion module exploits the probabilistic information from the Kalman filter to estimate the position of the person. Bayesian networks have been proposed in [Pavlovic00], [Beal02]. Beal used a single probabilistic model to combine the audio and visual location estimations and introduced probabilistic graphical models for the fusion. These methods have several advantages. Since they explicitly model the actual sources of variability in the problem, such as object appearance and background noise, the resulting algorithm turns out to be quite robust. The use of a probabilistic framework leads to a solution by an estimation algorithm which is Bayes-optimal and finally, parameter estimation and object tracking can be efficiently performed by using the expectation-maximization (EM) algorithm. Vermaag et al. proposed a sequential Monte Carlo (SMC) method, also known as particle filter (PF), to fuse the sensor results [Vermaag01]. For a state-space model, a basic PF recursively approximates the filtering distribution of states given observations using a dynamical model and random sampling by (i) predicting candidate configurations, and (ii) measuring their likelihood. Gatica-Perez et al. extended these principle ideas and use importance particle for the audio-visual tracking [Gatica03]. 6.2.4 EXAMPLES AND APPLICATIONS FOR AUDIO-VISUAL LOCALIZATION

In the previous chapters the principles of audio-visual localization were described. In this section a list of some potential applications and the benefit of audio-visual localization is given. With the globalization, teleconferencing has found a wide range of commercial applications: From facilitating business meetings to aiding in remote medical diagnoses, teleconferencing is used by corporate, educational, medical, government and military organizations. Teleconferencing enables new operational efficiencies resulting in reduced travel costs, faster business decision making, increased productivity, reduced time to market, and remote classroom environments. In teleconferencing systems the localization of speaking persons is very important. From the audio point of view the position can be used to improve the quality of the speech by using beamforming techniques. From the visual point of view the position of the

VISNET/WP4.3/D40/V1.0 Page 46/70

Page 47: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

actual speaker can be used to focus the camera to the speaker. An example for a videoconferencing system with audio-visual localization can be found in [Kapralos03]. 3D video communications is a new field of research in multi-media applications such as immersive videoconferencing or augmented reality [Mulligan04, Moezzi97, Thalmann03]. In these applications audio-visual first a 3D image of the scene has to be extracted and second this scene has to be virtual reconstructed at the other end of the communication. Another field of application is security and surveillance. The need of security is more and more increasing. The range of applications goes from surveillance of public places, e.g. parking places, to person count systems. 6.3 Bibliography

[Agrawal00] M. Agrawal and S. Prasad, “Broadband DOA estimation using spatial-only modeling of array data”, IEEE Trans. Signal Processing, Vol. 48, pp. 663–670, Mar. 2000.

[Bahadori03] S. Bahadori, L. Iocchi “A Stereo Vision System for 3D Reconstruction and Semi-Automatic Surveillance of Museum Areas”, Workshop "Intelligenza Artificiale per i Beni Culturali", AI*IA-03, Pisa, Italy.

[Bahadori03a] S. Bahadori, A. Cesta, G. Grisetti, L. Iocchi, R. Leone, D. Nardi, A. Oddi, F. Pecora, and R. Rasconi. “RoboCare: an Integrated Robotic System for the Domestic Care of the Elderly.” In Proceedings of Workshop on Ambient Intelligence AI*IA-03, Pisa, Italy

[Beal02] M. Beal, H. Attias, and N. Jojic, “Audio-video sensor fusion with probabilistic graphical models,” in Proc. ECCV, May 2002.

[Beymer00] Beymer, D. "Person counting using stereo". Workshop in Human Motion Dec 2000. Proceedings.

[Beymer99] D. J. Beymer and K. Konolige, “Real-time tracking of multiple people using stereo”. In Frame-Rate Workshop, 1999

[Bienvenu83] G. Bienvenu, L. Kopp, “Optimality of high-resolution array processing using the eigensystem approach,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-31, pp. 1235–1248, Oct. 1983

[Brandstein01] M. Brandstein, D. Ward, Eds. “Microphone arrays – Signal processing techniques and applications”. Springer-Verlag, 2001

[Brandstein95] M. S. Brandstein, J. E. Adcock, H. F. Silverman, “A practical time-delay estimator for localizing speech sources with microphone arrays”, Computer Speech and Language, Vol. 9, pp.153-169, Apr. 1995

[Brandstein97] M. S. Brandstein, J. E. Adcock, H. F. Silverman, “A Closed-Form Estimator for use with Room Environment Microphone Arrays,” IEEE Transactions on Speech and Audio Processing, Vol. 5, No. 1, pp. 45-50, Jan. 1997

[Brandstein97a] M. S. Brandstein, J. E. Adcock, H. F. Silverman, “A practical methodology for speech source localization with microphone arrays”, Computer Speech and Language, Vol. 11, pp. 91-126, Apr. 1997

[Buckley88] K. Buckley, L. Griffiths, “Broad-band signal-subspace spatial-spectrum (BASS-ALE) estimation”. IEEE Tr. Acoust., Speech, Signal Processing, Vol. ASSP-36, pp. 953-964, July 1988

VISNET/WP4.3/D40/V1.0 Page 47/70

Page 48: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Gatica03] D. Gatica-Perez, G. Lathoud, I. McCowan, J-M. Odobez, and D. Moore, “Audio-visual speaker tracking with importance particle filters”, in Proceedings of the IEEE International Conference on Image Processing, September 2003

[Hahn73] W. Hahn, S. Tretter, “Optimum processing for delay-vector estimation in passive signal arrays”. IEEE Tr. Information Theory, Vol. IT-19, pp. 608-614, Sept. 1973

[Hashimoto97] Hashimoto, K. et al. "People count system using multi-sensing application". International Conference on Transducers, June 1997.

[Haykin84] S. Haykin, Ed., “Array Signal Processing”. Prentice Hall, 1984

[Haykin91] S. Haykin, “Adaptive filter theory”. Prentice Hall, Second ed., 1991

[Johnson93] D. Johnson, D. Dudgeon, “Array signal processing: concepts and techniques”. Prentice Hall, 1993

[Kapralos03] B. Kapralos, M. R. M. Jenkin, and E. Milios, “Audio-visual localization of multiple speakers in a video teleconferencing setting”, Int. J. Imaging Systems and Technology, 13:95-105, 2003

[Knapp76] C.H. Knapp, G.C. Carter, “The generalized correlation method for estimation of time delay”. IEEE Tr. Acoust., Spech, Signal Processing, Vol. ASSP-24, pp. 320-327, Aug. 1976

[Krim96] H. Krim, M. Viberg, “Two decades of array signal processing research: the parametric approach.” IEEE Signal Processing Magazine, Volume: 13 Issue: 4 , Page(s): 67-94, July 1996

[Krolik89] J. Krolik and D. N. Swingler, “Multiple wide-band source location using steered covariance matrices,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. 37, pp. 1481–1494, Oct. 1989

[Kumaresan83] R. Kumaresan and D.W. Tufts, “Estimating the angles of arrival of multiple plane waves,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-19, pp. 134–139, Jan. 1983

[Moezzi97] Moezzi, S. "Immersive Telepresence". IEEE Multimedia. March 1997.

[Moor93] B. De Moor, “The singular value decomposition and long and short spaces of noisy matrices,” IEEE Trans. Signal Processing, vol. 41, pp. 2826–2838, Sept. 1993

[Mulligan04] Mulligan,J; Zabulis, X; Kelshikar, N; Daniilidis, K. "Stereo-based environment scanning for immersive telepresence". IEEE Trans. on Circuits and Systems, March 2004.

[Parisi02] R. Parisi, R. Gazzetta, E. Di Claudio, “Prefiltering approaches for time delay estimation in reverberant environments”. IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP2002, Orlando, Florida, USA, May 13-17, 2002

[Pavlovic00] V. Pavlovic, A. Garg, and J. Rehg, “Multimodal speaker detectionusing error feedback dynamic Bayesian networks,” in Proc. IEEE CVPR, Hilton Head Island, SC, 2000

[Ramanan03] Ramanan, D.; Forsyth, D.A.; "Finding and tracking people from the bottom up". Computer Vision and Pattern Recognition. June 2003.

[Rao89] B. D. Rao and K.V. S. Hari, “Performance analysis of root-music,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 1939–1949, Dec. 1989

[Rossi94] Rossi, M.; Bozzoli, A. "Tracking and counting moving people". ICIP-94. Nov. 1994

VISNET/WP4.3/D40/V1.0 Page 48/70

Page 49: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Roy89] R. Roy and K. Kailath, “Esprit—Estimation of signal parameter via rotational invariance techniques,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. 37, pp. 984–995, July 1989

[Schmidt72] R. Schmidt, “A new approach to geometry of range difference location”. IEEE Tr. Aerosp. Electron., Vol. AES-8, pp. 821-835, Nov. 1972

[Schmidt86] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antennas Propagat., vol. AP-34, pp. 276–280, March 1986

[Smith87] J. Smith, J. Abel, “Closed-form least-squares source location estimation from range-difference measurements”. IEEE Tr. Acoust., Speech, Signal Processing, Vol ASSP-35, pp. 1661-1669, Dec. 1987

[Spors01] S. Spors, N. Strobel, R. Rabenstein, “A Multi-Sensor Object Localization System”. Vision, Modeling and Visualization (VMV), pp. 19-26, T. Ertl, B. Girod, G. Greiner, H. Niemann, H.-P. Seidel, Stuttgart, Germany, 2001

[Stéphenne95] A. Stéphenne, B. Champagne, “Cepstral prefiltering for time delay estimation in reverberant environments”. IEEE Intl. Conf. Acoust., Speech, Signal Processing (ICASSP-95), Detroit MI, USA, pp. 3055-3058, May 1995

[Stoica97] P. Stoica, R. Moses, “Introduction to Spectral Analysis”, Prentice Hall, 1997

[Thalmann03] Thalmann, D. "Virtual humans for virtual reality and augmented reality" Proceedings of the IEEE Virtual Reality. Tutorial. March 03

[Vermaag01] J. Vermaak, M. Gagnet, A. Blake, and P. Perez, “Sequential Monte-Carlo fusion of sound and vision for speaker tracking,” in Proc. IEEE ICCV, Vancouver, July 2001

[Viberg91] M. Viberg, B. Ottersten, and T. Kailath, “Detection and estimation in sensor arrays using weighted subspace fitting”, IEEE Trans. Signal Processing, Vol. 39, pp. 2436–2449, Nov. 1991

[Wang85] H. Wang, M. Kaveh, “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources”, IEEE Tr. Acoust., Speech, Signal Processing, Vol. ASSP-33, pp. 823-831, Aug. 1985

[Yang03] Yang, D.B; Gonzalez-Banos, H.H. "Counting people in crowds with real-time network of simple image sensors". International Conference on Computer Vision. Proceedings. 2003

VISNET/WP4.3/D40/V1.0 Page 49/70

Page 50: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

7. AUDIOVISUAL PERSON RECOGNITION AND VERIFICATION

7.1 Introduction

Detecting and recognizing people in video 6sequences is a key process in video indexing and surveillance applications. For applications with a controlled background such as access control systems or driving license databases, the segmentation and recognition of people (faces) from still images is not a serious problem to overcome; but this is not the case for the majority of surveillance applications, where the images present a cluttered background full of objects. Take the example of a static picture of an airport scene. For these kinds of images the segmentation, detection and recognition of a person could represent a very serious and complicated challenge. For this reason, over the last few years, research on human recognition from images sequences has been a very active topic. A recognition advantage for moving faces was first demonstrated in [Knight97] who found that faces presented in negative (contrast-reversed) were better recognised when shown as an animated sequence, compared to when they were shown as a single static image. Moreover, psychology experts [Lander99] have shown that the recognition advantage for moving images is not solely due to the increased amount of static-based information contained in a moving sequence. When the same frames were shown but not animated, recognition rates were significantly lower compared to when they were shown as an animated sequence. Furthermore, the precise dynamic characteristics of the motion seem to be important in mediating the recognition advantage of motion. Recognition of thresholded images was better from sequences which maintained their original dynamic properties (motion), compared to when the same sequences were shown speeded up, slowed down, with the rhythm disrupted or in reverse. The main conclusion of these experiments is that it may be more useful to use video sequences instead of still-images for person (face) recognition applications and systems where a video sequence is available. The main advantages of using a video-based face recognition system should be:

1. Possibility of selecting good frames for the recognition. 2. Use of multiple cues (audio, motion, text, etc…). 3. Estimation of occluded parts of the face using several adjacent frames.

In [Zhao03] three video-based face recognition categories were enumerated due to their different views or perspectives of using the spatio-temporal information in a sequence:

1. Still image methods which apply one of the traditional techniques such as PCA, LDA, ICA, Kernel functions, etc frame-by-frame and then the results of each frame are combined in order to take a decision about the identity of the person.

2. Multimodal methods make use of different cues presented in a video sequence such as audio or captions (text) in order to get a more accurate recognition.

3. Spatiotemporal methods can be an extension of the second category but using also tracking features all over the temporal space for recognition purposes.

The first category does not exploit all the advantages mentioned above, because they don’t check the different modalities of the video. Moreover, still image approaches present a high computational burden because they process all the video frames of the sequence without taking care of temporal correlation between consecutive frames. This leads to a computational inefficiency of the algorithm without improving the overall performance. 6 In this section, the term video sequence implies also an associated audio stream.

VISNET/WP4.3/D40/V1.0 Page 50/70

Page 51: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

As far as the third category is only an extension of the second one, in the rest of this section only multimodal recognition approaches will be considered. These are described in more detail in section 7.3 which in fact provide an overview of the state-of-the-art in this field. A generic block diagram for an audiovisual recognition system is illustrated in Fig 7.1. The first step, video pre-processing refers to a shot selection module where only the shots where people appear are accepted. This step helps to reduce the amount of shots to be further analyzed and is described with a specific example for news sequences in the next section (7.2). Then, individual experts are combined with some of the techniques explained in section 5 to verify or recognize if person m is in each selected shot. A complete review of the main multimodal approaches used for person recognition found in the literature will be presented in section 7.3 differentiating between adaptive and non-adaptive methods.

Fig 7.1 Multimodal person recognition system

7.2 Shot selection in video sequences

In video-based person recognition approaches, one of the most important parameters is the computational burden of the algorithms. If the video-based method analyzes all the frames of all video shots or audio segments, the associated computational cost may be too high for real time applications. Thus, in general, video-based person recognition approaches include a pre-processing module for the detection of shots where people appear. This module allows reducing the computational load of the analysis system since the number of shots to be processed will be much smaller. Obviously, the main problem of selecting only some target video shots depends on the application scenario. It is necessary to process and to make use of all the a priori knowledge about the video sequence in order to extract the maximal quantity of lateral information which can be used to accept or to discard the video shots in an accurate and fast way. Thus, the methods involved in the shot selection module cannot be generalized for all the possible application scenarios such as detection of people in movies or surveillance applications. In the following, a shot selection module for detecting people in news sequence will be described in more detail and it will serve as a practical example. This shot selection module is part of a complete audio visual person detection and verification system proposed by Dr. Alberto Albiol from the Universitat Politècnica de Valencia in his thesis [Albiol03b], which has been supervised by Luis Torres, member of the UPC group. In the proposed work [Albiol03b], the shot selection module starts taking advantage of a priori knowledge of the editing techniques used in the news sequences. Secondly, the video is segmented into shorter logical units called video shots and audio segments for video and audio respectively. Finally, the concordance between the video and audio logical units is evaluated using lateral information such as shot activity and speech detection in order to select the correct shots where people appear and reject the others.

VISNET/WP4.3/D40/V1.0 Page 51/70

Page 52: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

7.2.1 ELEMENTS OF THE VIDEO SEQUENCE (NEWS SEQUENCE)

In general, people that appear in a news video sequence can be divided as two types: the first are news anchors and reporters, the second type are people who are the subject of news stories. Normally, the latter type is the one used for video indexing applications; for this reason, the goal of the shot selection system is to detect shots that contain the second type of people in which these people are also speaking. It is important to note that usually these types of sequences are pre-recorded. Fig 7.2 illustrates the type of shots that the algorithm wants to locate. The editing procedure for this type of scenes can be summarized as follows:

1. The anchorman or reporter starts recording his audio narration. 2. If the comments of somebody related to the news story are going to be inserted, then the audio and

video of the person are inserted together, creating a simultaneously audio and video transition. 3. If the reporter continues speaking or if more comments have to be inserted, the two first steps are

repeated until the narration is done. 4. Finally, images are used to pad the video track during the reporter narration. Usually, several shots

are inserted for each audio segment. The main consequence of this editing procedure is that it is possible to detect these shots by examining the matching of audio and video transitions.

Fig 7.2 Examples of two pre-recorded news stories: (a) The news reporter speaks the whole time. (b) The comments of a person related to the news story is inserted.

Of course additional clues can be considered, for instance, speech detection can be used to discard non-speech segments. Also, shots where the person who is speaking also appears on the image are usually characterized by a low motion activity. All of these properties can be used in the shot selection block to reject false shot candidates quickly and efficiently. Fig 7.3 could be a possible block diagram for the shot selection module for the news sequence application.

VISNET/WP4.3/D40/V1.0 Page 52/70

Page 53: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Fig 7.3 Possible block diagram used to detect person shots in news sequences

7.2.2 AUDIO AND VIDEO SEGMENTATION

As mentioned before, it is not possible to analyze a complete video sequence, but instead it is necessary to decomposed it into smaller logical units. The smallest logical unit for video is called shot which correspond to a continuous set of frames taken from one camera. Transitions between shots can be abrupt (cuts) or gradual (special effects such as dissolves, wipes, fade in, fade out…). The second type is more difficult to detect because the difference between consecutive frames is smaller. A great variety of techniques has been proposed in the literature for detection of transitions in video sequences. These video segmentation techniques can be broadly divided into three main categories: colour-based, edge-based and compression domain approaches. The colour-based algorithms are the most used ones and they are supposed to have the best relationship between computational cost and performance. The main drawback is their low detection rate when detecting gradual transitions. The edge-based approaches obtain more or less the same results as the previous ones when detecting abrupt transitions but in general, they get a higher detection rate in gradual transitions. The main disadvantage is the high computational cost. Finally, the compression domain algorithms used the basics of compression standards (MPEG) to segment the video. Normally, they are very fast but the global performance is lower than that of the two previous categories. As the main goal of the shot selection module is to reduce the computational burden of the person recognition system, the video should be segmented with a fast and relatively good algorithm. Moreover in news sequences, gradual transitions are much less frequent than cuts, so that a simple pixel-wise segmentation approach could be implemented to measure the difference between consecutive frames using the mean absolute frame difference (MAFD):

∑∑= =

−−=I

i

J

jnn jifjif

IJnMAFD

1 11 ),(),(1)(

where I and J are the horizontal and vertical dimensions of the frames, n is the frame index, and (i,j) are the spatial coordinates. The same procedure has to be implemented for audio in order to obtain smaller audio segments which can be compared with the video shots in the next block (concordance analysis). To achieve this a simple audio silence detector can be developed. 7.2.3 MODALITIES CORRESPONDENCE

Audio and video correspondence Once the audio and video segments are located, the objective is to find the correspondence between them. Fig 7.4 shows how the audio and video segments overlap when a person appears and is also taking in the news story. However, for real sequences the borders of audio and video segments do not overlap, as shown in Fig 7.4. This is due mainly because silence periods are usually located in the audio segment

VISNET/WP4.3/D40/V1.0 Page 53/70

Page 54: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

borders creating small inaccuracy. Fig 7.4 shows an example of the typical situation for news stories, where a long audio segment coexists with short video segments.

Fig 7.4 (a) Audio and video borders match exactly. (b) Audio and video borders almost match. (c) Audio segment contains several video shots.

Given an audio segment in the time interval [ ]1max1min , tt and a video segment defined in the interval

the intersection is defined as: [ 2max2min , tt ]

]

[ ] [ ),min(),,max(, 2max1max2min1minmaxmin tttttt =∩∩

then, if for a pair of audio and video segments, the overlap is defined as: 0)( minmax >− ∩∩ tt

⎭⎬⎫

⎩⎨⎧

−−

−−

= ∩∩∩∩

)()(,

)()(min

2min2max

minmax

1min1max

minmax

tttt

ttttoverlap

If this overlap parameter is above a certain threshold th then the audio and video segments are said to match and the shot is considered for further analysis using other modalities or cues.

Other modalities (motion and speech detection) Normally, when a speaker appears in the scene, the camera is placed on a fixed position focusing on the speaker. This low activity in the image can be used to further discard some of the selected shots obtained by only examining the matching of audio and video segments. The proposed measurement to quantify this low activity within a shot is based on the mean of the MAFD and can be calculated as:

∑= +−

=ne

nsn nsnenMAFDSA

)1()(

1

where ns and ne are the initial and final frame numbers of the analyzed shot. Low values correspond to “static” shots. Finally, speech detection can also be applied to discard all non-speech shots.

VISNET/WP4.3/D40/V1.0 Page 54/70

Page 55: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

7.3 State-of-the-art of multimodal person recognition

7.3.1 INTRODUCTION

Multimodal person recognition is a relatively recent problem that has been studied by several researchers. This section reviews the most representative approaches to multimodal person recognition found in the literature. Although some numerical results have been presented, it must be clear that a direct comparison between these is meaningless because the experimental results were not carried on the same database or with the same modality experts for all approaches. Numerical results are only used to demonstrate in an experimental and heuristic way, the improvement of multi-modal analysis (information fusion) against mono-modal analysis. The intention of this section is also not to give a detailed description of all experts (face, lip, speech, text experts) presented in all the approaches, only to give a general view of the system and how these experts are fused. Multimodal person recognition approaches can be subdivided into two areas: adaptive and non-adaptive. In non-adaptive approaches, the contribution of each expert is fixed a priori. In adaptive approaches, the contribution of at least one expert is varied according to its reliability and discrimination ability in the presence of some environmental condition (see section 5.4). For example, the contribution of a speech expert is decreased when the audio SNR is lowered. 7.3.2 NON-ADAPTIVE APPROACHES

Among all techniques proposed to combine multi-modal information sources, post-mapping fusion techniques are by far the most frequent solution adopted in multi-modal person recognition. However, some systems that use pre-mapping fusion technologies have been also proposed. An example of an approach that uses pre-mapping fusion is presented in [Luettin97]. In this work, speech features and visual speech features extracted from lip tracking are used for person recognition using feature vector concatenation. In order to match the frame rates for both feature sets, speech information is extracted at 30 fps instead of the usual 100 fps. The results presented in this work show a better performance when both modalities are used. More concretely, the author reported numerical results for two video sequences with a 77% identification rate (in average) for visual information, a 98.5% for audio information and a 100% for multimodal analysis. Although this is not a representative sample it indicates that acoustic and lip features contain complementary information for person recognition. On the other hand, opinion fusion is usually preferred against decision fusion in the systems that use post-mapping fusion techniques. An example of multi-modal system that uses decision fusion techniques can be found in [Verlinde99]. In this work, majority voting and AND & OR fusion methods are compared using three different modalities: frontal face, face profile and text-independent speaker recognition. For each modality a classifier was used to provide a hard decision about each class. From the experiments, it was found that the system performances of the three fusion approaches improved those obtained by the best single modality by reducing the Equal Error Rate 7(EER) from 10% to 7%. When the three fusion schemes were compared, it was reported that the AND fusion outperformed the other two fusion schemes, followed by majority voting and finally the OR voting approach. Another system that uses the AND rule was presented in [Poh01]. In [Chibelushi93] a decision fusion system is presented, where information from still face profile images and speech are combined using a form of weighted summation fusion:

2211 owowf +=

7 Equal Error Rate (EER) refers to the point where the False Acceptance Rate (FAR) is equal to the False Rejection Rate (FRR)

VISNET/WP4.3/D40/V1.0 Page 55/70

Page 56: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

where o1 and o2 are the opinions from the speech and the face profile experts respectively, and w1 and w2 are their corresponding weights. Each opinion reflects a likelihood that a given claimant is the true claimant. The verification decision was reached by means of thresholding the fused opinion f, which outperformed each separate modality. A similar approach to the previous one is presented in [Choudhury99]. This system was composed of a face recognition module, a speaker identification module and a classifier fusion module. The first step of the recognition system is to accurately and robustly detect the face. Choudhury uses following stages: (a) detection using the skin colour information; (b) detection of face feature location using symmetry transforms and image intensity gradient; (c) compute the feature trajectories using correlation based tracking; (d) process the trajectories to stably recover the 3D structure and 3D facial pose and (e) use 3D head model to warp and normalize the face to a frontal position. Then a probabilistic PCA approach retaining 35 eigenvectors (eigenfaces) is used to recognize the face candidates with the main contribution of using the 3D information to get depth values for the tracked faces in order to differentiate between real heads and a photo in front of the camera. This is very useful for security applications like Automated Teller Machine (ATM) authentication procedures. On the other hand, Choudhury’s speaker identification is based on a simple set o linear spectral features which are characterized with HMMs. Finally, the face and speaker recognition modules are combined using a weighting summation fusion whose weights are estimated in a Bayesian framework using the confidences for each expert. The training and the test data from 26 people were collected from an ATM scenario. The setup included a single camera and microphone placed at average head height. The experiments showed that the fusion of audio and video improved performance and that 100% recognition and verification were achieved when the image / audio clips with highest confidence scores were used. In [Jourlin97a, Jourlin97b], a form of weighted summation fusion to combine opinions of a text dependent speech expert and a text independent lip expert are used. Using an optimal weight, fusion led to better performance than using the underlying experts alone by reducing the false acceptance rate of the acoustic sub-system from 2.3% to 0.5% and by obtaining a 100% recognition rate. The experimental results were carried out on the M2VTS audio visual database. Another speaker recognition system that uses weighted summation approach to combine opinions of speech expert and lip expert (both text-independent) can be found in [Wark99]. In this work, the performance of speech expert was deliberately decreased by adding varying amounts of white noise to speech data. Experimental results showed that although performance was always better than using the speech expert alone, it significantly decreased as the noise level was increased. It was also found, that depending on the values of the weights, the performance in high noise levels was actually worse than using the lip expert alone. From these results, the authors proposed a statistical based method for selecting the weights which resulted in good performance in clean conditions and never fell below the performance of the lip expert in noisy conditions. In [Kittler97] multiple images of one person are used to generate multiple opinions using a frontal face expert. Although this is formally not a multi-modal system (since only visual information is used), decision fusion techniques are applied to combine the opinions extracted from each image. In this research, opinions are fused by various means, including simply averaging. It was shown that error rates were reduced by up to 40% when compared to use a single image. It was also found, that performance gains tended to saturate when more than five images were utilized. These results suggest that using a video sequence rather than one image provides superior performance. In [Brunelli95a], the opinions from a face expert (which used geometric features from static frontal images) and a speech expert are combined using the weighted product approach:

)1(21

11 )()( ww oxof −= In this system, identification rates of 81% and 92% obtained by speech and face experts respectively were improved up to 95% for the combined approach using optimal weights.

VISNET/WP4.3/D40/V1.0 Page 56/70

Page 57: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Hybrid schemes that use opinion and decision fusion techniques are also found in the literature. For instance, in [Dieckmann97] three experts: frontal face, dynamic lip image expert and text dependent speaker expert are proposed. A hybrid fusion scheme involving majority voting and opinion fusion was used. Two of the experts had to agree on the decision and the combined opinion had to exceed a specific threshold. The hybrid fusion works better than experts alone. In other cases, hybrid systems that use decision and opinion techniques are proposed when confidence values are available for some modalities but only decision values are available for the others. In [Trivedi00] an example of this situation is presented where the opinions from a face expert are fused with the decisions obtained from a speaker classifier. The authors didn’t present any experimental results. Another example of hybrid system can be found in [Hong98] where the outputs of fingerprint expert and a frontal face expert are combined. A hybrid fusion scheme involving ranked list and opinion fusion is used: opinions of the face expert for the top n identities were combined with the opinions of the fingerprint expert of the corresponding identities using a form of the product approach. This hybrid approach was used to take into account the relative computational complexity of the fingerprint expert. It was shown that in all tested cases fusion led to a better performance than using either expert alone as illustrated in Fig 7.5.

15,8%

42,2%

61,2% 64,1%

3,9% 6,9%10,6%

14,9%

1,8% 4,4% 6,6%9,8%

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

70,0%

1,0% 0,1% 0,01% 0,001%

False Acceptance Rate (FAR)

Fals

e R

ejec

tion

Rat

e (F

RR)

Face Expert Fingerprint Expert Multimodal system

Fig 7.5 Experimental results according to [Hong98]

As introduced above (Section 5) another possible alternative for opinion fusion is the use of a post-classifier which is utilized to reach a decision using the opinions provided by the experts. In [Ben99] the use of several binary classifiers for opinion fusion using a post-classifier is proposed. The authors investigated different post-classifiers: SVM, Bayesian classifier using Beta distributions, Fisher’s linear discriminant, decision tree and multilayer perceptron. Three experts were used, a frontal face expert and two speech experts. In order to evaluate the different fusion schemes, their performances were tested on a database composed of 295 subjects and a specific testing protocol. It was shown that the SVM classifier (using polynomial kernel) and the Bayesian classifier provided the best results with an Equal Error Rate lower than 1.9%. A similar and more recent multimodal approach was presented by Albiol and Torres [Albiol02] and [Albiol03] focused on the video indexing of news sequences using audio and video features. For this they used a method with four main modules: (a) Shot Selection Block; (b) Speaker Recognition Expert; (c) Face Recognition Expert and (d) Audio Visual Combiner using a post classifier.

VISNET/WP4.3/D40/V1.0 Page 57/70

Page 58: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

The objective of Albiol and Torres’ system is to locate the video shots where person m appears and is also speaking. Speaker recognition is based on Gaussian Mixture Models built from feature vectors that consist of 12 mel-frequency cepstral coefficients and its corresponding delta coefficients. Albiol and Torres created a 32-GMM model for each person p, using 2-3 min of clear speech. On the other hand, a universal background model is built from 1 hour of speech recorded from a variety of speakers of a database. Finally, the face recognition expert is based on PCA using Self-eigenfaces. They tested two possible post-classifiers to fuse the information: a Bayesian post-classifier and linear discriminant functions. The best performance was achieved by the opinion post-classifier based on linear discriminant functions. The results reported improved from the 92% of good recognized person rate using either video or audio separately till the 97% of good detection rate using multimodal analysis. 7.3.3 ADAPTIVE APPROACHES

In [Brunelli95b] weighted geometric average is used to fuse the opinions from two speech experts (for static and delta features) and three face experts (for eyes, nose and mouth areas) for person identification. The weights used in the geometric average were adaptive and dependent on the score distributions. According to the authors, the correct identification rate of the integrated system is 98% which represents a significant improvement with respect to the 88% and 91% provided by the speaker and face recognition systems respectively. The authors normalized the scores in order to obtain confidence values mapped in a standard interval [0,1] using tanh-estimators introduced by Hampel [Huber81] which are more efficient and robust to outliers than the statistical parameters such as the mean or the variance. Therefore, each list of scores {Si,j} i=1…N from classifier j, being N the number of people in the reference database, can be transformed into a normalized list by the following mapping:

[ ]1,0101.0tanh21

tanh

tanh,', ∈⎥

⎤⎢⎣

⎡+

⎭⎬⎫

⎩⎨⎧ −

μjiji

SS

where tanhμ and tanhσ are the average and standard deviation estimates of the scores {Si,j} i=1…N as given by the Hampel estimators. As already mentioned the normalized scores are then integrated using a weighting geometric average:

( )∑∏= jj

j w

j

w

jii SS1

,'

where the weights wj represent an estimate of the score dispersion in the right tail of the corresponding distributions:

0.15.0'5.0'

,

,

2

1 −−−

=ji

ji

j SS

w

According to the previous equation, each feature is given an importance proportional to the separation of the two best scores. Ambiguities in the classification by a single feature are almost discarded by assigning a low weight. A major advantage is that no distribution of the features has to be assumed. In [Wark00], the authors extended the previous work presented in [Wark99] for speaker recognition (using speech recognition and lip reading visual information) by proposing another heuristic method to adjust the weights. Experimental results showed that although the performance significantly decreased as the noise

VISNET/WP4.3/D40/V1.0 Page 58/70

Page 59: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

level increased it was always better than using the speech expert alone. However, in high noise levels non-adaptive weights were shown to provide better performance. A major disadvantage of the method is that the calculation of weights involved finding the opinion of the speech expert for all possible identities what limits the number of total number of clients due to computational cost. The weight for the speech expert was found as follows:

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡+

=21

1

21

21 κκ

κξξ

ξw

where 21

2

ξξξ+

was calculated during the training stage using the following equation:

imp

impi

true

trueii NN

2,

2, σσ

ξ +=

where, for the i-th expert, iξ is the standard error of the difference between simple means truei ,μ and impi ,μ

of opinions for true and impostor claims, respectively, 2,trueiσ and 2

,ìmpiσ are the corresponding variances,

while and is the number of opinions for true and impostor claims, respectively. Wark et al.

referred to

trueN impN

iξ as an a priori confidence. On the contrary, iκ is the a posteriori confidence and it is calculated during the test stage by the following expression:

truei

impiitrueii oo

,

,,

1

)()(μ

κΜ−Μ

=

For the i-th expert, trueiio ,)(Μ is the one dimensional Mahalanobis distance between opinion io and the

models of opinions of the true claims. Similarly, impiio ,)(Μ is the Mahalanobis distance between opinion

and the models of opinions for impostor claims. The value of iκ should be large for clean conditions but it should decrease under noisy conditions, so that the weight or importance for the speech expert under noisy conditions is reduced. More recently, in [Sanderson02] several new methods for combining speech and face information in noisy conditions were proposed, namely: a weight adjustment procedure, which explicitly measures the quality of the speech signal; a modification to the Bayesian post-classifier, allowing the adjustment of the degree of contribution of each expert to the final verification decision; a structurally noise resistant piece-wise linear post-classifier, which attempts to minimize the effects of noisy conditions via structural constraints on the decision boundary; and a modification to the Bayesian post-classifier, which also attempts to impose structural constraints. According to the author, experimental results showed that the proposed weight adjustment procedure outperforms Warks’ adaptive approach [Wark00]. Moreover, in noisy conditions, the noise resistant piece-wise linear post-classifier has similar performance to that of the proposed weight adjustment procedure, with the advantage of having a fixed (non-adaptive) structure.

VISNET/WP4.3/D40/V1.0 Page 59/70

Page 60: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

7.4 Bibliography

[Albiol02] A. Albiol, L. Torres, Ed Delp. “Combining Audio and Video for Video Sequence Indexing” IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 26-29, 2002. [Albiol03] A. Albiol, L. Torres, Ed Delp. “The Indexing of Persons in News Sequences Using Audio-visual Data” International Conference on Acoustics, Speech and Signal Processing, Hong-Kong, China, April 6-10, 2003. [Albiol03b] A. Albiol “Video indexing using multimodal information”. PhD thesis, Universidad Politecnica de Valencia, Valencia, Spain, April 2003 [Ben99] S. Ben-Yacoub, Y. Abdeljaoued and E. Mayoraz, “Fusion of Face and Speech Data for Person Identity Verification”, IEEE Trans. on Neural Networks, Vol. 10, No. 5 pp. 1065-1074, 1999 [Brunelli95a] R. Brunelli, D. Favaligna, T. Poggio, and L. Stringa, “Automatic person recognition using acoustic and geometric features”, in Machine vision & applications, 8:317-325, 1995 [Brunelli95b] R. Brunelli, D. Favaligna, “Person identification using multiple cues”, in IEEE Trans. on Pattern Analysis and Machine Intelligence, 10(17):955-965, October 1995 [Chibelushi93] C. C. Chibelushi, F. Deravi, and J. S. Mason, “Voice and facial image integration for speaker recognition”, in IEEE International symposium on Multimedia Technologies and Future applications, Southampton, 1993 [Choudhury99] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland, “Multimodal person recognition using unconstrained audio and video” In Audio- and Video-based Biometric Person Authentication. 1999 [Dieckmann97] U. Dieckmann, P. Plankensteiner, and T. Wagner, “SESAM: A biometric person identification system using sensor fusion”, Pattern Recognition Letters, 18(9):827-833, September 1997 [Hong98] L .Hong, and A. Jain, “Integrating faces and fingerprints for personal identification” in IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(12):1295-1306, December 1998 [Huber81] P. J. Huber, “Robust Statistics”, Wiley, 1981 [Jourlin97a] P. Jourlin, J. Luettin, D Genoud, and H. Wassner, “Acoustic-labial speaker verification”, in Pattern recognition letters, 18(9):853-858, September 1997 [Jourlin97b] P. Jourlin, J. Luettin, D Genoud, and H. Wassner, “Integrating acoustic and labial information for speaker identification and verification”, in Proceedings of the 5th European Conference on Speech Communication and Technology, pp. 1603-1606, Greece 1997 [Kittler97] J. Kittler, J. Matas, K. Johnson, and M. U. Ramos, “Combining evidence in personal identity verification systems” in Pattern Letters, 18(9):845-852, September 1997 [Knight97] B. Knight, A. Johnston, “The Role of Movement in Face Recognition” Vis. Cog. 4, 265-274. 1997 [Lander99] L. Karen, V. Bruce, “Dynamic information and famous face recognition: Exploring the beneficial effect of motion” 1999, http://www.stir.ac.uk/staff/psychology/kl3/ [Luettin97] J. Luettin, “Visual Speech and Speaker Recognition”, PhD Thesis, Department of Computer Science, University of Sheffield, 1997.

VISNET/WP4.3/D40/V1.0 Page 60/70

Page 61: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Poh01] N. Poh, and J. Korczak, “Hybrid biometric person authentication using face and voice features”, in Proceedings of the 3rd International Conference on Audio-and Video-based biometric person authentication, pp.348-353, Sweden 2001 [Sanderson02] C. Sanderson, “Automatic person verification using speech and face information”, PhD thesis, School of microelectronic engineering, Griffith University, 2002 [Trivedi00] M. Trivedi, I. Mikic, and S. Bhonsle, “Intelligent environments and active camera networks”, in IEEE Systems, Man and Cybernetics conference, Nashville, October 2000 [Verlinde99] P. Verlinde, A Contribution to Multi-Modal Identity Verification Using Decisión Fusion, PhD Thesis, Department of Signal and Image Processing, Telecom Paris, France, 1999 [Wark99] T. Wark, S. Sridharan, and V. Chandran, “Robust speaker verification via fusion of speech and lip modalities” in Proc. of the Intern. Conference on Acoustics, Speech and Signal Processing, pp. 3061-3064, Phoenix, 1999 [Wark00] T. Wark, S. Sridharan, and V. Chandran, “The use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMM’s” in Proc. of the Intern. Conference on Acoustics, Speech and Signal Processing, pp. 2389-2392, Istambul, 2000 [Zhao03] W. Zhao, R. Chellappa, P.J. Philips, A. Rosenfeld, “Face recognition: A literature survey”, ACM Computing Surveys (CSUR), v.35 n.4, p.399-458, 2003

VISNET/WP4.3/D40/V1.0 Page 61/70

Page 62: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

8. MULTIMODAL VIDEO INDEXING

8.1 Introduction

Recent advances in storage, acquisition, and networking technologies are driving the creation of large amounts of rich multimedia content. However, the interaction with multimedia data is still difficult. In order to facilitate video browsing, manipulation, and retrieval, a video index is required. Video indexing is the process of attaching content based labels to video. The manual creation of video indexes is a tedious task. Therefore automatic video indexing is necessary.

Many single modality based video indexing methods have been proposed. Because of the multimodality nature of video (i.e. communication along different types of information channels such as visual, auditory, and textual channels), it is expected that combining different single modality methods improves the efficiency in analyzing the semantical contents of video. Some representative works of multimodal content analysis and video indexing are [Tsekeridou99, Tsekeridou01, Krinidis01].

A typical method for multimodal video indexing consists of video document segmentation, audio document segmentation, and multimodal integration.

8.2 Video Document Segmentation

Video document segmentation decomposes a video document in its layout and content elements. 8.2.1 LAYOUT RECONSTRUCTION

Layout reconstruction corresponds to the detection of boundaries between shots. A shot is defined as a sequence of frames captured by one camera in a single continuous action in time and space. The most important layout reconstruction methods are based on color, motion, and edges.

Color-based algorithms: Two consecutive frames from different shots are unlikely to have similar colors. One of the most representative algorithms of this category was presented by Zhang et al. [Zhang97] who used histograms as descriptors for color. If the computed histogram distance between two consecutive frames is higher than a threshold a cut is declared.

Edge-based algorithms: Pioneer work in edge-based algorithms was developed by Zabih et al. [Zabih93]. They proposed to use edges as visual primitives. First, edges are extracted using the Canny detector from two consecutive frames. By computing a dissimilarity measure based on the fraction of edge-pixels which enter and exit between two consecutive frames, it is possible to detect cuts and gradual transitions as local maxima.

Motion-based algorithms: Bouthemy et al. [Bouthemy99] proposed to use the Iteratively Reweighted Least Squares (IRLS) technique to estimate efficiently the dominant motion. This robust estimation technique allows to detect the points which belong to the part of the image undergoing the dominant motion (inliers). If a cut occurs between two consecutive frames, the number of inliers is close to zero. On the other hand, if the consecutive frames are within the same shot, the number of inliers is nearly constant.

8.2.2 CONTENT SEGMENTATION

Content segmentation is the detection of the following elements:

VISNET/WP4.3/D40/V1.0 Page 62/70

Page 63: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

Setting: place where the video occurs. Since the setting is static, content-based image retrieval can be used. See the survey of this field [Smeulders00] for more details. In [Szummer98] images are classified are classified into indoor and outdoor using color, texture, and frequency information. By using color histograms, color coherence vectors, Discrete Cosine Transform (DCT) coefficients, edge direction histograms, and edge direction coherence vectors, outdoor images are further classified into city and landscape images in [Vailaya98].

People: People appearing in the video document. Face detection can be used to detect people. Various algorithms are proposed in the literature for face detection. For more details see survey [Yang02] and the VISNET deliverable D32 [Visnet.D32].

Objects: Noticeable static or dynamic entities in video document. Object detection is considered as a generalization of the problem of people detection. A representative technique is reported in [Schneiderman00], where cars are detected by using a product of histograms.

8.3 Audio Document Segmentation

Indexing and content-based retrieval are necessary to handle the large amounts of audio and multimedia data that is becoming available on the web and elsewhere. Since manual indexing using existing audio editors is extremely time consuming a number of automatic content analysis systems have been proposed. Most of these systems rely on speech recognition techniques to create text indices. On the other hand, very few systems have been proposed for automatic indexing of music and general audio. Intensive studies have been conducted on audio classification and segmentation by employing different features and methods. In spite of these research efforts, high-accuracy audio classification is only achieved for simple cases such as speech/music discrimination. Pfeiffer et al [Pfeiffer96] presented a theoretical framework and application of automatic audio content analysis using some perceptual features. Saunders [Saunders96] presented a speech / music classifier based on simple features such as zero-crossing rate and short-time energy for radio broadcast. A number of techniques for automatic analysis of audio information have been proposed [Foote99]. These approaches work reasonably well for restricted classes of audio and are based on pattern recognition techniques for classification. A general methodology for temporal audio segmentation based on multiple features is described in [Tzanet99]. A contents-based classification and segmentation using support vector machines (SVM) was suggested by Lie Lu [Lie02] .Five classes were considered silence, music, background sound, pure speech, and non- pure speech which includes speech over music and speech over noise. A sound stream is segmented by classifying each -segment into one of these five classes. In order to obtain more accuracy with the classification, Model-based procedures are used. These methods supply accuracies up to 90%. However it needs a priori knowledge about type of data, which have to be classified. Disadvantage of these Methods is it, that large amounts of training data are needed. Recent works have investigated the problem of segmenting broadcast news in broadcast news transcription and understanding evaluations [Wood98] [Chen98] [Sankar98]. The different approaches which have been used in the ARPA evaluations can be categorized into three classes [Chen98] [Kemp00]:

1. Model-based segmentation: Different models, e.g. HMM , are constructed for a fixed set of acoustic classes, such as anchor speaker, music etc, a training corpus. The incoming audio stream can be classified by maximum likelihood selection. Segment boundaries are assumed where a change in the acoustic class occurs.

VISNET/WP4.3/D40/V1.0 Page 63/70

Page 64: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

2. Metric-based segmentation: The audio stream is segmented at maxima of the distances between neighboring windows placed in evenly spaced time intervals.

3. Energy-based segmentation: Silence in the input audio stream is detected either by a decoder or directly by measuring and thresholding the audio energy. The segments are then generated by cutting the input at silence locations.

8.4 Multimodal Integration

After the video segmentation step, the integration of multimodal layout and content is performed to construct the video index. In [Snoek01] the approaches are classified according to their distinctive properties with respect to the processing cycle, the content segmentation, and the classification method. The processing cycle of the integration method can be iterated, allowing for incremental use of context, or non-iterated. The content segmentation can be performed by using the different modalities in a symmetric, i.e. simultaneous, or asymmetric, i.e. ordered, manner. The classification can be a statistical or knowledge-based approach. Most integration methods in the literature are symmetric and non-iterated. For example, Satoh et al. [Satoh99] proposed a system that associates names and faces in news vide. To accomplish this task, the system combines results of face sequence extraction and similarity evaluation from video, name extraction from transcripts, and video caption recognition. The integration is expressed by a co-occurrence factor C(N,F) between a face F and a name N. A representative statistical classification method for multimodal integration is Hidden Markov Models [Alatan01, Dimitrova00, Eickeler99, Huang99]. The advantages of this framework are its capabilities to integrate multimodal features and to include sequential features. Moreover, an HMM can also be used as a classifier combination method. In [Naphade01] a probabilistic framework for video indexing is proposed. It is based on multijects and multinets. Multijects correspond to probabilistic multimedia objects (e.g. mountain, beach, explosion, etc.). To model the interactions between the multijects, a multinet (i.e. Bayesian belief networks) is defined. The multimodal integration is achieved by the use of the multinet. They reported significant improvement in detection performance. Nam et al. [Nam98] presented a novel technique to characterize and index violent scenes in general TV drama and movies. Unlike [Pfeiffer96], this approach characterizes the violent scenes by integrating video and audio cues. First, motion, texture and colour are used to detect action scenes as well as explosion and blood frames. Then, they assumed that these violent visual effects are accompanied by unique sound effects which are temporally correlated with them. For this audio content analysis, the authors determined two sound classes (violent and non-violent) based on Gaussian modelling technique. The temporal signature of the audio segments is measure due to the energy entropy criterion. Like probabilistic entropy, the value of the energy entropy measure falls down in frames with sudden energy transitions while the energy is larger in a frame with nearly constant energy. Therefore, violent events based on burst sounds have a small energy entropy value. The experiments were carried out using 24 ms audio frames and a classification test of 41 different audio clips with violent and non-violent scenes. The reported classification rate was about 88%. Saraceno and Leonardi [Saraceno98] considered segmenting and indexing a video in dialogs, stories, actions and generic categories. First, they divided the video sequence into audio and video shots independently and then, they checked the concordance between both types of shots and group the ones that follow some predefined patterns. The audio signal is segmented into four basic types (silence, speech, music or noise) based on the signal energy and zero-crossing rate features. Video shot breaks are determined based on the colour histograms and for each shot a codebook is designed, which contains typical block patterns in all frames within this shot. Then, successive shots are compared and labelled sequentially: a shot that is close to a previous shot is given the same label; otherwise, a new label is assigned. The assignment of these labels is determined using a vector quantization method. Scene detection and classification depends on the definition of scene types in terms of their audio and visual

VISNET/WP4.3/D40/V1.0 Page 64/70

Page 65: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

patterns. A dialog scene is supposed to contain speech audio segments and an alternate change of video shots, i.e. the visual labels has to follow a pattern of the type ABABAB. A story scene presents also mostly speech audio segments but the visual information has a completely different pattern with some repetitions ABCADEFGAH…. On the contrary, in an action event the audio signal belongs mostly to one class which is not speech and the visual information exhibits a progressive pattern of shots with contrasting visual contents of the type ABCDEF…. Finally, consecutive shots which don’t correspond to the previous patterns are considered as generic scenes. Simulation results have been carried out on 75 min. of a movie and 30 min. of news having only one anchor. The two parameters used for the evaluation of the approach were recall and precision and it has been found that more accurate results can be obtained for news (about 90% recall and over 80% precision) than for movies (about 80% recall rate and 60% precision rate). Lienhart et al. [Lienhart99] extended their previous research efforts in audio analysis [Pfeiffer96], and they proposed an algorithm to cluster a video sequence automatically into scenes using certain basic attributes such as dialogs, similar settings and continuing audio characteristics. The system proceeds in four steps. In the first step, the shots are recovered from the video using the shot detection algorithm proposed by [Li96]. Then, audio features, colour features, orientation features and faces appearing in the shots are extracted. Next, the distances between consecutive shots are calculated using the different features separately, and a distance table is computed. Finally, based on the calculated shot distance tables, shots are merged for each feature separately but by means of a single algorithm. A further description using the audio features is reviewed in the next paragraph. To examine audio similarity, an audio feature vector is computed for each shot audio clip, which includes the magnitude spectrum of the audio samples. A forecasting feature vector is also calculated at every instance using exponential smoothing of previous feature vectors. The decision about an audio cut is based on the difference between a new feature vector and the forecasting vector and it is calculated via the Euclidean distance. Two thresholds are defined: a high one which directly determines a cut, and a lower one that determines a certain degree of similarity. If too many consecutive vectors are classified as “similar”, then an audio cut is also declared. The distance between two (video) shots is defined as the minimal distance between two audio feature vectors of the respective video shots. In the last step, the merging algorithm is applied on the distance table. It integrates all shots into audio sequences which are no further than a lookahead number of shots apart and whose distance is below a certain threshold. A dialog scene is detected by using only visual information from a face detector algorithm based on the eigenface approach. Similarly, the setting of the scene is determined using colour and orientation features. [Aydin00] and [Aydin01] intended to index and classify dialog scenes using a multimodal approach based on Hidden Markov Models. The content is segmented into dialogue scenes using the state transitions of a hidden Markov model (HMM). Each shot is classified using both audio and video information to determine the state/scene transitions for this model. Face detection and coarse-level audio segmentation are the basic tools for the indexing. While face information is extracted after applying some heuristics skin-coloured regions, audio analysis is achieved by examining signal energy, periodicity and ZCR of the audio waveform. According to the author, a dialogue scene requires three elements at the same time: some people, a conversation (speech audio segments) and a location. Three analyses are performed in order to determine if a face is detected (F) or not (N); which kind of audio segment is detected (T for speech, S for silence and M for music) and finally if the setting changes (C) or it remains unchanged (U). The different combinations of these three analyses (FTU, NMC, etc…) control the different state/scene transitions of the hidden Markov Model. Detection rates about the 70% were reported. One of the areas within content analysis that has grown more interest is video program categorization. Fischer et al. [Fischer95] investigated automatic recognition of film genres based on multimodal analysis. They classified the TV programs into news cast, car races, tennis, commercials or animated cartoons by using three levels of abstraction. At the first level, some syntactic properties of a video, which include colour histogram, motion, waveform and spectrum of audio, are extracted. Shot boundaries are detected using only visual information (colour and motion) but at the second level audio and video features are utilized to determine the video style of each shot. Typical style attributes are audio type (low level classification), camera motion, shot length and transitions and TV station logos. The audio analysis made use of the audio

VISNET/WP4.3/D40/V1.0 Page 65/70

Page 66: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

spectrum and loudness features. At the final level, the temporal variation pattern of each style attribute (style profile) is compared to the typical predefined profiles of the different TV program categories. Promising results were reported. More recently, Wang et al. [Wang01] implemented an HMM-based classifier for TV program categorization, which is driven by training data, instead of manually devised rules. The authors explained their motivation of using HMM to model a video program by the fact that the feature values of different programs at any time can be similar in contrast to their temporal behaviour which is quite different for each category. The different types of TV programs which can be classified by this approach are commercial, live basketball game, live football game, news report and weather forecast. For each program type 20 minutes of data are collected, with half used for training and the remaining half for testing. The features selected for the classification task were a mixture between audio features (statistical parameters from the volume, zero-crossing rate, Non Silence Ratio, 4-Hz modulation energy, Non Pitch Ratio, Frecuency Centroid, Bandwith, ERSB1, ERSB2 and ERSB3) and video features and motion features. For each scene class (TV program category), a five-state ergodic HMM is used and the feature vector is quantized into 256 observation symbols. Results were presented using audio, video and motion features separately and jointly. It could be concluded that audio characteristics were unable to discriminate between basketball and football game events and news cast and weather forecast scenes but globally, the audio features presented better results than video and motion features alone. The authors investigated different ways to combine both features: direct concatenation and product HMM. In the first combination method, the feature vectors from the different modalities are concatenated into one super vector. The main drawbacks of this method are that audio and video features need to be synchronized (modal features have to be calculated for the same period of time), and more data is needed for training. In HMM product, the features from the different modalities are independently classified using separate and different HMM modules. The final likelihood for a scene class is the product of the results from all modules. The average accuracy rate of the multimodal approach has increased in more than a 12%. 8.5 Bibliography

[Alatan01] A.A. Alatan, A.N. Akansu, and W. Wolf, “Multi-modal dialogue scene detection using hidden markov models for content-based multimedia indexing” Multimedia Tools and Applications, 14(2):137-151, 2001

[Aydin00] A. Aydin Alatan, A. N. Akansu and W. Wolf, “Comparative Analysis of hidden Harkov models for multi-modal dialogue scene indexing”, Kluwer Acad., Int. Journal on Multimedia Tools and Applications 2000

[Aydin01] A. Aydin Alatan, A, “Automatic multi-modal dialogue scene indexing”, IEEE ICIP '2001, Thessolaniki, Greece

[Bouthemy99] P. Bouthemy, M. Gelgon, and F. Ganansia, “A unified approach to shot change detection and camera motion characterization”, IEEE Trans. Circuits and Systems for Video Technology, 9(7):1030–1044, October 1999

[Dimitrova00] N. Dimitrova, L. Agnihotri, and G. Wei, “Video classification based on HMM using text and faces”, In European Signal Processing Conference, Tampere, Finland, 2000

[Eickeler99] S. Eickeler and S. Muller, “Content-based video indexing of TV broadcast news using hidden markov models”, In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 2997-3000, Phoenix, USA, 1999

[Fischer95] S. Fischer, R. Lienhart and W. Effelsberg, “Automatic recognition of film genres”, in Proc. 3rd ACM International Conf. Multimedia, San Francisco, November 1995

VISNET/WP4.3/D40/V1.0 Page 66/70

Page 67: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Foote99] J. Foote, “An overview of audio information retrieval,” ACM Multimedia Systems, vol. 7, pp. 2–10, 1999

[Huang99] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong, “Integration of multimodal features for video scene classification based on HMM” In IEEE Workshop on Multimedia Signal Processing, Copenhagen, Denmark, 1999

[Kemp00] T.Kemp, M. Schmidt, M. Westphal, A. Waibel, “Strategies for automatic segmentation of audio data”, ICASSP 2000

[Krinidis01] S.Krinidis, S.Tsekeridou and I.Pitas, "Multimodal Interaction for Scene Boundary Detection", in Proc. of IEEE Int. Conf. on Nonlinear Signal and Image Processing(NSIP01), Baltimore, Maryland, USA, 3-6 June 2001

[Li96] W. Li, S. Gauch, J. Gauch and K.M. Pua, “VISION: A digital video library”, ACM Multimedia Magazine, pp 19-27, 1996

[Lie02] Lie Lu, H. Zhang, Stan Z. Li “Content-based audio classi.cation and segmentation by using support vector machines ”, Multimedia Systems 2002

[Lienhart99] R. Lienhart, S. Pfeiffer and W. Effelsberg, “Scene determination based on video and audio features”, Multimedia Tools and Applications, 15(1):59--81, 2001

[Nam98] J. Nam, M. Alghoniemy and A.H. Tewfik, “Audio-visual content-based violent scene characterization”, in Proceedings of ICIP'98, pp. 353--357, 1998

[Naphade01] M.R. Naphade and T.S. Huang, “A probabilistic framework for semantic video indexing, filtering, and retrieval”, IEEE Transactions on Multimedia, 3(1):141-151, 2001

[Pfeiffer96] S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” in Proc. 4th ACM Int. Conf. Multimedia, pp. 21–30, 1996

[Sankar98] A. Sankar, F. Weng, Z. Rivlin, A. Stolcke, R. Gadde, “The development of SRI’s 1997 broadcast news transcription system”, DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, VA, Feb 8-11, 1998

[Saraceno98] C. Saraceno, R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing”, in Proc. Int. Conf. Image Processing (ICIP-98), vol. 1, Chicago, IL, Oct. 4-7, 1998, pp. 363-367

[Satoh99] S. Satoh, Y. Nakamura, and T. Kanade. Name-it: Naming and detecting faces in news videos. IEEE Multimedia, 6(1):22-35, 1999

[Saunders96] J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc. ICASSP’96, vol. II, pp. 993–996, Atlanta, GA, May 1996

[Schneiderman00] H. Schneiderman and T. Kanade, “A statistical method for 3D object detection applied to faces and cars”, In IEEE Computer Vision and Pattern Recognition, Hilton Head, USA, 2000.

[Smeulders00] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content based image retrieval at the end of the early years”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349-1380, 2000

[Snoek01] C. G. M. Snoek and M. Worring, “Multimodal Video Indexing: A Review of the State-of-the-art” ISIS Technical Report Series, Vol 2001-20, Intelligent Sensory Information Systems Group, University of Amsterdam, December 2001

VISNET/WP4.3/D40/V1.0 Page 67/70

Page 68: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

[Szummer98] M. Szummer and R.W. Picard. “Indoor-outdoor image classification”, In IEEE International Workshop on Content-based Access of Image and Video Databases, in conjunction with ICCV'98, Bombay, India, 1998

[Tsekeridou01] S.Tsekeridou and I.Pitas, "Content-based Video Parsing and Indexing based on Audio-Visual Interaction", IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 4, 2001

[Tsekeridou99] S.Tsekeridou and I.Pitas, "Audio-Visual Content Analysis for Content-based Video Indexing", in Proc. of IEEE Int. Conf. on Multimedia Computing and Systems (ICMCS'99), vol. 1, pp. 667-672, Florence, Italy, 7-11 June 1999

[Vailaya98] A. Vailaya, A.K. Jain, and H.-J. Zhang “On image classification: City images vs. landscapes”, Pattern Recognition, 31:1921-1936, 1998

[Visnet.D32] VISNET deliverable D32, “Face analysis system overview”. 2004

[Wang01] Y. Wang, Z. Liu and J.C Huang, “Multimedia content analysis using both audio and visual clues” IEEE Signal Processing Magazine, 2001

[Wood98] P.C.Woodland, T. Hain, S. Johnson, T. Niesler, A. Tuerk, S.Young, “Experiments in broadcast news transcription”, Proc. ICASSP 1998, pp. 909 ff, Seattle, Washington, 1998

[Yang02] M. H. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 24, no. 1, pp. 34-58, 2002

[Zabih93] R. Zabih, J. Miller, and K. Mai “Feature-based algorithms for detecting and classifying scene breaks”, in Proc. ACM Multimedia Conf., pages 189–200, San Fransisco, USA, November 1993

[Zhang97] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar “An integrated system for content-based video retrieval and browsing”, Pattern Recognition, 30(4):643–658, 1997

VISNET/WP4.3/D40/V1.0 Page 68/70

Page 69: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

9. FUTURE RESEARCH AND RESEARCH ACTIVITIES IN VISNET

9.1 Research Activities

The first step of each classification process consists of extracting a set of measures, or features, from the signal. The features are selected to describe a wide range of signal properties and they can have a completely different nature (audio, video, text, etc). Efficient video indexing has to take into account both the video and the audio streams which imply the computation of several video and audio features. Conventional approaches take into consideration a great number of both features before presenting the measures to a general classifier. Obviously, the computational complexity of these approaches is very high and arrays of computers have to be used in order to obtain practical results. In this WP the UPC will attempt to analyze alternative strategies which use feasible measures taking into account its relevance and computational burden at the classification stage. It will be meaningful to investigate and develop new approaches, in which the features are selected according to their suitability to distinguish the different classes at the current level of the video indexing analysis. The feature selection will be ideally performed automatically, ensuring that the selected features are the best choice taking into account two parameters: computational cost and classification accuracy. The main diagram block of the potential system is illustrated in the next figure:

Feature

Extraction Classifier 1

Relevance Analysis

Video

Classifier N

Set 1

Set N

Audio

C O M B I N E R

FINAL DECISION

MIN(comp. Cost, Classification error)

This relevance analysis will be studied in three different stages or phases:

1. In the first stage, the system will be analyzed based only on audio features with the main objective of classifying TV programs using only audio segments. This task is closely related with the main research activities of the UPC in the work package 4.1.

2. In a second stage, the relevance of only video features will be analyzed for the TV program classification, so that a first comparison between both modalities (audio and video) can be reported.

3. In the last stage, both modalities will be fused and combined using different combination techniques. The main idea behind this stage is to combine weak classifiers to build up a more accurate multi-classifier system. Furthermore, the relevance analysis will be considered from a multi-modal point of view with the main objective of detecting correlations between audio and video features in the final

VISNET/WP4.3/D40/V1.0 Page 69/70

Page 70: Networked Audiovisual Media Technologies VISNET IST …ee98235/Files/d40_final.pdfNetworked Audiovisual Media Technologies VISNET IST-2003-506946 D40 Review of the work done in Audio-Video

VISNET – NoE IST-2003-5006946 D40 – Review of the work done in Audio-Video fusion

classification. Ideally, it is intended to discover possible linear combinations of audio features which can replace more sophisticated and computational expensive video features.

VISNET/WP4.3/D40/V1.0 Page 70/70