Multimedia Information Retrieval€¦ · Multimedia Information Retrieval Norbert Fuhr Tutorial @...

Multimedia Information Retrieval

Norbert Fuhr

Tutorial @ HS-IR ’98

Chapter 1

Introduction

� document structures and attributes� media types� terminology

1

Document structures and attributes

IRnetworksheterogeneityeffectivnessuser friendlyn.

head

chapter

chapter

IR in

J. Doe

logicalstructure

document

title

sectionsection

contentstructure

layoutstructure

networks

author

author = ’J. Doe’crdate = 25-05-96ladate = 30-09-96

externalattributes

Universitat Dortmund, Informatik VI, N. Fuhr

Media

types

audio

t

x

yim

age

video

text is a linear medium

...


Terminology

monomedia object/document:object containing data of a single media

multimedia object/document:object containing data of multiple media

hypertext document:nonlinear text document (i.e. with links)

hypermedia document:nonlinear multimedia document


Course structure:

1. introduction2. views on media3. multimedia indexing


Chapter 2

Views on media

� views on media objects� FERMI multimedia data model

6

2.1 Views on media objects

here: images

physical viewpixel matrix

logical viewperceptive view

� colour

� texture

� brightness

symbolic viewspatial view: spatial relations

(depending on modelling space)structural view

� set of image objects

� structural relations between image objects (ag-gregation)


2.2 The FERMI Multimedia DocumentModel

2.2.1 Document structure and IR

impact of structure:

� of multimedia information:heterogeneity of multimedia data

� on semantic content:logical structure ˆ= discourse structure

� on corpus– classic IR: document = atomic unit– MMIR: retrieval of document components


2.2.2 Elements of the multimedia data model

� logical structure– hierarchy of structural objects– leaves = single-media data– implements explicit organization of discourse– other data model elements refer to logical structure

� attributes– classical attributes (author, dates,. . . )– index expressions

� navigational structure– links


2.2.2.1 The logical structure

logical structure ˆ= hierarchical aggregation of structural objects:

LS = (OS; �str;�seq; TYPEST; �tst; typest;TYPEM ; typem)

OS: finite set of document structural objectselements:osi

�str : aggregative relation between structural objects,defines hierarchical composition

�seq: defines a linear sequence onOS(corresponds to standard, linear order toaccess compo-nents

TYPEST : set of types of structural objects� e.g. for books:TY PEST = fDocument; Chapter;

Section; Sub�Section; Paragraph; Figureg� types correspond toabstraction levels

�tst : relation on structural object types defining hierarchy ofabstraction levels

typest : total function assigning each structural object its struc-tural type inTYPEST

TYPEM : set of media types,TYPEM = ftext; image;graphic;multimediag.

typem : total function assigning to each structural object its me-dia type inTYPEM


2.2.2.2 Attributes

A = (OS;NAMEA;VALUEA;

namea;domaina;valuea;SM)

where:

OS: the set of structural objects in the documentelements:osi

NAME A: set of attributes names.

VALUE A: set of all possible attribute values(union of all the domain languages of all attributes)

namea: partial function associating to structural objects a non-empty set of attribute names

domaina: total function defining the domain of any attributename(i.e. all the expressions of its associated language)

valuea: partial function assigning to structural objects the valuefor a related attribute name(definition allows multi-valued attributes)


Content Attributes

single-media models involve up to five types of views:

� the physical view� the structural view� the symbolic view� the spatial view(only in image and graphic models)� the perceptive view(only in image and graphic models)

! standard attribute names(called Content Attributes) forviews:

� physical� structural� symbolic� spatial� perceptive


2.2.2.3 Indexing model

indexing: assign index expressions to document structural ob-jects

retrieval of multimedia documents:retrieve smallest units that fulfill the query

! index expressions assigned to parent object have to imply in-dex expressions of its component objects

index objects:structural objects that are indexed(assigned a value of attributesymbolic)

index modelof a document base:

I = (OI;TYPEI ;�ind)

OI : set ofindex objects oiiOI � OS

TYPEI : set of index object types

TYPEI � TYPEST

�ind : relation representing structural dependency between in-dex objects:

�ind � OI�OI


2.2.2.4 Example of an indexing structure

example of structure and index hierarchy types(index objects of typeChapteror Subsectiononly)

Document

Chapter

Section

Subsection

Paragraph

Symbolic TypesStructure Types


parts

ofth

estru

cturalan

dsem

antic

views

ofa

do

cum

ent

U3

U4

U6

U7

Os1

Os2

Os3

Os4

Os5

Os6

Os7

Os8

Os13

Os14

Os15

Os16

Os17

Os10

Os11

Os12

Os18

Os19

Os9

U1

U2

U5

Paragraph

Subsection

Section

Docum

ent

Chapter

Structural View

Osem

8O

sem9

Osem

10O

sem12

Osem

11

Osem

2O

sem3

Semantic V

iew


Chapter 3

Multimedia Indexing

� audio� images� video

19

3.1 Audio

3.1.1 Sound retrieval

E. Wold et al.: Content-based classification, search and retrievalof audio. IEEE Multimedia 3(3), pp 27-36.

Levels of audio retrieval

1. exact match of sound samples2. inexact match of sounds, irrespective of sample rate, quan-

tization, compresssion,. . .3. inexact match of acoustic features / perceptual properties

of sound4. content-based match (for speech, musical content)

here: inexact match of acoustic features and perceptual proper-ties


Acoustic featuresaspects of sound considered:

loudness root-mean-square of audio signal (in decibels)pitch greatest common divisor of peaks in Fourier spectrabrightness centroid of short-time Fourier magnitude spectra

(higher frequency content of signal)bandwidth magnitude-weighted average of differences be-

tween spectral components and the centroid(variation of frequencies, e.g. sine wave vs. white noise)

harmonicity deviation of the sound’s spectrum from a har-monic spectrum(i.e. harmonic spectra vs. inharmonic spectra vs. noise)

variation of aspects over time:

1. compute aspect values at certain time intervals2. derive features from sequences:

� average value� variance� autocorrelation

(feature values weighted by amplitude)


sound example


Property Mean Variance AutocorrelationLoudness -54.4112 221.451 0.938929Pitch 4.21221 0.151228 0.524042Brightness 5.78007 0.0817046 0.690073Bandwidth 0.272099 0.0169697 0.519198


Indexing and retrieval

Indexing of a sound:compute and store feature vectora(mean, variance and autocorrelation for loudness, pitch, bright-ness, bandwidth and harmonicity)

Retrieval:

1. conditions w.r.t. feature values2. similarity of sounds: weighted Euclidian distance

mean:µ=1M

M

∑j=1

aj

covarianceR=1M

M

∑j=1

(aj�µ)(aj �µ)T

distanceD =

q(a�b)TR�1(a�b)

M – # sounds considered


Property-based training and classification

training:based on set of training sounds for a property(e.g. scratchiness)

compute property-specific mean and covariance

importance of feature: mean divided by standard deviation

classification

compute distances to means of all classes,select class with minimum distance

likelihood:

L = exp

�D2

2

�


Example:classification of laughter sounds


Example:class model for laughter

Feature Mean Variance ImportanceDuration 2.71982 0.191312 6.21826Loudness: Mean -45.0014 18.9212 10.3455– Variance 200.109 1334.99 5.47681– Autocorrelation 0.955071 7.71106e-05 108.762Brightness: Mean 6.16071 0.0204748 43.0547– Variance 0.0288125 0.000113187 2.70821– Autocorrelation 0.715438 0.0108014 6.88386Bandwidth: Mean 0.363269 0.000434929 17.4188– Variance 0.00759914 3.57604e-05 1.27076– Autocorrelation 0.664325 0.0122108 6.01186Pitch: Mean 4.48992 0.39131 7.17758– Variance 0.207667 0.0443153 0.986485– Autocorrelation 0.562178 0.00857394 6.07133

importance =jmeanj /p

variance


3.1.2 Speech retrieval

1. speech recognition! uncertain term identification2. application of text retrieval methods on recognized terms

! TREC speech retrieval track

3.1.3 Music retrieval

McNab etal: The New Zealand Digital Library MELody inDEX.D-Lib Magazine, May 1997.

1. melody transcription2. approximate string matching


3.2 Images

3.2.1 Introduction

3.2.1.1 Semantic vs. syntactic indexing and retrieval

syntactic image features:

� color� texture� contour

semanticimage features:

� objects(humans, animals, buildings, art works)

� topics(pollution, demonstration, political visit)

most image indexing methods support syntactic features only


3.2.1.2 Aboutness vs. ofness

ofness:objects shown in the image

aboutness:topic which is illustrated by the image

aboutness is very much user-dependente.g. image showing water pollution


3.2.2 QBIC

tool for querying image and video databases

� example images� user-constructed sketches and drawings� selected color and texture patterns� camera and object motion


3.2.2.1 System overview

main components:

� database population:1. processing of images and videos to extract syntacti-

cal features:

– colors

– textures

– shape

– camera motion

– object motion

2. storing features in database� database querying

1. user composes query graphically2. generate features from from graphical query3. search for database objects with similar features


Data modelbasic elements:

� still images/scenescontain objects

� video shots– sets of contiguous frames– contain motion objects

still images:

� scene:image or video frame

� objectpart of a scene

videos:

1. break into clips (shots)2. generate representative frame for each slot,

treated as still image3. generate motion objects from shots


querying:

� on objectsimages with a red, round object

� on scenesimages with 30 % red and 20 % blue

� on shotsshots panning from left to right

� on combinationsimages with 30 % red containing a blue object


3.2.2.2 Feature Calculation

color

color models: RGB, HSV, YUV, MTM

� average coordinates in color space� k element histogram

(typically k= 64;256)


texture

coarseness:scale of texturecontrast: vividness of a pattern

(function of variance of grey-level histogram)directionality: “peakedness” of distribution of gradient direc-

tions in image(favoured direction (e.g. grass) vs. isotropic (e.g. sand))


shape

area # pixels set in binary imagecircularity perimeter2 / areamajor axis orientation

1. compute 2nd order covariance matrix from boundarypixels

2. major axis orientation = direction of largest eigen-vector

eccentricity = (largest eigenvalue) / (smallest eigenvalue)algebraic moment invariances

� consider 18 features invariant to affine transforma-tions

� compute firstm central moments as eigenvalues ofpredefined matrices


sketch

based on reduced resolution edge map:

1. convert color image to single band luminance2. compute binary edge image3. reduce edge image to 64�644. thin reduced image


3.2.2.3 Sample queries

� average color queriessearch for images/objects with similar colorcomputed as weighted Euclidian distance in color space

� histogram color queriessearch for images with specified color distributionbased on 256-element histogram:

Q query histogramD image histogramZ element difference histogram:Z = Q�DA symmetric color similarity matrix

a(i; j)= 1�d(ci;cj)=dmax

ck kth color in histogramd(ci ;cj) MTM color distance

dmax maximum distance between any two colorssimilarity: jjRjj= ZTAZ


� texture queriesuser selects texture from a samplercompute weighted Euclidian distance in 3D texture space(coarseness, contrast, directionality)

� object shape– user draws shape– shape features: area, circularity, eccentricity, major-

axis-direction, object moments, tangent anglesaround object perimeter

– compute weighted Euclidian distance,weights are inverse variances of features

� query by sketchuser draws dominant lines and edges

1. reduce user sketch to 64�642. for each db image, correlate sketch with user sketch,

based on edge/no edge comparison3. compute correlation scores


3.2.3 IRIS

semantic indexing of images

1. image analysis� color� contour� texture

2. object recognition(a) basic objects:

clouds, snow, water, sky, forest, grass, sand, stone(b) high-level objects:

forestscene, skyscene, mountainscene, land-scapescene,. . .


3.2.3.1 Image Analysis

Color

IRIS subdivides color space into about 20 different colors

1. subdivide image into nonoverlapping tiles2. compute color histogram for eachtile3. most frequent color =: color of tile4. join tiles with similar colors and compute circumscribing

rectangle5. compute attributes of color rectangles:

� position� size� color� color density

(# tiles with color / # tiles in rectangle)� color evidence


original image


color-based segmentation:

...

colour2 HOR=mid,VER=up,SIZ=XL,SHP=Rect,COL=BLUE,

UL=0—1,LR=44—11,DEN=415—495

colour3 HOR=mid,VER=mid,SIZ=M,SHP=Rect,COL=BLUE,

UL=15—10,LR=44—17,DEN=136—240

colour4 HOR=left,VER=mid,SIZ=XS,SHP=Quad,COL=BLUE,

UL=1—11,LR=1—11,DEN=1—1

colour5 HOR=left,VER=mid,SIZ=XS,SHP=Rect,COL=BLUE,

UL=3—11,LR=14—12,DEN=13—24

...


Texture

consider local distribution and variation of grey values

1. compute normalized co-occurrence matrixp for 4 direc-tions: 0�, 90�, 45�, 135�

2. for each of the four directions, compute the following fea-tures fromC:� angular second moment� contrast (local variations)� correlation (linear relationship between pixel values)� variance (deviation from the average)� entropy

3. for each of the five parameters, compute the average fromthe values for the 4 directions(! invariance against rotation)


4. feed average values into neural network

output-layer

forest

gras

sand

water

stone

sky

clouds

ice

constrast

asm

variance

correlation

entropy

input-layer

hidden-layer hidden-layer

5. NN yields texture for eachtile6. join tiles with identical textures and compute circumscrib-

ing rectangles7. compute attributes of texture rectangles:

� position� size� texture� texture density (# tiles with texture / # tiles in rect-

angle)


...

texture3 HOR=mid,VER=mid,SIZ=L,SHP=Rect,TEX=ice,

UL=2—2,LR=10—3,DEN=11—18

texture4 HOR=left,VER=mid,SIZ=S,SHP=Path,TEX=clouds,

UL=0—3,LR=3—3,DEN=4—4

texture5 HOR=left,VER=mid,SIZ=S,SHP=Quad,TEX=stone,

UL=4—3,LR=5—4,DEN=3—4

texture6 HOR=mid,VER=mid,SIZ=S,SHP=Rect,TEX=clouds,

UL=5—3,LR=8—4,DEN=5—8

...


Contour

based on grey level image

1. gradient-based edge detection2. determination of object contours3. shape analysis: compute

� position of centroid� size of region� bound coordinates of region


3.2.3.2 Object Recognition

1. step from syntactical to semantical features:identification of primitive objects

2. derivation of higher-level semantical features

identification of primitive objects

� basis: color, texture and contour features

� for each feature, consider corresponding region� form graph describing topological relationships between

feature regions:– node = feature– edge = topological relationship: overlaps, meets,

contains

meetscontains

CT

T

CT

CL

CT

T

T

CL

CT

T

CL

overlaps


� formulate graph grammar rules for detecting primitive ob-jects

Clouds

Clouds

Texture Segment

Contour Segment

Color Segment

predicate((valcompeq(*self(2,"colorseg","COL"),"blue") ||valcompeq(*self(2,"colorseg","COL"),"white")) &&valcompeq(*self(2,"colorseg","VER"),"up"));

predicate(nrkind(*self(1,"contourseg"),"contains",*self(1,"colorseg")) &&nrkind(*self(1,"contourseg"),"contains",*self(1,"textureseg")));

Conditions of "Clouds"

MountainlakeSky

Lake

Mountain

Forest


3.2.4 Photobook

developed at MIT Media Lab

goal:semantic retrieval of imagesbased on semantics-preserving image compression

types of descriptions:

� appearance (faces)� shape� texture


3.2.4.1 Appearance

based on eigenimage representations

Training: Building Eigenrepresentations

1. preprocessing of input images:normalize w.r.t. position, scale, orientation

2. computation of eigenvectors of normalized image covari-ance for� training images (faces)� subregions of training images (eyes, nose, mouth)


mean

andfirstfew

eigenvectors:


Retrieval Γ: new image (region)

1. transformΓ into face space2. retrieval based on similarity measure


3.2.4.2 Shape

representationbased on modelling of physical deformationsfinite element method! stiffness matrix! eigenvectors

Retrieval

compute amount of energy needed to align object


3.2.4.3 Texture

representationbased on Wold decomposition for regular stochastic processes in2D= sum of three orthogonal components:

1. harmonic field2. generalized-evanescent field3. purely-indeterministic field

retrieval

1. derive parameters of Wold decompositions2. compute similarity of parameter vectors


3.3 Video

3.3.1 QBIC

3.3.1.1 Representation of video data

1. shot detection2. creation of representative frame3. identify moving structures/objects

Shot detection

set of frames grouped into shots because they

� depict same scene� signify single camera operation� contain distinct event/action� are chosen as single indexable unit


representative frame generationrepresentative frames

� treated as still images in database population� in retrieval returned for as answer representing shot

representative frame generation methods:

� random frame from a shot� synthesized r-frames

– mosaicking all frames in a panning shot– remove moving objects


layered representation

different layers used for identifying significant objects in thescene

algorithm divides a shot into a number of layers,each with its own

� 2D affine motion parameters� region of support ineach frame


3.4 Summary: media indexing andmatching

1. exact match2. inexact media match (irrespective of digitization parame-

ters)3. inexact media feature match4. content-based match


Chapter 4

Conclusions

Issues in MMIR

� syntactic (signal-based) vs. semantic (symbolic) indexingof MM objects

� dealing with document structure

� IR models for multimedia documents

79

Multimedia Information Retrieval€¦ · Multimedia Information Retrieval Norbert Fuhr Tutorial @...

Documents

Transcript of Multimedia Information Retrieval€¦ · Multimedia Information Retrieval Norbert Fuhr Tutorial @...