Multimedia Information Retrieval€¦ · Multimedia Information Retrieval Norbert Fuhr Tutorial @...
Transcript of Multimedia Information Retrieval€¦ · Multimedia Information Retrieval Norbert Fuhr Tutorial @...
Multimedia Information Retrieval
Norbert Fuhr
Tutorial @ HS-IR ’98
Chapter 1
Introduction
� document structures and attributes� media types� terminology
1
Document structures and attributes
IRnetworksheterogeneityeffectivnessuser friendlyn.
head
chapter
chapter
IR in
J. Doe
logicalstructure
document
title
sectionsection
contentstructure
layoutstructure
networks
author
author = ’J. Doe’crdate = 25-05-96ladate = 30-09-96
externalattributes
Universitat Dortmund, Informatik VI, N. Fuhr
Media
types
audio
t
x
yim
age
video
text is a linear medium
...
Universitat Dortmund, Informatik VI, N. Fuhr
Terminology
monomedia object/document:object containing data of a single media
multimedia object/document:object containing data of multiple media
hypertext document:nonlinear text document (i.e. with links)
hypermedia document:nonlinear multimedia document
Universitat Dortmund, Informatik VI, N. Fuhr
Course structure:
1. introduction2. views on media3. multimedia indexing
Universitat Dortmund, Informatik VI, N. Fuhr
Chapter 2
Views on media
� views on media objects� FERMI multimedia data model
6
2.1 Views on media objects
here: images
physical viewpixel matrix
logical viewperceptive view
� colour
� texture
� brightness
symbolic viewspatial view: spatial relations
(depending on modelling space)structural view
� set of image objects
� structural relations between image objects (ag-gregation)
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
2.2 The FERMI Multimedia DocumentModel
2.2.1 Document structure and IR
impact of structure:
� of multimedia information:heterogeneity of multimedia data
� on semantic content:logical structure ˆ= discourse structure
� on corpus– classic IR: document = atomic unit– MMIR: retrieval of document components
Universitat Dortmund, Informatik VI, N. Fuhr
2.2.2 Elements of the multimedia data model
� logical structure– hierarchy of structural objects– leaves = single-media data– implements explicit organization of discourse– other data model elements refer to logical structure
� attributes– classical attributes (author, dates,. . . )– index expressions
� navigational structure– links
Universitat Dortmund, Informatik VI, N. Fuhr
2.2.2.1 The logical structure
logical structure ˆ= hierarchical aggregation of structural objects:
LS = (OS; �str;�seq; TYPEST; �tst; typest;TYPEM ; typem)
OS: finite set of document structural objectselements:osi
�str : aggregative relation between structural objects,defines hierarchical composition
�seq: defines a linear sequence onOS(corresponds to standard, linear order toaccess compo-nents
TYPEST : set of types of structural objects� e.g. for books:TY PEST = fDocument; Chapter;
Section; Sub�Section; Paragraph; Figureg� types correspond toabstraction levels
�tst : relation on structural object types defining hierarchy ofabstraction levels
typest : total function assigning each structural object its struc-tural type inTYPEST
TYPEM : set of media types,TYPEM = ftext; image;graphic;multimediag.
typem : total function assigning to each structural object its me-dia type inTYPEM
Universitat Dortmund, Informatik VI, N. Fuhr
2.2.2.2 Attributes
A = (OS;NAMEA;VALUEA;
namea;domaina;valuea;SM)
where:
OS: the set of structural objects in the documentelements:osi
NAME A: set of attributes names.
VALUE A: set of all possible attribute values(union of all the domain languages of all attributes)
namea: partial function associating to structural objects a non-empty set of attribute names
domaina: total function defining the domain of any attributename(i.e. all the expressions of its associated language)
valuea: partial function assigning to structural objects the valuefor a related attribute name(definition allows multi-valued attributes)
Universitat Dortmund, Informatik VI, N. Fuhr
Content Attributes
single-media models involve up to five types of views:
� the physical view� the structural view� the symbolic view� the spatial view(only in image and graphic models)� the perceptive view(only in image and graphic models)
! standard attribute names(called Content Attributes) forviews:
� physical� structural� symbolic� spatial� perceptive
Universitat Dortmund, Informatik VI, N. Fuhr
2.2.2.3 Indexing model
indexing: assign index expressions to document structural ob-jects
retrieval of multimedia documents:retrieve smallest units that fulfill the query
! index expressions assigned to parent object have to imply in-dex expressions of its component objects
index objects:structural objects that are indexed(assigned a value of attributesymbolic)
index modelof a document base:
I = (OI;TYPEI ;�ind)
OI : set ofindex objects oiiOI � OS
TYPEI : set of index object types
TYPEI � TYPEST
�ind : relation representing structural dependency between in-dex objects:
�ind � OI�OI
Universitat Dortmund, Informatik VI, N. Fuhr
2.2.2.4 Example of an indexing structure
example of structure and index hierarchy types(index objects of typeChapteror Subsectiononly)
Document
Chapter
Section
Subsection
Paragraph
Symbolic TypesStructure Types
Universitat Dortmund, Informatik VI, N. Fuhr
parts
ofth
estru
cturalan
dsem
antic
views
ofa
do
cum
ent
U3
U4
U6
U7
Os1
Os2
Os3
Os4
Os5
Os6
Os7
Os8
Os13
Os14
Os15
Os16
Os17
Os10
Os11
Os12
Os18
Os19
Os9
U1
U2
U5
Paragraph
Subsection
Section
Docum
ent
Chapter
Structural View
Osem
8O
sem9
Osem
10O
sem12
Osem
11
Osem
2O
sem3
Semantic V
iew
Universitat Dortmund, Informatik VI, N. Fuhr
Chapter 3
Multimedia Indexing
� audio� images� video
19
3.1 Audio
3.1.1 Sound retrieval
E. Wold et al.: Content-based classification, search and retrievalof audio. IEEE Multimedia 3(3), pp 27-36.
Levels of audio retrieval
1. exact match of sound samples2. inexact match of sounds, irrespective of sample rate, quan-
tization, compresssion,. . .3. inexact match of acoustic features / perceptual properties
of sound4. content-based match (for speech, musical content)
here: inexact match of acoustic features and perceptual proper-ties
Universitat Dortmund, Informatik VI, N. Fuhr
Acoustic featuresaspects of sound considered:
loudness root-mean-square of audio signal (in decibels)pitch greatest common divisor of peaks in Fourier spectrabrightness centroid of short-time Fourier magnitude spectra
(higher frequency content of signal)bandwidth magnitude-weighted average of differences be-
tween spectral components and the centroid(variation of frequencies, e.g. sine wave vs. white noise)
harmonicity deviation of the sound’s spectrum from a har-monic spectrum(i.e. harmonic spectra vs. inharmonic spectra vs. noise)
variation of aspects over time:
1. compute aspect values at certain time intervals2. derive features from sequences:
� average value� variance� autocorrelation
(feature values weighted by amplitude)
Universitat Dortmund, Informatik VI, N. Fuhr
sound example
Universitat Dortmund, Informatik VI, N. Fuhr
Property Mean Variance AutocorrelationLoudness -54.4112 221.451 0.938929Pitch 4.21221 0.151228 0.524042Brightness 5.78007 0.0817046 0.690073Bandwidth 0.272099 0.0169697 0.519198
Universitat Dortmund, Informatik VI, N. Fuhr
Indexing and retrieval
Indexing of a sound:compute and store feature vectora(mean, variance and autocorrelation for loudness, pitch, bright-ness, bandwidth and harmonicity)
Retrieval:
1. conditions w.r.t. feature values2. similarity of sounds: weighted Euclidian distance
mean:µ=1M
M
∑j=1
aj
covarianceR=1M
M
∑j=1
(aj�µ)(aj �µ)T
distanceD =
q(a�b)TR�1(a�b)
M – # sounds considered
Universitat Dortmund, Informatik VI, N. Fuhr
Property-based training and classification
training:based on set of training sounds for a property(e.g. scratchiness)
compute property-specific mean and covariance
importance of feature: mean divided by standard deviation
classification
compute distances to means of all classes,select class with minimum distance
likelihood:
L = exp
�D2
2
�
Universitat Dortmund, Informatik VI, N. Fuhr
Example:classification of laughter sounds
Universitat Dortmund, Informatik VI, N. Fuhr
Example:class model for laughter
Feature Mean Variance ImportanceDuration 2.71982 0.191312 6.21826Loudness: Mean -45.0014 18.9212 10.3455– Variance 200.109 1334.99 5.47681– Autocorrelation 0.955071 7.71106e-05 108.762Brightness: Mean 6.16071 0.0204748 43.0547– Variance 0.0288125 0.000113187 2.70821– Autocorrelation 0.715438 0.0108014 6.88386Bandwidth: Mean 0.363269 0.000434929 17.4188– Variance 0.00759914 3.57604e-05 1.27076– Autocorrelation 0.664325 0.0122108 6.01186Pitch: Mean 4.48992 0.39131 7.17758– Variance 0.207667 0.0443153 0.986485– Autocorrelation 0.562178 0.00857394 6.07133
importance =jmeanj /p
variance
Universitat Dortmund, Informatik VI, N. Fuhr
3.1.2 Speech retrieval
1. speech recognition! uncertain term identification2. application of text retrieval methods on recognized terms
! TREC speech retrieval track
3.1.3 Music retrieval
McNab etal: The New Zealand Digital Library MELody inDEX.D-Lib Magazine, May 1997.
1. melody transcription2. approximate string matching
Universitat Dortmund, Informatik VI, N. Fuhr
3.2 Images
3.2.1 Introduction
3.2.1.1 Semantic vs. syntactic indexing and retrieval
syntactic image features:
� color� texture� contour
semanticimage features:
� objects(humans, animals, buildings, art works)
� topics(pollution, demonstration, political visit)
most image indexing methods support syntactic features only
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.1.2 Aboutness vs. ofness
ofness:objects shown in the image
aboutness:topic which is illustrated by the image
aboutness is very much user-dependente.g. image showing water pollution
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.2 QBIC
tool for querying image and video databases
� example images� user-constructed sketches and drawings� selected color and texture patterns� camera and object motion
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.2.1 System overview
main components:
� database population:1. processing of images and videos to extract syntacti-
cal features:
– colors
– textures
– shape
– camera motion
– object motion
2. storing features in database� database querying
1. user composes query graphically2. generate features from from graphical query3. search for database objects with similar features
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Data modelbasic elements:
� still images/scenescontain objects
� video shots– sets of contiguous frames– contain motion objects
still images:
� scene:image or video frame
� objectpart of a scene
videos:
1. break into clips (shots)2. generate representative frame for each slot,
treated as still image3. generate motion objects from shots
Universitat Dortmund, Informatik VI, N. Fuhr
querying:
� on objectsimages with a red, round object
� on scenesimages with 30 % red and 20 % blue
� on shotsshots panning from left to right
� on combinationsimages with 30 % red containing a blue object
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.2.2 Feature Calculation
color
color models: RGB, HSV, YUV, MTM
� average coordinates in color space� k element histogram
(typically k= 64;256)
Universitat Dortmund, Informatik VI, N. Fuhr
texture
coarseness:scale of texturecontrast: vividness of a pattern
(function of variance of grey-level histogram)directionality: “peakedness” of distribution of gradient direc-
tions in image(favoured direction (e.g. grass) vs. isotropic (e.g. sand))
Universitat Dortmund, Informatik VI, N. Fuhr
shape
area # pixels set in binary imagecircularity perimeter2 / areamajor axis orientation
1. compute 2nd order covariance matrix from boundarypixels
2. major axis orientation = direction of largest eigen-vector
eccentricity = (largest eigenvalue) / (smallest eigenvalue)algebraic moment invariances
� consider 18 features invariant to affine transforma-tions
� compute firstm central moments as eigenvalues ofpredefined matrices
Universitat Dortmund, Informatik VI, N. Fuhr
sketch
based on reduced resolution edge map:
1. convert color image to single band luminance2. compute binary edge image3. reduce edge image to 64�644. thin reduced image
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.2.3 Sample queries
� average color queriessearch for images/objects with similar colorcomputed as weighted Euclidian distance in color space
� histogram color queriessearch for images with specified color distributionbased on 256-element histogram:
Q query histogramD image histogramZ element difference histogram:Z = Q�DA symmetric color similarity matrix
a(i; j)= 1�d(ci;cj)=dmax
ck kth color in histogramd(ci ;cj) MTM color distance
dmax maximum distance between any two colorssimilarity: jjRjj= ZTAZ
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
� texture queriesuser selects texture from a samplercompute weighted Euclidian distance in 3D texture space(coarseness, contrast, directionality)
� object shape– user draws shape– shape features: area, circularity, eccentricity, major-
axis-direction, object moments, tangent anglesaround object perimeter
– compute weighted Euclidian distance,weights are inverse variances of features
� query by sketchuser draws dominant lines and edges
1. reduce user sketch to 64�642. for each db image, correlate sketch with user sketch,
based on edge/no edge comparison3. compute correlation scores
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.3 IRIS
semantic indexing of images
1. image analysis� color� contour� texture
2. object recognition(a) basic objects:
clouds, snow, water, sky, forest, grass, sand, stone(b) high-level objects:
forestscene, skyscene, mountainscene, land-scapescene,. . .
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.3.1 Image Analysis
Color
IRIS subdivides color space into about 20 different colors
1. subdivide image into nonoverlapping tiles2. compute color histogram for eachtile3. most frequent color =: color of tile4. join tiles with similar colors and compute circumscribing
rectangle5. compute attributes of color rectangles:
� position� size� color� color density
(# tiles with color / # tiles in rectangle)� color evidence
Universitat Dortmund, Informatik VI, N. Fuhr
original image
Universitat Dortmund, Informatik VI, N. Fuhr
color-based segmentation:
...
colour2 HOR=mid,VER=up,SIZ=XL,SHP=Rect,COL=BLUE,
UL=0—1,LR=44—11,DEN=415—495
colour3 HOR=mid,VER=mid,SIZ=M,SHP=Rect,COL=BLUE,
UL=15—10,LR=44—17,DEN=136—240
colour4 HOR=left,VER=mid,SIZ=XS,SHP=Quad,COL=BLUE,
UL=1—11,LR=1—11,DEN=1—1
colour5 HOR=left,VER=mid,SIZ=XS,SHP=Rect,COL=BLUE,
UL=3—11,LR=14—12,DEN=13—24
...
Universitat Dortmund, Informatik VI, N. Fuhr
Texture
consider local distribution and variation of grey values
1. compute normalized co-occurrence matrixp for 4 direc-tions: 0�, 90�, 45�, 135�
2. for each of the four directions, compute the following fea-tures fromC:� angular second moment� contrast (local variations)� correlation (linear relationship between pixel values)� variance (deviation from the average)� entropy
3. for each of the five parameters, compute the average fromthe values for the 4 directions(! invariance against rotation)
Universitat Dortmund, Informatik VI, N. Fuhr
4. feed average values into neural network
output-layer
forest
gras
sand
water
stone
sky
clouds
ice
constrast
asm
variance
correlation
entropy
input-layer
hidden-layer hidden-layer
5. NN yields texture for eachtile6. join tiles with identical textures and compute circumscrib-
ing rectangles7. compute attributes of texture rectangles:
� position� size� texture� texture density (# tiles with texture / # tiles in rect-
angle)
Universitat Dortmund, Informatik VI, N. Fuhr
...
texture3 HOR=mid,VER=mid,SIZ=L,SHP=Rect,TEX=ice,
UL=2—2,LR=10—3,DEN=11—18
texture4 HOR=left,VER=mid,SIZ=S,SHP=Path,TEX=clouds,
UL=0—3,LR=3—3,DEN=4—4
texture5 HOR=left,VER=mid,SIZ=S,SHP=Quad,TEX=stone,
UL=4—3,LR=5—4,DEN=3—4
texture6 HOR=mid,VER=mid,SIZ=S,SHP=Rect,TEX=clouds,
UL=5—3,LR=8—4,DEN=5—8
...
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Contour
based on grey level image
1. gradient-based edge detection2. determination of object contours3. shape analysis: compute
� position of centroid� size of region� bound coordinates of region
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.3.2 Object Recognition
1. step from syntactical to semantical features:identification of primitive objects
2. derivation of higher-level semantical features
identification of primitive objects
� basis: color, texture and contour features
� for each feature, consider corresponding region� form graph describing topological relationships between
feature regions:– node = feature– edge = topological relationship: overlaps, meets,
contains
meetscontains
CT
T
CT
CL
CT
T
T
CL
CT
T
CL
overlaps
Universitat Dortmund, Informatik VI, N. Fuhr
� formulate graph grammar rules for detecting primitive ob-jects
Clouds
Clouds
Texture Segment
Contour Segment
Color Segment
predicate((valcompeq(*self(2,"colorseg","COL"),"blue") ||valcompeq(*self(2,"colorseg","COL"),"white")) &&valcompeq(*self(2,"colorseg","VER"),"up"));
predicate(nrkind(*self(1,"contourseg"),"contains",*self(1,"colorseg")) &&nrkind(*self(1,"contourseg"),"contains",*self(1,"textureseg")));
Conditions of "Clouds"
MountainlakeSky
Lake
Mountain
Forest
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.4 Photobook
developed at MIT Media Lab
goal:semantic retrieval of imagesbased on semantics-preserving image compression
types of descriptions:
� appearance (faces)� shape� texture
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.4.1 Appearance
based on eigenimage representations
Training: Building Eigenrepresentations
1. preprocessing of input images:normalize w.r.t. position, scale, orientation
2. computation of eigenvectors of normalized image covari-ance for� training images (faces)� subregions of training images (eyes, nose, mouth)
Universitat Dortmund, Informatik VI, N. Fuhr
mean
andfirstfew
eigenvectors:
Universitat Dortmund, Informatik VI, N. Fuhr
Retrieval Γ: new image (region)
1. transformΓ into face space2. retrieval based on similarity measure
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.4.2 Shape
representationbased on modelling of physical deformationsfinite element method! stiffness matrix! eigenvectors
Retrieval
compute amount of energy needed to align object
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.2.4.3 Texture
representationbased on Wold decomposition for regular stochastic processes in2D= sum of three orthogonal components:
1. harmonic field2. generalized-evanescent field3. purely-indeterministic field
retrieval
1. derive parameters of Wold decompositions2. compute similarity of parameter vectors
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.3 Video
3.3.1 QBIC
3.3.1.1 Representation of video data
1. shot detection2. creation of representative frame3. identify moving structures/objects
Shot detection
set of frames grouped into shots because they
� depict same scene� signify single camera operation� contain distinct event/action� are chosen as single indexable unit
Universitat Dortmund, Informatik VI, N. Fuhr
representative frame generationrepresentative frames
� treated as still images in database population� in retrieval returned for as answer representing shot
representative frame generation methods:
� random frame from a shot� synthesized r-frames
– mosaicking all frames in a panning shot– remove moving objects
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
layered representation
different layers used for identifying significant objects in thescene
algorithm divides a shot into a number of layers,each with its own
� 2D affine motion parameters� region of support ineach frame
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
Universitat Dortmund, Informatik VI, N. Fuhr
3.4 Summary: media indexing andmatching
1. exact match2. inexact media match (irrespective of digitization parame-
ters)3. inexact media feature match4. content-based match
Universitat Dortmund, Informatik VI, N. Fuhr
Chapter 4
Conclusions
Issues in MMIR
� syntactic (signal-based) vs. semantic (symbolic) indexingof MM objects
� dealing with document structure
� IR models for multimedia documents
79