Presented by: Archana R eddy Jammula 800802639
description
Transcript of Presented by: Archana R eddy Jammula 800802639
![Page 1: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/1.jpg)
Semantic Indexing of multimedia content using visual, audio and text cuesWritten By:.W. H. Adams . Giridharan Iyengar . Ching-Yung Lin . Milind Ramesh Naphade . Chalapathy Neti
. Harriet J. Nock . John R. Smith
Presented by:Archana Reddy Jammula800802639
![Page 2: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/2.jpg)
WALKING THROUGH……..
• INTRODUCTION• SEMANTIC-CONTENT ANALYSIS
SYSTEM• EXPERIMENTAL RESULTS•CONCLUSIONS
![Page 3: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/3.jpg)
INTRODUCTION• Large digital video libraries require tools for representing,searching, and retrieving content.
QUERY-BY-EXAMPLE (QBE)
QBE TO QUERY-BY-KEYWORD (QBK)
SHIFT
![Page 4: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/4.jpg)
OVERVIEW
• IBM project • Trainable QBK system for the labeling and
retrieval of generic multimedia semantic concepts in video• Focus on detection of semantic-concepts
using information cues from multiple modalities
![Page 5: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/5.jpg)
RELATED WORK• A novel probabilistic framework for semantic video indexing
by learning probabilistic multimedia representations of semantic events to represent keywords and key concepts
• A library of examples approach• A rule-based system for indexing basketball videos by
detecting semantics in audio.• A framework for detecting sources of sounds in audio using
such cues as onset and offset.• A hidden-Markov-model (HMM) framework for generalized
sound recognition.• Usage of tempo to characterize motion pictures.
![Page 6: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/6.jpg)
INTRODUCTION OF NEW APPROACH
Emphasis on the extraction of semantics from individual modalities, in some instances, using audio and visual modalities.
Combines content analysis with information retrieval in a unified setting for the semantic labelingof multimedia content. Proposal of a novelapproach for representing semantic-concepts using a basis ofother semantic-concepts and propose a novel discriminantframework to fuse the different modalities.
CHANGE: Approach that combines audio and visual content analysis with textual information retrieval for semantic modeling of multimediacontent.
![Page 7: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/7.jpg)
APPROACH• Visualize it as a machine learning problem.• Assumption: A priori definition of a set of atomic semantic-concepts (objects, scenes, and events) which is assumed to be broad enough to cover the semantic query space of interest high level Concepts.• Steps:
• The set of atomic concepts are annotated manually in audio, speech, and/or video within a set of “training” videos.
• The annotated training data is then used to develop explicit statistical models of these atomic concepts; each such model can then be used to automatically label occurrences of the corresponding concept in new videos.ech, and/or video within a set of “training” videos.
![Page 8: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/8.jpg)
CHALLENGES
• Low-level features appropriate for labeling atomic concepts must be identified and appropriate schemes for modeling these features are to be selected• High-level concepts must be linked to the presence
/absence of other concepts and statistical models for combining these concept models into a high-level model must be chosen.• Cutting across these levels, information from multiple
modalities must be integrated or fused
![Page 9: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/9.jpg)
Three components:(i) Tools for defining a lexicon of semantic-concepts and
annotating examples of those concepts within a set of training videos;
(ii) Schemes for automatically learning the representations of semantic-concepts in the lexicon based on the labeled examples
(iii) Tools supporting data retrieval using the (defined) semantic-concepts.
SEMANTIC CONCEPT ANALYSIS SYSTEM:
![Page 10: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/10.jpg)
SEMANTIC CONCEPT ANALYSIS SYSTEM:
![Page 11: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/11.jpg)
LEXICON OF SEMANTIC-CONCEPTS• Working set of intermediate- and high-level
concepts, covering events, scenes, and objects.
• Defined independently of the modality in which their cues occur: whilst some are naturally expressed in one modality over the other
• Imposing hierarchy is difficult but if imposed then it is defined on terms of feature extraction or deriving from other concepts.
![Page 12: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/12.jpg)
Semantic concept Parade is defined as:-collection of people-music-context in which this clip is interpreted as parade
![Page 13: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/13.jpg)
ANNOTATING A CORPUS
• Annotation of visual data is performed at shot level; • Annotation of audio data is performed by specifying time
spans over which each audio concept (such as speech) occurs• Multimodal annotation follows with synchronized
playback of audio and video during the annotation process.• Media Streams present a lexicon of semantic concepts in
terms of set of icons.
![Page 14: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/14.jpg)
LEARNING SEMANTIC-CONCEPTS FROM FEATURES• Atomic concepts are modeled using features from a
single modality and the integration of cues from multiple modalities occurs only within models of high-level concepts (a late integration approach);• Focus is on the joint analysis of audio, visual, and textual
modalities for the semantic modeling of video.• Modeling approaches:
1. Probabilistic modeling of semantic-concepts and events using models such as GMMs, HMMs
2. Bayesian networks and discriminant approaches such as support vector machines (SVMs).
![Page 15: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/15.jpg)
PROBABILISTICMODELING FOR SEMANTICCLASSIFICATION
• LOGIC: Model a semantic-concept as a class conditional probability density function over a feature space.• In the given set of semantic-concepts and a feature
observation, choose the label as that class conditional density which yields the maximum likelihood of the observed feature.• As true class conditional densities are not available,
assumptions are made and choices made generally are:• GMMs for independent observation vectors • HMMs for time series data.
![Page 16: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/16.jpg)
GMM - PROBABILITY DENSITY FUNCTION
• GMM defines a probability density function of an n-dimensional observation vector x given a model M,
![Page 17: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/17.jpg)
HMM - PROBABILITY DENSITY FUNCTION
• An HMM allows us to model a sequence of observations (x1, x2, . . . , xn) as having been generated by an unobserved state sequence s1, . . . , sn with a unique starting state s0, giving the probability of the model M generating the output sequence as
![Page 18: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/18.jpg)
DISCRIMINANT TECHNIQUES: SUPPORT VECTOR MACHINES
• Flaw of probabilistic modeling for semantic classification technique:
The reliable estimation of class conditional parameters in the requires large amounts of training data for each class and the forms assumed for class conditional distributions may not be the most appropriate.
• Benefit of using discriminant techniques: support vector machines technique:
Use of a more discriminant learning approach requires fewer parameters and assumptions may yield better results
![Page 19: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/19.jpg)
SVM technique:• Separates two classes using a linear hyper plane. The classes
are represented as:
![Page 20: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/20.jpg)
LEARNING AUDIO CONCEPTS
• Scheme for modeling audio-based atomic concepts starts with annotated audio training set .• For a set of HMMs, one for each audio concept,
during testing, we use two distinct schemes to compute the confidences of the different hypotheses.
![Page 21: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/21.jpg)
REPRESENTING CONCEPTS USING SPEECH
• Speech cues may be derived from one of two sources: • Manual transcriptions • The results of automatic speech recognition (ASR) on the speech
segments of the audio.
• Procedure for labeling a particular semantic-concept using speech information alone assumes a priori definition of a set of query terms pertinent to that concept.
![Page 22: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/22.jpg)
Cont..
SCHEME FOLLOWED:• Scheme for obtaining such a set of query terms
automatically would be to use the most frequent words occurring within shots annotated by a particular concept ,the set might also be derived using human knowledge or word net.• Tagging, morphologically analyzing, and applying the stop list
to this set of words yield a set of query terms Q for use in retrieving the concept of interest.• Retrieval of shots containing the concept then proceeds by
ranking documents against Q according to their score, as in standard text retrieval, which provides a ranking of shots.
![Page 23: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/23.jpg)
LEARNING MULTIMODAL CONCEPTS
• Information cues from one or more modalities are integrated.• we can build richer models that exploit the
interrelationships between atomic concepts, which may not be possible if we model these high-level concepts in terms of their features.
![Page 24: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/24.jpg)
INFERENCE USING GRAPHICAL MODELS• Models used: Bayesian networks of various topologies and
parameterizations.• Advantage : Bayesian networks allows us to graphically specify a particular form of the joint probability density function.• Joint probability function encoded by the Bayesian network, is
P(E, A,V,T) = P(E)P(A/E)P(V/E)P(T/E)
(a) Bayesian Networks
![Page 25: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/25.jpg)
CLASSIFYING CONCEPTS USING SVMS• Scores from all the intermediate concept classifiers are
concatenated into a vector, and this is used as the feature in the SVM.
(b) Support Vector Machines
![Page 26: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/26.jpg)
EXPERIMENTAL RESULTS
![Page 27: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/27.jpg)
THE CORPUS• Dataset:• Subset of the NIST Video TREC 2001 corpus, which comprises
production videos derived from sources such as NASA and Open Video Consortium.
• 7 videos comprising 1248 video shots. The 7 videos describe NASA activities including its space program.
• The most pertinent audio cues are:Music- 84% of manually labeled audio samplesRocket engine explosion- 60% of manually labeled audio
samples
![Page 28: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/28.jpg)
PREPROCESSING AND FEATURE EXTRACTION
• Visual shot detection and feature extraction• Color • Structure
• Shape• Audio feature extraction
![Page 29: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/29.jpg)
LEXICON
• Lexicon in this experiment comprise more than 50 semantic concepts for describing events, sites, and objects with cues
in audio, video, and/or speech.
• A subset is described in Visual, Audio and Multimodal Concept experiments.
![Page 30: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/30.jpg)
EVALUATION METRICS
• Retrieval performance is measured using precision-recall curves
• Figure-of-merit (FOM) of retrieval effectiveness is used to summarize performance, defined as average precision . over the top 100 retrieved documents.
![Page 31: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/31.jpg)
RETRIEVAL USING MODELS FOR VISUAL FEATURES• Results: GMM versus SVM classification
• Depicts the overall retrieval effectiveness for a variety of intermediate (visual) semantic-concepts with SVM and GMM classifiers.
![Page 32: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/32.jpg)
PRECISION-RECALL CURVES
![Page 33: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/33.jpg)
RETRIEVAL USING MODELS FOR AUDIO FEATURES• Results: Minimum duration modeling
Fig: Effect of duration modeling
![Page 34: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/34.jpg)
IMPLICIT VERSUS EXPLICIT FUSION PERFORMACE GRAPH
Fig: Implicit versus explicit fusion
![Page 35: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/35.jpg)
RESULTS: FUSION OF SCORES FROM MULTIPLE AUDIO MODELS
FOM results: audio retrieval, different intermediate concepts.
![Page 36: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/36.jpg)
RESULTS: FUSION OF SCORES FROM MULTIPLE AUDIO MODELS
FOM results: audio retrieval, GMM versus HMM performanceand implicit versus explicit fusion.
![Page 37: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/37.jpg)
RETRIEVAL USING SPEECH
Two sets of results:• The retrieval of the rocket launch concept
using manually produced ground truth transcriptions • Retrieval using transcriptions produced using
ASR.
![Page 38: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/38.jpg)
Cont..
FOM results: speech retrieval using human knowledge based query.
![Page 39: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/39.jpg)
![Page 40: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/40.jpg)
BAYESIAN NETWORK INTEGRATION
• All random variables are assumed to be binary valued
• The scores emitted by the individual classifiers (rocket object and rocket engine explosion) are processed to fall into the 0–1 range by using the precision-recall curve as a guide
• Map acceptable operating points on the precision-recall curve to the 0.5 probability
![Page 41: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/41.jpg)
![Page 42: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/42.jpg)
SVM INTEGRATION
• For fusion using SVMs:
Procedure: Take the scores from all semantic models concatenating them into a 9-dimensional feature vector.
![Page 43: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/43.jpg)
Fig: Fusion of audio, text, and visual models using the SVMfusion model for rocket launch retrieval.
![Page 44: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/44.jpg)
Fig: The top 20 video shots of rocket launch/take-off retrieved using multimodal detection based on the SVM model. Nineteen of thetop 20 are rocket launch shots.
![Page 45: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/45.jpg)
SVM INTEGRATION cont..
FOM results for unimodal retrieval and the two multimodal fusion models
![Page 46: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/46.jpg)
CONCLUSION• Feasibility of the framework was demonstrated for the semantic-concept rocket launch: -For concept classification using information in single modalities -For concept classification using information from multiple modalities.
• Experimental results show that information from multiple modalities can be successfully integrated to improve semantic labeling performance over that achieved by any single modality.
![Page 47: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/47.jpg)
FUTURE WORK
• Schemes must be identified for automaticallydetermining the low-level features which are most
appropriate for labeling atomic concepts and for determining atomic concepts which are related to higher-level semantic-concepts.
• The scalability of the scheme and its extension to much larger numbers of semantic-concepts must also
be investigated.
![Page 48: Presented by: Archana R eddy Jammula 800802639](https://reader033.fdocuments.in/reader033/viewer/2022061506/56815d0f550346895dcb0f88/html5/thumbnails/48.jpg)
THANK YOU