Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...
-
Upload
giles-eaton -
Category
Documents
-
view
220 -
download
1
Transcript of Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...
Informedia at TRECVID 2003:Analyzing and Searching Broadcast News Video
TRECVID 2003
Carnegie Mellon UniversityA. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, a
nd H.D. Wactlar
Overview (1/3)
TRECVID 2003Shot boundary determination
identify the shot boundaries in the given video clip(s)
Story segmentationidentify the story boundary and types (miscellaneous or news)
High-level feature extractionOutdoors, news subject face, People, Building, Road, Animal..
SearchGiven the search test collection, a multimedia statement of inf
o. need (topic), return a ranked list of common reference shots from the test collection
Overview (2/3)Search
Interactive Search
Manual Search
Overview (3/3)
Semantic Classifiersmost are trained on keyframes
Interactive Searchallow more effective browsing and visualization of the
results of text queries using a variety of filter strategies
Manual Searchuse multiple retrieval agents (color, texture, ASR, OCR
and some of the classifiers, e.g. anchor, PersonX)Negative Pseudo-relevanceCo-retrievalEven text-based baseline using the OKAPI formula per
formed better other groups
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3)
Audio FeaturesThese features assist the extraction of the following me
dium-level audio-based features: music, male speech, female speech, and noise.
Based on the magnitude spectrum calculated using a Short Time Fourier Transform.
consist of features that summarize the overall spectral characteristics:
Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients
male/female: using Average Magnitude Difference Function (AMDF)
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (2/3)
Low-level Image FeaturesThe color feature is the mean and variance of each col
or channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation.
Another low-level feature is the canny edge direction histogram.
Face FeaturesSchneiderman’s face detector algorithm
Size and position of the largest face are used as additional face features
Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (3/3)
Text-based featuresthe most reliable high-level featureAutomatic Speech Transcripts (ASR), Video Optical C
haracter Recognition (VOCR)
Video OCR (VOCR)Manber and Wu’s approximate string matching techni
que, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton”However, incorrect text like “EIICKINSON” (for “DI
CKINSON”), and “Cincintoli” (for “Cincinnati”)
Fisher Linear Discriminant for Anchors and Commercials (1/2)
Multimode combination approach: use FLD to every feature set and synthesize new feature vectors
Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches.
Two different SVM-based classifiers:
anchor: color histogram, face info., and speaker info.
commercial: color histogram and audio feature
Fisher Linear Discriminant for Anchors and Commercials (2/2)
FLD weights for anchor detection
Anchor and Commercial classifier result
Feature Classifiers (1/7)
Baseline SVM Classifier with Common Annotation DataSVM with the power=2 polynomial
use only image features (no face)
perform a video based cross
validation with portions of the
common annotation data
MAPOutdoors 0.112Buildings 0.071Roads 0.028Vegetation 0.112Cars 0.040Aircraft 0.059Sports 0.051Weather News 0.017Physical violence 0.012Animals 0.017
Feature Classifiers (2/7)
Building Detectionexplore a classifier by adapting man-made structure detection
method by Kumar and Hebertthis method produces binary detection outputs for each of 22*16
grids, extract 5 features from the binary detection outputs.number of positive grids; area of the bounding box that includes all the positive grids; x and y coordinates of the center of the mass of the bounding grids; ratio of the width and height; compactness
462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVMMAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)
Feature Classifiers (3/7)
Plane Detection using additional still image datause image features described above3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples3516 negative examples By FLD and SVM, MAP 0.008 vs. 0.059 (baseline)
Car Detectionmodify the Schneiderman face detector algorithmOutperform the baseline with MAP 0.114 vs. 0.040
Feature Classifiers (4/7)
Zoom Detectionuse MPEG motion vectors to estimate the probability
of a zoom pattern
MAP 0.632
Female Speechuse an SVM trained on the LIMSI provided speech
features, together with the face characteristics
MAP 0.465
Feature Classifiers (5/7)
Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers
Model only based on text info. are better than random baselines on the development data
Feature Classifiers (6/7)
Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.
Feature Classifiers (7/7)
For each shot, both predictions from text-based and timing-based classifiers have to be considered
Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.
News Subject Monologues (1/2)
Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector
VOCR is applied to extract overlaid text in the hoping of finding people names
News Subject Monologues (2/2)
Another feature measures the average amount of motion in a camera shot, based on frame difference
also use commercial and anchor detectors
combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging
MAP 0.616
Finding Person X in Broadcast News (1/3)
Use text info. from a transcript and face info.
Relationship between the name of person x and time
S: one shot; TS: key frame;
TO: time of person namel;( ) ( )
( ) ( ) ( )name S O
text name anchor
p S T T
P S P S P S
Finding Person X in Broadcast News (2/3)
More limited face recognition based on video shotcollect sample faces {F1, F2, …, Fn} for person X
and all faces {f1, f2, …, fm} of i-frames in the news shot which Ptext is larger than zero
build the eigenspace for those faces
{f1, f2, …, fm, F1, F2, …, Fn} and represent them by the eigenfaces {eigf1, eigf2, …, eigfm, eigF1, …, eigFn}
combination rank score and estimate which shots has high possibility to contain that face
1
1 1( )
( )
1( ) ( )
n
ij j i
face i
R eigfn r eigf
S S R eigf Sk
Finding Person X in Broadcast News (3/3)
Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.
Learning Combination Weights in Manual Retrieval (1/5)
Shot-based video retrieval, a set of features is extracted
each shot is associated with a vector of individual retrieval scores from different media search modules
finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm
Learning Combination Weights in Manual Retrieval (2/5)
use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is
Similarity Measures
For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames
For text, CC and OCR transcripts is done using the OKAPI BM-25 formula
1
n
i iiy w s
Learning Combination Weights in Manual Retrieval (3/5)
Negative Pseudo-Relevance Feedback (NPRF)NPRF is effective at providing a more adaptive similarity
measure for image retrieval
Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance
Maximal Marginal Irrelevance (MMIR)
1 2\arg min ( , ) (1 ) max ( , )
i ji i jD T S D S
MMIR Sim D Q Sim D D
Learning Combination Weights in Manual Retrieval (4/5)
The Value of Intermediate-level DetectorsText-based feature is good at global ranking and other features is useful in refining the ranking afterwards
Learning Weights for each Modality in Video Retrieval
Baseline: Setting weights based on query typesPerson query: w=(text 2, face 1, color 1, anchor 0)
Non-person query: w=(text 2, face -1, color 1, anchor -1)
Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)
Learning Combination Weights in Manual Retrieval (5/5)
Learning weights using training labeled setSupervised learning algorithm in the development set
Co-Retrievala set of video shots are first labeled as relevant shots using text-
based features, and the results are augmented by learning with the other visual and intermediate level features
Experimental results
Interactive TREC Video Retrieval Evaluation for 2003 (1/2)
This interface has the following features:Storyboards of images spanning across video story segments
Emphasizing matching shots to a user’s query to reduce the image count
Resolution and layout under the user control
Additional filtering provided through shot classifiers
Display of filter count and distribution to guide manipulation of storyboard views
Interactive TREC Video Retrieval Evaluation for 2003 (2/2)
Conclusions
We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.