Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...

Informedia at TRECVID 2003:Analyzing and Searching Broadcast News Video

TRECVID 2003

Carnegie Mellon UniversityA. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, a

nd H.D. Wactlar

Overview (1/3)

TRECVID 2003Shot boundary determination

identify the shot boundaries in the given video clip(s)

Story segmentationidentify the story boundary and types (miscellaneous or news)

High-level feature extractionOutdoors, news subject face, People, Building, Road, Animal..

SearchGiven the search test collection, a multimedia statement of inf

o. need (topic), return a ranked list of common reference shots from the test collection

Overview (2/3)Search

Interactive Search

Manual Search

Overview (3/3)

Semantic Classifiersmost are trained on keyframes

Interactive Searchallow more effective browsing and visualization of the

results of text queries using a variety of filter strategies

Manual Searchuse multiple retrieval agents (color, texture, ASR, OCR

and some of the classifiers, e.g. anchor, PersonX)Negative Pseudo-relevanceCo-retrievalEven text-based baseline using the OKAPI formula per

formed better other groups

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3)

Audio FeaturesThese features assist the extraction of the following me

dium-level audio-based features: music, male speech, female speech, and noise.

Based on the magnitude spectrum calculated using a Short Time Fourier Transform.

consist of features that summarize the overall spectral characteristics:

Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients

male/female: using Average Magnitude Difference Function (AMDF)


Low-level Image FeaturesThe color feature is the mean and variance of each col

or channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation.

Another low-level feature is the canny edge direction histogram.

Face FeaturesSchneiderman’s face detector algorithm

Size and position of the largest face are used as additional face features


Text-based featuresthe most reliable high-level featureAutomatic Speech Transcripts (ASR), Video Optical C

haracter Recognition (VOCR)

Video OCR (VOCR)Manber and Wu’s approximate string matching techni

que, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton”However, incorrect text like “EIICKINSON” (for “DI

CKINSON”), and “Cincintoli” (for “Cincinnati”)

Fisher Linear Discriminant for Anchors and Commercials (1/2)

Multimode combination approach: use FLD to every feature set and synthesize new feature vectors

Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches.

Two different SVM-based classifiers:

anchor: color histogram, face info., and speaker info.

commercial: color histogram and audio feature

Fisher Linear Discriminant for Anchors and Commercials (2/2)

FLD weights for anchor detection

Anchor and Commercial classifier result

Feature Classifiers (1/7)

Baseline SVM Classifier with Common Annotation DataSVM with the power=2 polynomial

use only image features (no face)

perform a video based cross

validation with portions of the

common annotation data

MAPOutdoors 0.112Buildings 0.071Roads 0.028Vegetation 0.112Cars 0.040Aircraft 0.059Sports 0.051Weather News 0.017Physical violence 0.012Animals 0.017


Building Detectionexplore a classifier by adapting man-made structure detection

method by Kumar and Hebertthis method produces binary detection outputs for each of 22*16

grids, extract 5 features from the binary detection outputs.number of positive grids; area of the bounding box that includes all the positive grids; x and y coordinates of the center of the mass of the bounding grids; ratio of the width and height; compactness

462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVMMAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)


Plane Detection using additional still image datause image features described above3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples3516 negative examples By FLD and SVM, MAP 0.008 vs. 0.059 (baseline)

Car Detectionmodify the Schneiderman face detector algorithmOutperform the baseline with MAP 0.114 vs. 0.040


Zoom Detectionuse MPEG motion vectors to estimate the probability

of a zoom pattern

MAP 0.632

Female Speechuse an SVM trained on the LIMSI provided speech

features, together with the face characteristics

MAP 0.465


Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers

Model only based on text info. are better than random baselines on the development data


Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.


For each shot, both predictions from text-based and timing-based classifiers have to be considered

Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.

News Subject Monologues (1/2)

Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector

VOCR is applied to extract overlaid text in the hoping of finding people names

News Subject Monologues (2/2)

Another feature measures the average amount of motion in a camera shot, based on frame difference

also use commercial and anchor detectors

combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging

MAP 0.616

Finding Person X in Broadcast News (1/3)

Use text info. from a transcript and face info.

Relationship between the name of person x and time

S: one shot; TS: key frame;

TO: time of person namel;( ) ( )

( ) ( ) ( )name S O

text name anchor

p S T T

P S P S P S


More limited face recognition based on video shotcollect sample faces {F1, F2, …, Fn} for person X

and all faces {f1, f2, …, fm} of i-frames in the news shot which Ptext is larger than zero

build the eigenspace for those faces

{f1, f2, …, fm, F1, F2, …, Fn} and represent them by the eigenfaces {eigf1, eigf2, …, eigfm, eigF1, …, eigFn}

combination rank score and estimate which shots has high possibility to contain that face

1

1 1( )

( )

1( ) ( )

n

ij j i

face i

R eigfn r eigf

S S R eigf Sk


Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.

Learning Combination Weights in Manual Retrieval (1/5)

Shot-based video retrieval, a set of features is extracted

each shot is associated with a vector of individual retrieval scores from different media search modules

finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm


use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is

Similarity Measures

For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames

For text, CC and OCR transcripts is done using the OKAPI BM-25 formula

1

n

i iiy w s


Negative Pseudo-Relevance Feedback (NPRF)NPRF is effective at providing a more adaptive similarity

measure for image retrieval

Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance

Maximal Marginal Irrelevance (MMIR)

1 2\arg min ( , ) (1 ) max ( , )

i ji i jD T S D S

MMIR Sim D Q Sim D D


The Value of Intermediate-level DetectorsText-based feature is good at global ranking and other features is useful in refining the ranking afterwards

Learning Weights for each Modality in Video Retrieval

Baseline: Setting weights based on query typesPerson query: w=(text 2, face 1, color 1, anchor 0)

Non-person query: w=(text 2, face -1, color 1, anchor -1)

Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)


Learning weights using training labeled setSupervised learning algorithm in the development set

Co-Retrievala set of video shots are first labeled as relevant shots using text-

based features, and the results are augmented by learning with the other visual and intermediate level features

Experimental results

Interactive TREC Video Retrieval Evaluation for 2003 (1/2)

This interface has the following features:Storyboards of images spanning across video story segments

Emphasizing matching shots to a user’s query to reduce the image count

Resolution and layout under the user control

Additional filtering provided through shot classifiers

Display of filter count and distribution to guide manipulation of storyboard views

Interactive TREC Video Retrieval Evaluation for 2003 (2/2)

Conclusions

We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...

Documents

Transcript of Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...