Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...

29
Informedia at TRECVID 2003: Analyzing and Searching Broadcast N ews Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.C hristel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papern ick, C.G.M. Snoek, G. Tzanetakis, J. Yan g, R. Yang, and H.D. Wactlar

Transcript of Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie...

Page 1: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Informedia at TRECVID 2003:Analyzing and Searching Broadcast News Video

TRECVID 2003

Carnegie Mellon UniversityA. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, a

nd H.D. Wactlar

Page 2: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Overview (1/3)

TRECVID 2003Shot boundary determination

identify the shot boundaries in the given video clip(s)

Story segmentationidentify the story boundary and types (miscellaneous or news)

High-level feature extractionOutdoors, news subject face, People, Building, Road, Animal..

SearchGiven the search test collection, a multimedia statement of inf

o. need (topic), return a ranked list of common reference shots from the test collection

Page 3: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Overview (2/3)Search

Interactive Search

Manual Search

Page 4: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Overview (3/3)

Semantic Classifiersmost are trained on keyframes

Interactive Searchallow more effective browsing and visualization of the

results of text queries using a variety of filter strategies

Manual Searchuse multiple retrieval agents (color, texture, ASR, OCR

and some of the classifiers, e.g. anchor, PersonX)Negative Pseudo-relevanceCo-retrievalEven text-based baseline using the OKAPI formula per

formed better other groups

Page 5: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3)

Audio FeaturesThese features assist the extraction of the following me

dium-level audio-based features: music, male speech, female speech, and noise.

Based on the magnitude spectrum calculated using a Short Time Fourier Transform.

consist of features that summarize the overall spectral characteristics:

Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients

male/female: using Average Magnitude Difference Function (AMDF)

Page 6: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (2/3)

Low-level Image FeaturesThe color feature is the mean and variance of each col

or channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation.

Another low-level feature is the canny edge direction histogram.

Face FeaturesSchneiderman’s face detector algorithm

Size and position of the largest face are used as additional face features

Page 7: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (3/3)

Text-based featuresthe most reliable high-level featureAutomatic Speech Transcripts (ASR), Video Optical C

haracter Recognition (VOCR)

Video OCR (VOCR)Manber and Wu’s approximate string matching techni

que, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton”However, incorrect text like “EIICKINSON” (for “DI

CKINSON”), and “Cincintoli” (for “Cincinnati”)

Page 8: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Fisher Linear Discriminant for Anchors and Commercials (1/2)

Multimode combination approach: use FLD to every feature set and synthesize new feature vectors

Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches.

Two different SVM-based classifiers:

anchor: color histogram, face info., and speaker info.

commercial: color histogram and audio feature

Page 9: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Fisher Linear Discriminant for Anchors and Commercials (2/2)

FLD weights for anchor detection

Anchor and Commercial classifier result

Page 10: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (1/7)

Baseline SVM Classifier with Common Annotation DataSVM with the power=2 polynomial

use only image features (no face)

perform a video based cross

validation with portions of the

common annotation data

MAPOutdoors 0.112Buildings 0.071Roads 0.028Vegetation 0.112Cars 0.040Aircraft 0.059Sports 0.051Weather News 0.017Physical violence 0.012Animals 0.017

Page 11: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (2/7)

Building Detectionexplore a classifier by adapting man-made structure detection

method by Kumar and Hebertthis method produces binary detection outputs for each of 22*16

grids, extract 5 features from the binary detection outputs.number of positive grids; area of the bounding box that includes all the positive grids; x and y coordinates of the center of the mass of the bounding grids; ratio of the width and height; compactness

462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVMMAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)

Page 12: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (3/7)

Plane Detection using additional still image datause image features described above3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples3516 negative examples By FLD and SVM, MAP 0.008 vs. 0.059 (baseline)

Car Detectionmodify the Schneiderman face detector algorithmOutperform the baseline with MAP 0.114 vs. 0.040

Page 13: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (4/7)

Zoom Detectionuse MPEG motion vectors to estimate the probability

of a zoom pattern

MAP 0.632

Female Speechuse an SVM trained on the LIMSI provided speech

features, together with the face characteristics

MAP 0.465

Page 14: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (5/7)

Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers

Model only based on text info. are better than random baselines on the development data

Page 15: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (6/7)

Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.

Page 16: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Feature Classifiers (7/7)

For each shot, both predictions from text-based and timing-based classifiers have to be considered

Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.

Page 17: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

News Subject Monologues (1/2)

Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector

VOCR is applied to extract overlaid text in the hoping of finding people names

Page 18: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

News Subject Monologues (2/2)

Another feature measures the average amount of motion in a camera shot, based on frame difference

also use commercial and anchor detectors

combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging

MAP 0.616

Page 19: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Finding Person X in Broadcast News (1/3)

Use text info. from a transcript and face info.

Relationship between the name of person x and time

S: one shot; TS: key frame;

TO: time of person namel;( ) ( )

( ) ( ) ( )name S O

text name anchor

p S T T

P S P S P S

Page 20: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Finding Person X in Broadcast News (2/3)

More limited face recognition based on video shotcollect sample faces {F1, F2, …, Fn} for person X

and all faces {f1, f2, …, fm} of i-frames in the news shot which Ptext is larger than zero

build the eigenspace for those faces

{f1, f2, …, fm, F1, F2, …, Fn} and represent them by the eigenfaces {eigf1, eigf2, …, eigfm, eigF1, …, eigFn}

combination rank score and estimate which shots has high possibility to contain that face

1

1 1( )

( )

1( ) ( )

n

ij j i

face i

R eigfn r eigf

S S R eigf Sk

Page 21: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Finding Person X in Broadcast News (3/3)

Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.

Page 22: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Learning Combination Weights in Manual Retrieval (1/5)

Shot-based video retrieval, a set of features is extracted

each shot is associated with a vector of individual retrieval scores from different media search modules

finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm

Page 23: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Learning Combination Weights in Manual Retrieval (2/5)

use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is

Similarity Measures

For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames

For text, CC and OCR transcripts is done using the OKAPI BM-25 formula

1

n

i iiy w s

Page 24: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Learning Combination Weights in Manual Retrieval (3/5)

Negative Pseudo-Relevance Feedback (NPRF)NPRF is effective at providing a more adaptive similarity

measure for image retrieval

Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance

Maximal Marginal Irrelevance (MMIR)

1 2\arg min ( , ) (1 ) max ( , )

i ji i jD T S D S

MMIR Sim D Q Sim D D

Page 25: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Learning Combination Weights in Manual Retrieval (4/5)

The Value of Intermediate-level DetectorsText-based feature is good at global ranking and other features is useful in refining the ranking afterwards

Learning Weights for each Modality in Video Retrieval

Baseline: Setting weights based on query typesPerson query: w=(text 2, face 1, color 1, anchor 0)

Non-person query: w=(text 2, face -1, color 1, anchor -1)

Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)

Page 26: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Learning Combination Weights in Manual Retrieval (5/5)

Learning weights using training labeled setSupervised learning algorithm in the development set

Co-Retrievala set of video shots are first labeled as relevant shots using text-

based features, and the results are augmented by learning with the other visual and intermediate level features

Experimental results

Page 27: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Interactive TREC Video Retrieval Evaluation for 2003 (1/2)

This interface has the following features:Storyboards of images spanning across video story segments

Emphasizing matching shots to a user’s query to reduce the image count

Resolution and layout under the user control

Additional filtering provided through shot classifiers

Display of filter count and distribution to guide manipulation of storyboard views

Page 28: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Interactive TREC Video Retrieval Evaluation for 2003 (2/2)

Page 29: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel,

Conclusions

We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.