1/30 Video indexing and retrieval at TREC 2002 1 Laboratoire de Reconnaissance de Formes et Vision...

1/30

Video indexing and retrieval atTREC 2002

1Laboratoire de Reconnaissance de Formes et VisionInstitut National des Sciences Appliquées de Lyon

Bât. Jules Verne, 20, Avenue Albert Einstein69621 Villeurbanne cedex, France

Christian Wolf1

[email protected] Doermann2

[email protected]

2Laboratory for Language and Media ProcessingInstitute for Advanced Computer Studies

University of MarylandCollege Park, MD 20742-3275, USA

2/30

Plan of the presentation

Introduction - The TREC Competition

Features & query techniques

Experiments & ResultsRun types

Example queries

The impact of speech/text/color

Conclusion and Outlook

Introduction Features & Query types Experimental Results ConclusionImpact of Features

3/30

The NIST TExt Retrieval Conference

68.45 hours of MPEG 1 from “the internet archive” and the “open video project”

The goal of the conference series is to encourage research in information retrieval from large amounts of text by providing

a large test collectionuniform scoring proceduresa forum for organizations

interested in comparing their results

The Video Retrieval Track aims at the investigation of content-based retrieval from digital video.


4/30

Aims and Tasks

Shot boundary determination

Feature extraction

Search

3 sub tasks are defined in the Video Track, and participants are free to choose for which tasks they want to submit results:

Feature development collection (23.6h)

Feature test collection (5h)

Search test collection (40.12h)


5/30

Search: different query typesTwo different query types are supported by the competition: manual and interactive queries.


6/30

Example search topicsFind shots with Eddie Rickenbacker in them Find additional shots with James H. Chandler Find pictures of George Washington Find shots with a depiction of Abraham Lincoln Find shots of people spending leisure time at the beach, for example: walking, Find shots of one or more musicians: a man or woman playing a music instrument with instrumental music audible. Musician(s) and instrument(s) must be at least partly visible sometime during the shot. Find shots of football players Find shots of one or more women standing in long dresses. Dress should be one piece and extend below knees. The entire dress from top to end of dress below knees should be visible at some point. Find shots of the Golden Gate Bridge Find shots of Price Tower, designed by Frank Lloyd Wright and built in Bartlesville, Oklahoma, . Find shots containing Washington Square Park's arch in New York City. The entire arch should be visible at some point Find overhead views of cities - downtown and suburbs. The viewpoint should be higher than the highest building visible Find shots of oil fields, rigs, derricks, oil drilling/pumping equipment. Shots just of refineries are not desired Find shots with a map (sketch or graphic) of the continental US. Find shots of a living butterfly Find more shots with one or more snow-covered moutain peaks or ridges. Some sky must be visible them behind


7/30

The feature extraction task: overlay text


Binarization:

OCR: Scansoft

“Soukaina Oufkir”

Detection,Multiple frame integration

Suppression of false alarms

TONY RIYERAARNOLD GILLESPIEEUGENE PODDANYEMERY NAWKúN5GEORGE GORDONGERALD NEYIUD i recto rTRUE BOAROMANCARL URBANArt DirectionEMERY NAWKINSMusic ScoreDirectorGEORGE GORDONl E W K E LLERPRODUCTIONa yen Pu s1c~

. .ai ~ ia 7) E nAl~1I.Mol, 6I J'-Nr~vir lowre,740~17-jF 00Iis!'/

Text examples Non-Text ex.A linear classifier trained with Fisher’s linear discriminant is used to classfy the OCR output for each text box into text and non text.

Separation of characters into 4 types:

Upper A-ZLower a-zdigits 0-9bad rest

Features:

Number of good characters (upper+lower+digits)

Number of charactersF1=

Number of class changes

Number of charactersF2=

8/30

Features

Shot boundary definition (MPEG7-

XML)

14524 shots

search test collection

(40h)

Outdoors IBM

Outdoors MSRA

Outdoors Mediamill

Face IBM

Face MSRA

Face Mediamill

Face IBM

Donated features:

10 different binary features from different donators (all in all 32 detectors). Confidence is given for each shot.

MPEG7-XML

Temporal Color Correlograms

Developed by UMD in collaboration with the University of Oulu. [Rautiainen and Doermann, 2002]

Detected and recognized text

Developped by INSA de Lyon. [Wolf and Jolion, 2002]

Speech recognition LIMSI Donated featureMPEG7-XMLSpeech recognition MSRA


9/30

Query techniques

TextSpeech

Binary features

Temporalcolorfeatures

Query


10/30

Recognized text and speechFor the actual retrieval we used the freely available managing gigabytes software (http://www.cs.mu.oz.au/mg). Two query metrics are available:

•Boolean

•Ranked, based on the cosine measure.


Target: “Nick Chandler”Query: “chandler” N-gram: chand|handl|andle|ndler|chandl|handle|andler|chandle|handler|chandlerResults: ni ck l6 tia ndler

colleges cattlemen handlers of livestock

MG has been written for error free documents so it checks for exact matches on the stemmed words (e.g. produced fits producer).

We added an inexact match feature by using N-grams:

11/30

Binary featuresThe binary features specify the presence of a feature in each shot, the information being given in the confidence measure [0,1].

People - IBMPeople - MediamillPeople - MSRAOutdoors - IBMOutdoors - MediamillOutdoors - MSRA...


The product rule

j

iji xCxQ )()(

Training the combining classifier

The sum rule

j

iji xCxQ )()(

Quantifies the true likelihood, if the features are statistically independent. Bad if base classifiers are weakly trained or have high error rates.

Works well with base classifiers with independent noise behaviour.

X

Cij ... Output of classifier j for class iQi ... Output of combined classifier for class i

12/30

Binary features - ranked queries

0.27

0.87

0.94

0.15

0.08

0.65

0.27

0.23

0.56

0.15

0.76

0.07

1.0

1.0

1.0

1.0

1.0

0.0

People - IBM

People - Mediamill

People - MSRA

Outdoors - IBM

Outdoors - Mediamill

Indoors - IBM

Shot 1 Shot 2Query vector

1

1

0


Eucledian distance

Mahalanobis distance

)()(),( 1 yxyxyxD T

... Covariance matrix for the complete data set

)()(),( yxyxyxD T

3 dimensional case:

13/30

Temporal color features


dppIp n

cIpIp

d

cc jnnic

ji

212,

)(

, |Pr21

It stores the probability that, given any pixel p1 of color ci, a pixel p2 at distance d is of color cj among the shots frames In.

The distance is calculated using the L1 norm.

TREC: Auto correlogram ci = cj

For each shot, a temporal color correlogram is held. [Rautiainen and Doermann, 2002]:

14/30

The query tool


15/30

QueryingKeyword based queries on text or speech or both together,

with or without n-grams, boolean or ranked.Ranked color queries.Ranked queries on binary features.Filters on binary features.

Query 1 Query 2 Query 3 Query 41.00

0.96

0.00

1.00

0.20

0.00

1.00

0.70

0.00

1.00

0.30

0.00

AND, OR combination of query results incl. weighted combinations of the ranking of both queries.

Truncate queries.

View the keyframes of queries.

Export query results into stardom, the graphical browsing tool.


i

is

is N

rm

1

1,

16/30

Stardom


17/30Introduction Features & Query types Experimental Results ConclusionImpact of Features

18/30

ExperimentsTopic min. Description

75 Eddie Rickenbacker76 6 James Chandler77 14 George Washington78 19 Abraham Lincoln79 43 People at the Beach80 20 Musicians with music playing81 Football players82 29 Women standing in long dresses83 19 Golden gate bridge84 11 Price Tower85 Washington Square Park´s arch86 85 Overhead views of cities87 61 Oil fields, ricks, derricks88 43 Map of the continental US89 61 A living butterfly90 12 Snow covered mountain peaks91 Parrots92 20 Sailbots, sailing ship93 15 beef or dairy cattle, cows94 17 people walking in cities95 15 Nuclear explosion with mushroom96 39 US flag97 23 Miscroscopic views of living cells98 15 Locomotive approaching99 19 Rocket or missile taking off

Manual run using all available features.

Manual run without speech recognition.

Interactive run using all available features. The graphical tool was used to browse the data, but all submitted results were queries submitted by the command line tool.


19/30

Example queries

Binary features People>=0.25 Landscape <= 0.75

ANDOR

Querytype on

text/speech James Chandler

text/speech Jim Chandler

text/speech N-gram James Chandler

Color 1. Example video



weight

4

4

2

1

1

1

“Find additional shots of James H. Chandler”: manual query:weight

100000

1

“Shots of rockets or missiles taking off”: manual & interactive:

Querytype on

text/speech rocket missile

text/speech taking off launch start



weight

2

2

1

1

OR

Binary features Ranked: 7000 -People -Faces

AND

weight

100000

1


Prec./100 0.2Avg. prec. 0.38

76 manual

Prec./100 0.05Avg. prec. 0.34

99 manual

20/30

Manual vs. interactive queriesQuerytype on

text/speech beach

text/speech beach fun sun

text/speech leisure sand vacation





weight

4

3

2

1

1

1

1

Binary features People>=0.25 Indoors <= 0.75 Outdoors>=0.25

OR AND

weight

100000

1

Manual query


Prec./100 0Avg. prec. 0

79 manual

Querytype on

text/speech swimming

text/speech shore

weight

2

1

Querytype on

text/speech water

Binary featuresPL<=0.5 OD>=0.5

CT<=0.05 ID<=0.75 LS>=0.5

weight

2

1

Binary features Landscape>=0.3 Cityscape <= 0.5 Outdoors>=0.5

weight

2

1

OR

OR

weight

100000

1

AND

OR

Interactive query

Prec./100 0.07Avg. prec. 0.11

79 interactive

21/30

Ranked binary queries: distance functions


Topic Eucl. Mah.82 0,24 0,1684 0 099 0 0

Topic Query Eucl. Mah.82 +People +Indoors -Outdoors -Landscape 0,24 0,1684 +Cityscape -People -Face 0,48 0,0699 -People -Face 0,96 0,8

Full query Binary query only

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2000

4000

6000

8000

10000

12000

14000People IBMPeople MediamillPeople MSRA

Query vector

shot vector

vars

1 0 0,121 0,18 0,130 0 0,060 0,23 00 0 0,160 0,05 0,070 0,01 00 1 0,120 0 0,04

diff

std dev.

Distributions of the 3 “people” detectorsExample false alarm

22/30

Precision curves per topic


Manual Manual no ASR Interactive

Interactive

Precision / result set size

Precision / recall

23/30

Precision curves consolidated


Manual Manual no ASR Interactive

Precision / result set size

Precision / recall

24/30

Comparison with other teams


ID mean std. dev.

1 0,23 0,14

2 0,14 0,17

3 0,11 0,15

4 0,09 0,12

5 0,09 0,16

6 0,08 0,11

7 0,07 0,20

8 0,07 0,09

9 0,06 0,10

10 0,06 0,18

11 0,06 0,10

12 0,06 0,12

13 0,06 0,09

14 0,06 0,10

15 0,04 0,11

16 0,03 0,08

17 0,03 0,06

18 0,03 0,05

19 0,02 0,05

20 0,01 0,02

21 0,01 0,01

22 0,01 0,01

23 0,01 0,00

24 0,00 0,01

25 0,00 0,01

26 0,00 0,01

27 0,00 0,00

Average precison - Manual

25/30

ID mean std. dev.

1 0,52 0,24

2 0,32 0,21

3 0,31 0,20

4 0,29 0,21

5 0,26 0,21

6 0,24 0,22

7 0,22 0,23

8 0,18 0,21

9 0,15 0,15

10 0,15 0,15

11 0,07 0,11

12 0,05 0,08

13 0,05 0,08

Comparison with other teams


Average precision - Interactive

26/30

SpeechThe quality of the speech queries highly depends on the topic.

In general, the return sets of speech queries are very heterogenous and need to be filtered, e.g. by binary filters.

Example: “rocket missile”


27/30

ColorAs expected, the color filters have been very useful in cases where the query images where very different from other images in terms of low level features, or where the relevant shots in the database share common color properties with the example query (e.g. shots are in the same environment).

Query “living cells”: results of the run without speech are better than the run including speech.


28/30

ColorSearching for “James Chandler” using the color features only.


29/30

Recognized text

“Dance”

“EnergyGas”

“Music”

“Oil”

The type of videos present in the collection does not favor the use of recognized text. In most videos, the only text present in the documentaries is the title at the beginning and the casting at the end.


“Airline”“Air plane”

30/30

Conclusion and OutlookExploit temporal continuities between the frames, as already

proposed by the dutch team during TREC 2001. This seems to be especially important for video OCR, since sometimes single shots with text only “interrupt” content shots.

Training of the combination of features.More research into the combination of the binary features

(normalization, robust outlier detection etc.).Browsing: The graphical viewing interface could be very

promising, if it is possible to integrate tiny (and enlargable) keyframes into the grid.

Use of additional features: Explicit color filters and query by (sketched) example:

define regions and color ranges. Motion features. Usage of the internet to get example images (google).


1/30 Video indexing and retrieval at TREC 2002 1 Laboratoire de Reconnaissance de Formes et Vision...

Documents

Transcript of 1/30 Video indexing and retrieval at TREC 2002 1 Laboratoire de Reconnaissance de Formes et Vision...