TRECVID-2011 Semantic Indexing task: Overview

TRECVID-2011 Semantic Indexing task: Overview

Georges Quénot Laboratoire d'Informatique de Grenoble

George Awad NIST

also with Franck Thollard, Bahjat Safadi (LIG) and Stéphane Ayache (LIF) and support from the Quaero Programme

Outline

Task summary Evaluation details

Inferred average precision Participants

Evaluation results Pool analysis Results per category Results per concept Significance tests per category

Global Observations Issues

Semantic Indexing task (1)

Goal: Automatic assignment of semantic tags to video segments (shots)

Secondary goals: Encourage generic (scalable) methods for detector development.

Semantic annotation is important for filtering, categorization, browsing, searching, and browsing.

Participants submitted two types of runs: Full run Includes results for 346 concepts, from which NIST evaluated 20.

Lite run Includes results for 50 concepts, subset of the above 346.

TRECVID 2011 SIN video data Test set (IACC.1.B): 200 hrs, with durations between 10 seconds and 3.5 minutes.

Development set (IACC.1.A & IACC.1.tv10.training): 200 hrs, with durations just longer than 3.5 minutes.

Total shots: (Much more than in previous TRECVID years, no composite shots)

Development: 146,788 + 119,685

Test: 137,327

Common annotation for 360 concepts coordinated by LIG/LIF/Quaero


Selection of the 346 target concepts Include all the TRECVID "high level features" from 2005 to 2010 to

favor cross-collection experiments

Plus a selection of LSCOM concepts so that: we end up with a number of generic-specific relations among them

for promoting research on methods for indexing many concepts and using ontology relations between them

we cover a number of potential subtasks, e.g. “persons” or “actions” (not really formalized)

It is also expected that these concepts will be useful for the content-based (known item) search task.

Set of 116 relations provided: 559 “implies” relations, e.g. “Actor implies Person”

10 “excludes” relations, e.g. “Daytime_Outdoor excludes Nighttime”


NIST evaluated 20 concepts and Quaero evaluated 30 concepts

Four training types were allowed A - used only IACC training data

B - used only non-IACC training data

C - used both IACC and non-IACC TRECVID (S&V and/or Broadcast news) training data

D - used both IACC and non-IACC non-TRECVID training data

Datasets comparison

TV2007

TV2008=

TV2007 + New

TV2009 =

TV2008 + New

TV2010

TV2011=

TV2010 + New

Dataset length (hours)

~100 ~200 ~380 ~400 ~600

Master shots

36,262 72,028 133,412 266,473 403,800

Unique program

titles 47 77 184 N/A N/A

Number of runs for each training type REGULAR FULL RUNS A B C D

Only IACC data 62

Only non-IACC data 2

Both IACC and non-IACC TRECVID data

1

Both IACC and non-IACC non-TRECVID data

3

LIGHT RUNS A B C D

Only IACC data 96

Only non-IACC data 2

Both IACC and non-IACC TRECVID data

1

Both IACC and non-IACC non-TRECVID data

3

Total runs (102) 96 94%

2 2%

1 1%

3 3%

50 concepts evaluated 2 Adult 5 Anchorperson 10 Beach 21 Car 26 Charts 27 Cheering* 38 Dancing* 41 Demonstration_Or_Protest* 44 Doorway* 49 Explosion_Fire* 50 Face 51 Female_Person 52 Female-Human-Face-Closeup* 53 Flowers* 59 Hand* 67 Indoor

75 Male_Person 81 Mountain* 83 News_Studio 84 Nighttime* 86 Old_People* 88 Overlaid_Text 89 People_Marching 97 Reporters 100 Running* 101 Scene_Text 105 Singing* 107 Sitting_down* 108 Sky 111 Sports 113 Streets 123 Two_People 127 Walking* -The 10 marked with “*” are a subset of those tested in 2010

128 Walking_Running 227 Door_Opening 241 Event 251 Female_Human_Face 261 Flags 292 Head_And_Shoulder 332 Male_Human_Face 354 News 392 Quadruped 431 Skating 442 Speaking 443 Speaking_To_Camera 454 Studio_With_Anchorperson 464 Table 470 Text 478 Traffic 484 Urban_Scenes

Evaluation

Each feature assumed to be binary: absent or present for each master reference shot

Task: Find shots that contain a certain feature, rank them according to confidence measure, submit the top 2000

NIST sampled ranked pools and judged top results from all submissions

Evaluated performance effectiveness by calculating the inferred average precision of each feature result

Compared runs in terms of mean inferred average precision across the: 50 feature results for full runs 23 feature results for lite runs

Inferred average precision (infAP)

Developed* by Emine Yilmaz and Javed A. Aslam at Northeastern University

Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools

This means that more features can be judged with same annotation effort

Experiments on previous TRECVID years feature submissions confirmed quality of the estimate in terms of actual scores and system ranking

* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.

2011: mean extended Inferred average precision (xinfAP)

2 pools were created for each concept and sampled as: Top pool (ranks 1-100) sampled at 100%

Bottom pool (ranks 101-2000) sampled at 8%

Judgment process: one assessor per concept, watched complete shot while listening to the audio.

infAP was calculated using the judged and unjudged pool by sample_eval

50 concepts 268156 total judgments

52522 total hits 6747 Hits at ranks (1-10)

28899 Hits at ranks (11-100) 16876 Hits at ranks (101-2000)

--- --- KIS --- --- SIN Aalto University --- --- --- --- --- SIN Beijing Jiaotong University CCD INS KIS --- SED SIN Beijing University of Posts and Telecommunications-MCPRL CCD --- --- *** *** SIN Brno University of Technology --- *** *** MED SED SIN Carnegie Mellon University --- --- KIS MED --- SIN Centre for Research and Technology Hellas --- INS KIS MED --- SIN City University of Hong Kong --- --- KIS MED --- SIN Dublin City University --- --- --- *** --- SIN East China Normal University --- --- --- --- --- SIN Ecole Centrale de Lyon, Université de Lyon --- --- *** *** --- SIN EURECOM --- INS --- --- --- SIN Florida International University CCD --- --- --- --- SIN France Telecom Orange Labs (Beijing) --- --- --- --- --- SIN Institut EURECOM *** *** *** *** *** SIN Tsinghua University, Fujitsu R&D and Fujitsu Laboratories --- INS --- *** --- SIN JOANNEUM RESEARCH Forschungsgesellschaft mbH and Vienna University of Technology --- --- *** MED --- SIN Kobe University *** INS *** *** *** SIN Laboratoire d'Informatique de Grenoble *** INS *** MED *** SIN National Inst. of Informatics *** *** *** *** SED SIN NHK Science and Technical Research Laboratories --- --- --- --- --- SIN NTT Cyber Solutions Lab --- *** --- MED --- SIN Quaero consortium --- --- --- MED SED SIN Tokyo Institute of Technology, Canon Corporation CCD --- --- --- --- SIN University of Kaiserslautern *** *** --- *** --- SIN University of Marburg --- *** *** MED --- SIN University of Amsterdam --- *** *** MED --- SIN University of Electro-Communications CCD --- --- --- --- SIN University of Queensland

2011 : 28/56 Finishers

** : group didn’t submit any runs -- : group didn’t participate

2011 : 28/56 Finishers

Task finishers Participants

2011 28 56

2010 39 69

2009 42 70

2008 43 64

2007 32 54

2006 30 54

2005 22 42

2004 12 33

Participation and finishing declined! Why?

0

5000

10000

15000

20000

25000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Inferred Unique Hits Adult

Face

Indoor

Scene_Text

event

Frequency of hits varies by feature

5%**

**from total test shots

6 Cheering

8 Demonstration_

Protest

10 Explostion_

Fire

14 Flowers

18 Mountain

21 Old_People

27 Singing

33 Walking

7 Dancing

9 Doorway

13 Female_face

_closeup

15 Hand

20 Night_time

25 Running

28 Sitting_down

Overlaid_text

2010 common features

Male_person

Sky

Head_Shoulder

Male_human_ face

Speaking

Text

Female_ person

True shots contributed uniquely by team

Team No. of Shots

Team No. of shots

Vid 1130 Mar 69 UEC 965 NHK 49 iup 822 dcu 49 vir 749 FTR 42 nii 429 Qua 9

CMU 385 FIU 2

ecl 214

brn 185

Pic 177

IRI 154

ITI 151

Tok 140

UvA 72

Full runs Lite runs Team No. of

Shots Team No. of

shots

UEC 506 ITI 41 JRS 404 brn 41 Vid 337 FTR 30 iup 318 Tok 25 vir 257 UvA 19 BJT 245 UQM 16

MCP 149 Eur 11 nii 145 Mar 9 cs2 120 ECN 3

CMU 102 Qua 2

IRI 50

thu 48

Pic 45

No. of unique shots found are MORE than what was found in TV2010 (more shots this year)

More unique shots

compared to TV2010

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16 0,18

0,2

A_T

okyo

Tech

_Can

on_2

A

_Tok

yoTe

ch_C

anon

_1

A_U

vA.L

eona

rdo

A_U

vA.R

apha

el

A_U

vA.D

onat

ello

A

_Tok

yoTe

ch_C

anon

_3

A_Q

uaer

o1

A_Q

uaer

o2

A_U

vA.M

iche

lang

elo

A_Q

uaer

o3

A_Q

uaer

o4

A_C

MU

4 A

_CM

U3

A_I

RIM

1 A

_Pic

SOM

_1

A_I

RIM

4 A

_CM

U2

A_P

icSO

M_4

A

_Pic

SOM

_2

A_F

TRD

BJ-S

IN-4

A

_brn

o.ru

n3

A_v

ireo

.bas

elin

e_vi

deo

A_M

arbu

rg4

A_n

ii.Su

perC

at-d

ense

6 A

_Mar

burg

3 A

_IRI

M2

A_M

arbu

rg2

A_P

icSO

M_3

A

_IRI

M3

A_n

ii.Su

perC

at-d

ense

6 A

_nii.

Supe

rCat

-den

se6m

ul.r

gb

A_b

rno.

run2

A

_Mar

burg

1 A

_CM

U1

A_e

cl_l

iris

_IA

A

_brn

o.ru

n1

A_N

HKS

TRL2

A

_NH

KSTR

L1

A_N

HKS

TRL3

A

_Vid

eose

nse

A_N

HKS

TRL4

A

_ecl

_lir

is_I

A

_Vid

eose

nse

A_V

ideo

sens

e A

_FIU

-UM

-1

A_F

IU-U

M-3

A

_FIU

-UM

-2

A_i

upr-

dfki

A

_FIU

-UM

-4

A_i

upr-

dfki

A

_dcu

.Com

GLo

calB

oWO

ntol

o…

A_U

EC4

A_I

TI-C

ERTH

A

_ITI

-CER

TH

A_V

ideo

sens

e A

_ITI

-CER

TH

A_I

TI-C

ERTH

A

_iup

r-df

ki

A_U

EC1

A_U

EC3

A_U

EC2

A_i

upr-

dfki

Ra

ndom

Run

Category A results (Full runs)

Med

ian

= 0.

109

Mea

n In

fAP.

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16 0,18

0,2 B_

dcu.

Loca

lFea

ture

BoW

B_vi

reo.

SF_w

eb_i

mag

e

Category B results (Full runs) M

ean

InfA

P.

Med

ian

0.

033

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16 0,18

0,2 D

_Tok

yoTe

ch_C

anon

_4

D_v

ireo

.A-S

VM

D_v

ireo

.Tra

dBoo

st

Category D results (Full runs) M

ean

InfA

P. M

edia

n

0.11

2

Note: Category C has only 1 run (C_dcu.GlobalFeature ) with score = 0.01

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16

A_T

okyo

Tech

_Can

on_1

A

_UvA

.Leo

nard

o A

_UvA

.Don

atel

lo

A_U

vA.M

iche

lang

elo

A_Q

uaer

o1

A_C

MU

3 A

_Qua

ero3

A

_CM

U2

A_F

TRD

BJ-S

IN-2

A

_Pic

SOM

_2

A_P

icSO

M_1

A

_Pic

SOM

_4

A_F

TRD

BJ-S

IN-3

A

_IRI

M2

A_M

arbu

rg3

A_n

ii.Su

perC

at-d

ense

6 A

_Mar

burg

1 A

_Eur

ecom

_HS

A_E

urec

om_V

ideo

Sens

e_…

A

_CM

U1

A_n

ii.Su

perC

at-…

A

_ecl

_lir

is_I

A

A_E

CNU

_1

A_e

cl_l

iris

_I

A_N

HKS

TRL2

A

_NH

KSTR

L3

A_V

ideo

sens

e A

_Vid

eose

nse

A_d

cu.C

omG

Loca

lBoW

On…

A

_iup

r-df

ki

A_M

CPRB

UPT

1 A

_FIU

-UM

-2

A_F

IU-U

M-4

A

_Vid

eose

nse

A_I

TI-C

ERTH

A

_ITI

-CER

TH

A_I

TI-C

ERTH

A

_cs2

4_ko

be_s

in

A_U

EC1

A_N

TT-S

L-ZJ

U

A_N

TT-S

L-ZJ

U

A_U

EC3

A_J

RS-V

UT_

1 A

_iup

r-df

ki

A_J

RS-V

UT_

3 A

_BJT

U_S

IN_3

A

_UQ

MSG

1 A

_UQ

MSG

3

Category A results (Lite runs) M

ean

InfA

P.

Med

ian

= 0.

056

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16

B_dc

u.Lo

calF

eatu

reBo

W

B_vi

reo.

SF_w

eb_i

mag

e

Category B results (Lite runs) M

ean

InfA

P.

Med

ian

= 0.

028

0 0,02 0,04 0,06 0,08

0,1 0,12 0,14 0,16

D_T

okyo

Tech

_Can

on_4

D_v

ireo.

A-SV

M

D_v

ireo.

Trad

Boos

t

Category D results (Lite runs) M

ean

InfA

P.

Med

ian

= 0.

082

Note: Category C has only 1 run (C_dcu.GlobalFeature ) with score = 0.017

0

0,1

0,2

0,3

0,4

0,5

0,6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

10

9

8

7

6

5

4

3

2

1

Median

Top 10 InfAP scores by feature (Full runs) In

f AP.

1 Adult

2 Anchorperson

3 Beach

4 car

5 Charts

6 Cheering

7 Dancing

8 Demonstratio

n/protest

9 Doorway

10 Explosion

11 Face

12 Female_person

13 Female_face_

closeup

14 Flowers

15 Hand

16

Indoor

17 Male_person

18 Mountain

19 News_studio

20 Nighttime

21 old_people

22 overlaid_text

23 people_ marching

24 Reporters

25 Running

26 Scene_Text

27 Singing

28 Sitting_down

29 Sky

30 Sports

31 Streets

32 Two_people

33 Walking

34 Walking_ Running

35 Door_

opening

36 Event

37 Female_face

38 Flags

39 Head&

Shoulders

40 Male_human

_face

41 News

42 Quadruped

43 Skating

44 Speaking

45 Speaking_to_

camera

46 Studio_with_anchorpers

on

47 Table

48 Text

49 Traffic

50 Urban_scenes

Top 10 InfAP scores for 23 common features (Lite AND Full runs)

InfA

P.

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

10

9

8

7

6

5

4

3

2

1

Median

1 Adult

2 Car

3 Cheering

4 Dancing

5 Demonstratio

n/protest

6 Doorway

7 Explosion

8 Female_person

9 Female_face_

Closeup

10 Flowers

11 Hand

12 Indoor

13 Male_person

14 Mountain

15 News_studio

16 Nighttime

17 old_people

18 Running

19 Scene_Text

20 Singing

21 Sitting_down

22 Walking

23 Walking_ Running

Run name (mean infAP) A_TokyoTech_Canon_2 0.173 A_TokyoTech_Canon_1 0.173 A_UvA.Leonardo_1 0.172 A_UvA.Raphael_3 0.170 A_UvA.Donatello_2 0.168 A_TokyoTech_Canon_3 0.164 A_Quaero1 0.153 A_Quaero2 0.151 A_UvA.Michelangelo_4 0.150 A_Quaero3 0.150

Significant differences among top 10 A-category full runs (using randomization test, p < 0.05)

A_UvA.Leonardo_1

A_UvA.Raphael_3

A_Quaero1

A_Quaero2

A_Quaero3

A_UvA.Michelangelo_4

A_UvA.Donatello_2

A_Quaero1

A_Quaero2

A_Quaero3


A_TokyoTech_Canon_2

A_TokyoTech_Canon_3


A_Quaero1

A_Quaero2

A_Quaero3

A_TokyoTech_Canon_1

A_TokyoTech_Canon_3


A_Quaero1

A_Quaero2

A_Quaero3

Significant differences among top 10 B-category full runs (using randomization test, p < 0.05)

Run name (mean infAP) B_dcu.LocalFeatureBoW_2 0.046 B_vireo.SF_web_image_4 0.019

B_dcu.LocalFeatureBoW_2

B_vireo.SF_web_image_4

Run name (mean infAP) D_TokyoTech_Canon_4 0.172 D_vireo.A-SVM _3 0.112 D_vireo.TradBoost_2 0.076

Significant differences among top D-category full runs (using randomization test, p < 0.05)

D_TokyoTech_Canon_4 D_vireo.A-SVM _3

D_vireo.TradBoost_2

Run name (mean infAP) A_TokyoTech_Canon_1 0.149 A_TokyoTech_Canon_2 0.148 A_UvA.Leonardo_1 0.142 A_UvA.Raphael_3 0.141 A_UvA.Donatello_2 0.140 A_TokyoTech_Canon_3 0.138 A_UvA.Michelangelo_4 0.121 A_Quaero1 0.120 A_CMU4 0.120 A_Quaero2 0.119

Significant differences among top 10 A-category lite runs (using randomization test, p < 0.05)

A_UvA.Leonardo_1

A_UvA.Donatello_2

A_CMU4

A_Quaero1

A_Quaero2


A_UvA.Raphael_3

A_CMU4

A_Quaero1

A_Quaero2


A_TokyoTech_Canon_1

A_TokyoTech_Canon_3

A_CMU4

A_Quaero1

A_Quaero2


A_TokyoTech_Canon_2

A_TokyoTech_Canon_3

A_CMU4

A_Quaero1

A_Quaero2


Significant differences among top B-category lite runs (using randomization test, p < 0.05)

Run name (mean infAP) B_dcu.LocalFeatureBoW_2 0.039 B_vireo.SF_web_image_4 0.017

B_dcu.LocalFeatureBoW_2 B_vireo.SF_web_image_4

Significant differences among top D-category lite runs (using randomization test, p < 0.05)

Run name (mean infAP) D_vireo.TradBoost_2 0.054 D_vireo.A-SVM_3 0.082 D_TokyoTech_Canon_4 0.148

D_TokyoTech_Canon_4 D_vireo.A-SVM_3

D_vireo.TradBoost_2

Site experiments include: focus on robustness, merging many different representations use of spatial pyramids improved bag of word approaches improved kernel methods sophisticated fusion strategies combination of low and intermediare/gigh features efficiency improvements (e.g. GPU implementations) analysis of more than one keyframe per shot audio analysis using temporal context information not so much use of motion information, metadata or ASR use of external (ImageNet 1000-concept) data

Still not many experiments using external training data (main focus on category A)

No improvement using external training data

Observations

2:40 - 3:00, Tokyo Institute of Technology, Canon Corporation 3:00 - 3:20, PicSOM - Aalto University 3:20 - 3:40, CMU-Informedia - Carnegie Mellon University 3:40 - 4:00, Break in the NIST West Square Cafeteria 4:00 - 4:20, Quaero - Quaero Consortium 4:20 - 4:40, Discussion

Presentations to follow

Less participation – poll results – this year

Has the task become too big considering video data? No (3). Close to the limit. Yes.

Has the task become too big considering the number of concepts? No (3). Yes (2), we did not participate for this reason; at least the full task

Did the task not brought enough novelty compared to previous years? Yes, this is a concern, the task lacks excitement. Not so much. We found it sufficiently interesting to participate Yes. A challenging topic for this year's task was the increasing of the number of

concepts.

Any other reason or issue with the task? US Aladdin program / MED task competition? Only 50 (of 346) concepts are evaluated in the testing phase. We would like to

know how the Mean InfAP will change if the number of testing concepts is increased (lite versus full results already show some consistency)

Poll results – next year

Should we continue to increase the number of concepts for the full task? Why increase? What is the underlying scientific question? Possibly but slowly. Slightly or keep the current size. Yes, but the selected concepts should not be dropped out like this year. It's

okay to keep the number of concepts. No.

Should we keep, reduce or increase the number of concepts for the light task? No opinion. Reduce the number. It is important to be able to annotate the data with

ground truth. This is not possible if there are too many concepts. Preferably less. Keep the current size (3).

Should we continue increasing the diversity of target concepts or not? Again, what is the scientific rationale? Maybe another task. Yes, definitely. Yes. How about increasing concepts of human emotion? Yes.

Poll results – next year

Any other suggestion for introducing novelty in this task? Perhaps collecting training data in an automatic fashion, rather than using the

collaborative annotations. Increase the diversity of video sources, in terms of countries and languages. Increase the diversity of evaluation measures, not confine to MAP. How about having multiple levels of appearance for positive samples? Consider an online variant.

Additional comments Too much time was spent on extracting features but more effort should

be on developing new frameworks and learning methods. Provide more auxiliary information, such as speech recognition results, or others. The data size might be too big and it seems that computation power and

storage play a key role to get promising results. Improve the quality of the videos. Low number of positive samples is a problem. Provide clearer specification on all concepts. Some concepts have very few positive instances. Suggest change data type every year.

Many thanks for the feedback!

SIN 2012

A maximum number of participants is good but not the goal; we want people to be happy with the proposed task.

What is the scientific rationale for many and diverse concepts? Potential applications require a large number of concepts and very diverse ones.

Scalability at the computing power level is not the only issue.

Relations between concepts (both explicit and implicit) may have a key role to play; this can be exploited and evaluated only at a sufficient scale.

Another possible novelty: Multiple levels of relevance for positive samples or ranking of positive samples

Same or similar task; same type of data; similar volume of data. Comparable or slightly reduced number of concepts. Better definition of concepts, better annotation. Encourage and provide infrastructure for sharing contributed elements:

low-level features, detection scores, ...

TRECVID-2011 Semantic Indexing task: Overview

Documents

Transcript of TRECVID-2011 Semantic Indexing task: Overview