Joint Visual-Text Modeling for Multimedia Retrieval

184
1 Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 – Final Presentation, August 17 2004

Transcript of Joint Visual-Text Modeling for Multimedia Retrieval

Page 1: Joint Visual-Text Modeling for Multimedia Retrieval

1

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 – Final Presentation, August 17 2004

Page 2: Joint Visual-Text Modeling for Multimedia Retrieval

2

TeamUndergraduate Students

Desislava Petkova (Mt. Holyoke), Matthew Krause (Georgetown)

Graduate StudentsShaolei Feng (U. Mass), Brock Pytlik(JHU), Paola Virga (JHU)

Senior ResearchersPinar Duygulu, Bilkent U., TurkeyPavel Ircing (U. West Bohemia)Giri Iyengar, IBM ResearchSanjeev Khudanpur, CLSP, JHUDietrich Klakow, Uni. SaarlandR. Manmatha, CIIR, U. Mass AmherstHarriet Nock, IBM Research (external participant)

Page 3: Joint Visual-Text Modeling for Multimedia Retrieval

3

“ … Palestinian leaderYes Sir You’re Fat today said …”

Big Picture: Multimedia Retrieval Task

Find clips showingYasser Arafat

VIDEO CLIPS

“ … Palestinian leaderYasser Arafat today said …”

Multimedia RetrievalSystem

Yasser Arafat

Process Query Image

Process Query Text

Spoken DocumentRetrievalImage

Content-basedRetrieval

Joint-Visual Text Models!

Most research has addressed:I. Text queries, text (or degraded text) documentsII. Image queries, image data

CombineScores

“ … Palestinian leaderYasser Arafat today said …”

Page 4: Joint Visual-Text Modeling for Multimedia Retrieval

4

Joint Visual-Text Modeling

Process Query Text Joint word-

vistermretrievalProcess

Query Image

Yasser ArafatVIDEO CLIPS

“ … [Yes sir, you’re fat today said]…

Query ofWords and Visterms

Document of

words

Query of

words

Document ofWords and Visterms

Retrieve documents using p(Document|Query)

Retrieve documents using p(dw,dv | qw,qv)

Page 5: Joint Visual-Text Modeling for Multimedia Retrieval

5

Joint Visual-Text Modeling: KEY GOAL

Show that joint visual-text modeling improves multimedia retrieval

Demonstrate and Evaluate performance of these models on TRECVID2003 corpus and task

Page 6: Joint Visual-Text Modeling for Multimedia Retrieval

6

Key StepsAutomatically annotate video with concepts (meta-data)

E.g. Video contains a face, in a studio-environment …

Retrieve videoGiven a query, select suitable meta-data for the query and retrieveCombine with text-retrieval in a unified Language Model-based IR setting

Page 7: Joint Visual-Text Modeling for Multimedia Retrieval

7

TRECVID Corpus and TaskCorpus

Broadcast news videos used for Hub4 evaluations (ABC, CNN, CSPAN)120 Hours of video

TasksShot-boundary detectionNews Story segmentation (multimodal)Concept detection (Annotation)Search task

Page 8: Joint Visual-Text Modeling for Multimedia Retrieval

8

Alternate (development) CorpusCOREL photograph database

5000 high-quality photographs with captions

TaskAnnotation

Page 9: Joint Visual-Text Modeling for Multimedia Retrieval

9

TRECVID Search task definition

Statement of Information need + Examples

Manual Selection ofSystem Parameters

Rankedlist of video shots

ManualInteractive

NIST Evaluation

Page 10: Joint Visual-Text Modeling for Multimedia Retrieval

10

Our search task definition

Statement of Information need + Examples

AutomaticSelection ofSystem Parameters

Rankedlist of video shots

Isolate Algorithmic issues from interface and user issues

NIST Evaluation

Page 11: Joint Visual-Text Modeling for Multimedia Retrieval

11

dLanguage Model based Retrieval

q

Vist

erm

sW

ords

Words Vistermsd

Baseline model

Relating document visterms to query words (MT,RelevanceModel,HMMs)

Relating document words to query images (Text Classification experiments)

Visual-only retrieval models

Rank documents with p(qw,qv|dw,dv)

Page 12: Joint Visual-Text Modeling for Multimedia Retrieval

12

EvaluationConcept annotation performance

Compare against manual ground truthRetrieval task performance

Compare against NIST relevance judgements

Both measured using Mean Average Precision (mAP)

Page 13: Joint Visual-Text Modeling for Multimedia Retrieval

13

Mean Average Precision (mAP)

T

tAPmAP

treltStAP

iprecisiontS

Tt

relevanti

=

=

=

)(

)()()(

)()(}{

Page 14: Joint Visual-Text Modeling for Multimedia Retrieval

14

Experimental Setup: CorporaTRECVID03 Corpus120 HoursGround Truth on Dev data

Train38K shots

DevTest10K shots

TRECVID03IR Collection32K Shots

Train4500 images

Test500images

COREL Corpus5000 images

Page 15: Joint Visual-Text Modeling for Multimedia Retrieval

15

Experimental Setup: Visual Features

Original

L*a*b Edge Strength Co-occurrence

Page 16: Joint Visual-Text Modeling for Multimedia Retrieval

16

Interest Point Neighborhoods (Harris detector)

Greyscale image Interest points

Page 17: Joint Visual-Text Modeling for Multimedia Retrieval

17

Experimental Setup: Visual Feature list

Regular partitionL*a*b Moments (COLOR)Smoothed Edge Orientation Histogram (EDGE)Grey-level Co-occurrence matrix (TEXTURE)

Interest Point neighborhoodCOLOR, EDGE, TEXTURE

Page 18: Joint Visual-Text Modeling for Multimedia Retrieval

18

dPresentation Outline

q

Words Visterms

Vist

erm

sW

ords

dTranslation (MT) models (Paola),

Relevance Models (Shao Lei,Desislava),

Graphical Models(Pavel, Brock)

Text classification models(Matt)

Integration & Summary(Dietrich)

Page 19: Joint Visual-Text Modeling for Multimedia Retrieval

19

A Machine Translation Approach

to Image Annotation

Presented by Paola Virga

Page 20: Joint Visual-Text Modeling for Multimedia Retrieval

20

dPresentation Outline

q

Words Visterms

Vist

erm

s W

ords

d Translation (MT) models

)|()|()|( Vwc

Vw dcpcqpdqp ∑=

Page 21: Joint Visual-Text Modeling for Multimedia Retrieval

21

p(f|e) = ∑ p(f,a|e)a

p(c|v) = ∑ p(c,a|v)a

Inspiration from Machine Translation

Direct translation modelgrass

grass

grass

grass grass grass grass

grass grass

tigertiger

tigertiger

tigertiger

grass

Page 22: Joint Visual-Text Modeling for Multimedia Retrieval

22

Discrete Representation of Image Regions (visterms) to create analogy to MT

concepts

sun sky waves sea

Solution : Vector quantization visterms

In Machine Translation discrete tokensIn our task

However, the features extracted from regionsare continuous

{fn1, fn2, …fnm} -> vk

sun sky sea waves

tiger water grass

water harbor sky clouds sea

v10 v22 v35 v43c5 c1 c38 c71

v20 v21 v50 v10c15 c21 c83

v78 v78 v1 v1c21 c19 c1 c56 c38

v10 v22

v35 v43

v10

v20 v21

v50

v78 v78

v1 v1

Page 23: Joint Visual-Text Modeling for Multimedia Retrieval

23

p (sun | )

Image annotation using translation probabilitiesp(c|v) : Probabilities obtained from direct translation

∑∈

=VdvV

V vcPd

dcP )|(1)(0 |

v10 v22

v35 v43

Page 24: Joint Visual-Text Modeling for Multimedia Retrieval

24

Annotation Results (Corel set)

field foals horses maretree horses foals mare field

flowers leaf petals stemsflowers leaf petals grass tulip

people pool swimmers waterswimmers pool people water sky

mountain sky snow watersky mountain water clouds snow

jet plane sky sky plane jet tree clouds

people sand sky water sky water beach people hills

Top: manual annotations, bottom : predicted words (top 5 words with the highest probability)Red : correct matches

Page 25: Joint Visual-Text Modeling for Multimedia Retrieval

25

Feature selectionFeatures : color, texture, edgeExtracted from blocks, or around interest

points

ObservationsFeatures extracted from blocks give better performance than features extracted around interest points

When the features are used individuallyEdge features give the best performance

Training using all is the bestUsing Information Gain to select visterms vocabulary didn’t help

Integrating number of faces, increases the performance slightly

mAP values for different features

Page 26: Joint Visual-Text Modeling for Multimedia Retrieval

26

Model and iteration selectionStrategies compared

(a) IBM Model 1 p(c|v)(b) HMM on top of (a)(c) IBM Model 4 on top of (b)

-> Observation : IBM Model 1 is the best

Number of iterations in Giza training affects the performance-> Less iterations give better annotation performance

but cannot produce rare words

Corel TREC0.125 0.124

Page 27: Joint Visual-Text Modeling for Multimedia Retrieval

27

Integrating word co-occurrences Model 1 with word co-occurrence

Integrating word co-occurrences into the model helps for Corel but not for TREC

∑=

=C

jVjjiVi dcPccPdcP

101 )|()|()( |

Corel TRECModel 1 0.125

0.145Model 1 + Word-CO0.1240.124

Page 28: Joint Visual-Text Modeling for Multimedia Retrieval

28

Inspiration from CLIRTreat Image Annotation as a Cross-lingual IR problem

Visual Document comprising visterms (target language) and a query comprising a concept (source language)

( ) 44 344 21Vd

CVv

VV Gcpdvpvcpdcp∀∈

−+⎟⎠

⎞⎜⎝

⎛= ∑ same

)|()1(|)|()|( λλ

Page 29: Joint Visual-Text Modeling for Multimedia Retrieval

29

Inspiration from CLIRTreat Image Annotation as a Cross-lingual IR problem

Visual Document comprising visterms (target language) and a query comprising a concept (source language)

Image does not provide a good estimate of p(v|dv) Tried p(v) and DF(v), DF works best

( )∑∈

=Vdv

VV vcpdvpdcp )|(|)|(

∑∈

=Vdv

TrainV vcpvDFdcscore )|()()|(

Page 30: Joint Visual-Text Modeling for Multimedia Retrieval

30

Annotation Performance on TRECModel 1 0.124CLIR using Model 1 0.126

Significant at p=0.04

Average Precision values for the top 10 wordsFor some concepts we achieved up to 0.6

Page 31: Joint Visual-Text Modeling for Multimedia Retrieval

31

Annotation Performance on TREC

Page 32: Joint Visual-Text Modeling for Multimedia Retrieval

32

Questions?

Page 33: Joint Visual-Text Modeling for Multimedia Retrieval

33

Relevance Models for Image AnnotationPresented by Shaolei FengUniversity of Massachusetts, Amherst

Page 34: Joint Visual-Text Modeling for Multimedia Retrieval

34

dRelevance Models as Visual Model

q

Words Visterms

Vist

erm

sW

ords

d

Use Relevance Models to estimate the probabilities of concepts given test keyframes

)|()|()|( vwc

vw dcpcqpdqp ∑=

Goal:

Page 35: Joint Visual-Text Modeling for Multimedia Retrieval

35

IntuitionImages are defined by spatial context.

Isolated pixels have no meaning.Context simplifies recognition/retrieval.E.g.Tiger is associated with grass, tree, water forest.

Less likely to be associated with computers.

Page 36: Joint Visual-Text Modeling for Multimedia Retrieval

36

Introduction to Relevance ModelsOriginally introduced for text retrieval and cross-lingual retrieval

Lavrenko and Croft 2001, Lavrenko, Choquette and Croft, 2002A formal approach to query expansion.

A nice way of introducing context in imagesWithout having to do this explicitly Do this by computing the joint probability of images and words

Page 37: Joint Visual-Text Modeling for Multimedia Retrieval

37

Cross Media Relevance Models (CMRM)

Two parallel vocabularies: Words and VistermsAnalogous to Cross – lingual relevance models Estimate the joint probabilities of words and visterms from training images

Tiger

R

Tree

Grass

)|()|()(),(||

1

JvPJcPJPdcP iTJ

d

iv

v

∑ ∏∈ =

=

J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03.

Page 38: Joint Visual-Text Modeling for Multimedia Retrieval

38

Continuous Relevance Models (CRM)

A continuous version of Cross Media Relevance ModelEstimate the P(v|J) using kernel density estimate

: Gaussian Kernel: Bandwidth

∑=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

||

1

1)|(J

i

JivvK

nJvP

β

βK

Page 39: Joint Visual-Text Modeling for Multimedia Retrieval

39

Continuous Relevance ModelA generative modelConcept words wj generated by an i.i.d. sample from a multinomialVisterms vi generated by a multi-variate (Gaussian) density

Page 40: Joint Visual-Text Modeling for Multimedia Retrieval

40

Normalized Continuous Relevance Models

Normalized CRMPad annotations to fixed length. Then use the CRM.Similar to using a Bernoulli model (rather than a multinomial for words).Accounts for length (similar to length of document in text retrieval).

S. L. Feng, V. Lavrenko and R. Manmatha, Multiple Bernoulli Models for Image and Video Annotation, in CVPR’04V. Lavrenko, S. L. Feng and R. Manmatha, Statistical Models for Automatic Video Annotation and Retrieval, in ICASSP04

Page 41: Joint Visual-Text Modeling for Multimedia Retrieval

41

Annotation PerformanceOn Corel data Set:

Normalized-CRM works best

Models CMRM CRM Normalized-CRM

Mean average Precision

0.14 0.23 0.26

Page 42: Joint Visual-Text Modeling for Multimedia Retrieval

42

Annotation Examples (Corel set)

Sky train railroad locomotive water

Cat tiger bengaltree forest

Snow fox arctic tails water

Mountain plane jet water sky

Tree plane zebra herd water

Birds leaf nest water sky

Page 43: Joint Visual-Text Modeling for Multimedia Retrieval

43

Results: Relevance Model on TrecVideo Set

Model: Normalized continuous relevance modelFeatures: color and texture

Our comparison experiments show adding edge feature only get very slight improvement

Evaluate annotation on the development dataset for annotation evaluation

mean average precision: 0.158

Page 44: Joint Visual-Text Modeling for Multimedia Retrieval

44

Annotation Performance on TREC

Page 45: Joint Visual-Text Modeling for Multimedia Retrieval

45

Proposal: Using Dynamic Information for Video RetrievalPresented by Shaolei FengUniversity of Massachusetts, Amherst

Page 46: Joint Visual-Text Modeling for Multimedia Retrieval

46

MotivationCurrent models based on single frames in each shot.But video is dynamic

Has motion information.Use dynamic (motion) information

Better image representations (segmentations)Model events/actions

Page 47: Joint Visual-Text Modeling for Multimedia Retrieval

47

Why Dynamic InformationModel actions/events

Many Trecvid 2003 queries require motion information. E.g.

find shots of an airplane taking off.find shots of a person diving into water.

Motion is an important cue for retrieving actions/events.

But using the optical flow over the entire image doesn’t help.Use motion features from objects.

Better Image RepresentationsMuch easier to segment moving objects from background than to segment static images.

Page 48: Joint Visual-Text Modeling for Multimedia Retrieval

48

Problems with still images.Current approach

Retrieve videos using static frames.Feature representations

Visterms from keyframes.Rectangular partition or static segmentation

Poorly correlated with objects.Features – color, texture, edges.

Problem: visterms not correlated well with concepts.

Page 49: Joint Visual-Text Modeling for Multimedia Retrieval

49

Better Visterms – better results. Model performs well on related tasks.Retrieval of handwritten manuscripts.

Visterms – word images.

Features computed over word images.Annotations – ASCII word.“you are to be particularly careful”

Segmentation of words easier.Visterms better correlated with concepts.

So can we extend the analogy to this domain…

Page 50: Joint Visual-Text Modeling for Multimedia Retrieval

50

Segmentation Comparison

Pictures from Patrick Bouthemy’s Website, INRIA

a: Segmentation using only still image information

b: Segmentation using only motion information

Page 51: Joint Visual-Text Modeling for Multimedia Retrieval

51

Represent Shots not KeyframesShot boundary detection

Use standard techniques.Segment moving objects

E.g. By finding outliers from dominant (camera) motion.

Visual features for object and background.Motion features for object

E.g Trajectory information,Motion features for background.

Camera pan, zoom …

Page 52: Joint Visual-Text Modeling for Multimedia Retrieval

52

ModelsOne approach - modify relevance model to include motion information.Probabilistically annotate shots in the test set.

Other models e.g. HMM also possible

)|()|()|()()),(,(||

1

SmPSvPScPSPddcP iTS

d

iimv ∑ ∏

∈ =

=

T: training set, S: shots in the training set

)|()|()(),(||

1

JvPJcPJPdcP iTJ

d

iv

v

∑ ∏∈ =

=

Page 53: Joint Visual-Text Modeling for Multimedia Retrieval

53

Estimation P(vi|S), P(mi|S)If discrete visterms use smoothed maximum likelihood estimates.If continuous use kernel density estimates.

Take advantage of repeated instances of the same object in shot.

Page 54: Joint Visual-Text Modeling for Multimedia Retrieval

54

PlanModify models to include dynamic informationTrain on TrecVID03 development datasetTest on TrecVID03 test dataset

Annotate the test set Retrieve using TrecVID 2003 queries.Evaluate retrieval performance using mean average precision

Page 55: Joint Visual-Text Modeling for Multimedia Retrieval

55

Score Normalization Experiments

Presented by Desislava Petkova

Page 56: Joint Visual-Text Modeling for Multimedia Retrieval

56

Motivation for Score NormalizationScore probabilities are smallBut there seems to be discriminating powerTry to use likelihood ratios

Page 57: Joint Visual-Text Modeling for Multimedia Retrieval

57

Bayes Optimal Decision Rule

P w s r s1 r s

r s P w sP w s

P s P w sP s P w s

P w P s wP w P s w

p w pdf w s wp w pdf w s w

=

=

=

Page 58: Joint Visual-Text Modeling for Multimedia Retrieval

58

Estimating Class-Conditional PDFsFor each word:

Divide training images into positive and negative examplesCreate a model to describe the score distribution of each set

GammaBetaNormalLognormal

Revise word probabilities

Page 59: Joint Visual-Text Modeling for Multimedia Retrieval

59

Annotation Performance

Did not improve annotation performance on Corel or TREC

Page 60: Joint Visual-Text Modeling for Multimedia Retrieval

60

Proposal:Using Clustering to Improve Concept AnnotationDesislava PetkovaMount Holyoke College17 August 2004

Page 61: Joint Visual-Text Modeling for Multimedia Retrieval

61

Automatically annotating imagesCorel:5000 images

4500 training500 testing

Word vocabulary374 words

Annotations1-5 words

Image vocabulary500 visterms

Page 62: Joint Visual-Text Modeling for Multimedia Retrieval

62

Relevance models for annotationA generative language modeling approachFor a test image I = {v1, …, vm} compute the joint distribution of each word w in the vocabulary with the visterms of I

Compare I with training images J annotated with w

P w , IJ T

P J P w , I J

P w , IJ T

P J P w Ji 1

m

P vi J

Page 63: Joint Visual-Text Modeling for Multimedia Retrieval

63

Estimating P(w|J) and P(v|J)Use maximum-likelihood estimates

Smooth with the entire training set T

P w J 1 a c w , JJ

a c w ,TT

P v J 1 b c v , JJ

b c v ,TT

Page 64: Joint Visual-Text Modeling for Multimedia Retrieval

64

MotivationEstimating the relevance model of a single image is a noisy process

P(v|J): visterm distributions are sparseP(w|J): human annotations are incomplete

Use clustering to get better estimates

Page 65: Joint Visual-Text Modeling for Multimedia Retrieval

65

Potential benefits of clustering

{cat, grass, tiger, water}

{cat, grass, tiger}{water}

{cat, grass, tiger, tree}

{grass, tiger, water}{cat}

Words in red are missing in the annotation

Page 66: Joint Visual-Text Modeling for Multimedia Retrieval

66

Relevance Models with ClusteringCluster the training images using K-means

Use both visterms and annotationsCompute the joint distribution of visterms and words in each cluster

Use clusters instead of individual images

P w , IC T

P C P w Ci 1

m

P vi C

Page 67: Joint Visual-Text Modeling for Multimedia Retrieval

67

Preliminary results on annotation performance

mAP

Standard relevance model(4500 training examples)

0.14

Relevance model with clusters(100 training examples)

0.128

Page 68: Joint Visual-Text Modeling for Multimedia Retrieval

68

Cluster-based smoothingSmooth maximum likelihood estimates for the training images based on clusters they belong to

P w J 1 a1 a2c w , J

Ja1

c w ,C J

C J

a2c w ,T

T

P v J 1 b1 b2c v , J

Jb1

c v ,C J

C J

b2c v , T

T

Page 69: Joint Visual-Text Modeling for Multimedia Retrieval

69

ExperimentsOptimize smoothing parameters

Divide training set 4000 training images500 validation images

Find the best set of clustersQuery-dependent clustersInvestigate soft clustering

Page 70: Joint Visual-Text Modeling for Multimedia Retrieval

70

Evaluation planRetrieval performance

Average precision and recall for one-word queries

Comparison with the standard relevance model

Page 71: Joint Visual-Text Modeling for Multimedia Retrieval

71

Hidden Markov Modelsfor Image AnnotationsPavel IrcingSanjeev Khudanpur

Page 72: Joint Visual-Text Modeling for Multimedia Retrieval

72

dPresentation Outline

q

Words Visterms

Vist

erm

sW

ords

dTranslation (MT) models (Paola),

Relevance Models (Shao Lei,Desislava),

Graphical Models(Pavel, Brock)

Text classification models(Matt)

Integration & Summary(Dietrich)

Page 73: Joint Visual-Text Modeling for Multimedia Retrieval

73

Model setup

tiger

ground

water

grass • alignment between image blocks and annotation words is a hidden variable, models are trained using the EM algorithm (HTK toolkit)

Test HMM has |W| states, 2 scenarios: (a) p(w’|w) uniform

(b) p(w’|w) from co-occurrence LM

Posterior probability from forward-backward pass used for p(w|Image)

Training HMMs: • separate HMM for each

training image – states given by manual annotations.

• image blocks are “generated”by annotation words

Page 74: Joint Visual-Text Modeling for Multimedia Retrieval

74

Challenges in HMM training Inadequate annotationsThere is no notion of order in the annotation words

Difficulties with automatic alignment between words and image regions

No linear order in image blocks (assume raster-scan)Additional spatial dependence between block-labels is missedPartially addressed via a more complex DBN (see later)

Page 75: Joint Visual-Text Modeling for Multimedia Retrieval

75

Inadequacy of the annotations

cartransportation

vehicle outdoors

non-studio setting nature-non-vegetation

snow

man-made object

TRECVID databaseAnnotation concepts capture mostly semantics of the image and they are not very suitable for describing visual properties

Corel databaseAnnotators often mark only interesting objects

beachpalmpeopletree

Page 76: Joint Visual-Text Modeling for Multimedia Retrieval

76

Alignment problemsThere is no notion of order in the annotation words

Difficulties with automatic alignment between words and image regions

Page 77: Joint Visual-Text Modeling for Multimedia Retrieval

77

Gradual TrainingIdentify a set of “background” words (sky, grass, water,...)In the initial stages of HMM training

Allow only “background” states to have their individual emission probability distributionsAll other objects share a single “foreground”distribution

Run several EM iterationsGradually untie the “foreground” distribution and run more EM iterations

Page 78: Joint Visual-Text Modeling for Multimedia Retrieval

78

Gradual Training Results

Results:Improved alignment of training imagesAnnotation performance on test images did not change significantly

Page 79: Joint Visual-Text Modeling for Multimedia Retrieval

79

Another training scenariosmodels were forced to visit every state during training

huge models, marginal difference in performance

special states introduced to account for unlabelled background and unlabelled foreground, with different strategies for parameter tying

Page 80: Joint Visual-Text Modeling for Multimedia Retrieval

80

Annotation performance - CorelImage features LM mAP

No

Yes

No

Yes

0.120Discrete

0.150

0.140Continuous(1 Gaussian per state) 0.157

Continuous features are better than discreteCo-ocurrence language model also gives moderate improvement

Page 81: Joint Visual-Text Modeling for Multimedia Retrieval

81

Annotation performance - TRECVID

Model LM mAP

No

Yes

No

Yes

0.0941 Gaussian per state

X

0.14512 Gaussians per state

X

Continuous features only, no language model

Page 82: Joint Visual-Text Modeling for Multimedia Retrieval

82

Annotation Performance on TREC

Page 83: Joint Visual-Text Modeling for Multimedia Retrieval

83

Summary: HMM-Based Annotation Very encouraging preliminary results

Effort started this summer, validated on Corel, and yielded competitive annotation results on TREC

Initial findingsProper normalization of the features is crucial for system performance: bug found and fixed on Friday!Simple HMMs seem to work best

More complex training topology didn’t really helpMore complex parameter tying was only marginally helpful

Glaring gapsNeed a good way to incorporate a language model

Page 84: Joint Visual-Text Modeling for Multimedia Retrieval

84

Brock PytlikJohns Hopkins [email protected]

Graphical Models for Image Annotation

+Joint Segmentation and

Labeling for Content Based Image Retrieval

Page 85: Joint Visual-Text Modeling for Multimedia Retrieval

85

OutlineGraphical Models for Image Annotation

Hidden Markov ModelsPreliminary Results

Two-Dimensional HMM’sWork in Progress

Joint Image Segmentation and LabelingTree Structure Models of Image Segmentation

Proposed Research

Page 86: Joint Visual-Text Modeling for Multimedia Retrieval

86

Graphical Model Notation

tiger

ground

water

grass

water

ground grass

tiger

3C

3O

water

ground grass

tiger

2C

2O

1Cwater

ground grass

tiger

1O

p(o | c) p(o | c)

p(c | c ') p(c | c ')

Page 87: Joint Visual-Text Modeling for Multimedia Retrieval

87

Graphical Model Notation

tiger

ground

water

grass

water

ground grass

tiger

3C

3O

water

ground grass

tiger

2C

2O

1C

water

1O

p(o | c) p(o | c)

p(c | c ') p(c | c ')

Page 88: Joint Visual-Text Modeling for Multimedia Retrieval

88

Graphical Model Notation

tiger

ground

water

grass

water

ground grass

tiger

3C

3O

water

2C

2O

1C

water

1O

p(o | c) p(o | c)

p(c | c ') p(c | c ')

Page 89: Joint Visual-Text Modeling for Multimedia Retrieval

89

Graphical Model Notation

tiger

ground

water

grasstiger

3C

3O

water

2C

2O

1C

water

1O

p(o | c) p(o | c)

)|( 'ccp p(c | c ')

Page 90: Joint Visual-Text Modeling for Multimedia Retrieval

90An HMM for a 24-block Image

Graphical Model Notation Simplified

Page 91: Joint Visual-Text Modeling for Multimedia Retrieval

91

Graphical Model Notation Simplified

An HMM for a 24-block Image

Page 92: Joint Visual-Text Modeling for Multimedia Retrieval

92

Modeling Spatial Structure

An HMM for a 24-block Image

Page 93: Joint Visual-Text Modeling for Multimedia Retrieval

93

Modeling Spatial Structure

An HMM for a 24-block Image Transition probabilities represent spatial extent of objects

Page 94: Joint Visual-Text Modeling for Multimedia Retrieval

94

Modeling Spatial Structure

Transition probabilities represent spatial extent of objects

A Two-Dimensional Model for a 24-block Image

Page 95: Joint Visual-Text Modeling for Multimedia Retrieval

95

Modeling Spatial Structure

Transition probabilities represent spatial extent of objects

A Two-Dimensional Model for a 24-block Image

Model Training Time Per Image

Training Time Per Iteration

1-D HMM .5 sec 37.5 min2-D HMM 110 sec 8250 min = 137.5 hr

Page 96: Joint Visual-Text Modeling for Multimedia Retrieval

96

Bag-of-Annotations TrainingUnlike ASR Annotation Words are Unordered

1

Constraint on

Ct

Ct

Tiger, Sky, Grass

Mt

p(Mt =1) =1 1 if ct ∈ tiger,grass,sky{ }0 otherwise⎧ ⎨ ⎩

Page 97: Joint Visual-Text Modeling for Multimedia Retrieval

97

Bag-of-Annotations Training (II) Forcing Annotation Words to Contribute

Mt(1) = Mt −1

(1) ∨(Ct = tiger)

Mt(2) = Mt−1

(2) ∨(Ct = grass)

Only permit paths that visit every annotation word.

Ct

Mt(3) = Mt−1

(3) ∨(Ct = sky)

Mt(1) Mt

(2) Mt(3)

Page 98: Joint Visual-Text Modeling for Multimedia Retrieval

98

Inference on Test ImagesForward Decoding

p(c | dv ) =p(c,dv )p(dv)

Page 99: Joint Visual-Text Modeling for Multimedia Retrieval

99

Inference on Test ImagesForward Decoding

)( )|( 1

SpsvpcS

N

iii∑ ∏

∋ =⎥⎦

⎤⎢⎣

p(c | dv ) =p(c,dv )p(dv)

=

Page 100: Joint Visual-Text Modeling for Multimedia Retrieval

100

Inference on Test ImagesForward Decoding

)( )|( 1

SpsvpS

N

iii∑ ∏ ⎥⎦

⎤⎢⎣

=

)( )|( 1

SpsvpcS

N

iii∑ ∏

∋ =⎥⎦

⎤⎢⎣

p(c | dv ) =p(c,dv )p(dv)

=

Page 101: Joint Visual-Text Modeling for Multimedia Retrieval

101

Inference on Test ImagesForward Decoding

Viterbi DecodingApproximate Sum over all Paths with the Best Path

)( )|( 1

SpsvpS

N

iii∑ ∏ ⎥⎦

⎤⎢⎣

=

)( )|( 1

SpsvpcS

N

iii∑ ∏

∋ =⎥⎦

⎤⎢⎣

p(c | dv ) =p(c,dv )p(dv)

=

Page 102: Joint Visual-Text Modeling for Multimedia Retrieval

102

Annotation Performance on Corel Data

Model Image Features

mAP

Discrete 0.071

DiscreteContinuous

0.0860.074

DiscreteContinuous

TrainingTBD

Working with 2-D models needs further studymAP not yet on par with other models

Page 103: Joint Visual-Text Modeling for Multimedia Retrieval

103

Future WorkImproved Training for Two-Dimensional Models

Permits training horizontal and vertical chains separately

Other variations could be investigated Next Idea

Joint Image Segmentation and Labeling

)|()|(),|( ,1,11,,1, jijijijiji ccpccpcccp −−−− ∝

Page 104: Joint Visual-Text Modeling for Multimedia Retrieval

104

Joint Segmentation and Labeling

tiger, grass, sky

Page 105: Joint Visual-Text Modeling for Multimedia Retrieval

105

Joint Segmentation and Labeling

tiger, grass, sky

Page 106: Joint Visual-Text Modeling for Multimedia Retrieval

106

Joint Segmentation and Labeling

tiger, grass, sky

Page 107: Joint Visual-Text Modeling for Multimedia Retrieval

107

Joint Segmentation and Labeling

tiger, grass, sky

sky

tiger

grass

sky

tiger

grass

Page 108: Joint Visual-Text Modeling for Multimedia Retrieval

108

Research ProposalA Generative Model for Joint Segmentation and Labeling

Tree construction by agglomerative clustering of image regions (blocks) based on visual similarity

Segmentation = A cut across the resulting treeLabeling = Assigning concepts to resulting leaves

Page 109: Joint Visual-Text Modeling for Multimedia Retrieval

109

ModelGeneral Model

∑ ∏∈ ∈

=))(tree(cuts )(leaves

)|)(obs( ),|( )(),(vdu ul

llv clplucpupdcp

Page 110: Joint Visual-Text Modeling for Multimedia Retrieval

110

ModelGeneral Model

∑ ∏∈ ∈

=))(tree(cuts )(leaves

)|)(obs( ),|( )(),(vdu ul

llv clplucpupdcp

Probability of Cut

Page 111: Joint Visual-Text Modeling for Multimedia Retrieval

111

ModelGeneral Model

∑ ∏∈ ∈

=))(tree(cuts )(leaves

)|)(obs( ),|( )(),(vdu ul

llv clplucpupdcp

Probability of Label GivenCut and Leaf

Page 112: Joint Visual-Text Modeling for Multimedia Retrieval

112

ModelGeneral Model

∑ ∏∈ ∈

=))(tree(cuts )(leaves

)|)(obs( ),|( )(),(vdu ul

llv clplucpupdcp

Probability of Observation Given Label

Page 113: Joint Visual-Text Modeling for Multimedia Retrieval

113

ModelGeneral Model

Independent Generation of Observations Given Label

∑ ∏∈ ∈

=))(tree(cuts )(leaves

)|)(obs( ),|( )(),(vdu ul

llv clplucpupdcp

∑ ∏ ∏∈ ∈ ∈

=))tree((cuts )(leaves )child

)|(),|()(),(vdu ul (lo

llv coplucpupdcp

Page 114: Joint Visual-Text Modeling for Multimedia Retrieval

114

Estimating Model ParametersSuitable independence assumptions may need to be made

All cuts are equally likely?Given a cut, leaf labels have a Markov dependenceGiven a label, its image footprint is independent neighboring image regions

Work out EM algorithm for this model

Page 115: Joint Visual-Text Modeling for Multimedia Retrieval

115

Estimating Cuts given TopologyUniform

All cuts containing leaves or more equally likelyHypothesize number of segments produced

Hypothesize which possible segmentation usedGreedy Choice

Pick node with largest observation probability remaining that produces a valid segmentation

Repeat until all observations accounted forChanges Model

No longer distribution over cutsAffects valid labeling strategies

|| c

Page 116: Joint Visual-Text Modeling for Multimedia Retrieval

116

Estimating Labels Given CutsUniform

Like HMM training with fixed concept transitionsNumber of Children

Sky often generates a large number of observationsCanoe often generates a small number of observations

Co-occurrence Language ModelEliminates label independence given cutCould do two-pass model like MT group did (not exponential)

∑ ∑∈ ∈

⎥⎦

⎤⎢⎣

⎡=

Ca umacpmaplucp )|()|(),|(

)(leaves12

Page 117: Joint Visual-Text Modeling for Multimedia Retrieval

117

Estimating Observations Given LabelsLabel Generates its Observations Independently

Problem: Product of Children at least as high as Parent Score

Label Generates Composite Observation at Node

Page 118: Joint Visual-Text Modeling for Multimedia Retrieval

118

Evaluation PlanEvaluate on Corel Image set using mAPTREC annotation task

Page 119: Joint Visual-Text Modeling for Multimedia Retrieval

119

Questions?

Page 120: Joint Visual-Text Modeling for Multimedia Retrieval

120

Predicting Visual Concepts From TextPresented byMatthew Krause

Page 121: Joint Visual-Text Modeling for Multimedia Retrieval

121

dPresentation Outline

q

Words Visterms

Vist

erm

sW

ords

dTranslation (MT) models (Paola),

Relevance Models (Shao Lei,Desislava),

Graphical Models(Pavel, Brock)

Text classification models(Matt)

Integration & Summary(Dietrich)

Page 122: Joint Visual-Text Modeling for Multimedia Retrieval

122

A Motivating Example

Page 123: Joint Visual-Text Modeling for Multimedia Retrieval

123

A Motivating Example<Word stime="177.09" dur="0.22" conf="0.727"> IT'S </Word><Word stime="177.31" dur="0.25" conf="0.963"> MUCH </Word><Word stime="177.56" dur="0.11" conf="0.976"> THE </Word><Word stime="177.67" dur="0.29" conf="0.977"> SAME </Word><Word stime="177.96" dur="0.14" conf="0.980"> IN </Word><Word stime="178.10" dur="0.13" conf="0.603"> THE </Word><Word stime="178.38" dur="0.57" conf="0.953"> SUMMERTIME

</Word><Word stime="178.95" dur="0.50" conf="0.976"> GLACIER </Word><Word stime="179.45" dur="0.60" conf="0.974"> AVALANCHE

</Word>

Page 124: Joint Visual-Text Modeling for Multimedia Retrieval

124

ConceptsAssume there is a hidden variable c which generates query words from a document’s visterms.

∑ ∑≅=C C

wvwwvwv dcpcqpdcpcdqpdqp )|()|()|(),|()|(

Page 125: Joint Visual-Text Modeling for Multimedia Retrieval

125

ASR Features ExampleSTEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE OVER THE BLACK SEADRIFTING SLOWLY TOWARDS THE COASTOF THE CAUCUSES HIS TEAM PLANS IFNECESSARY TO BRING HIM DOWN AFTERDAYLIGHT TOMORROW YOU THE CHECHENCAPITAL OF GROZNY

Page 126: Joint Visual-Text Modeling for Multimedia Retrieval

126

Building FeaturesInsert Sentence Boundaries

Case Restoration

Noun Extraction Named Entity Detection

WordNet Processing

Feature Set

Page 127: Joint Visual-Text Modeling for Multimedia Retrieval

127

ASR Features ExampleSTEVE FOSSETT AND HIS BALLOON SOLO SPIRIT ARSENIDE OVER THE BLACK SEADRIFTING SLOWLY TOWARDS THE COASTOF THE CAUCUSES HIS TEAM PLANS IFNECESSARY TO BRING HIM DOWN AFTERDAYLIGHT TOMORROW YOU THE CHECHENCAPITAL OF GROZNY

Page 128: Joint Visual-Text Modeling for Multimedia Retrieval

128

ASR Features ExampleSTEVE FOSSETT AND HISBALLOON SOLO SPIRITARSENIDE.

OVER THE BLACK SEADRIFTING SLOWLYTOWARDS THE COASTOF THE CAUCUSES.

HIS TEAM PLANS IFNECESSARY TO BRING HIMDOWN AFTER DAYLIGHTTOMORROW.

YOU THE CHECHEN CAPITALOF GROZNY

Page 129: Joint Visual-Text Modeling for Multimedia Retrieval

129

ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.

Over the Black Sea driftingslowly towards the coast of thecaucuses.

His team plans if necessary tobring him down after daylighttomorrow.

you the Chechan capital ofGrozny….

Page 130: Joint Visual-Text Modeling for Multimedia Retrieval

130

ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.

Over the Black Sea driftingslowly towards the coast of thecaucuses.

His team plans if necessary tobring him down after daylighttomorrow.

you the Chechan capital ofGrozny.

Named EntitiesMale Person, Location (Region)

Page 131: Joint Visual-Text Modeling for Multimedia Retrieval

131

ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.

Over the Black Sea driftingslowly towards the coast of thecaucuses.

His team plans if necessary tobring him down after daylighttomorrow.

you the Chechan capital ofGrozny.

Named EntitiesMale Person, Location (Region)

Page 132: Joint Visual-Text Modeling for Multimedia Retrieval

132

ASR Features ExampleSteve Fossett and his balloon Solo Spirit arsenide.

Over the Black Sea driftingslowly towards the coast of thecaucuses.

His team plans if necessary tobring him down after daylighttomorrow.

you the Chechan capital ofGrozny.

Named EntitiesMale Person, Location (Region)

Nounsballoon, solo, spirit, coast, caucus, team, daylight, Chechan, capital, Grozny

WordNetnature

Page 133: Joint Visual-Text Modeling for Multimedia Retrieval

133

Feature SelectionBasic feature set (nouns + NEs) has ~18,000 elements/shot

6000 elements x {previous, this, next}Using only a subset of the possible features may affect performance.Two strategies for feature selection:

Remove very rare words (18,000 7902)Eliminate low-value features

Page 134: Joint Visual-Text Modeling for Multimedia Retrieval

134

Information GainMeasures the change in entropy given the value of a single feature

∑∈

=−=)(

)|()()(),(FValuesw

wFCHwpCHFCGain

Page 135: Joint Visual-Text Modeling for Multimedia Retrieval

135

Information Gain ResultsBasketball

1. (empty)2. Location-city3. (empty) (previous)4. “game” (previous)5. “game”6. Person-male7. “point” (previous)8. “game” (next)9. “basketball (previous)10. “win”11. (empty) (next)12. “basketball”13. “point”14. “title” (previous)15. “win” (previous)

Sky1. Person-male (previous)2. “car” (previous)3. Person4. Person-male5. “jury”6. Person (next)7. (empty) (next)8. “point”9. “report”10. “point” (next)11. “change” (previous)12. “research” (next)13. “fiber” (previous)14. “retirement” (next)15. “look”

Page 136: Joint Visual-Text Modeling for Multimedia Retrieval

136

Choosing an optimal number of features

0.56

0.565

0.57

0.575

0.58

250

750

1250

1750

2250

2750

3250

3750

4250

4750

5250

5750

6250

6750

7250

Number of Features

AP

Page 137: Joint Visual-Text Modeling for Multimedia Retrieval

137

ClassifiersNaïve BayesDecision TreesSupport Vector MachinesVoted PerceptronsLanguage ModelAdaBoosted Naïve Bayes & Decision StumpsMaximum Entropy

Page 138: Joint Visual-Text Modeling for Multimedia Retrieval

138

Naïve BayesBuild a binary classifier (present/absent) for each concept.

)()()|()|(

w

ww dp

cpcdpdcp =

Page 139: Joint Visual-Text Modeling for Multimedia Retrieval

139

Language ModelingConceptually similar to Naïve Bayes but

MultinomialSmoothed distributionsDifferent feature selection

Page 140: Joint Visual-Text Modeling for Multimedia Retrieval

140

Maximum Entropy ClassificationBinary constraints

Single 75-concept model

Ranked list of concepts for each shot.

Page 141: Joint Visual-Text Modeling for Multimedia Retrieval

141

Results on the most common concepts

0

0.1

0.2

0.3

0.4

0.5

0.6

AP

text non_studio face indoors outdoors people person

ChanceLang ModelNaïve BayesMaxEnt

Page 142: Joint Visual-Text Modeling for Multimedia Retrieval

142

Results on selected concepts

0

0.1

0.2

0.3

0.4

0.5

0.6

AP

weather basketball face sky indoors beach vehicle car

ChanceLang ModelNaïve BayesMaxEnt

Page 143: Joint Visual-Text Modeling for Multimedia Retrieval

143

Mean Average Precision

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

AP

Chance Language Model SVM Naïve Bayes Max Ent

Page 144: Joint Visual-Text Modeling for Multimedia Retrieval

144

Will this help for retrieval?“Find shots of a person diving into some water.”

person, water_body, non-studio_setting, nature_non-vegetation, person_action, indoors

“Find shots of the front of the White House in the daytime with the fountain running.”

building, outdoors, sky, water_body, cityscape, house, nature_vegetation

“Find shots of Congressman Mark Souder.”person, face, indoors, briefing_room_setting, text_overlay

Page 145: Joint Visual-Text Modeling for Multimedia Retrieval

145

Will this help for retrieval?“Find shots of a person diving into some water.”

person, water_body, non-studio_setting, nature_non-vegetation, person_action, indoors

“Find shots of the front of the White House in the daytime with the fountain running.”

building, outdoors, sky, water_body, cityscape, house, nature_vegetation

“Find shots of Congressman Mark Souder.”person, face, indoors, briefing_room_setting, text_overlay

Page 146: Joint Visual-Text Modeling for Multimedia Retrieval

146

Performance on retrieval-relevant conceptsConcept Importance AP Chanceoutdoors 0.68 0.434 0.270person 0.48 0.267 0.227

vehicle 0.36 0.106 0.043

man-made-obj. 0.28 0.190 0.156

sky 0.40 0.119 0.061

face 0.28 0.582 0.414

building 0.24 0.078 0.042road 0.24 0.055 0.037transportation 0.24 0.151 0.065indoors 0.24 0.459 0.317

Page 147: Joint Visual-Text Modeling for Multimedia Retrieval

147

SummaryPredict visual concepts for ASRTried Naïve Bayes, SVMs, MaxEnt, Language Models,…Expect improvements in retrieval

Page 148: Joint Visual-Text Modeling for Multimedia Retrieval

148

Joint Visual-Text Video OCR Proposed by:Matthew KrauseGeorgetown University

Page 149: Joint Visual-Text Modeling for Multimedia Retrieval

149

MotivationTREC queries ask for:

specific personsspecific placesspecific eventsspecific locations

Page 150: Joint Visual-Text Modeling for Multimedia Retrieval

150

Motivation“Find shots of Congressman Mark Souder”

Page 151: Joint Visual-Text Modeling for Multimedia Retrieval

151

Motivation“Find shots of a graphic of Dow Jones Industrial Average showing a rise for one day. The number of points risen that day must be visible.”

Page 152: Joint Visual-Text Modeling for Multimedia Retrieval

152

MotivationFind shots of the Tomb of the Unknown Soldier in Arlington National Cemetery.

Page 153: Joint Visual-Text Modeling for Multimedia Retrieval

153

Motivation

WEIFll I1 NFWdJ TNNIF H

Page 154: Joint Visual-Text Modeling for Multimedia Retrieval

154

Joint Visual-Text Video OCRGoal: Improve video OCR accuracy by exploiting other information in the audio and video streams during recognition.

Page 155: Joint Visual-Text Modeling for Multimedia Retrieval

155

Why use video OCR?…. Sources tell C.N.N. there’s evidence that links those incidents with the January bombing of a women’s health clinic in Birmingham, Alabama. Pierre Thomas joins us now from Washington. He has more on the story in this live report…

Page 156: Joint Visual-Text Modeling for Multimedia Retrieval

156

Why use video OCR?

Page 157: Joint Visual-Text Modeling for Multimedia Retrieval

157

Why use video OCR?

Page 158: Joint Visual-Text Modeling for Multimedia Retrieval

158

Why use video OCR?Those links are growing more intensiveinvestigative focus toward fugitive EricRudolph who’s been charged in theBirmingham bombing which killed an off-duty policeman…

Page 159: Joint Visual-Text Modeling for Multimedia Retrieval

159

Why use video OCR?Text overlays provide high precision information about query-relevant concepts in the current image.

Page 160: Joint Visual-Text Modeling for Multimedia Retrieval

160

Finding TextUse existing tools and data from IBM/CMU.

Page 161: Joint Visual-Text Modeling for Multimedia Retrieval

161

Image ProcessingPreprocessing

Normalize the text region’s heightFeature extraction

ColorEdge Strength and Orientation

Page 162: Joint Visual-Text Modeling for Multimedia Retrieval

162

Proposal: HMM-based recognizer

c1 c2 c3 c4 c5 c6

M A I T K

Page 163: Joint Visual-Text Modeling for Multimedia Retrieval

163

Proposal: Cache-based LMsAugment the recognizers with an interpolation of language models

Background language modelCache-based language model

ASR or closed caption text“Interesting” words from the cache

Named Entities

321 )|()|()|()|( λλλ hcphcphcphcp iinteresticacheibgi =

Page 164: Joint Visual-Text Modeling for Multimedia Retrieval

164

EvaluationEvaluate on TRECVID dataCharacter Error Rate

Compare vs. manual transcriptionsMean Average Precision

NIST-provided relevance judgments

Page 165: Joint Visual-Text Modeling for Multimedia Retrieval

165

SummaryInformation from text overlays appears to be useful for IR.General character recognition is a Hard problem.Adding in external knowledge sources via the LMs should improve accuracy.

Page 166: Joint Visual-Text Modeling for Multimedia Retrieval

166

Work Plan1. Text Localization

IBM/CMU text finders + height normalization2. Image Processing & Feature

ExtractionBegin with color and edge features

3. HMM-based RecognizerTrain using TREC data with hand-labeled captions

4. Language ModelingBackground, Cache, and “Interesting Words”

Page 167: Joint Visual-Text Modeling for Multimedia Retrieval

167

Retrieval Experiments and

Summary

Presented by Dietrich Klakow

Page 168: Joint Visual-Text Modeling for Multimedia Retrieval

168

dPresentation Outline

q

Words Visterms

Vist

erm

sW

ords

dTranslation (MT) models (Paola),

Relevance Models (Shao Lei,Desislava),

Graphical Models(Pavel, Brock)

Text classification models(Matt)

Integration & Summary(Dietrich)

Page 169: Joint Visual-Text Modeling for Multimedia Retrieval

169

The Matrix

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry )| vwvw ,dd,qp(q

Page 170: Joint Visual-Text Modeling for Multimedia Retrieval

170

The Matrix

)| ww dp(q )| vw dp(q

)| wv dp(q

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry

)| vv dp(q

Page 171: Joint Visual-Text Modeling for Multimedia Retrieval

171

•Naïve Bayes•Max. Ent•LM•SVM, Ada Boost, …

•MT•Relevance Models•HMM

)| vw dp(q

The Matrix

)| ww dp(q

)| wv dp(q

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry

)| vv dp(q

Page 172: Joint Visual-Text Modeling for Multimedia Retrieval

172

)|)|)|

vwvvww

vwvw

,ddp(q,ddp(q,dd,qp(q

×=

)|)1()|)|

vwwwww

vww

dp(qdp(q,ddp(q

λλ −+=

Retrieval Model I: p(q|d)

Baseline. Standard text-retrievalText QueryImage Documents

Page 173: Joint Visual-Text Modeling for Multimedia Retrieval

173

Retrieval Model I: p(q|d)

)]|()1()|[)]|)1()|[

)|

vvvwvv

vwwwww

vwvw

dqpdp(qdp(qdp(q

,dd,qp(q

λλλλ

−+×−+

=

α Only minor improvements over baseline

Page 174: Joint Visual-Text Modeling for Multimedia Retrieval

174

Retrieval Model II: p(q|d)We want to estimateAssume pairwise marginals given:

Setting: Maximum Entropy problem4 constraints1 iteration of GIS:

), vwvw ,dd,qp(q

),(),,

vwdq

vwvw dqp,dd,qp(qwv

=∑

4321 )|()|()|()|()| λλλλvvwvvwwwvwvw dqpdqpdqpdqp,dd,qp(q ∝

Page 175: Joint Visual-Text Modeling for Multimedia Retrieval

175

Baseline TRECVID: Text Retrieval

Retrieval mAP: 0.131

)| ww dp(q

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry

Report best automatic run from literature (0.16)

Page 176: Joint Visual-Text Modeling for Multimedia Retrieval

176

Combination with visual model

)| ww dp(q )| vw dp(q

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry mAP: 0.131

Page 177: Joint Visual-Text Modeling for Multimedia Retrieval

177

Combination with visual model

Retrieval mAP: 0.139

)| ww dp(q )| vw dp(q

Visterms dvWords dw

DocumentW

ords

qw

Vis

term

s qv

Que

ry

MT 0.126Relevance Models 0.158HMM 0.145

Concept Annotationon images mAP on TRECVID

MT: Best overall performance so far

MTmAP: 0.131

Page 178: Joint Visual-Text Modeling for Multimedia Retrieval

178

Combination with MT and ASR

Retrieval mAP: 0.149

MT

)| ww dp(q )| vw dp(q

Visterms dvWords dw

Document

)| wv dp(q

Wor

ds q

wV

iste

rms q

v

Que

ry

Concepts from ASR: mAP=0.125

MT 0.126Relevance Models 0.158HMM 0.145

Concept Annotationon images: mAP on TRECVID

Best results reported in literature: retrieval mAP=0.162

mAP: 0.131

Page 179: Joint Visual-Text Modeling for Multimedia Retrieval

179

Recall-Precision-Curve

Improvementsin high precisionregion

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

ciss

ion

Recall

BestBasline

Page 180: Joint Visual-Text Modeling for Multimedia Retrieval

180

Difficulties and Limitations we faced

Annotations are Inconsistent, sometimes abstract, …

Used plain vanilla featuresColor, texture, edge on key-framesNo time for exploration of alternatives

Uniform block segmentation of imagesUpper bound for concepts from ASR

Page 181: Joint Visual-Text Modeling for Multimedia Retrieval

181

Future WorkModel

Incompletely labelled images Inconsistent annotations

Get beyond the 75-concept bottleneckLarger concept set (+training data) Direct modelling

Better model for spatial and temporal dependencies in videoQuery dependent processing

E.g. image features, combination weights, OCR-features

Desislava

Shaolei and Brock

Matt

Page 182: Joint Visual-Text Modeling for Multimedia Retrieval

182

Overall SummaryConcepts from image

MT: CLIR with direct translation works best Relevance models: best numbers on development testHMM: novel competitive approach for image annotation

Concepts from ASR: oh my god, it works

Fusion: adding multiple source in log-linear combination helped

Overall: 14% improvement

Page 183: Joint Visual-Text Modeling for Multimedia Retrieval

183

AcknowledgmentsTREC for the dataBBN for NE-taggingIBM:

for providing the features Close captioning alignment (Arnon Amir)

Help with GMTK: Jeff Bilmes and Karen LivescuCLSP for the capitalizer (WS 03 MT-team) INRIA for the face detectorNSF, DARPA and NSA for the money CLSP for hosting

Laura, Sue, ChrisEiwe, John, PeterFred

Page 184: Joint Visual-Text Modeling for Multimedia Retrieval

184From: http://www.nature.ca/notebooks/english/tiger.htm