Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval...

50
1 Joint Visual-Text Modeling for Multimedia Retrieval Presented by: Giri Iyengar JHU CLSP Summer Workshop 2004 Team Update, August 4 th 2004

Transcript of Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval...

Page 1: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

1

Joint Visual-Text Modeling for Multimedia Retrieval Presented by: Giri IyengarJHU CLSP Summer Workshop 2004Team Update, August 4th 2004

Page 2: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

2

TeamUndergraduate Students

Desislava Petkova (Mt. Holyoke), Matthew Krause (Georgetown)

Graduate StudentsShaolei Feng (U. Mass), Brock Pytlik(JHU), Pavel Ircing(U. West Bohemia), Paola Virga (JHU)

Senior ResearchersPinar Duygulu, Bilkent U., TurkeyGiri Iyengar, IBM ResearchSanjeev Khudanpur, CLSP, JHUDietrich Klakow, Uni. SaarlandR. Manmatha, CIIR, U. Mass Amherst

Page 3: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

3

OutlineTRECVID descriptionOur retrieval modelsModels for Image AnnotationModels for Auto-IllustrationNext Steps

Page 4: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

4

TRECVID Corpus and TaskCorpus

Broadcast news videos used for Hub4 evaluations

TasksShot-boundary detectionNews Story segmentation (multimodal)Concept detection (“Annotation”)Search task

Page 5: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

5

Concept Detection Task

NIST provides list of benchmark conceptsSites build models & NIST evaluates

Page 6: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

6

Search TaskFind video clips (“shots”) visually

relevant to a multimedia statement of information need

Find shots containing

the Mercedes Logo (star)

Find shots with

the Sphinx

Page 7: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

7

Concept Detection vs. SearchThe essential difference is

When are the topics made available?How much training data is available?

Page 8: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

8

Alternate (development) CorpusCOREL photograph database

5000 high-quality photographs with captions

TaskAnnotation

Page 9: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

9

OutlineTRECVID descriptionOur retrieval models

LinearLog-Linear

Models for Image AnnotationModels for Auto-IllustrationNext Steps

Page 10: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

10

Retrieval Model I: p(q|d)

)|)|)|

vwvvww

vwvw

,ddp(q,ddp(q,dd,qp(q

×=

)|)1()|)|

vwwwww

vww

dp(qdp(q,ddp(q

λλ −+=

Baseline. Standard text-retrieval

Page 11: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

11

Retrieval Model I: p(q|d)

)]|()1()|[)]|)1()|[

)|

vvvwvv

vwwwww

vwvw

dqpdp(qdp(qdp(q

,dd,qp(q

λλλλ

−+×−+

=

Page 12: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

12

Retrieval Model II: p(q|d)We want to estimateAssume we can only estimate pairwisemarginals reliably. E.g.,

Setting this as a Maximum Entropy problem with 6 constraints, after 1 iteration of GIS we get (Log-Linear Model)

), vwvw ,dd,qp(q

),(),,

vvqd

vwvw qdp,dd,qp(qww

=∑

4321 )|()|()|()|()| λλλλvvwvvwwwvwvw dqpdqpdqpdqp,dd,qp(q ∝

Page 13: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

13

OutlineTRECVID descriptionOur retrieval modelsModels for Image Annotation--p(qw|dv)

Relevance ModelsMachine Translation ModelsGraphical Models

Models for Auto-Illustration-- p(qv|dw)Next Steps

Page 14: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

14

Component p(qw|dv): Image Annotation

Build models for p(c|dv ) using Relevance Models, MT & HMMsGiven textual query, get a distribution over a concept vocabulary (p(qw|c))

E.g Run text retrieval over training corpus and build this model from top R documentsBaseline: Rule-based system

)|()|()|( vwc

vw dcpcqpdqp ∑=

Page 15: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

15

Relevance ModelsAnalogous to Cross Lingual IR

Retrieve documents in a language (images) with queries in a different language (English text)

Cross Media Relevance Model (CMRM) Discrete model - Visual vocabulary discretized.

Continuous Relevance Model. P(r|J) is estimated using a kernel density function.

∑ ∏=

=J

m

ii JrPJcPJPcP

1

)|()|()(),( r

∑ ∏∈

=J vv

JvPJcPJPvcP )|()|()(),(

Page 16: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

16

Continuous Relevance ModelA generative modelConcept words wj generated by an i.i.d. sample from a multinomialFeatures ri generated by a multi-variate (Gaussian) density

Page 17: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

17

Normalized Continuous Relevance Models

Normalized CRMPad annotations to fixed length. Then use the CRM.Similar to using a Bernoulli model (rather than a multinomial for words).Accounts for length (similar to length of document in text retrieval).

Page 18: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

18

ExtensionsNew model.

How does one estimate P(r|w,J)?Discrete case

Maybe probabilities from IBM translation model 1?Or estimate using images with word w as annotation.

Continuous CaseA little trickier. Not clear yet.

∑ ∏∈

=J rr

JwrPJcPJPrcP ),|()|()(),(

Page 19: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

19

ExtensionsInformation Retrieval

Query based stuff usually works betterPseudo-Relevance feedback, query expansion

Run an initial (text) ASR retrievalCreate text/image models with the returned top ranked documents

Similarly, run an initial image retrievalCreate text/image models with the returned top ranked documents

Page 20: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

20

CRM

Normalized-CRM

Retrieval exampleQuery: Bill_Clinton

Page 21: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

21

Retrieval Results on Video data

Data: 6 hrs of TRECVID03 – 5200 keyframes.

Training – 3470 keyframes. Test – 1730 keyframes.

Page 22: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

22

Machine Translation Approach

Pr(f|e) = ∑ Pr(f,a|e)a

Pr(w|v) = ∑ Pr(w,a|v)a

Page 23: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

23

Image Annotation using IBM-Model 1

p0(c|J) ~ ∑v∈J p(c|v)

Page 24: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

24

Annotation Results (Corel set)

field foals horses mare

tree horses foals mare field

birds grass plane zebra grass tusks water plane

zebra

flowers leaf petals stems flowers leaf petals grass

tulip

people pool swimmers water swimmers pool people water sky

bear polar snow beach polar grass bear snow

mountain sky snow watersky mountain waterclouds snow

jet plane sky sky plane jet tree clouds

people sand sky water sky water beach people hills

Page 25: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

25

Using word co-occurrences Model 1 with word co-occurrencep1 (c2 | J) = ∑c1 p0(c1|J) p(c2|c1)

Normalizationp2(c|J) ~ log(N/k) p1(c|J)

N : number of images in the training datak : number of times concept c appears in training data

Page 26: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

26

Evaluating annotation performance

Prediction performance = 1 / N ∑I (#correct in I) / (# actual annotations in I)

Recall(c) = (number of times c is correctly predicted) / (number of times c appears as an annotation word)

Precision(c) = (number of times c is correctly predicted) / (number of times c is pedicted)

Predict 5 words with the highest frequency , and compare with actual annotations

Page 27: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

27

Results – Preliminary TRECVID

Model Prediction performance

Recall Precision Num predicted

MAP

Model1 0.488 0.0540(0.272)

0.0666(0.336)

23 0.088(0.309)

Model1+ WC

0.229 0.0533(0.326)

0.0573(0.350)

19 0.096(0.303)

Page 28: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

28

Comparison on annotation performance

F1 = (recall + precision ) / 2

Page 29: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

29

Comparison on retrieval performance

Page 30: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

30

HMMs for Image Annotation

tiger

ground

water

grass

• Image blocks are “generated” by its caption-words•Alignment between image blocks and captions is a hidden variable

•Train via the EM algorithm

•Training HMMs: 3-5 known states, given by the caption.•4500 training images•374 word caption-vocab•1 pdf/pmf per state

Test HMM has 374 states, with p(w’|w) estimated by co-occurrence LM

Posterior probability from forward-backward pass used for p(w|Image)

Page 31: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

31

Challenges in HMM trainingThere is no notion of order in the caption words.No linear order in image blocks (assume raster-scan)

Additional spatial dependence between block-labels is missedPartially addressed via a more complex DBN (see next)

Unsuitable (incomplete) human annotationsAnnotators often mark only interesting objects,

leaving large portions of the image unlabelled …in spite of having such objects (sky) in the caption vocabulary

This may be alleviated by inserting an optional “unlabelled background” state during training

Page 32: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

32

Accounting for Unlabelled Background (Gradual Training)

Identify a set of “background” words (sky, grass, water,...)In the initial stages of HMM training

allow only “background” states to have their individual emission probability distributionsAll other objects share a single “foreground” distribution

Run several EM iterationsUntie the foreground distribution and run more EM iterationsImproved alignment of training imagesRetrieval performance on test images not so good … work in progress

With Gradual Training

Page 33: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

33

Results – HMM - Corel data Model MAP Prediction

performance Recall Precision

Discrete – w/o LM 0.1199 0.10230.1006

(0.3793)0.0900

(0.3394)

Discrete – with LM 0.1501 0.32680.1646

(0.4774)0.0991

(0.2875)

Continuous – w/o LM 0.1100 0.10170.1085

(0.4036)0.0841

(0.3127)

Continuous – with LM 0.1231 0.21030.1320

(0.3787)0.0954

(0.2738)

Page 34: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

34

Modeling Spatial DependencyHidden states have spatial dependence -- beyond the left-neighbors modeled by a raster-scan HMMConstruct a graphical model, with as many hidden states as image-blocks, and explicitly model neighborhood dependencies

An HMM for a 24-block Image A DBN for a 24-block Image

Page 35: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

35

GMTK Implementation of DBNEach hidden variable depends on 2 neighbors

Leads to larger “transition probability” tablesData sparseness in trainingMay be alleviated by tying various probabilities

Leads to a larger configuration (state) spaceRunning times go up during decoding

Work in progress …Struggling with GMTK issues

Page 36: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

36

Annotation using Cross-lingual IRTreat Image Annotation as a Cross-lingual IR problem (like Relevance Models)

Image comprising visterms (target language) and a query comprising concepts (source language)

( ) )|()1()|(|)|( Cdv

VV BcpvcpBvpdcPV

λλ −+⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

Page 37: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

37

Cross-Lingual IR using LemurWhat is needed?

Source Documents (concepts)26608 DocumentsVocabulary size 75

Target Documents (visterms)26608 DocumentsVocabulary size 2981

Dictionary:Giza++ p(concept|visual)

Queries: all concepts75 query concepts

Work in progess

p(tiger| ) = 0.7

Page 38: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

38

Score distribution modelingEach model gives a “score” for a conceptObtain a histogram of scores when the concept is present (absent) in the training setRe-rank annotation based on this ratio

Page 39: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

39

Score distribution modeling

Page 40: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

40

OutlineTRECVID descriptionOur retrieval modelsModels for Image Annotation--p(qw|dv)Models for Text Annotation -- p(qv|dw)Next Steps

Page 41: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

41

Component p(qv|dw): Text Annotation

Given visual part of the query, process it using one of the previous models to get p(qv|c)

)|()|()|( wvc

wv dcpcqpdqp ∑=

Page 42: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

42

Text Annotation ModelsPreprocess ASR text to extract features

Named Entities, Concrete NounsFeature Selection

Information GainModels tried

LM-based, Naïve Bayes, SVM, MaxEnt

Page 43: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

43

Building FeaturesInsert Sentence Boundaries

Case Restoration

Noun Extraction Named Entity Detection

Extract Concrete Nouns

Feature Set

Page 44: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

44

Language Model Based AnnotationTrain two language models:

PC(f1 … fm)PnC(f1 … fm)

Pick the one that minimizes perplexity on test setIssues:

Picking features from ASR

Page 45: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

45

Important features for “face”

No annotation: good indicator for absence of “face”No single strong positive feature

Feature f Rel. Position Count P(f,C)/P(f)/P(C)

Empty Previous shot 2855 0.70

Empty This shot 1251 0.56

Empty Next shot 845 0.56

NE_Person-male This shot 1773 1.30

NE_JobTitle This Shot 869 1.36

… … … …

Total number of shots: 28054

Page 46: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

46

Optimal Number of Features

A few thousand features are sufficient

0.49

0.495

0.5

0.505

0.51

0.515

0.52

100 1000 10000

Aver

age

prec

ision

Number of features

Concept: Face

Page 47: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

47

Temporal correlation of concepts

Weak temporal correlations(Exception: concept “monologue”)

0

0.5

1

1.5

2

2.5

3

1 10 100 1000

auto

core

latio

n fu

nctio

n

Relative shot Id (1=present shot)

facetext-overlay

Page 48: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

48

Naïve Bayes classifier results for select concepts

weather_news basketball face sky car0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Binary Naive Bayes Classification

Random PredictionBreakeven Precision

Page 49: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

49

OutlineTRECVID descriptionOur retrieval modelsModels for Image Annotation--p(qw|dv)Models for Auto-Illustration-- p(qv|dw)Next Steps

Page 50: Joint Visual-Text Modeling for Multimedia Retrieval Visual-Text Modeling for Multimedia Retrieval ... U. Mass Amherst. 3 ... 5000 high-quality photographs with

50

Next StepsIntegration Experiments

Evaluate linear and log-linear models on TRECVID03 queries

Improve component models & iterate …