Matching Words and Pictures - micc.unifi.it · • Assuming an unknown one-one correspondence...
Transcript of Matching Words and Pictures - micc.unifi.it · • Assuming an unknown one-one correspondence...
H. Dunlop
• Improvements in translation from visual to semantic representations lead to improvements in image access
• Numerous applications:−Auto-annotation−Region naming (aka object recognition)−Browsing−Searching−Auto-illustration
From visual to semantic representations
Multimedia Translation
• Data:
Words are associated with images, but correspondences are unknown
sun sea skysun sea sky
Statistical Machine Translation
• Assuming an unknown one-one correspondence between words, come up with a joint probability distribution linking words in the two languages
• Missing data problem: solution is Expectation Maximization (EM)
Given the translation probabilities, estimate the
correspondences
Given the correspondences, estimate the translation probabilities
Overview
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabular by
Pinar Duygulu, Kobus Barnard, Nando de Freitas, David Forsyth
Input Representation
• Segment with Normalized Cuts:
• Only use regions larger than a threshold (typically 5-10 per image)
• Form vector representation of each region
• Cluster regions with k-means to form blob tokens
sun sky waves sea
word tokens
Input Representation
• Represent each region with a feature vector
– Size: portion of the image covered by the region
– Position: coordinates of center of mass
– Color: avg. and std. dev. of (R,G,B), (L,a,b) and (r=R/(R+G+B),g=G/(R+G+B))
– Texture: avg. and variance of 16 filter responses
– Shape: area / perimeter2, moment of inertia, region area / area of convex hull
Expectation Maximization
• Select word with highest probability to assign to each blob
N
n
M
j
L
i
ninjnj
n n
bwtiapbwp
1 1 1
)|()()|(
probability that blob bni
translates to word wnj
probability of obtaining word wnj given instance of blob bni
# of images
# of words
# of blobs
Expectation Maximization
• Initialize to blob-word co-occurrences:
• Iterate:
Given the translation probabilities, estimate the
correspondences
Given the correspondences, estimate the translation
probabilities
Word Prediction
• On a new image:
– Segment
– For each region:
• Extract features
• Find the corresponding blob token using nearest neighbor
• Use the word posterior probabilities to predict words
Refusing to Predict
• Require: p(word|blob) > threshold
– ie. Assign a null word to any blob whose best predicted word lies below the threshold
• Prunes vocabulary, so fit new lexicon
Indistinguishable Words
• Visually indistinguishable:
– cat and tiger, train and locomotive
• Indistinguishable with our features:
– eagle and jet
• Entangled correspondence:
– polar – bear
– mare/foals – horse
• Solution: cluster similar words
– Obtain similarity matrix
– Compare words with symmetrised KL divergence
– Apply N-Cuts on matrix to get clusters
– Replace word with its cluster label
Experiments
• Train with 4500 Corel images
– 4-5 words for each image
– 371 words in vocabulary
– 5-10 regions per image
– 500 blobs
• Test on 500 images
Auto-Annotation
• Determine most likely word for each blob
• If probability of word is greater than some threshold, use in annotation