Sentence generation

Every Picture Tells a Story: Generating Sentences from Images

Ali Farhadi, Mohsen Hejrati , Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth

Proceedings of ECCV-2010

Motivation

Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence.

Auto-annotation

Motivation

Demonstrating how good automatic methods can correlate a description to a given image or obtain images that illustrate a given sentence.

Auto-illustration

Contributions

Proposes a system to compute score linking of an image to a sentence and vice versa.

Evaluates their methodology on a novel dataset consisting of human-annotated images. (PASCAL Sentence Dataset)

Quantitative evaluation on the quality of the predictions.

Overview

The ApproachMapping Image to Meaning

Predicting the triplet of an image involves solving a small multi-label Markov random field.

23

16

29

The Approach

Node potentials: Computed as a linear combination of scores from several detectors and classifiers. (feature functions)

Edge potentials: Edge potentials are estimated by the frequencies of the node labels.

The ApproachImage Space

To provide information about the nodes on the MRF we first need to construct image features:Node Features:

• Felzenszwalb et al. detector responses• Hoiem et al. classification responses• Gist-based scene classification responses

Feature Functions: Node features, Similarity Features

The ApproachImage Space

• Average of the node features over KNN neighbors in the training set to the test image by matching image features:

• Average of the node features over KNN neighbors in the training set to the test image by matching those node features derived from classifiers and detectors:

Similarity Features

The ApproachEdge Potentials

Four estimates for edges from node A to node B• The normalized frequency of the word A in our corpus, f(A).• The normalized frequency of the word B in our corpus, f(B).• The normalized frequency of (A and B) at the same time, f(A, B).• f(A,B)/(f(A)f(B))

Linear combination of multiple estimates for the edge potentials:

The Approach

Sentence Space

Extract triplets from sentencesUse Lin Similarity to determine Semantic distance between two words.Determine actions commonly co-occurring Compute sentence node potentials from these measures.

Learning and Inference1. Learning to predict triplets for images is done

discriminatively using a dataset of images labeled with their meaning triplets.

2. The potentials are computed as linear combinations of feature functions.

3. This makes the learning problem as searching for the best set of weights on the linear combination of feature functions so that the ground truth triplets score higher than any other triplet.

4. Inference involves finding argmaxywTφ(x, y) where φ is the potential function, y is the triplet label, and w are the learned weights.

Evaluation

Dataset

Experimental Settings

600 training images and 400 testing images.50 closest triplets for matching

PASCAL Sentence Dataset: Pascal 2008 development kit. 50 images from 20 categoriesAmazon’s Mechanical Turk generate 5 captions for each image.

Evaluation

Scoring a match between images and sentences is done by ranking them in opposite spaces and summing over them weighed by inverse rank of the triplets.

Distributional Semantics Usage:Text Information and Similarity measure is used to take care of out of vocabulary words that occurs in sentences but are not being learnt by a detector/classifier.

EvaluationQuantitative Measures

Tree-F1 measure: A measure that reflects two important interacting components, accuracy and specificity.

Precision is defined as the total number of edges on the path that matches the edges on the ground truth path divided by the total number of edges on the ground truth path.

Recall is the total number of edges on the predicted path which is in the ground truth path divided by the total number of edges in the path.BLUE Measure: A measure to check if the triplet we generate is logically valid or not. For e.g., (bottle, walk, street) is not valid. For that, we check if the triplet ever appeared in our corpus or not.

ResultsAuto -Annotation

ResultsAuto -Illustration

ResultsExamples of Failures

Discussion

•Sentences are not really generated from the image, but searched from a pool of user-annotated-descriptions.

• The intermediate meaning space in the model helps in approaching the two-way problem as well as is benefitted by the distributional semantics.

•The way to output a score and quantitatively evaluate the co-relation of description and images seems interesting.

Sentence generation

Documents

Transcript of Sentence generation