VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra...
-
Upload
terence-garrison -
Category
Documents
-
view
216 -
download
2
Transcript of VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra...
VIP: Finding Important People in ImagesClint Solomon MathialaganAndrew C. GallagherDhruv Batra
CVPR 2015
1
Outline• Introduction • Approach• Results• Importance vs Saliency• Application: Improving Im2Text• Conclusions
2
Introduction • Project: https://computing.ece.vt.edu/~mclint/vip/
• Demo: http://cloudcv.org/vip/
3
Introduction • The goal of this paper is to automatically predict the
importance of individuals in group photographs.
4
Introduction• Who are most important individuals in these pictures?
• Humans have a remarkable ability to understand social roles and identify important players, even without knowing identities of the people in the images.
5
Introduction • What is Importance?1. the photographer2. the subjects3. neutral third-party human observers
• In this work, we rely on the wisdom of the crowd to estimate the “ground-truth” importance of a person in an image.
6
Introduction• ApplicationsIm2textPhoto cropping algorithmsSocial networking sites and image search applications
• ContributionsWe learn a model for predicting importance of individuals in
photos.We collect two importance datasets.We show that we can automatically predict the importance of
people with high accuracy, and incorporating this predicted importance in applications 7
Approach• Framework
ModelPerson Features
Dataset
Distance FeaturesScaleSharpnessFace Pose FeaturesFace Occlusion
M(pi , pj ) ≈ Si − Sj
Image-Level
Corpus-Level
8
Approach• We model importance in two ways
1. Image-Level Importance: “Given an image, who is the most important individual?”
2. Corpus-Level Importance: “Given multiple images, in which image is a specific person most important?”
9
Approach(1)-Dataset Collection• Image-Level DatasetIn this setting, we need a dataset of images containing at least three people with varying levels of importance. Flickr
• Corpus-Level DatasetIn this setting, we need a dataset that has multiple pictures of the same person; and multiple sets of such photos. TV series (‘Big Bang Theory’)
10
Approach(2)-Importance Annotation
Annotation Interfaces used with MTurk
Image-Level Importance Annotation: Hovering over a button (A or B) highlights the person associated with it.
Corpus-Level Importance Annotation: Hovering over a frame shows the where the person is located in the frame.
11
Approach(2)-Importance Annotation • (pi , pj ): each annotated pair of faces • si , sj : the relative importance scores (0 ~+1)• Note that si and sj are not absolute, as they are not calibrated
for comparison to another person, say pk from another pair.
12
Approach(2)-Importance Annotation • The table shows a breakdown of both datasets along the
magnitude of differences in importance.
13
Approach(3)-Importance Model • The objective is to build a model M that regresses to the
difference in ground truth importance score:
• We use a linear model: : the features extracted for this pair : the regressor weightsWe use ν-Support Vector Regression to learn these weights.
14
Approach(3)-Importance Model • We compared two ways of composing these individual face
features:
Using difference of features
Concatenating the two individual features
15
Approach(4)-Person Features• Distance Features
We first scale the image to a size of (1, 1), and compute two distance features
1. Distance from center2. Weighted distance from center
We compute two more features to capture how far a person is from the center of a group
1. Normalized distance from centroid2. Normalized distance from weighted centroid
d
0.50.5 𝒅
the largest dimension of the face box1
2
3 𝑥𝑐𝑚=𝑚1 𝑥1+𝑚2𝑥2+𝑚3𝑥3
𝑚1+𝑚2+𝑚3 ❑
𝑦 𝑐𝑚=𝑚1𝑦 1+𝑚2 𝑦2+𝑚3 𝑦3
𝑚1+𝑚2+𝑚3 ❑
the weighted average of center points of faces
the weight of a face= the area of the headthe total area of faces in the image
16
Approach(4)-Person Features• Scale
• SharpnessSobel filterCompute the sum of the gradient energy in a face bounding
box, normalized by the sum of the gradient energy in all the bounding boxes in the image.
17
Approach(4)-Person Features• Face Pose Features
DPM face pose features-We resize the face bounding box patch from the image to 128×128 pixels-Run the face pose and landmark estimation algorithm of Zhu et al. [28]. -Our pose feature is this component id, which can range from 1 to 13.
[28] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In CVPR, 2012.
18
Approach(4)-Person Features• Face Pose Features
Aspect ratioWhile the aspect ratio of a face is typically 1:1, this ratio can differentiate between some head poses.
DPM face pose differenceWe compute the pose of the person subtracted by the average pose of every other person in the image.
19
Approach(4)-Person Features• Face OcclusionDPM face scoreswe use scores for each the 13 components in the face detection model of [28] as a feature.
Face detection successThis is a binary feature indicating whether the face detection API [22] we used was successful in detection the face, or whether it required human annotation.The API achieved a nearly zero false positive rate on our dataset.
[28] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In CVPR, 2012.[22] SkyBiometry. https://www.skybiometry.com/.
20
Results• BaselinesWe compare our proposed approach to three natural
baselines: center, scale, and sharpness baselines.We used the method of Harel et al. [10, 12] to produce
saliency maps and computed the fraction of saliency intensities inside each face as a measure of its importance.
We measure inter-human agreement in a leave-one-humanout manner.
[10] J. Harel. A saliency implementation in matlab. http://www.klab.caltech.edu/ harel/share/gbvs.php.[12] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in NIPS, 2006.
21
Results• MetricsWe use mean squared error to measure the performance of
our relative importance regressors.In addition,we convert the regressor output into binary
classification by thresholding against zero.For each pair of faces (pi , pj ), we use a weighted classification
accuracy measure, where the weight is the ground-truth importance score of the more important of the two, i.e. max{si , sj}.
22
Results• Image-Level Importance Results
Overall, we achieve an improvement of 3.17% (3.54% relative improvement). The mean squared error for our SVR is 0.1489.
the best baseline
23
Results• Image-Level Importance ResultsTable 4 show a break-down of the accuracies into the three
categories of annotations.
>
24
Results• Corpus-Level Importance Results
the best baseline
25
Results• Corpus-Level Importance ResultsTable 6 shows the category breakdown.
26
Results• Image-Level and Corpus-LevelFig. 4 shows some qualitative results for image experiments
and corpus experiments.
27
Results• Image-Level and Corpus-LevelTable 7 reports results from an ablation study, which shows
the impact of the features on the final performance.
28
Importance vs Saliency• We measured the correlation between importance and
saliency rankings using Kendall’s Tau. The Kendall’s Tau was 0.5256. The most salient face was also the most important person in 52.56% of the cases.
• Fig. 5 shows qualitative examples of individuals who are judged by humans to be salient but not important, important but not salient, both salient and important, and neither. 29
Application: Improving Im2Text
Setup Prediction Results
30
Conclusions• We proposed the task of automatically predicting the
importance of individuals in group photographs.
• Compared to previous work in visual saliency, the proposed person importance is correlated but not identical.
• We showed that our method can successfully predict the importance of people from purely visual cues, and incorporating predicted importance provides significant improvement in im2text.
31
References• Narrow depth-of- field https://goo.gl/EfxN2Q• Sobel filter http://goo.gl/BmBCx9
32