Bestof!both!worlds:! human7machine!collaboraon!for...
Transcript of Bestof!both!worlds:! human7machine!collaboraon!for...
Bibliography
Best of both worlds: human-‐machine collabora9on for object annota9on Olga Russakovsky1 Li-‐Jia Li2 Li Fei-‐Fei1 1 Stanford University 2 Snapchat (this work was done at Yahoo! Labs)
Image annota*on state computer vision (+ human input)
within specified precision, u9lity, and/or budget
Given N detec9ons (Region, Class, Probability) (R1,C1,p1) … (RN, CN, pN):
Expected u9lity is Σpif(Ri,Ci) Expected precision is Σpi/N
Output: final annota*on
Bed
Dresser
Plant
Pillow Pillow
Curtains Curtains Fan
Input: image & constraints A) Image to label: B) Constraints: 1. u9lity:
Given a func9on f (Region,Class) -‐> [0, 1] indica9ng the importance/usefulness of a label (R,C), the u9lity of a labeling (R1,C1)… (RN,CN) is Σf(Ri,Ci)
2. precision: If returned N detec9ons and M are correct then precision is M/N
3. and/or budget Budget is human annota9on 9me
Human tasks: 4 binary tasks, 3 open-‐ended tasks
Select op*mal ques*on within budget
Update image beliefs
Return op*mal
annota*on
Compute image beliefs
Goal: To build a principled framework for accurately and efficiently localizing objects in an image
Introduc9on Method Results
Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog
Verify-box: Is the yellow box tight around a car Answer: No
Object Detection
Object Detection
Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the
image contain a sofa Answer: Yes
... ... ...
... ... ...
... ... ......
Dog
CarPerson
Draw-box: Draw a box around a dog Answer: Yellow box below
Draw-box: Draw a box around a person Answer: Yellow box below
Draw-box: Draw a box around a sofa Answer: Yellow box below
Final Labeling
Final LabelingRemote ControlPersonTableSofa
Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog
Verify-box: Is the yellow box tight around a car Answer: No
Object Detection
Object Detection
Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the
image contain a sofa Answer: Yes
... ... ...
... ... ...
... ... ......
Dog
CarPerson
Draw-box: Draw a box around a dog Answer: Yellow box below
Draw-box: Draw a box around a person Answer: Yellow box below
Draw-box: Draw a box around a sofa Answer: Yellow box below
Final Labeling
Final LabelingRemote ControlPersonTableSofa
Summary: We developed a principled human-‐in-‐the-‐loop framework integra9ng state-‐of-‐the-‐art scene understanding models with state-‐of-‐the-‐art crowd engineering methods for detec9ng objects in images.
Key differences with prior work: 1. Complex, open-‐ended annota9on task 2. Novel Markov Decision Process formula9on 3. Using mul$ple types of human input and
mul$ple types of computer vision models 4. Automa9cally trading off u9lity, precision, and
cost of annota9on
Computer vision conclusions: -‐ Current detectors recognize 1-‐2 objects per image -‐ Our system can collect large-‐scale datasets
Crowd engineering conclusions: -‐ Good to principally combine mul9ple human tasks -‐ Asking complex ques9ons may be more efficient
Bed (0.6)
Objects in image: fan (0.3), plant (0.8), …
An object (0.9)
Another bed in image (0.2) Another pillow in image (0.9)
Pillow (0.9) An object (0.1)
Another object in image (0.8) Classifiers: Krizhevsky 2012, Hoffman 2014, Caffe Detectors: Girshick 2014 Objectness: Alexe 2012 Instances sta*s*cs: ILSVRC2014 Objects sta*s*cs: Greene 2014
Cost and error rates Empirical from Mturk Errors rates are: False Nega9ve / False Posi9ve / (Wrong on Amempt) Note for future work: Dis9nguishing between “good” and “bad” bounding box is very difficult for humans
Cost
Crowd engineering is improving
Label quan9ty and quality per image
Automa9c object detec9on Low cost Low accuracy Few objects
Best: combining the advantages of computer vision and manual annota9on
Manual annota9on High accuracy Huge cost Many objects
Object detectors are improving
[Branson 2010, Jain 2013, Branson 2014, Lad 2014]
Select ques*on: Markov Decision Process State: set of image descrip9ons, with probabili9es Ac9on: a ques9on to ask humans Reward: increase in es9mated u9lity of labeling divided by cost Transi9on probabili9es correspond to expected user answer. Computed from:
1. Current es9mate of correct response 2. Pre-‐computed human error rates
Op9miza9on: 2-‐step lookahead search
Object detec9on
Is yellow box 9ght around a car?
Human answer: no
Draw a box around a person
Answer: drew yellow box
Final labeling
Annota9on probability
Computer vision model
Event (e.g., “there is a cat in this box”) Image
User feedback
User model
User accuracy or error rate Number of users
Update image beliefs: Joint computer vision and user model
Our challenge: compu9ng P(uk | E) when E = “there is a cat in this box” and uk is a response to a previous (poten9ally unrelated) ques9on e.g., “is there a cow in this image?”
[Branson 2010]
Par9ally sponsored by ONR MURI, by Yahoo! Labs, and by NVIDIA [Alexe 2012] B. Alexe et al. Measuring the objectness of image windows. PAMI 2012. [Branson 2010] S. Branson et al. Visual Recogni9on with Humans in the Loop. ECCV 2010. [Branson 2014] S. Branson et al. Ac9ve annota9on transla9on. CVPR 2014. [Girshick 2014] R. Girshick et al. Rick feature hierarchies for accurate object detec9on and seman9c segmenta9on. CVPR 2014.
CV+H: random ordering
H-‐only: rando
m ordering
ILSVRC-‐DET baseline CV-‐only
CV+H: only binary ques9ons
Data: ILSVRC 2014-‐DET valida9on set, 200 object classes, 2216 images (≥4 instances per image) Models: [Krizhevsky 2012, Hoffman 2014, Caffe, Girshick 2014. Alexe 2012, Greene 2014
User model assump9ons: 1) Users are independent 2) Error rates do not depend on image appearance
(1) Computer vision (CV) and human input (H) are mutually beneficial (at low budget)
(2) MDP is effec9ve at selec9ng tasks
(3) Complex human tasks are necessary
Budget (seconds)
Avg. objects labe
led
(4) Our strategy is more effec9ve than ILSVRC baseline
[Hoffman 2014] J. Hoffman et al. LSDA: Large scale detec9on through adapta9on. NIPS 2014. [Jain 2013] S. Jain and K. Grauman. Predic9ng sufficient annota9on strength for interac9ve foreground segmenta9on. ICCV 2013. [ILSVRC] O. Russakovsky. J. Deng et al. ImageNet Large Scale Visual Recogni9on Challenge. IJCV 2015. [Lad 2014] S. Lad and D. Parikh. Interac9vely guiding semi-‐supervised clustering via amribute-‐based explana9ons. ECCV 2014.
Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog
Verify-box: Is the yellow box tight around a car Answer: No
Object Detection
Object Detection
Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the
image contain a sofa Answer: Yes
... ... ...
... ... ...
... ... ......
Dog
CarPerson
Draw-box: Draw a box around a dog Answer: Yellow box below
Draw-box: Draw a box around a person Answer: Yellow box below
Draw-box: Draw a box around a sofa Answer: Yellow box below
Final Labeling
Final LabelingRemote ControlPersonTableSofa
Object detec9on
Name a new object
Answer: sofa
Is there a sofa?
Answer: yes
Final labeling
Draw a box around a sofa
Answer: drew yellow box
Qualita*ve results
Quan*ta*ve results 1: Is there a fan?
Cost: 5.34 sec Error: 0.13 / 0.02
2: Is this a bed?
5.89 sec 0.23 / 0.07
3: Is this an object?
5.71 sec 0.29 / 0.04
4: Are there more
pillows?
7.57 sec 0.25 / 0.26
5: Name this object.
9.67 sec 0.25 / 0.08 / 0.06
6: Draw another bed.
10.21 sec 0.28 / 0.16 / 0.29
7: Name another object.
9.46 sec 0.02 / 0.12 / 0.05
pillow, bed