Bestof!both!worlds:! human7machine!collaboraon!for...

Bibliography

Best of both worlds: human-‐machine collabora9on for object annota9on Olga Russakovsky1 Li-‐Jia Li2 Li Fei-‐Fei1 1 Stanford University 2 Snapchat (this work was done at Yahoo! Labs)

Image annota*on state computer vision (+ human input)

within specified precision, u9lity, and/or budget

Given N detec9ons (Region, Class, Probability) (R1,C1,p1) … (RN, CN, pN):

Expected u9lity is Σpif(Ri,Ci) Expected precision is Σpi/N

Output: final annota*on

Bed

Dresser

Plant

Pillow Pillow

Curtains Curtains Fan

Input: image & constraints A) Image to label: B) Constraints: 1.  u9lity:

Given a func9on f (Region,Class) -‐> [0, 1] indica9ng the importance/usefulness of a label (R,C), the u9lity of a labeling (R1,C1)… (RN,CN) is Σf(Ri,Ci)

2.  precision: If returned N detec9ons and M are correct then precision is M/N

3.  and/or budget Budget is human annota9on 9me

Human tasks: 4 binary tasks, 3 open-‐ended tasks

Select op*mal ques*on within budget

Update image beliefs

Return op*mal

annota*on

Compute image beliefs

Goal: To build a principled framework for accurately and efficiently localizing objects in an image

Introduc9on Method Results

Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog

Verify-box: Is the yellow box tight around a car Answer: No

Object Detection

Object Detection

Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the

image contain a sofa Answer: Yes

... ... ...

... ... ...

... ... ......

Dog

CarPerson

Draw-box: Draw a box around a dog Answer: Yellow box below

Draw-box: Draw a box around a person Answer: Yellow box below

Draw-box: Draw a box around a sofa Answer: Yellow box below

Final Labeling

Final LabelingRemote ControlPersonTableSofa



Object Detection

Object Detection



... ... ...

... ... ...

... ... ......

Dog

CarPerson




Final Labeling


Summary: We developed a principled human-‐in-‐the-‐loop framework integra9ng state-‐of-‐the-‐art scene understanding models with state-‐of-‐the-‐art crowd engineering methods for detec9ng objects in images.

Key differences with prior work: 1.  Complex, open-‐ended annota9on task 2.  Novel Markov Decision Process formula9on 3.  Using mul$ple types of human input and

mul$ple types of computer vision models 4.  Automa9cally trading off u9lity, precision, and

cost of annota9on

Computer vision conclusions: -‐  Current detectors recognize 1-‐2 objects per image -‐  Our system can collect large-‐scale datasets

Crowd engineering conclusions: -‐  Good to principally combine mul9ple human tasks -‐  Asking complex ques9ons may be more efficient

Bed (0.6)

Objects in image: fan (0.3), plant (0.8), …

An object (0.9)

Another bed in image (0.2) Another pillow in image (0.9)

Pillow (0.9) An object (0.1)

Another object in image (0.8) Classifiers: Krizhevsky 2012, Hoffman 2014, Caffe Detectors: Girshick 2014 Objectness: Alexe 2012 Instances sta*s*cs: ILSVRC2014 Objects sta*s*cs: Greene 2014

Cost and error rates Empirical from Mturk Errors rates are: False Nega9ve / False Posi9ve / (Wrong on Amempt) Note for future work: Dis9nguishing between “good” and “bad” bounding box is very difficult for humans

Cost

Crowd engineering is improving

Label quan9ty and quality per image

Automa9c object detec9on Low cost Low accuracy Few objects

Best: combining the advantages of computer vision and manual annota9on

Manual annota9on High accuracy Huge cost Many objects

Object detectors are improving

[Branson 2010, Jain 2013, Branson 2014, Lad 2014]

Select ques*on: Markov Decision Process State: set of image descrip9ons, with probabili9es Ac9on: a ques9on to ask humans Reward: increase in es9mated u9lity of labeling divided by cost Transi9on probabili9es correspond to expected user answer. Computed from:

1.  Current es9mate of correct response 2.  Pre-‐computed human error rates

Op9miza9on: 2-‐step lookahead search

Object detec9on

Is yellow box 9ght around a car?

Human answer: no

Draw a box around a person

Answer: drew yellow box

Final labeling

Annota9on probability

Computer vision model

Event (e.g., “there is a cat in this box”) Image

User feedback

User model

User accuracy or error rate Number of users

Update image beliefs: Joint computer vision and user model

Our challenge: compu9ng P(uk | E) when E = “there is a cat in this box” and uk is a response to a previous (poten9ally unrelated) ques9on e.g., “is there a cow in this image?”

[Branson 2010]

Par9ally sponsored by ONR MURI, by Yahoo! Labs, and by NVIDIA [Alexe 2012] B. Alexe et al. Measuring the objectness of image windows. PAMI 2012. [Branson 2010] S. Branson et al. Visual Recogni9on with Humans in the Loop. ECCV 2010. [Branson 2014] S. Branson et al. Ac9ve annota9on transla9on. CVPR 2014. [Girshick 2014] R. Girshick et al. Rick feature hierarchies for accurate object detec9on and seman9c segmenta9on. CVPR 2014.

CV+H: random ordering

H-‐only: rando

m ordering

ILSVRC-‐DET baseline CV-‐only

CV+H: only binary ques9ons

Data: ILSVRC 2014-‐DET valida9on set, 200 object classes, 2216 images (≥4 instances per image) Models: [Krizhevsky 2012, Hoffman 2014, Caffe, Girshick 2014. Alexe 2012, Greene 2014

User model assump9ons: 1)  Users are independent 2)  Error rates do not depend on image appearance

(1) Computer vision (CV) and human input (H) are mutually beneficial (at low budget)

(2) MDP is effec9ve at selec9ng tasks

(3) Complex human tasks are necessary

Budget (seconds)

Avg. objects labe

led

(4) Our strategy is more effec9ve than ILSVRC baseline

[Hoffman 2014] J. Hoffman et al. LSDA: Large scale detec9on through adapta9on. NIPS 2014. [Jain 2013] S. Jain and K. Grauman. Predic9ng sufficient annota9on strength for interac9ve foreground segmenta9on. ICCV 2013. [ILSVRC] O. Russakovsky. J. Deng et al. ImageNet Large Scale Visual Recogni9on Challenge. IJCV 2015. [Lad 2014] S. Lad and D. Parikh. Interac9vely guiding semi-‐supervised clustering via amribute-‐based explana9ons. ECCV 2014.



Object Detection

Object Detection



... ... ...

... ... ...

... ... ......

Dog

CarPerson




Final Labeling


Object detec9on

Name a new object

Answer: sofa

Is there a sofa?

Answer: yes

Final labeling

Draw a box around a sofa

Answer: drew yellow box

Qualita*ve results

Quan*ta*ve results 1: Is there a fan?

Cost: 5.34 sec Error: 0.13 / 0.02

2: Is this a bed?

5.89 sec 0.23 / 0.07

3: Is this an object?

5.71 sec 0.29 / 0.04

4: Are there more

pillows?

7.57 sec 0.25 / 0.26

5: Name this object.

9.67 sec 0.25 / 0.08 / 0.06

6: Draw another bed.

10.21 sec 0.28 / 0.16 / 0.29

7: Name another object.

9.46 sec 0.02 / 0.12 / 0.05

pillow, bed

Bestof!both!worlds:! human7machine!collaboraon!for...

Documents

Transcript of Bestof!both!worlds:! human7machine!collaboraon!for...