Bestof!both!worlds:! human7machine!collaboraon!for...

1
Bibliography Best of both worlds: humanmachine collabora9on for object annota9on Olga Russakovsky 1 LiJia Li 2 Li FeiFei 1 1 Stanford University 2 Snapchat (this work was done at Yahoo! Labs) Image annota*on state computer vision (+ human input) within specified precision, u9lity, and/or budget Given N detec9ons (Region, Class, Probability) (R 1 ,C 1 ,p 1 ) … (R N ,C N ,p N ): Expected u9lity is Σp i f(R i ,C i ) Expected precision is Σp i /N Output: final annota*on Bed Dresser Plant Pillow Pillow Curtains Curtains Fan Input: image & constraints A) Image to label: B) Constraints: 1. u9lity: Given a func9on f (Region,Class) > [0, 1] indica9ng the importance/usefulness of a label (R,C), the u9lity of a labeling (R 1 ,C 1 )… (R N ,C N ) is Σf(R i ,C i ) 2. precision: If returned N detec9ons and M are correct then precision is M/N 3. and/or budget Budget is human annota9on 9me Human tasks: 4 binary tasks, 3 openended tasks Select op*mal ques*on within budget Update image beliefs Return op*mal annota*on Compute image beliefs Goal: To build a principled framework for accurately and efficiently localizing objects in an image Introduc9on Method Results ... Car Person Name-image: Verify-image: Does the ... ... Summary: We developed a principled humaninthe loop framework integra9ng stateoftheart scene understanding models with stateoftheart crowd engineering methods for detec9ng objects in images. Key differences with prior work: 1. Complex, openended annota9on task 2. Novel Markov Decision Process formula9on 3. Using mul$ple types of human input and mul$ple types of computer vision models 4. Automa9cally trading off u9lity, precision, and cost of annota9on Computer vision conclusions: Current detectors recognize 12 objects per image Our system can collect largescale datasets Crowd engineering conclusions: Good to principally combine mul9ple human tasks Asking complex ques9ons may be more efficient Bed (0.6) Objects in image: fan (0.3), plant (0.8), … An object (0.9) Another bed in image (0.2) Another pillow in image (0.9) Pillow (0.9) An object (0.1) Another object in image (0.8) Classifiers: Krizhevsky 2012, Hoffman 2014, Caffe Detectors: Girshick 2014 Objectness: Alexe 2012 Instances sta*s*cs: ILSVRC2014 Objects sta*s*cs: Greene 2014 Cost and error rates Empirical from Mturk Errors rates are: False Nega9ve / False Posi9ve / (Wrong on Amempt) Note for future work : Dis9nguishing between “good” and “bad” bounding box is very difficult for humans Cost Crowd engineering is improving Label quan9ty and quality per image Automa9c object detec9on Low cost Low accuracy Few objects Best: combining the advantages of computer vision and manual annota9on Manual annota9on High accuracy Huge cost Many objects Object detectors are improving [Branson 2010, Jain 2013, Branson 2014, Lad 2014] Select ques*on: Markov Decision Process State : set of image descrip9ons, with probabili9es Ac9on : a ques9on to ask humans Reward : increase in es9mated u9lity of labeling divided by cost Transi9on probabili9es correspond to expected user answer. Computed from: 1. Current es9mate of correct response 2. Precomputed human error rates Op9miza9on : 2step lookahead search Object detec9on Is yellow box 9ght around a car? Human answer: no Draw a box around a person Answer: drew yellow box Final labeling Annota9on probability Computer vision model Event (e.g., “there is a cat in this box”) Image User feedback User model User accuracy or error rate Number of users Update image beliefs: Joint computer vision and user model Our challenge : compu9ng P(u k | E) when E = “there is a cat in this box” and u k is a response to a previous (poten9ally unrelated) ques9on e.g., “is there a cow in this image?” [Branson 2010] Par9ally sponsored by ONR MURI, by Yahoo! Labs, and by NVIDIA [Alexe 2012] B. Alexe et al. Measuring the objectness of image windows. PAMI 2012. [Branson 2010] S. Branson et al. Visual Recogni9on with Humans in the Loop. ECCV 2010. [Branson 2014] S. Branson et al. Ac9ve annota9on transla9on. CVPR 2014. [Girshick 2014] R. Girshick et al. Rick feature hierarchies for accurate object detec9on and seman9c segmenta9on. CVPR 2014. CV+H: random ordering Honly: random ordering ILSVRCDET baseline CVonly CV+H: only binary ques9ons Data: ILSVRC 2014DET valida9on set, 200 object classes, 2216 images (≥4 instances per image) Models: [Krizhevsky 2012, Hoffman 2014, Caffe, Girshick 2014. Alexe 2012, Greene 2014 (1) Computer vision (CV) and human input (H) are mutually beneficial (at low budget) (2) MDP is effec9ve at selec9ng tasks (3) Complex human tasks are necessary Budget (seconds) Avg. objects labeled (4) Our strategy is more effec9ve than ILSVRC baseline [Hoffman 2014] J. Hoffman et al. LSDA: Large scale detec9on through adapta9on. NIPS 2014. [Jain 2013] S. Jain and K. Grauman. Predic9ng sufficient annota9on strength for interac9ve foreground segmenta9on. ICCV 2013. [ILSVRC] O. Russakovsky. J. Deng et al. ImageNet Large Scale Visual Recogni9on Challenge. IJCV 2015. [Lad 2014] S. Lad and D. Parikh. Interac9vely guiding semisupervised clustering via amributebased explana9ons. ECCV 2014. ... ... ... ... Remote Control Person Table Sofa Object detec9on Name a new object Answer: sofa Is there a sofa? Answer: yes Final labeling Draw a box around a sofa Answer: drew yellow box Qualita*ve results Quan*ta*ve results 1: Is there a fan? Cost: 5.34 sec Error: 0.13 / 0.02 2: Is this a bed? 5.89 sec 0.23 / 0.07 3: Is this an object? 5.71 sec 0.29 / 0.04 4: Are there more pillows? 7.57 sec 0.25 / 0.26 5: Name this object. 9.67 sec 0.25 / 0.08 / 0.06 6: Draw another bed. 10.21 sec 0.28 / 0.16 / 0.29 7: Name another object. 9.46 sec 0.02 / 0.12 / 0.05 pillow, bed

Transcript of Bestof!both!worlds:! human7machine!collaboraon!for...

Page 1: Bestof!both!worlds:! human7machine!collaboraon!for ...ai.stanford.edu/~olga/posters/cvpr15-poster.pdfFinal Labeling Final Labeling Remote Control Person Table Sofa Answer: Yellow box

 Bibliography  

Best  of  both  worlds:  human-­‐machine  collabora9on  for  object  annota9on  Olga  Russakovsky1          Li-­‐Jia  Li2          Li  Fei-­‐Fei1   1  Stanford  University        2  Snapchat  (this  work  was  done  at  Yahoo!  Labs)  

Image  annota*on  state  computer  vision  (+  human  input)  

within  specified  precision,  u9lity,  and/or  budget    

Given  N  detec9ons  (Region,  Class,  Probability)    (R1,C1,p1)  …  (RN,  CN,  pN):  

Expected  u9lity  is  Σpif(Ri,Ci)    Expected  precision  is  Σpi/N  

Output:  final  annota*on  

Bed  

Dresser  

Plant  

Pillow  Pillow  

Curtains   Curtains  Fan  

Input:  image  &  constraints  A)  Image  to  label:  B)  Constraints:  1.  u9lity:  

Given  a  func9on  f  (Region,Class)  -­‐>  [0,  1]  indica9ng  the  importance/usefulness  of  a  label  (R,C),  the  u9lity  of  a  labeling  (R1,C1)…  (RN,CN)  is  Σf(Ri,Ci)  

2.  precision:  If  returned  N  detec9ons  and  M  are  correct  then  precision  is  M/N  

3.  and/or  budget  Budget  is  human  annota9on  9me  

Human  tasks:  4  binary  tasks,  3  open-­‐ended  tasks  

Select  op*mal  ques*on  within  budget  

Update    image  beliefs  

Return  op*mal  

annota*on  

Compute  image  beliefs    

Goal:  To  build  a  principled  framework  for  accurately  and  efficiently  localizing  objects  in  an  image  

 Introduc9on    Method    Results  

Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog

Verify-box: Is the yellow box tight around a car Answer: No

Object Detection

Object Detection

Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the

image contain a sofa Answer: Yes

... ... ...

... ... ...

... ... ......

Dog

CarPerson

Draw-box: Draw a box around a dog Answer: Yellow box below

Draw-box: Draw a box around a person Answer: Yellow box below

Draw-box: Draw a box around a sofa Answer: Yellow box below

Final Labeling

Final LabelingRemote ControlPersonTableSofa

Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog

Verify-box: Is the yellow box tight around a car Answer: No

Object Detection

Object Detection

Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the

image contain a sofa Answer: Yes

... ... ...

... ... ...

... ... ......

Dog

CarPerson

Draw-box: Draw a box around a dog Answer: Yellow box below

Draw-box: Draw a box around a person Answer: Yellow box below

Draw-box: Draw a box around a sofa Answer: Yellow box below

Final Labeling

Final LabelingRemote ControlPersonTableSofa

Summary:  We  developed  a  principled  human-­‐in-­‐the-­‐loop  framework  integra9ng  state-­‐of-­‐the-­‐art  scene  understanding  models  with  state-­‐of-­‐the-­‐art  crowd  engineering  methods  for  detec9ng  objects  in  images.    

Key  differences  with  prior  work:  1.  Complex,  open-­‐ended  annota9on  task  2.  Novel  Markov  Decision  Process  formula9on  3.  Using  mul$ple  types  of  human  input  and  

mul$ple  types  of  computer  vision  models  4.  Automa9cally  trading  off  u9lity,  precision,  and  

cost  of  annota9on  

Computer  vision  conclusions:  -­‐  Current  detectors  recognize  1-­‐2  objects  per  image  -­‐  Our  system  can  collect  large-­‐scale  datasets      

Crowd  engineering  conclusions:  -­‐  Good  to  principally  combine  mul9ple  human  tasks  -­‐  Asking  complex  ques9ons  may  be  more  efficient  

Bed  (0.6)  

Objects  in  image:  fan  (0.3),  plant  (0.8),  …  

An  object  (0.9)  

Another  bed  in  image  (0.2)  Another  pillow  in  image  (0.9)  

Pillow  (0.9)  An  object  (0.1)  

Another  object  in  image  (0.8)  Classifiers:        Krizhevsky  2012,  Hoffman  2014,  Caffe  Detectors:              Girshick  2014  Objectness:                    Alexe  2012  Instances  sta*s*cs:  ILSVRC2014  Objects  sta*s*cs:                  Greene  2014  

Cost  and  error  rates  Empirical  from  Mturk  Errors  rates  are:          False  Nega9ve  /          False  Posi9ve  /            (Wrong  on  Amempt)      Note  for  future  work:  Dis9nguishing  between  “good”  and  “bad”  bounding  box  is  very  difficult  for  humans    

Cost  

Crowd  engineering  is  improving  

Label  quan9ty  and  quality  per  image  

 Automa9c  object  detec9on      Low  cost              Low  accuracy                                                    Few  objects  

Best:  combining  the  advantages  of  computer  vision  and  manual  annota9on  

           Manual  annota9on  High  accuracy      Huge  cost  Many  objects  

Object  detectors  are  improving  

[Branson  2010,  Jain  2013,  Branson  2014,  Lad  2014]  

Select  ques*on:  Markov  Decision  Process  State:  set  of  image  descrip9ons,  with  probabili9es  Ac9on:  a  ques9on  to  ask  humans  Reward:  increase  in  es9mated  u9lity  of  labeling  divided  by  cost  Transi9on  probabili9es  correspond  to  expected  user  answer.  Computed  from:  

1.   Current  es9mate  of  correct  response  2.   Pre-­‐computed  human  error  rates  

Op9miza9on:  2-­‐step  lookahead  search  

Object  detec9on  

Is  yellow  box  9ght  around  a  car?  

Human  answer:  no  

Draw  a  box  around  a  person  

Answer:  drew  yellow  box  

Final  labeling  

Annota9on  probability  

Computer  vision  model  

Event    (e.g.,  “there  is  a  cat  in  this  box”)   Image  

User  feedback  

User  model  

User  accuracy  or  error  rate  Number  of  users  

Update  image  beliefs:  Joint  computer  vision  and  user  model  

Our  challenge:  compu9ng  P(uk  |  E)  when  E  =  “there  is  a  cat  in  this  box”  and  uk  is  a  response  to  a  previous  (poten9ally  unrelated)  ques9on  e.g.,  “is  there  a  cow  in  this  image?”  

[Branson  2010]  

Par9ally  sponsored  by  ONR  MURI,  by  Yahoo!  Labs,  and  by  NVIDIA  [Alexe  2012]  B.  Alexe  et  al.  Measuring  the  objectness  of  image  windows.  PAMI  2012.  [Branson  2010]  S.  Branson  et  al.  Visual  Recogni9on  with  Humans  in  the  Loop.  ECCV  2010.  [Branson  2014]  S.  Branson  et  al.  Ac9ve  annota9on  transla9on.  CVPR  2014.  [Girshick  2014]  R.  Girshick  et  al.  Rick  feature  hierarchies  for  accurate  object  detec9on  and  seman9c  segmenta9on.  CVPR  2014.  

CV+H:  random  ordering    

H-­‐only:  rando

m  ordering  

ILSVRC-­‐DET  baseline  CV-­‐only  

CV+H:  only  binary  ques9ons  

Data:    ILSVRC  2014-­‐DET  valida9on  set,  200  object  classes,  2216  images  (≥4  instances  per  image)    Models:  [Krizhevsky  2012,  Hoffman  2014,  Caffe,    Girshick  2014.    Alexe  2012,  Greene  2014  

User  model  assump9ons:    1)  Users  are  independent  2)  Error  rates  do  not  depend  on  image  appearance    

(1)  Computer  vision  (CV)  and  human  input  (H)  are  mutually  beneficial  (at  low  budget)  

(2)  MDP  is  effec9ve  at  selec9ng  tasks  

(3)  Complex  human  tasks  are  necessary  

Budget  (seconds)  

Avg.  objects  labe

led  

(4)  Our  strategy  is  more  effec9ve  than  ILSVRC  baseline  

[Hoffman  2014]    J.  Hoffman  et  al.  LSDA:  Large  scale  detec9on  through  adapta9on.  NIPS  2014.  [Jain  2013]  S.  Jain  and  K.  Grauman.  Predic9ng  sufficient  annota9on  strength  for  interac9ve  foreground  segmenta9on.  ICCV  2013.  [ILSVRC]  O.  Russakovsky.  J.  Deng  et  al.  ImageNet  Large  Scale  Visual  Recogni9on  Challenge.  IJCV  2015.  [Lad  2014]  S.  Lad  and  D.  Parikh.  Interac9vely  guiding  semi-­‐supervised  clustering  via  amribute-­‐based  explana9ons.  ECCV  2014.  

Answer: Yellow box below Final LabelingDraw-box: Draw a box around a dog

Verify-box: Is the yellow box tight around a car Answer: No

Object Detection

Object Detection

Name-image: Name a new object Answer: SofaObject Detection Verify-image: Does the

image contain a sofa Answer: Yes

... ... ...

... ... ...

... ... ......

Dog

CarPerson

Draw-box: Draw a box around a dog Answer: Yellow box below

Draw-box: Draw a box around a person Answer: Yellow box below

Draw-box: Draw a box around a sofa Answer: Yellow box below

Final Labeling

Final LabelingRemote ControlPersonTableSofa

Object  detec9on  

Name  a  new  object  

Answer:  sofa  

Is  there  a  sofa?  

Answer:  yes  

Final  labeling  

Draw  a  box  around  a  sofa  

Answer:  drew  yellow  box  

Qualita*ve  results  

Quan*ta*ve  results  1:  Is  there  a  fan?  

Cost:  5.34  sec  Error:  0.13  /  0.02  

2:  Is  this  a  bed?  

5.89  sec  0.23  /  0.07  

3:  Is  this  an  object?  

5.71  sec  0.29  /  0.04  

4:  Are  there  more  

pillows?  

7.57  sec  0.25  /  0.26  

5:  Name  this  object.  

9.67  sec  0.25  /  0.08  /  0.06  

6:  Draw  another  bed.  

10.21  sec  0.28  /  0.16  /  0.29  

7:  Name  another  object.  

9.46    sec  0.02  /  0.12  /  0.05  

pillow,  bed