Towards a Watson That Sees: Language-Guided Action...
Transcript of Towards a Watson That Sees: Language-Guided Action...
![Page 1: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/1.jpg)
Towards a Watson That Sees:Language-Guided Action Recognition for Robots
Ching L. Teo, Yezhou Yang, Hal Daume III, Cornelia Fermullerand
Yiannis Aloimonos
University of Maryland Institute for Advanced Computer Studies
Apr 9, 2012
![Page 2: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/2.jpg)
Outline
1 Why Vision and Language
2 Action Recognition
3 Future Work
4 Conclusions
![Page 3: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/3.jpg)
Motivation
Computer Vision seeks toachieve perception throughvisual signals.
Current approaches toperception: extract edges,texture, motion, color,surfaces.
What comes next? Felzenszwalb, P. et al, Objectdetection with discriminativelytrained part-based models, PAMI2010
![Page 4: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/4.jpg)
This is not enough: high-levelknowledge is required.
High-level knowledge: semantics,function, attributes, associations.
Enables reasoning from perception.
Approach: use language as aresource of high-level informationto solve problems in ComputerVision. Jia Deng et al, ImageNet: A
Large-Scale Hierarchical ImageDatabase, CVPR 2009
![Page 5: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/5.jpg)
Why Language?
Language provides:
1 A lexicon (dictionary) that encodes contextual relationshipbetween entities: e.g. ladle
occurs−−−−→ kitchen.
2 Prior knowledge: e.g. knifecuts−−→ cucumber.
3 Generalizes to beyond what is “seen” (non-visual): e.g. sun,
beaches, peoplerelates−−−−→ vacation.
Computational Linguistics have provided tools and relationaldatabases that encodes relationships such as: is-a, cause-effect,performs-functions, motivated by etc.
![Page 6: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/6.jpg)
Approach: The Cognitive Dialog
A model of a reasoning process that involves the VisualExecutive (VE) and Language Executive (LE).
VE provides: observations, low-level feature extraction.
LE provides: constraints, plausibility of observations from VE.
Each iteration of the “dialog” seeks to optimize a globalfunction related to the task: e.g. scene/object/actionrecognition, object segmentation.
![Page 7: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/7.jpg)
Action Recognition is Hard
Goal: Recognizing various actionsinvolving hand-tools.
Challenge: ambiguity of actionssimply from the trajectories of thehand alone.
Key idea is to use language assource of prior knowledge thatrelates tools with the actionsperformed.
![Page 8: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/8.jpg)
Corpus-Guided Action Recognition
Input: set of unlabeled videos.Goal is to estimate a model thatassigns clusters of videos to thecorrect actions.
VE: detects tools and actionfeatures (noisy).
LE: provides a language model forcomputing likelihoods oftool-action co-occurrence.
Strategy: EM to update the actionassignment model at each iteration.
(a) Training the language model from alarge text corpus. (b) Detected tools arequeried into the language model. (c)Language model returns prediction ofaction. (d) Action features are comparedand beliefs updated.
![Page 9: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/9.jpg)
Active Tool Detection
Overview of the tool detection strategy: (1) Optical flow is first computedfrom the input video frames. (2) We train a CRF segmentation model basedon optical flow + skin color. (3) Guided by the flow computations, wesegment out hand-like regions (and removed faces if necessary) to obtain thehand regions that are moving (the active hand that is holding the tool). (4)The active hand region is where the tool is localized. Using the PLS detector(5), we compute a detection score for the presence of a tool.
![Page 10: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/10.jpg)
Extracting Action Features
Detected hands are trackedusing a Kalman Filter acrossthe entire video sequence.
Creates a 20-dim featurevector consisting of Fouriercoefficients and normalizedaveraged velocities.
Detected hand trajectories. x and ycoordinates are denoted as red and bluecurves respectively.
![Page 11: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/11.jpg)
Action-Tool Language Model
Computes the conditionalprobability that an actionoccured given that a toolhas been detected:PL(vj |ni ) =
#(vj ,ni )#(ni )
.
Uses the Gigaword Corpuscontaining 10 years worth ofNYT newswire data.
Gigaword co-occurrence matrix of toolsand predicted actions.
![Page 12: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/12.jpg)
Guiding Action Recognition Via EM
E step: compute the expectation of the assignment variablefor action j , tool i , video d : Wijd ∝ PI (i)PL(j |i)Pen(d |j)M step: update the model parameter C via:
arg maxC
(−∑
i ,j ,dWijd ||Fd − Cj ||2)
![Page 13: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/13.jpg)
Experiments and Results
Evaluated over a dataset of4 humans making sushi –UMD-Sushi MakingDataseta.
2 experiments: (1)Unsupervised Clustering(K-Means vs EM) and (2)Supervised Classification.
ahttp://www.umiacs.umd.edu/research/POETICON/umd sushi/
(a) Unsupervised recognition accuracy: nolanguage (K-Means) versus language(EM). (b) Classification accuracy: nolanguage versus language. All reportedresults have variances within ±0.5%.
![Page 14: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/14.jpg)
Future Work
Integrating the framework into arobotic agent – completes theCognitive Dialog.
Better language modelling – usingNER or shallow parsing methods toobtain better PL estimates.
Using Kinect as input to reduceviewpoint dependency.
Using attributes for more robusttool and action detection.
![Page 15: Towards a Watson That Sees: Language-Guided Action ...users.umiacs.umd.edu/~cteo/public-shared/ICRA12_Action...Towards a Watson That Sees: Language-Guided Action Recognition for Robots](https://reader035.fdocuments.in/reader035/viewer/2022071411/61078673b52ee7100f11e2be/html5/thumbnails/15.jpg)
Conclusions
When used properly, language can be exploited to solveimportant problems in vision.
In this talk, used a Cognitive Dialog framework for actionrecognition.
Promising results point to feasibility of using Language ineven more Computer Vision problems.