Understanding Video and Text by Joint Spatial, …Understanding Video and Text by Joint Spatial,...

Understanding Video and Text by

Joint Spatial, Temporal, and Causal Inference

Song-Chun Zhu

Center for Vision, Cognition, Learning and ArtsUniversity of California, Los Angeles

Stanford Workshop on AI and Knowledge, April 16, 2014

Question in a 5th Grade Test

Need joint reasoning using Vision + Language + Cognition (physics, causality)

1, Understanding Scene by Joint Spatial, Temporal, Causal and Text Parsing

Click this youtube video to watch the demo https://www.youtube.com/watch?v=FmFK52WwSQg#t=89

Joint video-text parsing

Ref: K. Tu, et al. Joint Video and Text Parsing for Understanding Events and Answering Quires, IEEE Multimedia, 2014. [pdf]

Joint video-text parsing

2, Answering User Queries on What, Who, Where, When and Why

We transfer the joint parse graph in RDF format and feed into a query engine.

https://www.youtube.com/watch?v=FnbYODNEgM8

Click this youtube video to watch the demo

3, A Restricted Turing Test on Understanding Object, Scene & Event

3 Areas, 30+ cameras (ground, tower, mobile), 3,000,000 frames (1 TB).

Ontology: objects, attributes, scenes, actions, group activities, spatial-temporal relations.

For example:

Location: Conference Room

Time: 15:47:00 - 16:19:00 [32 minute duration] [Video and Question are prepared by SIG]

Q: Is there at least one chair in the conference room that no one ever sits in?

Q: Is there a person putting food into the mouth?

Q: Is the upper and lower leg of a person in a white shirt occluded from the view of camera by a table?…

Agent = +

Object

Attribute and Fluent

Action

Mind/Intents

4, Knowledge representation:

the Spatial, Temporal, Causal And-Or Graph (STC-AOG)

Parent(s)(Is_Part_Of)

Children(Has_Parts)

Fluents

Attributes

Actions

Or-Node

Type 1: Objects

OBJECTTerminates (grounding)

Hybrid Image Template

Compositional hierarchy: S-AOG

S.C. Zhu and D. Mumford, A stochastic grammar of images, 2006 [pdf]

Z.Z. Si and S.C. Zhu, Learning And-or templates for object modeling and recognition, PAMI 2013, [pdf]

ACTION

Parent(s) (Event)

Children (Sub-Actions)

Fluent Fluent

ObjectHuman pose

Terminates

Type 2: Action / Event

ACTION

Compositional hierarchy: T-AOG

Ref: Pei et al, Video Event Parsing and Learning with Goal and Intent Prediction, Computer Vision and Image Understanding,

vol. 117, no. 10, pp 1369-1383, 2013. [pdf]

FLUENT

Owned By

Type 3: Fluent

Action Action

Object Mind

Terminates (grounding)

Change detection

A. Fire and S.C. Zhu, Using Causal Induction in Humans to Learn and Infer Causality from Video, CogSci, 2013 [pdf]

Planning Function

Domain Knowledge

(ST-PG)

STC-AOGPerception.

Intents

Planning

Type 4: Intents/Minds

Object

Embody the Mind.

Society of minds

4. Augmenting Visual Knowledge with Commonsense

To achieve deeper understanding of objects, scenes, and events, one need

to consider many other aspects:

1, Function and affordance:

Scenes are often defined by activities in space; Objects are often defined by how they can be used.

2, Physics: such as material, center of mass, velocity, force, torque, work,

temperature, state (solid, liquid).

3, Intents: the intention and goals of agents in the scenes and events.

4, Causality: causal-effects, laws, and equations,

Example 1: Augmenting Traditional Image Parsing with Functionality

Augmented object affordance

Augmented contextual relations

Traditional parse tree Augmented parse graph

Y.B. Zhao and S.C. Zhu, NIPS 2011, CVPR2013. [pdf]

Augmented physical properties:

- material, friction, mass, velocity

Augmented physical relations:

- supporting, attaching, hanging

ground

foreground objectsceiling

ground

ceiling

Example 2: Augmenting Traditional Image Parsing with Physics

Traditional parse tree Augmented parse graph

B. Zheng et al cvpr 2013. ICRA2014. [pdf]

Below is an example that uses physical constraints to help scene parsing,i.e. a valid parse (interpretation) must be physically plausible.

By grouping the voxels (captured by depth sensor) into geometric solids (parts) and then intoObject (segmentation), so as to minimize physical instability, and maximize functionality to serve humans.

B. Zheng et al, CVPR2013.

Video example I: making tool to reach food.

What is commonsense ?In nature, low rank animals, like crow, have astonishing commonsense

knowledge which goes way beyond current computer intelligence.

In this process, the crow must understand the scene, know the material property of the metal stick,make the hook with torque, use the hook, lots of physics, ….

Click this youtube video https://www.youtube.com/watch?v=dbwRHIuXqMU

Video example II: Cracking nuts using vehicles at crosswalk

In this process, the crow demonstrates profound scene understanding capabilities:dynamics of human/vehicle, causality, physical properties of objects,…

https://www.youtube.com/watch?v=BGPGknpq3e0

The crow videos prove that there exists a solution

--- small volume

embedded in your smart phones, wearable devices;

--- low-power

< 0.1 Watt (human brain is upper bounded by ~10 Watt,

crow’s brain is ~ 100 times smaller.)

An unifying math foundation for visual knowledge

regimes of representations

Stochastic grammarSpatialTemporal,Causal

Probabilistic predicate logic (common sense, domain knowledge)

Sparse coding(low-D manifolds,

textons)

Markov, Gibbs Fields(hi-D manifolds,

textures)

Reasoning

Cognition

Recognition

Coding

Understanding Video and Text by Joint Spatial, …Understanding Video and Text by Joint Spatial,...

Documents

Transcript of Understanding Video and Text by Joint Spatial, …Understanding Video and Text by Joint Spatial,...

Joint Spatial and Temporal Classification of Mobile ... · Joint Spatial and Temporal Classification of Mobile Traffic Demands Angelo Furnoy, Marco Fiorez, Razvan Stanica Université

INVITED REVIEW Statistical methods in spatial geneticsgigu/Invited_review_MolEcol.pdf · INVITED REVIEW Statistical methods in spatial genetics ... The joint analysis of spatial and

Joint Spatial Division and Multiplexing in Massive MIMO: A ...bbcr.uwaterloo.ca/~xshen/paper/2020/song2020Joint.pdfIndex Terms—Massive MIMO, beamforming, joint spatial division and

Joint FAO/WFP Workshop on Spatial Methodologies for Poverty … · 2006-05-19 · Initiative Joint FAO/WFP Workshop on Spatial ... Socio-economic data and thematic maps for the Poverty

Mining Object, Spatial, Multimedia, Text, andWeb Dataweb.engr.illinois.edu/~hanj/cs412/bk3/chapter10.pdf2 Chapter 10 Mining Object, Spatial, Multimedia, Text, and Web Data One step

Spatial joint probability for FCRM and national risk ... · Spatial joint probability for FCRM and National Risk Assessment – Multivariate event modeller user guide . We are the

Accelerated Joint Image Despeckling Algorithm in the Wavelet and Spatial Domains

On Joint Modelling and Testing for Local and Global Spatial

Joint Spatial Division and Multiplexing...Joint Spatial Division and Multiplexing A. Adhikary1, J. Nam2, J.Y. Ahn2 and G. Caire1 1Ming-hsieh Department of Electrical Engineering, USC,

Typograph: Multiscale Spatial Exploration of Text … Multiscale Spatial Exploration of Text Documents ... interactively explore the information to gain more detail, ... parts-of-speech

Joint Visual-Text Modeling for Multimedia Retrieval

Chap. 10 Mining Object, Spatial, Multimedia, Text, and Web Data

Interactive Learning of Spatial Knowledge for Text to 3D ... · Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation Author: Angel X. Chang, Manolis Savva, Christopher

Joint Spectral and Spatial Preprocessing Prior to Endmember … · 2012. 1. 23. · Joint Spectral and Spatial Preprocessing Prior to Endmember Extraction from Hyperspectral Images

Research Article Design of Joint Spatial and Power …downloads.hindawi.com/journals/ijap/2015/368463.pdfResearch Article Design of Joint Spatial and Power Domain Multiplexing Scheme

The Timber Tie Beam: The Analysis of Spatial Framework Joint · The Timber Tie Beam: The Analysis of Spatial Framework Joint Jana Rumlováa*, Roman Fojtíka a VSB – Technical University

The Map and the Text: The Geo-Literary Spatial Encounters ...

Learning Spatial Knowledge for Text to 3D Scene Generation

Joint quantification of uncertainty on spatial and non ...pangea.stanford.edu/departments/ere/dropbox/scrf/documents/reports/...Joint quantification of uncertainty on spatial and non-spatial

Joint Spatial Division and Multiplexing - arXiv · Joint Spatial Division and Multiplexing Ansuman Adhikaryy, Junyoung Nam, Jae-Young Ahn, and Giuseppe Cairey Abstract We propose