Newtonian Scene Understanding: Unfolding the Dynamics of...
Transcript of Newtonian Scene Understanding: Unfolding the Dynamics of...
Newtonian Image Understanding:
Unfolding the Dynamics of Objects in Static Images
Roozbeh Mottaghi1 Hessam Bagherinezhad2 Mohammad Rastegari1 Ali Farhadi1,2
1Allen Institute for Artificial Intelligence (AI2)2University of Washington
Abstract
In this paper, we study the challenging problem of pre-
dicting the dynamics of objects in static images. Given a
query object in an image, our goal is to provide a physi-
cal understanding of the object in terms of the forces acting
upon it and its long term motion as response to those forces.
Direct and explicit estimation of the forces and the motion
of objects from a single image is extremely challenging. We
define intermediate physical abstractions called Newtonian
scenarios and introduce Newtonian Neural Network (N3)
that learns to map a single image to a state in a Newto-
nian scenario. Our evaluations show that our method can
reliably predict dynamics of a query object from a single
image. In addition, our approach can provide physical rea-
soning that supports the predicted dynamics in terms of ve-
locity and force vectors. To spur research in this direction
we compiled Visual Newtonian Dynamics (VIND) dataset
that includes more than 6000 videos aligned with Newto-
nian scenarios represented using game engines, and more
than 4500 still images with their ground truth dynamics.
1. Introduction
A key capability in human perception is the ability to
proactively predict what happens next in a scene [4]. Hu-
mans reliably use these predictions for planning their ac-
tions, making everyday decisions, and even correcting vi-
sual interpretations [15]. Examples include predictions in-
volved in passing a busy street, catching a frisbee, or hitting
a tennis ball with a racket. Performing these tasks require
a rich understanding of the dynamics of objects moving in
a scene. For example, hitting a tennis ball with a racket re-
quires knowing the dynamics of the ball, when it hits the
ground, how it bounces back from the ground, and what
form of motion it follows.
Rich physical understanding of human perception even
allows predictions of dynamics on only a single image.
Most people, for example, can reliably predict the dynam-
query
objectV
F
Figure 1. Given a static image, our goal is to infer the dynamics
of a query object (forces that are acting upon the object and the
expected motion of the object as a response to those forces). In
this paper, we show an algorithm that learns to map an image to
a state in a physical abstraction called a Newtonian scenario. Our
method provides a rich physical understanding of an object in an
image that allows prediction of long term motion of the object and
reasoning about the direction of net force and velocity vectors.
ics of the volleyball shown in Figure 1. Theories in per-
ception and cognition attribute this capability, among many
explanations, to previous experience [9] and existence of an
underlying physical abstraction [14].
In this paper, we address the problem of physical un-
derstanding of objects in images in terms of the forces ac-
tioning upon them and their long term motions as their re-
sponses to those forces. Our goal is to unfold the dynamics
of objects in still images. Figure 1 shows an example of a
long term motion predicted by our approach along with the
physical reasoning that supports the predicted dynamics.
Motion of objects and its relations to various physical
quantities (mass, friction, external forces, geometry, etc.)
has been extensively studied in Mechanics. In schools, clas-
sical mechanics is taught using basic Newtonian scenarios
that explain a large number of simple motions in real world:
inclined surfaces, falling, swinging, external forces, projec-
tiles, etc. To infer the dynamics of an object, students need
to figure out the Newtonian scenario that explains the situ-
3521
(1)(2) (3) (4) (5) (6)
(7) (8) (9) (10) (11) (12)
Figure 2. Newtonian Scenarios are defined according to different physical quantities: direction of motion, forces, etc. We use 12 scenarios
that are depicted here. The circle represents the object, and the arrow shows the direction of its motion.
ation, find the physical quantities that contribute to the mo-
tion, and then plug them into the corresponding equations
that relate contributing physical quantities to the motion.
Estimating physical quantities from an image is an ex-
tremely challenging problem. For example, computer vi-
sion literature does not provide a reliable solution to direct
estimation of mass, friction, the angle of an inclined plane,
etc. from an image. Instead of direct estimation of the phys-
ical quantities from images, we formulate the problem of
physical understanding as a mapping from an image to a
physical abstraction. We follow the same principles of clas-
sical Mechanics and use Newtonian scenarios as our phys-
ical abstraction. These scenarios are depicted in Figure 2.
We chose to learn this mapping in the visual space and thus
render the Newtonian scenarios using game engines.
Mapping a single image to a state in a Newtonian sce-
nario allows us to borrow the rich Newtonian interpretation
offered by game engines. This enables predicting the long
term motion of the object along with rich physical reasoning
that supports the predicted motion in terms of velocity and
force vectors1. Learning such a mapping requires reason-
ing about subtle visual and contextual cues, and common
knowledge of motion. For example, to predict the expected
motion of the ball in Figure 1 one needs to rely on previous
experience, visual cues (subtle hand posture of the player on
the net, the line of sight of other players, their pose, scene
configuration), and the knowledge about how objects move
in a volleyball scene. To perform this mapping, we adopt a
data driven approach and introduce Newtonian Neural Net-
works (N3) that learns the complex interplay between vi-
sual cues and motions of objects.
To facilitate research in this challenging direction, we
compiled VIND, VIsual Newtonian Dynamics dataset, that
contains 6806 videos, with the corresponding game engine
videos for training and 4516 still images with the predicted
motions for testing.
Our experimental evaluations show promising results in
Newtonian understanding of objects in images and enable
1Throughout this paper we refer to force and velocity vector as normal-
ized unit vectors that show the direction of force or velocity.
prediction of long-term motions of objects backed by ab-
stract Newtonian explanations of the predicted dynamics.
This allows us to unfold the dynamics of moving objects in
static images. Our experimental evaluations also show the
benefits of using an intermediate physical abstraction com-
pared to competitive baselines that make direct predictions
of the motion.
2. Related Work
Cognitive studies: Recent studies in computational cog-
nitive science show that humans approximate the principles
of Newtonian dynamics and simulate the future states of the
world using these principles [14, 5]. Our use of Newtonian
scenarios as an intermediate representation is inspired by
these studies.
Motion prediction: The problem of predicting future
movements and trajectories has been tackled from different
perspectives. Data-driven approaches have been proposed
in [37, 25] to predict motion field in a single image. Fu-
ture trajectories of people are inferred in [19]. [33] pro-
posed to infer the most likely path for objects. In contrast,
our method focuses on the physics of the motion and esti-
mates a 3D long-term motion for objects. There are recent
methods that address prediction of optical flow in static im-
ages [28, 34]. Flow does not carry semantics and represents
very short-term motions in 2D whereas our method can in-
fer long term 3D motions using force and velocity informa-
tion. Physic-based human motion modeling was studied by
[8, 6, 7, 32]. They employed human movement dynamics to
predict future pose of humans. In contrast, we estimate the
dynamics of objects.
Scene understanding: Reasoning about the stability of
a scene has been addressed in [18] that use physical con-
straints to reason about the stability of objects that are mod-
eled by 3D volumes. Our work is different in that we rea-
son about the dynamics of stable and moving objects. The
approach of [38] computes the probability that an object
falls based on inferring disturbances caused naturally or by
human actions. In contrast, we do not explicitly encode
physics equations and we rely on images and direct percep-
3522
Figure 3. Viewpoint annotation. We ask the annotators to choose the game engine video (among 8 different views of the Newtonian
scenario) that best describes the view of the object in the image. The object in the game engine video is shown in red, and its direction of
movement is shown in yellow. The video with a green border is the selected viewpoint. These videos correspond to Newtonian scenario
(1).
tion. The early work of Mann et al. [26] studies the percep-
tion of scene dynamics to interpret image sequences. Their
method, unlike ours, requires complete geometric specifica-
tion of the scene. A rich set of experiments are performed
by [35] on sliding motion in the lab settings to estimate ob-
ject mass and friction coefficients. Our method is not lim-
ited to sliding and works on a wide range of physical sce-
narios in various types of scenes.
Action Recognition: Early prediction of activities has
been discussed in [29, 27, 16, 23]. Our work is quite differ-
ent since we estimate long-term motions as opposed to the
class of actions.
Human object interaction: Prediction of human action
based on object interactions has been studied in [20]. Pre-
diction of the behavior of humans based on functional ob-
jects in a scene has been explored in [36]. Relative motion
of objects in a scene are inferred in [13]. Our work is related
to this line of thought in terms of predicting future events
from still images. But our objective is quite different. We
do not predict the next action, we care about understanding
the underlying physics that justifies future motions in still
images.
Tracking: Note that our approach is quite different from
tracking [17, 11, 10] since tracking methods are not destined
for single image reasoning. [32] incorporates simulations to
properly model human motion and prevent physically im-
possible hypotheses during tracking.
3. Problem Statement & Overview
Given a static image, our goal is to reason about the
expected long-term motion of a query object in 3D. To
this end, we use an intermediate physical abstraction called
Newtonian scenarios (Figure 2) rendered by a game engine.
We learn a mapping from a single image to a state in a New-
tonian scenario by our proposed Newtonian Neural Net-
work (N3). A state in a Newtonian scenario corresponds
to a specific moment in the video generated by the game
engine and includes a set of rich physical quantities (force,
velocity, 3D motion) for that moment. Mapping to a state in
a Newtonian scenario allows us to borrow the correspond-
ing physical quantities and use them to make predictions
about the long term motion of the query object in a single
image.
Mapping from a single image to a state in a Newtonian
scenario involves solving two problems: (a) figuring out
which Newtonian scenario explains the dynamics of the im-
age best; (b) finding the correct moment in the scenario that
matches the state of the object in motion. There are strong
contextual and visual cues that can help to solve the first
problem. However, the second problem involves reasoning
about subtle visual cues and is even hard for human anno-
tators. For example, to predict the expected motion and the
current state of the ball in Figure 1 one needs to reason from
previous experiences, visual cues, and knowledge about the
motion of the object. N3 adopts a data driven approach
to use visual cues and the abstract knowledge of motion to
learn (a) and (b) at the same time. To encode the visual
cues N3 uses 2D Convolutional Neural Networks (CNN) to
represent the image. To learn about motions N3 uses 3D
CNNs to represent game engine videos of Newtonian sce-
narios. By joint embedding N3 learns to map visual cues to
exact states in Newtonian scenarios.
4. VIND Dataset
We collect VIsual Newtonian Dynamics (VIND) dataset,
which contains game engine videos, natural videos and
static images corresponding to the Newtonian scenarios.
The Newtonian scenarios that we consider are inspired by
the way Mechanics is taught in school and cover commonly
seen simple motions of objects (Figure 2). Few factors dis-
tinguish these scenarios from each other: (a) the path of the
object, e.g. scenario (3) describes a projectile motion, while
scenario (4) describes a linear motion, (b) whether the ap-
plied force is continuous or not, e.g., in scenario (8), the
external force is continuously applied, while in scenario (4)
the force is applied only in the beginning. (c) whether the
object has contact with a support surface or not, e.g., this is
the factor that distinguishes scenario (10) from scenario (4).
Newtonian Scenarios: Representing a Newtonian scenario
by a natural video is not ideal due to the noise caused by
3523
4x2
27x2
27
96x5
5x5
5
256x2
7x2
7
384x1
3x1
3
384x1
3x1
3
256x1
3x1
3
4096x1 4096x1
11x11 5x5 3x3 3x3 3x3
Strid
e:4
ReL
U
Max
Poolin
g:2x2
ReL
U
Max
Poolin
g:2x2
Den
se: 2
56 x
13
x 13
Res
hape
Max
Poolin
g:2x2
ReL
U
ReL
U
ReL
U
10x1
0x2
56x2
56
3x3x3 3x3x3 3x3x3 3x3x3 3x3x3 3x3x3
10x6
4x2
56x2
56
66 GE Videos
66 GE Videos
C x
F x
H x
W
C x
H x
W
10x6
4x1
28x1
28
10x6
4x6
4x6
4
10x6
4x3
2x3
2
10x6
4x1
6x1
6
10x6
4x8
x8
10
x 4
09
6
10
x 4
09
6
Max
Poolin
g:1x2
x2
ReL
U
Max
Poolin
g:1x2
x2
ReL
U
Max
Poolin
g:1x2
x2
ReL
U Max
Poolin
g:1x2
x2
ReL
U M
axPoolin
g:1x2
x2
ReL
U
Max
Poolin
g:1x2
x2
ReL
U
1x10
1x10
Cosine Similarity
SoftMax
SoftMax 66 x 1
(1)
(8)
(2)
(4)
1x10 SoftMax
1x10 SoftMax
66 x 1
den
se
den
se
Den
se: 6
4 x
8 x
8
Bat
ch-N
ormal
izat
ion
Res
hape
66 x 1
Figure 4. Newtonian Neural Network (N3): This figure illustrates a schematic view of our proposed neural network model. The first
row (referred to as image row), processes the static image augmented by an extra channel that shows the localization of the query object
with a Gaussian-smoothed binary mask. Image row has the same architecture as AlexNet [21] for image classification. The larger cubes
in the row indicate the convolutional outputs. The dimensions for convolutional outputs are Channels, Height, Width. The smaller cubes
inside them indicate 2D convolutional filters, which are convolved across Width and Height. The second row (referred to as motion row),
processes the video inputs from game engine. This row has similar architecture to C3D [31]. The dimensions for convolutional outputs
in this row are Channels, Frames, Height, Width. The filters in the motion row are convolved across Frames, Width and Height. These
two rows meet by a cosine similarity layer that measures the similarities between the input image and each frame in the game engine
videos. The maximum value of these similarities, in each Newtonian scenario is used as the confidence score for that scenario describing
the motion of the object in the input image.
camera motion, object clutter, irrelevant visual nuisances,
etc. To abstract away the Newtonian dynamics from noise
and clutter in real world, we construct the Newtonian sce-
narios (shown in Figure 2) using a game engine. A game
engine takes a scene configuration as input (e.g. a ball above
the ground plane) and simulates it forward in time accord-
ing to laws of motion in physics. For each Newtonian sce-
nario, we render its corresponding game engine scenario
from different viewpoints. In total, we obtain 66 game en-
gine videos. For each game engine video, we store its depth
map, surface normals and optical flow information in ad-
dition to the RGB image. In total each frame in the game
engine video has 10 channels.
Images and Videos: We also collect a dataset of natural
videos and images depicting moving objects. The current
datasets for action or object recognition are not suitable for
our task as they either show complicated movements that go
beyond classical dynamics (e.g. head massage or make up
in UCF-101 [30], HMDB-51 [22]) or they show no motion
(most images in PASCAL [12] or COCO [24]).
Annotations. We provide three types of annotations for
each image/frame: (1) bounding box annotations for the
objects that are described by at least one of our Newtonian
scenarios, (2) viewpoint information i.e. which viewpoint of
the game engine videos best describes the direction of the
movements in the image/video, (3) state annotations. By
state, we mean how far the object has moved on the ex-
pected scenario (e.g. is it at the beginning of the projectile
motion? or is it at the peak point?). More details about
the collection of the dataset and the annotation procedure
can be found in Section 6 and the supplementary material.
Example game engine videos corresponding to Newtonian
scenario (1) are shown in Figure 3.
5. Newtonian Neural Network
N3 is shaped by two parallel convolutional neural net-
works (CNNs); one to encode visual cues and another to
represent Newtonian motions. The input to N3 is a static
image with four channels (RGBM; where M is the object
mask channel that specifies the location of the query object
by a bounding-box mask smoothed with a Gaussian ker-
nel) and 66 videos of Newtonian scenarios2(as described in
Section 4) where each video has 10 frames (equally-spaced
frames sampled from the entire video) and each frame has
10 channels (RGB, flow, depth, and surface normal). The
output of N3 is a 66 dimensional vector where each dimen-
sion shows the confidence of the input image being assigned
2From now on, we refer to the game engine videos rendered for New-
tonian scenarios as Newtonian scenarios.
3524
to a viewpoint of a Newtonian scenario. N3 learns the map-
ping by enforcing similarities between the vector represen-
tations of static images and that of video frames correspond-
ing to Newtonian scenarios. The state prediction is achieved
by finding the most similar frame to the static image in the
Newtonian space.
Figure 4 depicts a schematic illustration of N3. The
first row resembles the standard CNN architecture for im-
age classification introduced by [21]. We refer to this row
as image row. Image row has five 2D CONV layers (convo-
lutional layers) and two FC layers (fully connected layers).
The second row is a volumetric convolutional neural net-
work inspired by [31]. We refer to this row as motion row.
Motion row has six 3D CONV layers and one FC. The input
to the motion row is a batch of 66 videos (corresponding to
66 Newtonian scenarios rendered by game engines). The
motion row generates a 4096x10 matrix as output for each
video, where a column in this matrix can be seen as a de-
scriptor for a frame in the video. To preserve the same num-
ber of frames in the output, we eliminate MaxPooling over
the temporal dimension for all CONV layers in the motion
row. The two rows are joined by a matching layer that uses
cosine similarity as a matching measure. The input to the
image row is an RGBM image and the output is a 4096 di-
mensional vector (values after FC7 layer). This vector can
be seen as a visual descriptor for the input image.
The matching layer takes the output of the image row and
the output of the motion row as input and computes the co-
sine similarity between the image descriptors and all of the
10 frames’ descriptors in each video in the batch. Therefore,
the output of matching layer are 66 vectors where each vec-
tor has 10 dimensions. The dimension with maximum sim-
ilarity value indicates the state of dynamics for each New-
tonian scenario. For example, if the third dimension has the
maximum value, it means, the input image has maximum
similarity with the third frame of the game engine video,
thus it must have the same state as that of the third frame
in the corresponding game engine video. SoftMax layers
are appended after the cosine similarity layer to pick the
maximum similarity as a confidence score for each Newto-
nian scenario. This enables N3 to learn the state prediction
without any state level annotations. This is an advantage
for N3 that can implicitly learn the state of the motion by
directly optimizing for the prediction of Newtonian scenar-
ios. These confidence scores are linearly combined with the
confidence scores from the image row to produce the final
scores. This linear combination is controlled by a parame-
ter λ ∈ [0, 1] that weights the effect of motion for the final
score.
Training: In order to train N3, we feed the input by
picking a batch of random images from the training set and
a batch of game engine videos that cover all Newtonian sce-
narios (66 videos). Each iteration involves a forward and
a backward pass through the network. We use cross en-
tropy as our loss function:E = − 1
n
Pn
i=1[pi log pi + (1 −
pi) log (1− pi)], where pi is the ground truth label of the
input image (the value is 1 for the ground truth class and 0
otherwise) and pi is the predicted probability obtained by
taking SoftMax over the output of N3. In each iteration,
we feed a random batch of images to the network, but a
fixed batch of videos across all iterations. This enables N3
to penalize the error over all of the Newtonian scenarios at
each iteration. The other option could be passing a pair of
a random image and a game engine video, then predicting
a binary output showing whether the image corresponds to
the Newtonian scenario or not. This requires a lot more
iterations to see all the possible positive and negative pair-
ings for an image and has shown to be less effective for our
problem.
Testing: At test time, the 4096x10 descriptors for ab-
stract motions can be pre-computed from the motion row of
N3 after CONV6 layer. For each test, we only feed a single
RGBM image as input and obtain the underlying Newtonian
scenario h and its matching state sh. The predicted scenario
(h) is the scenario with maximum confidence in the output.
The matching state sh is achieved by
sh = argmaxi
{Sim(x, vih)} (1)
where x is the 4096x1 image descriptor, vih is the
4096x10 video descriptor for Newtonian scenario h and
i ∈ {1, 2, .., 10} indicates the frame index in the video.
Sim(., .) is the standard cosine similarity between two vec-
tors. Given h and sh, a long-term 3D motion path can be
drawn for the query object by borrowing the game engine
parameters (e.g. direction of velocity and force, 3D motion,
and camera view point) from the state sh of Newtonian sce-
nario h.
6. Experiments
We compare our method with a number of baselines in
predicting the motion of a query object in an image and
provide an ablation study that examines the utility of differ-
ent components in our method. We further show qualitative
results for motion prediction and estimation of force and
velocity directions. We also show the benefits of estimat-
ing optical flow from our long term motions predicted by
our method. Additionally, we show the generalization to
unseen scene types.
6.1. Settings
Network: We implemented our proposed neural net-
work N3 in Torch [2]. We use a machine with a 3.5GHz
Intel Xeon CPU and GeForce TITAN X GPU to train
and test our model. To train N3, we initialized the im-
3525
(a)(b)
(e)
(c)
(d)(f)
(g) (h) (i)
Figure 5. The expected motion of the object in the static image is shown in orange. We have visualized the 3D motion of the object (red
sphere) and its superposition on the image (left image). We also show failure cases in the red box, where the red and green curves represent
our prediction and ground truth, respectively.
age row (refer to Figure 4) by a publicly available 3 pre-
trained CNN model. We initialize the fourth channel (M)
by random values drawn from a Gaussian distribution (µ =0,σ = 10
filter size). The motion row was initialized ran-
domly, where the random parameters came from a Gaus-
sian distribution (µ = 0,σ = 10
filter size). For training, we
use batches of 128 input images in the image row and 66
videos in the motion row. We run the forward and backward
passes for 5000 iterations4. We started by the learning rate
of 10−1 and gradually decreased it down to 10−4. In order
to prevent the numerical instability of the cosine similarity
function, we use the smooth version of cosine similarity,
which is defined as: S(x, y) = x.y|x||y|+✏
, where ✏ = 10−5.
Dataset details: We use Blender [1] game engine to ren-
der the game engine videos corresponding to the 12 Newto-
nian scenarios. We factor out the effect of force magnitude
and camera distance.
The Newtonian scenarios are rendered from 8 different
azimuth angles. Scenarios 6, 7, and 11 in Figure 2 are
symmetric across different azimuth angles and we there-
fore render them from 3 different elevations of the cam-
era. The Newtonian scenarios 2 and 12 are the same across
viewpoints with 180◦ azimuth difference. We consider four
views for those scenarios. For stability (scenario (5)), we
consider only 1 viewpoint (there is no motion). In total, we
obtain 66 videos for all 12 Newtonian scenarios.
Our new dataset (VIND) contains more than 6000 video
clips in natural scenes. These videos contain more than
200,000 frames in total. For training, we use frames
randomly sampled from these video clips. To train our
3https://github.com/BVLC/caffe/tree/master/models/bvlc alexnet4 In our experiments the loss values start converging after 5K iterations.
model, we use bounding box information of query objects
and viewpoint annotations for the corresponding Newtonian
scenario (the procedure for viewpoint annotations is shown
in Figure 3).
The image portion of our dataset includes 4516 images
that are divided into 1458 and 3058 images for validation
and testing, respectively. We tune our parameters using the
validation set and report our results on the test subset. For
evaluation, each image has bounding box, viewpoint and
state annotations. The details of the annotation and collec-
tion process is described in the supplementary material.
6.2. Estimating the motion of query objects
Given a single image and a query object, we evaluate
how well our method can estimate the motion of the object.
We compare the resulting 3D curves from our method with
that of the ground truth.
Evaluation Metric. We use an evaluation metric which
is similar to the F-measure used for comparing contours
(e.g. [3]). The 3D curve of groundtruth and the estimated
motion are in XY Z space. However, the two curves do
not necessarily have the same length. We slide the shorter
curve over the longer curve to find an alignment with the
minimum distance. We then compute precision and recall
by thresholding the distance between corresponding points
on the curves.
We also report results using the Modified Hausdorff Dis-
tance (MHD), however the F-measure is more interpretable
since it is a number between 0 and 100.
Baselines. A set of comparisons with a number of base-
lines are presented in Table 1. The first baseline, called Di-
rect Regression, is a direct regression from images to the
trajectories in the 3D space (groundtruth curves are rep-
3526
(1) (2) (3) (4)(5) (6) (7) (8) (9) (10) (11) (12)Avg.
Direct Regression 32.7 59.9 12.4 16.1 84.6 48.8 8.2 20.2 1.6 13.8 49.0 16.4 30.31
Direct Regression - Nearest 52.7 38.4 17.3 23.5 64.9 69.2 18.1 36.2 3.2 20.4 76.5 24.2 37.05
N3(ours) 60.8 64.7 39.4 37.6 95.4 54.1 50.3 76.9 9.4 38.1 72.1 72.4 55.96
Table 1. Estimation of the motion of the objects in 3D. F-measure is used as the evaluation metric.
resented by B-splines with 1200 knots). For this base-
line, we modify AlexNet architecture to regress each im-
age to its corresponding 3D curve. More specifically, we
replace the classification loss layer with a Mean Squared
Error (MSE) loss layer. Table 1 shows that N3 signif-
icantly outperforms this baseline that aims at directly re-
gressing the motion from visual data. We postulate that this
is mainly due to the dimensionality of the output and the
complex interplay between subtle visual cues and the 3D
motion of objects. To further probe that if the direct regres-
sion can even roughly estimate the shape of the trajectory
we build an even stronger baseline. For this new baseline,
called Direct Regression-Nearest, we use the output of the
direct regression baseline above to find the most similar 3D
curve among Newtonian scenarios (based on normalized
Euclidean distance between the B-spline representations).
Table 1 shows that N3 also outperforms this competitive
baseline. In terms of the MHD metric, N3 also outperforms
the baselines (5.59 versus 5.97 and 7.32 for the baseline
methods; lower is better).
Figure 5 shows qualitative results in estimating the ex-
pected motion of the object in still images. When N3 pre-
dicts a 3D curve for an image it also estimates the view-
point. This allows us to project the 3D curve back onto the
image. Figure 5 shows examples of these estimated mo-
tions. For example, N3 correctly predicts the motion of the
football thrown (Figure 5(f)), and estimates the right mo-
tion for the ping pong ball falling (Figure 5(e)). Note that
N3 cannot reason about possible future collisions with other
elements in the scene. For example Figure 5(a) shows a pre-
dicted motion that goes through soccer players. This figure
also shows some examples of failures. The mistake in Fig-
ure 5(h) can be attributed to the large distance between the
player and the basketball. Note that when we project 3D
curves to images we need to make assumptions about the
distance to the camera and the 2D projected curves might
have inconsistent scales.
Ablation studies. To study our method in further details,
we test two variations of our method. In the first variation, λ
(defined in Section 5) is set to 1, which means that we are ig-
noring the motion row in the network. We refer to this vari-
ation as N3 −NV in Table 2. N3 outperforms N3 −NV ,
indicating that the motion abstraction is an important factor
in N3. To study the effectiveness of N3 in state prediction,
in the second variation, we measure the utility of providing
state supervision for training N3. We modified the output
Ablations N3 −NV N3 N3 + SS
F-measure 52.67 55.96 56.10
Table 2. Ablation study of 3D motion estimation. The average
across 12 Newtonian scenarios is reported.
layer of N3 to learn the exact state of the motion from the
groundtruth augmented by state level annotations. This case
is referred to as N3+SS in Table 2. The small gap between
the results in N3 and N3 + SS shows that N3 can reliably
predict the correct state without state supervision. The pro-
cedure of annotating states (i.e. specifying which frame of
the game engine video corresponds to the state of the object
in the image) is explained in the supplementary material.
Another ablation is to study the effectiveness of N3
in classifying images into 66 classes corresponding to 12
Newtonian scenarios rendered from different viewpoints.
In this ablation, shown in Table 3, we compare N3 to
N3 − NV with and without state supervision (SS) in a
classification setting (not prediction of the motion). Also,
our experiments show that N3 and N3 − NV make dif-
ferent types of mistakes since fusing these variations in an
optimal way (by an oracle) results in an improvement in
classification (25.87).
Ablations N3−NV N
3−NV + SS N
3N
3+ SS
Avg. Accuracy 20.37 19.32 21.71 21.94
Table 3. Estimation of Newtonian scenario and viewpoint (no state
estimation).
Short-term flow estimation. Our method is designed to
predict long-term motions in 3D, yet it can estimate short
term motions by projecting the long term 3D motion onto
the image. We compare the effectiveness of N3 in esti-
mating the flow with the state of the art methods explic-
itly trained to predict short-term flow from a single im-
age. In particular, we compare with the recent method of
Predictive-CNN [34]. For each query object, we average
the dense flow predicted by [34] over the pixels in the ob-
ject box and obtain a single flow vector. The evaluation
metric is angular error (we do not compute flow magnitude).
As shown in Table 4, our method outperforms [34] on our
dataset.
Method Angular Err.
Predictive-CNN [34] 1.53
N3 (ours) 1.29Table 4. Short-term flow prediction in a single image. The evalua-
tion metric is angular error.
3527
Figure 6. Visualization of the direction of net force and object velocity. The velocity is shown in green and the net force is shown in
magenta. The corresponding Newtonian scenario is shown above each image.
Method F-measure
Direct Regression 25.76
N3 (ours) 36.40Table 5. Generalization to unseen scene types.
Force and velocity estimation. It is interesting to see
that N3 can predict the direction of the net force and ve-
locity in a static image for a query object! Figure 6 shows
qualitative examples. For example, it is exciting to show
that N3 can predict the friction in the bowling example, and
the gravity in the basketball example. The net force applied
to the chair in the bottom row (left) is zero since the normal
force from the floor cancels the gravity.
Generalization to unseen scene types. We also evalu-
ate how well our model generalizes to unseen scene types.
We remove all images that represent the same scene type
(e.g., all images that show a billiard scene in scenario (4))
from our training data and test how well we can estimate
the motion of the object in images that show those scene
types. Our method outperforms the baseline method (Ta-
ble 5). The reported result is the average over 12 Newto-
nian scenarios, where we remove one scene type from each
Newtonian scenario during training. The complete list of
the removed scene types is available in the supp. material.
7. ConclusionsIn this paper we address the challenging problem of
Newtonian understanding of objects in static images. Nu-
merous physical quantities contribute to shaping the dy-
namics of objects in a scene. Direct estimation of those
quantities is extremely challenging. In this paper, we as-
sume intermediate physical abstractions, Newtonian scenar-
ios and introduce a model that can map from a single image
to a state in a Newtonian scenario. This mapping needs to
learn subtle visual and contextual cues to be able to rea-
son about the correct Newtonian scenario, state, viewpoint,
etc. Rich physical predictions about the dynamics of ob-
jects in an images can then be made by borrowing informa-
tion through the established correspondences to Newtonian
scenarios. This allows us to predict the motion and reason
about it in terms of velocity and force directions for a query
object in a still image.
Our current solution can only reason about simple mo-
tions of rigid bodies and cannot handle complex and com-
pound motions, specially when it is affected by other exter-
nal elements in the scene (e.g. the motion of thrown ball
would change if there is a wall in front of it in the scene).
In addition, our method does not provide estimates for mag-
nitude of the force and velocity vectors. We postulate that
there might be very subtle visual cues that can contribute
tho those estimates.
Rich physical understanding of images is an important
building block towards deeper understanding of images, en-
ables visual reasoning, and opens several new and excit-
ing research directions in scene understanding. Reasoning
about how objects move in an image is tightly coupled with
semantic and geometric scene understanding. Explicit joint
reasoning about these interactions is an exciting research
direction.
Acknowledgments: This research was partially supported
by ONR N00014-13-1-0720, NSF IIS-1338054, and Allen
Distinguished Investigator Award.
3528
References
[1] Blender. http://www.blender.org/. 6
[2] Torch7. http://torch.ch. 5
[3] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour
detection and hierarchical image segmentation. PAMI, 2011.
6
[4] M. Bar. The proactive brain: memory for predictions. Royal
Society of London. Series B, Biological sciences, 2009. 1
[5] P. Battaglia, J. Hamrick, and J. B. Tenenbaum. Simulation as
an engine of physical scene understanding. PNAS, 2013. 2
[6] M. A. Brubaker and D. J. Fleet. The kneed walker for human
pose tracking. In CVPR, 2008. 2
[7] M. A. Brubaker, D. J. Fleet, and A. Hertzmann. Physics-
based person tracking using simplified lower-body dynam-
ics. In CVPR, 2007. 2
[8] M. A. Brubaker, L. Sigal, and D. J. Fleet. Estimating contact
dynamics. In ICCV, 2009. 2
[9] O. Cheung and M. Bar. Visual prediction and perceptual
expertise. Intl. J. of Psychophysiology, 2012. 1
[10] R. Collins, Y. Liu, and M. Leordeanu. On-line selection of
discriminative tracking features. PAMI, 2005. 3
[11] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object
tracking. PAMI, 2003. 3
[12] M. Everingham, L. Gool, C. K. Williams, J. Winn, and
A. Zisserman. The pascal visual object classes (voc) chal-
lenge. IJCV, 2010. 4
[13] D. F. Fouhey and C. Zitnick. Predicting object dynamics in
scenes. In CVPR, 2014. 3
[14] J. Hamrick, P. Battaglia, and J. B. Tenenbaum. Internal
physics models guide probabilistic judgments about object
dynamics. Annual Meeting of the Cognitive Science Societ,
2011. 1, 2
[15] J. Hawkins and S. Blakeslee. On Intelligence. Times Books,
2004. 1
[16] M. Hoai and F. De la Torre. Max-margin early event detec-
tors. In CVPR, 2012. 3
[17] M. Isard and A. Blake. Condensation conditional density
propagation for visual tracking. IJCV, 1998. 3
[18] Z. Jia, A. Gallagher, A. Saxena, and T. Chen. 3d-based rea-
soning with blocks, support, and stability. In CVPR, 2013.
2
[19] K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert.
Activity forecasting. In ECCV, 2012. 2
[20] H. Koppula and A. Saxena. Anticipating human activities
using object affordances for reactive robotic response. In
RSS, 2013. 3
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, 2012. 4, 5
[22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
Hmdb: a large video database for human motion recognition.
In ICCV, 2011. 4
[23] T. Lan, T. Chen, and S. Savarese. A hierarchical representa-
tion for future action prediction. In ECCV, 2014. 3
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common
objects in context. In ECCV, 2014. 4
[25] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense corre-
spondence across scenes and its applications. PAMI, 2011.
2
[26] R. Mann, A. Jepson, and J. Siskind. The computational per-
ception of scene dynamics. CVIU, 1997. 3
[27] M. Pei, Y. Jia, and S.-C. Zhu. Parsing video events with goal
inference and intent prediction. In ICCV, 2011. 3
[28] S. L. Pintea, J. C. van Gemert, and A. W. M. Smeulders. Deja
vu: - motion prediction in static images. In ECCV, 2014. 2
[29] M. S. Ryoo. Human activity prediction: Early recognition of
ongoing activities from streaming videos. In ICCV, 2011. 3
[30] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of
101 human action classes from videos in the wild. Technical
Report CRCV-TR-12-01, 2012. 4
[31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In ICCV, 2015. 4, 5
[32] M. Vondrak, L. Sigal, and O. C. Jenkins. Physical simulation
for probabilistic motion tracking. In CVPR, 2008. 2, 3
[33] J. Walker, A. Gupta, and M. Hebert. Patch to the future:
Unsupervised visual prediction. In CVPR, 2014. 2
[34] J. Walker, A. Gupta, and M. Hebert. Dense optical flow pre-
diction from a static image. In ICCV, 2015. 2, 7
[35] J. Wu, I. Yildirim, J. J. Lim, W. T. Freeman, and J. B. Tenen-
baum. Galileo: Perceiving physical object properties by inte-
grating a physics engine with deep learning. In NIPS, 2015.
3
[36] D. Xie, S. Todorovic, and S.-C. Zhu. Inferring dark matter
and dark energy from videos. In ICCV, 2013. 3
[37] J. Yuen and A. Torralba. A data-driven approach for event
prediction. In ECCV, 2010. 2
[38] B. Zheng, Y. Zhao, J. C. Yu, K. Ikeuchi, and S.-C. Zhu. De-
tecting potential falling objects by inferring human action
and natural disturbance. In ICRA, 2014. 2
3529