Deep Declarative Networks: A New Hopeusers.cecs.anu.edu.au/~sgould/papers/declarativeNetworks...A...
Transcript of Deep Declarative Networks: A New Hopeusers.cecs.anu.edu.au/~sgould/papers/declarativeNetworks...A...
Deep Declarative Networks:A New Hope
A/Prof. Stephen Gould
Research School of Computer Science
The Australian National University
2019
slide credit: Dylan Campbell
What did we gain?
β Better-than-human performance on closed-world classification tasks
β Very fast inference (with the help of GPU acceleration)β versus very slow iterative optimization
procedures
β Common tools and software frameworks for sharing research code
β Robustness to variations in real-world data if training set is sufficiently large and diverse
Clear mathematical models; separation between algorithm and objective (loss function)
Theoretical performance guarantees
Interpretability and robustness to adversarial attacks
Ability to enforce hard constraints
Intuition guided by physical models
Parsimony β capacity consumed learning what we already know
What did we lose?
S Gould | 3
What if we could have the best of both worlds?
S Gould | 4
Deep learning models
β’ Linear transforms (i.e., convolutions)
β’ Elementwise non-linear transforms
β’ Spatial/global pooling
S Gould | 5
Deep learning layerπ₯: input
π¦: output
π: local parameters
π: forward function
π½: global error / loss
π¦ = π π₯; ππ₯ π¦
ππ½
ππ¦
ππ½
ππ₯
π
ππ½
ππ
processing node
S Gould | 6
End-to-end computation graph
π2 π§1; π π3 π§2; π π4 π§3; π
π5 π§1; π π6 π§5; π π7 π§6; π
π1 π₯; π π8 π§4, π§7; π
π₯
π¦
π¦ = π8 π4 π3 π2 π1 π₯ , π7 π6 π5 π1 π₯
S Gould | 7
End-to-end learning
β’ Learning is about finding parameters that maximize performance,argmaxπ performance(model(π))
β’ To do so we need to understand how the model output changes as a function of its input and parameters
β’ (Local based learning) incrementally updates parameters based on a signal back-propagated from the output of the network
β’ This requires calculation of gradients
π¦ = π π₯; ππ₯ π¦
ππ½
ππ¦
ππ½
ππ₯
πππ½
ππππ½
ππ₯=
ππ½
ππ¦
ππ¦
ππ₯and
ππ½
ππ=
ππ½
ππ¦
ππ¦
ππ
S Gould | 8
Example: Back-propagation through a node
Consider the following implementation of a node
We can back-propagate gradients as
ππ¦π‘ππ¦π‘β1
=1
21 β
π₯
π¦π‘β12
ππ¦π‘ππ₯
=1
2π¦π‘β1+
ππ¦π‘ππ¦π‘β1
ππ¦π‘β1ππ₯
It turns out that this node implements the Babylonian algorithm, which computes
π¦ = π₯
As such we can compute its derivative directly as
ππ¦
ππ₯=
1
2 π₯
=1
2π¦
Chain rule gives ππ½
ππ₯from
ππ½
ππ¦(input) and
ππ¦
ππ₯(computed)
fwd_fcn(x)
y0 =1
2π₯
for π‘ = 1, β¦, π do
yπ‘ β1
2π¦π‘β1 +
π₯
π¦π‘β1return π¦π
bck_fcn(x, y)
return 1
2π¦
S Gould | 9
Separate of forward and backward operations
π₯ π¦
ππ½
ππ¦
ππ½
ππ₯processing node
y0 =1
2π₯
for π‘ = 1, β¦, π do
yπ‘ β1
2π¦π‘β1 +
π₯
π¦π‘β1return π¦π
return 1
2π¦
ππ½
ππ¦
forward function
backward function
S Gould | 10
Deep declarative networks (DDNs)
In an imperative node the implementation of the forward processing function απ is explicitly defined. The output is then
π¦ = απ(π₯; π)
where π₯ is the input and π are the parameters of the node.
In a declarative node the inputβoutput relationship is specified as the solution to an optimization problem
π¦ β argminπ’βπΆ π(π₯, π’; π)
where π is the objective and πΆ are the constraints.
[Gould et al., 2019]
π¦ = απ π₯; ππ₯ π¦
Dπ¦π½Dπ₯π½
π π·ππ½
π¦ = argminπ’βπΆπ π₯, π’; π
π₯ π¦Dπ¦π½Dπ₯π½
π Dππ½
S Gould | 11
Imperative vs. declarative node example: global average pooling
Imperative specification:
π¦ =1
π
π=1
π
π₯π
Declarative specification:
π¦ = argminπ’ββπ
π=1
π
π’ β π₯π2
π₯π β βπ β£ π = 1,β¦ , π β βπ
βthe vector π’ that is the minimumdistance to all input vectors π₯πβ
S Gould | 12
Deep declarative nodes: special cases
Unconstrained(e.g., robust pooling)
π¦ π₯ β argminπ’ββππ π₯, π’
Equality Constrained(e.g., projection onto πΏπ-sphere)
π¦ π₯ βargminπ’ββππ π₯, π’
subject to β π₯, π’ = 0
Inequality Constrained(e.g., projection onto πΏπ-ball)
π¦ π₯ βargminπ’ββππ π₯, π’
subject to β π₯, π’ β€ 0
π¦ = argminπ’βπΆπ π₯, π’; π
π₯ π¦Dπ¦π½Dπ₯π½
π Dππ½
[Gould et al., 2019]
S Gould | 13
Imperative and declarative nodes can co-exist
π2 π§1; πargmin
π3 π§2, π’; ππ4 π§3; π
π5 π§1; π π6 π§5; πargmin
π7 π§6, π’; π
π1 π₯; π π8 π§4, π§7; π
π₯
π¦
π¦ = π8 π4 argmin π3 π2 π1 π₯ , π’ , argmin π7 π6 π5 π1 π₯ , π’
S Gould | 14
Learning as bi-level optimization
minimize (over x) objective(x, y)
subject to constraints(x)
minimize (over y) objective(x, y)
subject to constraints(y)
bi-level learning problem
declarative node problem
minimize (over x) objective(x)
subject to constraints(x)
learning problem
S Gould | 15
A game theoretic perspective
β’ Consider two players, a leader and a followerβ’ The market dictates the price its willing to pay for some goods based on
supply, i.e., quantity produced by both players, π· ππ + ππβ’ Each player has a cost structure associated with producing goods, πͺπ ππ and
wants to maximize profits, πππ· ππ + ππ β πͺπ ππβ’ The leader picks a quantity of goods to produce knowing that the follower
will respond optimally. In other words, the leader solves
maximizeπ1 π1π π1 + π2 β πΆ1 π1subject to π2 β argmaxπ ππ π1 + π β πΆ2 π
[Stackelberg, 1930s]
S Gould | 16
β’ Closed-form lower-level problem: substitute for π in upper problem
β’ May result in a difficult (single-level) optimization problem
Solving bi-level optimization problems
[Bard, 1998; Dempe and Franke, 2015]
minimizeπ₯ π½(π₯, π¦)subject to π¦ β argminπ’π(π₯, π’)
minimizeπ₯ π½(π₯, π¦ π₯ )
S Gould | 17
Solving bi-level optimization problems
β’ Convex lower-level problem: replace lower problem with sufficient conditions (e.g., KKT conditions)
β’ May result in non-convex problem if KKT conditions are not convex
[Bard, 1998; Dempe and Franke, 2015]
minimizeπ₯ π½(π₯, π¦)subject to π¦ β argminπ’π(π₯, π’)
minimizeπ₯,π¦ π½(π₯, π¦)
subject to β π¦ = 0
S Gould | 18
Solving bi-level optimization problems
β’ Gradient descent: compute gradient with respect to π
β’ But this requires computing the gradient of π (itself a function of π)
[Bard, 1998; Dempe and Franke, 2015]
minimizeπ₯ π½(π₯, π¦)subject to π¦ β argminπ’π(π₯, π’)
π₯ β π₯ β πππ½(π₯, π¦)
ππ₯+ππ½(π₯, π¦)
ππ¦
ππ¦
ππ₯
S Gould | 19
Algorithm for solving bi-level optimization
SolveBiLevelOptimization:
initialize π₯
repeat until convergence:
solve π¦ β argminπ’π(π₯, π’)
compute π½(π₯, π¦)
compute ππ½
ππ₯=
ππ½(π₯,π¦)
ππ₯+
ππ½(π₯,π¦)
ππ¦
ππ¦
ππ₯
update π₯ β π₯ β πππ½
ππ₯
return π₯
[Bard, 1998; Dempe and Franke, 2015; Gould et al., 2019]
S Gould | 20
[Gould et al., 2019]
How do we compute π
ππ₯argminπ’βπΆπ π₯, π’ ?
S Gould | 21
Implicit differentiation
Let π:β Γ β β β be a twice differentiable function and let
π¦ π₯ = argminπ’π π₯, π’
The derivative of π vanishes at (π₯, π¦). By Diniβs implicit function theorem (1878)
ππ¦(π₯)
ππ₯= β
π2π
ππ¦2
β1π2π
ππ₯ππ¦
The result extends to vector-valued functions, vector-argument functions and (equality) constrained problems. See [Gould et al., 2019].
[Krantz and Parks, 2013; Dontchev and Rockafellar, 2014; Gould et al., 2016; Gould et al., 2019]
S Gould | 22
Proof sketch
π¦ β argminπ’ π π₯, π’ βππ(π₯, π¦)
ππ¦= 0
LHS:π
ππ₯
ππ(π₯,π¦)
ππ¦=
π2π π₯,π¦
ππ₯ππ¦+
π2π π₯,π¦
ππ¦2ππ¦
ππ₯
RHS:π
ππ₯0 = 0
Rearranging gives ππ¦
ππ₯= β
π2π
ππ¦2
β1π2π
ππ₯ππ¦.
π’
π
π¦
[Gould et al., 2019]
S Gould | 23
Deep declarative nodes: what do we need?
β’ Forward passβ’ A method to solve the optimization problem
β’ Backward passβ’ Specification of objective and constraintsβ’ (And cached result from the forward pass)β’ Do not need to know how the problem was solved
[Gould et al., 2019]
π¦ = argminπ’βπΆπ π₯, π’; π
π₯ π¦Dπ¦π½Dπ₯π½
π Dππ½
π’
π
π¦S Gould | 24
examples
S Gould | 25
π¦ =1
π
π=1
π
π₯πfurther processing(e.g., classification)
π₯
π¦π½
data generating network
y = argminu
π=1
π12 π’ β π₯π
2
Global average pooling
π₯π β βπ β£ π = 1,β¦ , π β βπ
[Gould et al., 2019]
S Gould | 26
Robust penalty functions, π
Quadratic Pseudo-Huber Huber Welsch Truncated Quad.
12π§
21 +
π§
πΌ
2
β 1 α12π§
2 for π§ β€ πΌ
else πΌ( π§ β 12πΌ)
1 β expβπ§2
2πΌ2α12π§
2 for π§ β€ πΌ12πΌ
2 otherwise
closed-form, convex, smooth, unique solution
convex, smooth, unique solution
convex,non-smooth,non-isolated
solutions
non-convex,smooth,
isolated solutions
non-convex,non-smooth,
isolated solutions
[Gould et al., 2019]
S Gould | 27
Example: robust pooling
argminu
π=1
π
π π’ β π₯π; πΌ
1
2π¦ 2
π₯
π¦π½
data generating network
minimize (over π₯) π½ π₯, π¦ β 12 π¦ 2
subject to π¦ β argminπ’
π=1
π
π(π’ β π₯π; πΌ)
[Gould et al., 2019]
S Gould | 28
π³π π³π π³β
closed-form, smooth,unique solution*
non-smooth, isolated solutions non-smooth,isolated solutions
Example: Euclidean projection
[Gould et al., 2019]
S Gould | 30
argminπ’ββπ1
2π’πππ’ + πππ’ + π
subject toπ΄π’ = ππΊπ’ β€ β
Can be differentiated with respect to its parameters:
π β βπΓπ, π β βπ, π΄ β βπΓπ, π β βπ
Example: quadratic programs
[Amos and Kolter, 2017]
S Gould | 31
Can be differentiated with respect to its parameters:
π΄ β βπΓπ, π β βπ, π β βπ
Example: convex programs
[Agrawal et al., 2019]
argminπ’ββπ πππ’subject to π β π΄π’ β πΎ
S Gould | 32
Implementing deep declarative nodes
β’ Need: objective and constraint functions, solver to obtain π¦
β’ Gradient by automatic differentiation
import autograd.numpy as np
from autograd import grad, jacobian
def gradient(x, y, f)
fY = grad(f, 1)
fYY = jacobian(fY, 1)
fXY = jacobian(fY, 0)
return -1.0 * np.linalg.solve(fYY(x,y), fXY(x,y))
ππ¦(π₯)
ππ₯= β
π2π
ππ¦2
β1π2π
ππ₯ππ¦
[Gould et al., 2019]
S Gould | 33
cvxpylayers
β’ Disciplined convex optimizationβ’ Subset of optimization problems
β’ Write problem using cvxβ’ Solver and gradient computed automatically!
[Agrawal et al., 2019]
x = cp. Parameter(n)
y = cp. Variable(n)
obj = cp. Minimize(cp.sum_squares(x - y ))
cons = [ y >= 0]
prob = cp. Problem(obj, cons)
layer = CvxpyLayer(prob, parameters=[x], variables=[y])
S Gould | 34
applications
S Gould | 35
Robust point cloud classification
[Gould et al., 2019]
S Gould | 36
Robust point cloud classification
[Gould et al., 2019]
S Gould | 37
Video activity recognition
Stand Up Sit Down
[Marszalek et al., 2009; Carreira et al., 2019]
S Gould | 38
stand up
Frame encoding
abstract feature 1
abst
ract
fea
ture
2
[Fernando and Gould, 2016]
S Gould | 39
stand up
sit down
stand up
answerphone
answerphone
abstract feature 1
abst
ract
fea
ture
2
[Fernando and Gould, 2016]
S Gould | 40
Video clip classification pipeline
frame 1
frame 2
frame t
β¦
Co
nv-
1
ReL
U-1
Poo
l-1
Co
nv-
2
ReL
U-2
Poo
l-2
Co
nv-
3
ReL
U-3
Poo
l-3
Co
nv-
4
ReL
U-4
Poo
l-4
Co
nv-
5
ReL
U-5
Poo
l-5
FC-6
ReL
U-6
FC-7
ReL
U-7
L2-n
orm
Co
nv-
1
ReL
U-1
Poo
l-1
Co
nv-
2
ReL
U-2
Poo
l-2
Co
nv-
3
ReL
U-3
Poo
l-3
Co
nv-
4
ReL
U-4
Poo
l-4
Co
nv-
5
ReL
U-5
Poo
l-5
FC-6
ReL
U-6
FC-7
ReL
U-7
L2-n
orm
Co
nv-
1
ReL
U-1
Poo
l-1
Co
nv-
2
ReL
U-2
Poo
l-2
Co
nv-
3
ReL
U-3
Poo
l-3
Co
nv-
4
ReL
U-4
Poo
l-4
Co
nv-
5
ReL
U-5
Poo
l-5
FC-6
ReL
U-6
FC-7
ReL
U-7
L2-n
orm
T-Po
ol
L2-n
orm
Soft
max
[Fernando and Gould, 2016]
S Gould | 41
Temporal pooling
β’ Max/avg/robust pooling summarizes an unstructured set of objects
π₯π β£ π = 1,β¦ , π β βπ
β’ Rank pooling summarizes a structured sequence of objects
π₯π π = 1,β¦ , π β βπ
[Fernando et al., 2015; Fernando and Gould, 2016]
S Gould | 42
Rank Pooling
β’ Find a ranking function π:βπ β β such that π π₯π‘ < π π₯π for π‘ < π
β’ In our case we assume that π: π₯ β¦ π’ππ₯ is a linear function
β’ Use π’ as the representation
[Fernando et al., 2015; Fernando and Gould, 2016]
S Gould | 43
Experimental results
Method Accuracy (%)
Max-Pool + SVM 66
Avg-Pool + SVM 67
Rank-Pool + SVM 66
Max-Pool-CNN (end-to-end) 71
Avg-Pool-CNN (end-to-end) 70
Rank-Pool-CNN (end-to-end) 87
Improved trajectory features + fisher vectors + rank-pooling
87
[Rodriguez et al., 2008]
21% improvement!
150 video clips from BBC and ESPN footage10 sports actions
[Fernando and Gould, 2016]
S Gould | 44
Visual attribute ranking
1. Order a collection of images according to a given attribute
2. Recover the original image from shuffled image patches
[Santa Cruz et al., 2017]
S Gould | 45
Birkhoff polytope
β’ Permutation matrices form discrete points inEuclidean space which imposes difficulties forgradient based optimizers
β’ The Birkhoff polytope is the convex hull for theset of π Γ π permutation matrices
β’ This coincides exactly with the set of π Γ π doubly stochastic matrices
β’ We relax our visual permutation learning problem over permutation matrices to a problem over doubly stochastic matrices
{π₯1, β¦ , π₯π} β π΅π
[Santa Cruz et al., 2017]
S Gould | 46
End-to-end visual permutation learning
0 1 0 00 0 1 00 0 0 11 0 0 0
X
π
π₯
CNN
CNN
β¦ CNN
0 0.7 0.2 0.10 0.1 0.8 00.1 0.1 0 0.80.9 0.1 0 0
π
?
π΄
positive matrix
doubly stochastic matrix
[Santa Cruz et al., 2017]
S Gould | 47
Sinkhorn normalization or projection onto π΅π
Alternatively, define a deep declarative module
[Santa Cruz et al., 2017]
sinkhorn_fcn(A)
π = π΄
for π‘ = 1, β¦, π do
ππ,π βππ,π
Οπππ,π
ππ,π βππ,π
Οπππ,π
return π
S Gould | 48
Visual attribute learning results
[Santa Cruz et al., 2017]
S Gould | 49
Blind perspective-n-point
[Campbell et al., 2019 (unpublished)]
S Gould | 50
Blind perspective-n-point
[Campbell et al., 2019 (unpublished)]
ππΉ2D featureextraction
πΉ
ππ3D featureextraction
π
pairwisedistance
sinkhornnormalization
RANSACweighted
BPnP(π , π‘)
declarative node
declarative node
S Gould | 51
Blind perspective-n-point
[Campbell et al., 2019 (unpublished)]
S Gould | 52
code and tutorials at http://deepdeclarativenetworks.com
CVPR 2020 Workshop (http://cvpr2020.deepdeclarativenetworks.com)
S Gould | 53