David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf ·...
Transcript of David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf ·...
![Page 1: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/1.jpg)
TTIC 31230, Fundamentals of Deep Learning
David McAllester, Winter 2018
Computation Graphs and Backpropagation
The Educational Framework (EDF)
Minibatching
Explicit Index Notation
1
![Page 2: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/2.jpg)
Computation Graphs
A computation graph (sometimes called a “computational graph”)is a sequence of assignment statements.
L0[j] = Relu
∑i
W 0[j, i] x[i]
+ b0[j]
L1[y] = σ
∑j
W 1[y, j] L0[j]
+ b1[y]
P (y) =
1
ZeL
1[y]
2
![Page 3: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/3.jpg)
Computation Graphs
A simpler example:
` =
√x2 + y2
u = x2
v = y2
r = u + v
` =√r
A computation graph defines a DAG (a directed acyclic graph)where the variables are the nodes. Each assignment determinesone or more directed edges from the left hand variable to theright hand variables.
3
![Page 4: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/4.jpg)
Computation Graphs
1. u = x2
2. w = y2
3. r = u + w4. ` =
√r
For each variable z we can consider ∂`/∂z.
Gradients are computed in the reverse order.
(4) ∂`/∂r = 12√r
(3) ∂`/∂u = ∂`/∂r(3) ∂`/∂w = ∂`/∂r(2) ∂`/∂y = (∂`/∂w) ∗ (2y)(1) ∂`/∂x = (∂`/∂u) ∗ (2x)
4
![Page 5: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/5.jpg)
A More Abstract Example
y = f (x)
z = g(y, x)
u = h(z)
` = u
For now assume all values are scalars.
We will “backpopagate” the assignments the reverse order.
5
![Page 6: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/6.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
∂`/∂u = 1
6
![Page 7: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/7.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
∂`/∂u = 1
∂`/∂z = (∂`/∂u) (∂h/∂z) (this uses the value of z)
7
![Page 8: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/8.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
∂`/∂u = 1
∂`/∂z = (∂`/∂u) (∂h/∂z)
∂`/∂y = (∂`/∂z) (∂g/∂y) (this uses the value of y and x)
8
![Page 9: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/9.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
∂`/∂u = 1
∂`/∂z = (∂`/∂u) (∂h/∂z)
∂`/∂y = (∂`/∂z) (∂g/∂y)
∂`/∂x = ??? Oops, we need to add up multiple occurrences.
9
![Page 10: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/10.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
We let x.grad be an attribute (as in Python) of node x.
We will initialize x.grad to zero.
During backpropagation we will accumulate contributions to∂`/∂x into x.grad.
10
![Page 11: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/11.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
z.grad = y.grad = x.grad = 0
u.grad = 1
Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.
11
![Page 12: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/12.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
z.grad = y.grad = x.grad = 0
u.grad = 1
Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.
z.grad += u.grad ∗ ∂h/∂z
12
![Page 13: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/13.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
z.grad = y.grad = x.grad = 0
u.grad = 1
Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.
z.grad += u.grad ∗ ∂h/∂zy.grad += z.grad ∗ ∂g/∂yx.grad += z.grad ∗ ∂g/∂x
![Page 14: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/14.jpg)
Backpropagation
y = f (x)
z = g(y, x)
u = h(z)
` = u
z.grad = y.grad = x.grad = 0
u.grad = 1
z.grad += u.grad ∗ ∂h/∂zy.grad += z.grad ∗ ∂g/∂yx.grad += z.grad ∗ ∂g/∂xx.grad += y.grad ∗ ∂f/∂x
14
![Page 15: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/15.jpg)
The Vector-Valued Case
y = f (x)
z = g(y, x)
u = h(z)
` = u
Now suppose the variables can be vector-valued.
The loss ` is still a scalar.
In this casex.grad = ∇x `
These are now vectors of the same dimension with
x.grad[i] = ∂`/∂x[i]
15
![Page 16: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/16.jpg)
The Tensor-Valued Case
y = f (x)
z = g(y, x)
u = h(z)
` = u
Now suppose the variables can be tensor-valued (the values aremulti-dimensional arrays). The loss is still a scalar.
The computation is now a “tensor flow”.
16
![Page 17: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/17.jpg)
The Tensor-Valued Case
For the tensor case we simply view tensors as a special case ofvectors. The indeces of x.grad are the same as the indeces ofx.
x.grad.shape = x.shape
The backpropagation equation for y = f (x) is
x.grad[i1, . . . , in] = (y.grad)[j1, . . . , jk]∇xf (x)[j1, . . . , jk, i1, . . . , in]
j1, . . . , jk are indeces of y and i1, . . . , in are indeces of x.
Indeces not on the left of the equation are implicitly summed.
17
![Page 18: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/18.jpg)
The EDF Framework
The educational frameword (EDF) is a simple Python-NumPyimplementation of a deep learning framework.
In EDF we write
y = F (x)
z = G(y, x)
u = H(z)
` = u
This is Python code where variables are bound to objects.
18
![Page 19: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/19.jpg)
The EDF Framework
y = F (x)
z = G(y, x)
u = H(z)
` = u
This is Python code where variables are bound to objects.
x is an object in the class Input.
y is an object in the class F .
z is an object in the class G.
u and ` are the same object in the class H .
![Page 20: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/20.jpg)
y = F (x)
class F (CompNode):
def init (self, x):CompNodes.append(self)self.x = x
def forward(self):self.value = f(self.x.value)
def backward(self):self.x.addgrad(self.grad ∗ ∇x f (x)) #needs x.value
20
![Page 21: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/21.jpg)
Nodes of the Computation Graph
There are three kinds of nodes in a computation graph —inputs, parameters and computation nodes.
class Input:def init (self):
passdef addgrad(self, delta):
pass
class CompNode: #initialization is handled by the subclassdef addgrad(self, delta):
self.grad += delta
21
![Page 22: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/22.jpg)
class Parameter:
def init (self,value):Parameters.append(self)self.value = value
def addgrad(self, delta):#sums over the minibatchself.grad += np.sum(delta, axis = 0)
def SGD(self):self.value -= learning rate*self.grad
22
![Page 23: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/23.jpg)
MLP in EDF
The following Python code constructs the computation graphof a multi-layer perceptron (NLP) with one hidden layer.
L1 = Relu(Affine(Phi1,x))
Q = Softmax(Sigmoid(Affine(Phi2,L1))
ell = LogLoss(Q,y)
Here x and y are input computation nodes whose value havebeen set. Here Phi1 and Phi2 are “parameter packages” (amatrix and a bias vector in this case). We have computationnode classes Affine, Relu, Sigmoid, LogLoss each of whichhas a forward and a backward method.
![Page 24: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/24.jpg)
class Affine(CompNode):
def __init__(self,Phi,x):
CompNodes.append(self)
self.x = x
self.Phi = Phi
def forward(self):
self.value = (np.matmul(self.x.value,
self.Phi.w.value)
+ self.Phi.b.value)
![Page 25: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/25.jpg)
def backward(self):
self.x.addgrad(
np.matmul(self.grad,
self.Phi.w.value.transpose()))
self.Phi.b.addgrad(self.grad)
self.Phi.w.addgrad(self.x.value[:,:,np.newaxis]
* self.grad[:,np.newaxis,:])
25
![Page 26: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/26.jpg)
The Core of EDF
def Forward():
for c in CompNodes: c.forward()
def Backward(loss):
for c in CompNodes + Parameters: c.grad = 0
loss.grad = 1/nBatch
for c in CompNodes[::-1]: c.backward()
def SGD():
for p in Parameters:
p.SGD()
26
![Page 27: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/27.jpg)
Minibatching
Ther running time of EDF, and of any framework, is greatlyimproved by minibatching.
Minibatching: We run some number of instances together(or in parallel) and then do a parameter update based on theaverage gradients of the instances of the batch.
For NumPy minibatching is not so much about parallelism asabout making the vector operations larger so that the vectoroperations dominate the slowness of Python. On a GPU mini-batching allows parallelism over the batch elements.
27
![Page 28: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/28.jpg)
Minibatching
With minibatching each input value and each computed valueis actually a batch of values.
We add a batch index as an additional first tensor dimensionforeach input and computed node.
For example, if a given input is a D-dimensional vector thenthe value of an input node has shape (B,D) where B is sizeof the minibatch.
Parameters do not have a batch index.
![Page 29: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/29.jpg)
class Parameter:
def init (self,value):Parameters.append(self)self.value = value
def addgrad(self, delta):#sums over the minibatchself.grad += np.sum(delta, axis = 0)
def SGD(self):self.value -= learning rate*self.grad
29
![Page 30: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/30.jpg)
forward and backward must handle minibatching
The forward and backward methods must be written to handleminibatching. We will consider some examples.
30
![Page 31: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/31.jpg)
An MLP
L1 = Relu(Affine(Phi1,x))
P = Softmax(Sigmoid(Affine(Phi2,L1))
ell = LogLoss(P,y)
31
![Page 32: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/32.jpg)
The Sigmoid Class
The Sigmoid and Relu classes just work.
class Sigmoid:
def __init__(self,x):
CompNodes.append(self)
self.x = x
def forward(self):
self.value = 1. / (1. + np.exp(-self.x.value))
def backward(self):
self.x.grad += self.grad
* self.value
* (1.-self.value)
![Page 33: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/33.jpg)
y = Affine([W,B],x)
forward:
y.value[b,j] = x.value[b,i]W.value[i,j]
y.value[b,j] += B.value[j]
backward:
x.grad[b,i] += y.grad[b,j]W.value[i,j]
W.grad[i,j] += y.grad[b,j]x.value[i]
B.grad[j] += y.grad[b,j]
![Page 34: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/34.jpg)
class Affine(CompNode):
def __init__(self,Phi,x):
CompNodes.append(self)
self.x = x
self.Phi = Phi
def forward(self):
# y.value[b,j] = x.value[b,i]W.value[i,j]
# y.value[b,j] += B.value[j]
self.value = (np.matmul(self.x.value,
self.Phi.W.value)
+ self.Phi.B.value)
![Page 35: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/35.jpg)
def backward(self):
self.x.addgrad(
# x.grad[b,i] += y.grad[b,j]W.value[i,j]
np.matmul(self.grad,
self.Phi.W.value.transpose()))
# B.grad[j] += y.grad[b,j]
self.Phi.B.addgrad(self.grad)
# W.grad[i,j] += y.grad[b,j]x.value[b,i]
self.Phi.W.addgrad(self.x.value[:,:,np.newaxis]
* self.grad[:,np.newaxis,:])
35
![Page 36: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/36.jpg)
Explicit Index Notation
A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]
abbreviates
for x1, . . . , xn, y1, . . . , yk : A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]
A[x1, . . . , xn] = e[x1, . . . , xn, y1, . . . , yk]
abbreviates
for x1, . . . , xn : A[x1, . . . , xn] = 0
A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]
36
![Page 37: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/37.jpg)
Affine Transformation with Batch Index
y[b, i] = x[b, j]W [j, i]
y[b, i] += B[i]
37
![Page 38: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/38.jpg)
The Swap Rule for Tensor Products
For backprop swap an input with the output and add “grad”.
y[b, i] = x[b, j]W [j, i]
x.grad[b, j] += y.grad[b, i]W [j, i]
W.grad[i, j] += x[b, j]y.grad[b, i]
y[b, i] += B[i]
B.grad[i] += y.grad[b, i]
![Page 39: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/39.jpg)
The Swap Rule for Functions of Tensor Products
For backprop swap an input with the output and add “grad”and f .
y[b, i] = f (x[b, j]W [j, i])
x.grad[b, j] += y.grad[b, i] f W [j, i]
W.grad[i, j] += y.grad[b, i] f x[b, j]
Here f is scalar to scalar.
39
![Page 40: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/40.jpg)
NumPy: Reshaping Tensors
For an ndarray x (tensor) we have that x.shape is a tuple ofdimensions. The product of the dimensions is the number ofnumbers.
In NumPy an ndarray (tensor) x can be reshaped into anyshape with the same number of numbers.
40
![Page 41: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/41.jpg)
NumPy: Broadcasting
Shapes can contain dimensions of size 1.
Dimensions of size 1 are treated as “wild card” dimensions inoperations on tensors.
x.shape = (5, 1)
y.shape = (1, 10)
z = x ∗ yz.shape = (5, 10)
z[i, j] = x[i, 0] ∗ y[0, j]
![Page 42: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/42.jpg)
class Affine(CompNode):
...
def backward(self):
...
# W.grad[i,j] += y.grad[b,j]x.value[b,i]
self.Phi.W.addgrad(self.grad[:,np.newaxis,:]
*self.x.value[:,:,np.newaxis])
class Parameter:
...
def addgrad(self, delta):
self.grad += np.sum(delta, axis = 0)
42
![Page 43: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/43.jpg)
NumPy: Broadcasting
When a scalar is added to a matrix the scalar is reshaped toshape (1, 1) so that it is added to each element of the matrix.
When a vector of shape (k) is added to a matrix the vector isreshaped to (1, k) so that it is added to each row of the matrix.
In general when two tensors of different order (number of di-mensions) are added, unit dimensions are prepended to theshape of the tensor of smaller order to make the orders match.
43
![Page 44: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/44.jpg)
Appendix: The Jacobian Matrix
Consider y = f (x) in vector-valued case.
In the vector-valued case the backpropagation equation is
x.grad += (y.grad)> ∇xf (x)
where
(∇xf (x))[i, j] = J [i, j] = ∂f (x)[i]/∂x[j]
The matrix J [i, j] is the Jacobian of f .
Index Types: j is an x-index and i is a y-index.
44
![Page 45: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation](https://reader036.fdocuments.in/reader036/viewer/2022070110/604a525400563549036318ae/html5/thumbnails/45.jpg)
END