Post on 24-Dec-2021
CS 4803 / 7643: Deep Learning
Dhruv Batra
Georgia Tech
Topics: – Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Patterns in backprop
Administrativia• HW1 Reminder
– Due: 09/09, 11:59pm
• HW2 out on 9/10• Schedule: https://www.cc.gatech.edu/classes/AY2021/cs7643_fall/
• Project discussion next class
(C) Dhruv Batra 2
Recap from last time
(C) Dhruv Batra 3
How do we compute gradients?• Analytic or “Manual” Differentiation
• Symbolic Differentiation
• Numerical Differentiation
• Automatic Differentiation– Forward mode AD– Reverse mode AD
• aka “backprop”
(C) Dhruv Batra 4
Vector/Matrix Derivatives Notation
(C) Dhruv Batra 5
Vector/Matrix Derivatives Notation
(C) Dhruv Batra 6
Vector Derivative Example
(C) Dhruv Batra 7
Extension to Tensors
(C) Dhruv Batra 8
Chain Rule: Composite Functions
(C) Dhruv Batra 9
Chain Rule: Scalar Case
(C) Dhruv Batra 10
Chain Rule: Vector Case
(C) Dhruv Batra 11
Chain Rule: Jacobian view
(C) Dhruv Batra 12
Chain Rule: Graphical view
(C) Dhruv Batra 13
Chain Rule: Cascaded
(C) Dhruv Batra 14
Deep Learning = Differentiable Programming
• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering
• Auto-Diff– A family of algorithms for
implementing chain-rule on computation graphs
(C) Dhruv Batra 15
Directed Acyclic Graphs (DAGs)• Exactly what the name suggests
– Directed edges– No (directed) cycles– Underlying undirected cycles okay
(C) Dhruv Batra 16
Directed Acyclic Graphs (DAGs)• Concept
– Topological Ordering
(C) Dhruv Batra 17
Computational Graphs• Notation
(C) Dhruv Batra 18
Deep Learning = Differentiable Programming
• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering
• Auto-Diff– A family of algorithms for
implementing chain-rule on computation graphs
(C) Dhruv Batra 19
20
g
Forward mode AD
21
g
Reverse mode AD
Plan for Today• Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop
(C) Dhruv Batra 22
Example: Forward mode AD
(C) Dhruv Batra 23
+
sin( )
x1 x2
*
Example: Forward mode AD
(C) Dhruv Batra 24
+
sin( )
x1 x2
*
(C) Dhruv Batra 25
+
sin( )
x1 x2
*
Example: Forward mode AD
(C) Dhruv Batra 26
+
sin( )
x1 x2
*
Example: Forward mode AD
(C) Dhruv Batra 27
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another input variable x3?
(C) Dhruv Batra 28
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another input variable x3?A: more sophisticated graph; d+1 “passes” for d+1 variables
(C) Dhruv Batra 29
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another output variable f2?
(C) Dhruv Batra 30
+
sin( )
x1 x2
*
Example: Forward mode AD
Q: What happens if there’s another output variable f2?A: more sophisticated graph;
d “passes” for variables
Example: Reverse mode AD
(C) Dhruv Batra 31
+
sin( )
x1 x2
*
Example: Reverse mode AD
(C) Dhruv Batra 32
+
sin( )
x1 x2
*
+
Gradients add at branches
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Duality in Fprop and Bprop
(C) Dhruv Batra 34
+
+
FPROP BPROPSUM
COPY
(C) Dhruv Batra 35
Example: Reverse mode AD
+
sin( )
x1 x2
*
(C) Dhruv Batra 36
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another input variable x3?
(C) Dhruv Batra 37
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another input variable x3?A: more sophisticated graph;
single “backward prop”
(C) Dhruv Batra 38
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another output variable f2?
(C) Dhruv Batra 39
Example: Reverse mode AD
+
sin( )
x1 x2
*
Q: What happens if there’s another output variable f2?A: more sophisticated graph; c “backward props” for c vars
Forward Pass vs Forward mode AD vs Reverse Mode AD
(C) Dhruv Batra 40
+
sin( )
x1 x2
*
+
sin( )
x2
*
x1
+
sin( )
x1 x2
*
Forward mode vs Reverse Mode• What are the differences?
(C) Dhruv Batra 41
+
sin( )
x2
*
+
sin( )
x1 x2
*
x1
Forward mode vs Reverse Mode• What are the differences?
• Which one is faster to compute? – Forward or backward?
(C) Dhruv Batra 42
Forward mode vs Reverse Mode• x Graph L• Intuition of Jacobian
(C) Dhruv Batra 43
Forward mode vs Reverse Mode• What are the differences?
• Which one is faster to compute? – Forward or backward?
• Which one is more memory efficient (less storage)? – Forward or backward?
(C) Dhruv Batra 44
Plan for Today• Automatic Differentiation
– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop
(C) Dhruv Batra 45
(C) Dhruv Batra 46Figure Credit: Andrea Vedaldi
Neural Network Computation Graph
(C) Dhruv Batra 47Figure Credit: Andrea Vedaldi
Backprop
Any DAG of differentiable modules is allowed!
Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra 48
Computational Graph
Key Computation: Forward-Prop
(C) Dhruv Batra 49Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Key Computation: Back-Prop
(C) Dhruv Batra 50Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 51Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 52Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]
(C) Dhruv Batra 53Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 54Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 55Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]
(C) Dhruv Batra 56Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters
(C) Dhruv Batra 57Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Backpropagation: a simple example
Backpropagation: a simple example
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Q: What is an add gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
add gate: gradient distributor
Q: What is a max gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
max gate: gradient router
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
max gate: gradient router
Q: What is a mul gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
max gate: gradient router
mul gate: gradient switcher
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
+
Gradients add at branches
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Duality in Fprop and Bprop
(C) Dhruv Batra 68
+
+
FPROP BPROPSUM
COPY