CS 4803 / 7643: Deep Learning

Dhruv Batra

Georgia Tech

Topics: – Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Patterns in backprop

Administrativia• HW1 Reminder

– Due: 09/09, 11:59pm

• HW2 out on 9/10• Schedule: https://www.cc.gatech.edu/classes/AY2021/cs7643_fall/

• Project discussion next class

(C) Dhruv Batra 2

Recap from last time

(C) Dhruv Batra 3

How do we compute gradients?• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD– Reverse mode AD

• aka “backprop”

(C) Dhruv Batra 4

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 5

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 6

Vector Derivative Example

(C) Dhruv Batra 7

Extension to Tensors

(C) Dhruv Batra 8

Chain Rule: Composite Functions

(C) Dhruv Batra 9

Chain Rule: Scalar Case

(C) Dhruv Batra 10

Chain Rule: Vector Case

(C) Dhruv Batra 11

Chain Rule: Jacobian view

(C) Dhruv Batra 12

Chain Rule: Graphical view

(C) Dhruv Batra 13

Chain Rule: Cascaded

(C) Dhruv Batra 14

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 15

Directed Acyclic Graphs (DAGs)• Exactly what the name suggests

– Directed edges– No (directed) cycles– Underlying undirected cycles okay

(C) Dhruv Batra 16

Directed Acyclic Graphs (DAGs)• Concept

– Topological Ordering

(C) Dhruv Batra 17

Computational Graphs• Notation

(C) Dhruv Batra 18

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 19

Forward mode AD

Reverse mode AD

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 22

Example: Forward mode AD

(C) Dhruv Batra 23

sin( )

(C) Dhruv Batra 24

sin( )

(C) Dhruv Batra 25

sin( )

(C) Dhruv Batra 26

sin( )

(C) Dhruv Batra 27

sin( )

Q: What happens if there’s another input variable x3?

(C) Dhruv Batra 28

sin( )

Q: What happens if there’s another input variable x3?A: more sophisticated graph; d+1 “passes” for d+1 variables

(C) Dhruv Batra 29

sin( )

Q: What happens if there’s another output variable f2?

(C) Dhruv Batra 30

sin( )

Q: What happens if there’s another output variable f2?A: more sophisticated graph;

d “passes” for variables

Example: Reverse mode AD

(C) Dhruv Batra 31

sin( )

(C) Dhruv Batra 32

sin( )

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop

(C) Dhruv Batra 34

FPROP BPROPSUM

(C) Dhruv Batra 35

sin( )

(C) Dhruv Batra 36

sin( )

Q: What happens if there’s another input variable x3?

(C) Dhruv Batra 37

sin( )

Q: What happens if there’s another input variable x3?A: more sophisticated graph;

single “backward prop”

(C) Dhruv Batra 38

sin( )

Q: What happens if there’s another output variable f2?

(C) Dhruv Batra 39

sin( )

Q: What happens if there’s another output variable f2?A: more sophisticated graph; c “backward props” for c vars

Forward Pass vs Forward mode AD vs Reverse Mode AD

(C) Dhruv Batra 40

sin( )

Forward mode vs Reverse Mode• What are the differences?

(C) Dhruv Batra 41

sin( )

• Which one is faster to compute? – Forward or backward?

(C) Dhruv Batra 42

Forward mode vs Reverse Mode• x Graph L• Intuition of Jacobian

(C) Dhruv Batra 43

• Which one is faster to compute? – Forward or backward?

• Which one is more memory efficient (less storage)? – Forward or backward?

(C) Dhruv Batra 44

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 45

(C) Dhruv Batra 46Figure Credit: Andrea Vedaldi

Neural Network Computation Graph

(C) Dhruv Batra 47Figure Credit: Andrea Vedaldi

Backprop

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra 48

Computational Graph

Key Computation: Forward-Prop

(C) Dhruv Batra 49Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters

Backpropagation: a simple example

Patterns in backward flow

Q: What is an add gate?

add gate: gradient distributor

Q: What is a max gate?

max gate: gradient router

Q: What is a mul gate?

mul gate: gradient switcher

Gradients add at branches

Duality in Fprop and Bprop

(C) Dhruv Batra 68

FPROP BPROPSUM

CS 4803 / 7643: Deep Learning

Documents

Transcript of CS 4803 / 7643: Deep Learning

2009 AIRBUS AS350B3+ // sn 4803 · N-0166 Oslo Norway Phone: +47 2414 5800 Org. No.: 967 461 490 2009 AIRBUS AS350B3+ s/n 4803 LN-OWJ Asking Price: 1.200.000 EUR EQUIPMENT: • Maximum

CS 4803 / 7643: Deep Learning - College of Computing...a black person bail (C) Dhruv Batra & Zsolt Kira Slide Credit: David Madras 48. Example –Word Embeddings • Fairness is morally

4803-b571 (1).doc

Regjeringen.no...NOTIFICATION OF A NEW CASE Saksansv: AVD / 4803 / TUBA Dok.dato: 05102016 Jour.dato: 06102016 EFTA COURT 2016 - 2017 EFTA COURT CASES Saksbeh: AVD / 4803 / TUBA Grad:

Lambda 4803

CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:

I 4803 LOAD ANALYSIS ON TALL TILE SILOS REMOVAL … · 4803 wind load analysis on tall tile silos i with removal of connecting platform plant 1 ore silos operable unit 3 removal action

CS 4803 / 7643: Deep Learning...(C) Dhruv Batra 2. Recap from last time (C) Dhruv Batra 3. Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin

LNCS 4803 - The Conceptualization of a Configurable Multi-party

Provisioning / en / A31008-M2212-R910-2-7643 / cover front ... · Provisioning / en / A31008-M2212-R910-2-7643 / intro.fm / 04.04.2012 Introduction 3 Template A4, Version 1, 03.04.2012

CS 6474/CS 4803 Social Computing: Online Content Moderation

CS 4803 / 7643: Deep Learning Guest Lecture: …...Note: word vectors are sometimes called word embeddings or word representations. They are a distributedrepresentation. banking =

CS 6474/4803 Social Computing - Home | munmundEvery day we make countless decisions based on the activity of those around us In another town on business, you and a few colleagues are

CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.

Digital Tachometer ENGR 4803 Electromechanical Systems & Mechatronics.

AlAA, 99-4803 From Leo. to th.e Planets Using€¦ · AlAA, 99-4803 From Leo. to th.e Planets Using A. D. McRonald, J1 E. Randolph., Jet Propulsion Laboratory California Institute

Charlotte Real Estate For Sale: 4803 Cobble Glen Way Charlotte, NC 28269

Dimension Reduction - Visualizationpoloclub.gatech.edu/cse6242/2013spring/lectures/...Feb 12, 2013 · Dimension Reduction CSE 6242 A / CS 4803 DVA Feb 12, 2013 Guest Lecturer: Jaegul

Brochure 4803%20 %2020%20avenue%20nw