Computational Graph &...

Computing GradientHung-yi Lee

李宏毅

Introduction

• Backpropagation: an efficient way to compute the gradient

• Prerequisite • Backpropagation for feedforward net:

• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html

• Simple version: https://www.youtube.com/watch?v=ibJpTrp5mcE

• Backpropagation through time for RNN: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html

• Understanding backpropagation by computational graph

• Tensorflow, Theano, CNTK, etc.

Computational Graph

• A “language” describing a function• Node: variable (scalar, vector, tensor ……)

• Edge: operation (simple function)

𝑎 = 𝑓 𝑏, 𝑐

Example 𝑦 = 𝑓 𝑔 ℎ 𝑥

𝑢 = ℎ 𝑥 𝑣 = 𝑔 𝑢 𝑦 = 𝑓 𝑣

x u v yℎ 𝑔 𝑓

Computational Graph

• Example: e = (a+b) ∗ (b+1)

e = c ∗ d

c = a + b

d = b + 1∗

Review: Chain Rule

Case 1

Case 2

yhz xgy

yxkz , shy sgx

x y zg h

Computational Graph

• Example: e = (a+b) ∗ (b+1)

++1𝜕𝑐

𝜕𝑎

Compute Τ𝜕𝑒 𝜕𝑏

𝜕𝑐

𝜕𝑏

𝜕𝑑

𝜕𝑏

𝜕𝑒

𝜕𝑑

𝜕𝑒

𝜕𝑐Sum over all paths from b to e

=1=1 =1

=1x(b+1)+1x(a+b)

Computational Graph

• Example: e = (a+b) ∗ (b+1)

++1𝜕𝑐

𝜕𝑎

𝜕𝑐

𝜕𝑏

𝜕𝑑

𝜕𝑏

𝜕𝑒

𝜕𝑑

𝜕𝑒

𝜕𝑐

=1=1 =1

3 5a = 3, b = 2

Computational Graph

• Example: e = (a+b) ∗ (b+1)

++1𝜕𝑐

𝜕𝑎

𝜕𝑐

𝜕𝑏

𝜕𝑑

𝜕𝑏

𝜕𝑒

𝜕𝑑

𝜕𝑒

𝜕𝑐

=1=1 =1

=d =c5

1Start with 1

8=8Compute Τ𝜕𝑒 𝜕𝑏

a = 3, b = 23

Computational Graph

• Example: e = (a+b) ∗ (b+1)

++1𝜕𝑐

𝜕𝑎

Compute Τ𝜕𝑒 𝜕𝑎

𝜕𝑐

𝜕𝑏

𝜕𝑑

𝜕𝑏

𝜕𝑒

𝜕𝑑

𝜕𝑒

𝜕𝑐

=1=1 =1

=d =c3 5

1Start with 1

a = 3, b = 2

Computational Graph

• Example: e = (a+b) ∗ (b+1)

++1𝜕𝑐

𝜕𝑎

𝜕𝑐

𝜕𝑏

𝜕𝑑

𝜕𝑏

𝜕𝑒

𝜕𝑑

𝜕𝑒

𝜕𝑐

=1=1 =1

=d =c3 5

1 Start with 1

Reverse mode

Compute

a = 3, b = 2

What is the benefit?

Computational Graph

• Parameter sharing: the same parameters appearing in different nodes

𝑦 = 𝑥𝑒𝑥2 𝜕𝑦

𝜕𝑥=? 𝑒𝑥

2+ 𝑥 ∙ 𝑒𝑥

2∙ 2𝑥

∗ ∗

𝑥𝑥

= 𝑒𝑥2

𝜕𝑦

𝜕𝑥= 𝑥 ∙ 𝑒𝑥

2∙ 𝑥

𝜕𝑦

𝜕𝑥= 𝑥 ∙ 𝑒𝑥

2∙ 𝑥

𝜕𝑦

𝜕𝑥= 𝑒𝑥

Computational Graphfor Feedforward Net

Review:Backpropagation

Forward Pass Backward Pass

11 ll za

1211 llll baWz

……

Layer lLayer 1l

111 bxWz

Error signal

11 • lTlll Wz

Lzδ •

LTLLL Wz • 11

Review:Backpropagation

Layer L

Layer L-1

……

…… TW L

CyL1-L

1L mz……

Backward Pass

11 • lTlll Wz

Lzδ •

LTLLL Wz • 11

Error signal

(we do not use softmax here)

Feedforward Network

𝜎z2 y

𝑧1 = 𝑊1𝑥 + 𝑏1 𝑧2 = 𝑊2𝑎1 + 𝑏2

= 𝜎 𝜎 b1W1 x +𝜎 b2W2 + bLWL +… …

Loss Function of Feedforward Network

𝜎z2 y

𝑧1 = 𝑊1𝑥 + 𝑏1 𝑧2 = 𝑊2𝑎1 + 𝑏2

ො𝑦 𝐶

𝐶 = 𝐿 𝑦, ො𝑦

Gradient of Cost Function

𝜎z2 y

ො𝑦 𝐶

To compute the gradient …

Computing the partial derivative on the edge

Using reverse mode Output is always a scalar

𝜕𝐶

𝜕𝑊2

𝜕𝐶

𝜕𝑊1

𝜕𝐶

𝜕𝑏2𝜕𝐶

𝜕𝑏1

𝜕𝑎1

𝜕𝑧1=?

vector

Jacobian Matrix

𝑦 = 𝑓 𝑥 𝑥 =

𝑥1𝑥2𝑥3

𝑦 =𝑦1𝑦2

𝜕𝑦

𝜕𝑥=

Τ𝜕𝑦1 𝜕𝑥1 Τ𝜕𝑦1 𝜕𝑥2 Τ𝜕𝑦1 𝜕𝑥3Τ𝜕𝑦2 𝜕𝑥1 Τ𝜕𝑦2 𝜕𝑥2 Τ𝜕𝑦2 𝜕𝑥3

Example

𝑥1 + 𝑥2𝑥32𝑥3

= 𝑓

𝑥1𝑥2𝑥3

𝜕𝑦

𝜕𝑥=

1 𝑥3 𝑥20 0 2

size of y

size of x

ො𝑦 𝐶

Last Layer

……… ො𝑦 =

00⋮1⋮

𝐶 = −𝑙𝑜𝑔𝑦𝑟Cross Entropy:

𝑖 ≠ 𝑟:

𝑖 = 𝑟:

𝜕𝐶

𝜕𝑦𝑖=?

Τ𝜕𝐶 𝜕𝑦𝑖 = 0

Τ𝜕𝐶 𝜕𝑦𝑟 = −1/𝑦𝑟𝜕𝐶

𝜕𝑦= 0 ⋯ −1/𝑦𝑟 ⋯

Τ𝜕𝐶 𝜕𝑦

z2 y𝜎

ො𝑦 𝐶

𝜕𝑦

𝜕𝑧2𝑖 ≠ 𝑗: 0Τ𝜕𝑦𝑖 𝜕𝑧𝑗

𝑖 = 𝑗: Τ𝜕𝑦𝑖 𝜕𝑧𝑖2 = 𝜎′ 𝑧𝑖

𝜎′ 𝑧𝑖2

𝑧𝑖2

𝜕𝑦

𝜕𝑧2is a Jacobian matrix

i-th row, j-th column: Τ𝜕𝑦𝑖 𝜕𝑧𝑗2 Τ𝜕𝐶 𝜕𝑦

How about softmax? ☺

square 𝑦

Diagonal Matrix

𝑦𝑖 = 𝜎 𝑧𝑖2

a1 z2 y

ො𝑦 𝐶

𝑧2 = 𝑊2𝑎1𝜕𝑦

𝜕𝑧2𝜕𝑧2

𝜕𝑎1

𝜕𝑧2

𝜕𝑊2

𝜕𝑧2

𝜕𝑎1is a Jacobian matrix

𝑧𝑖2 = 𝑤𝑖1

2 𝑎11 +𝑤𝑖2

2 𝑎21 +

⋯+𝑤𝑖𝑛2 𝑎𝑛

𝜕𝑧𝑖2

𝜕𝑎𝑗1 = 𝑊𝑖𝑗

𝑧2𝑎1

i-th row, j-th column:

Τ𝜕𝐶 𝜕𝑦

Considering W2

as a mxn vector

𝜕𝑧2

𝜕𝑊2=

a1 z2 y

ො𝑦 𝐶

𝑧2 = 𝑊2𝑎1𝜕𝑦

𝜕𝑧2

𝜕𝑊2

𝜎𝑊2

𝑧𝑖2 = 𝑤𝑖1

2 𝑎11 +𝑤𝑖2

2 𝑎21 +

⋯+𝑤𝑖𝑛2 𝑎𝑛

𝑖 = 𝑗:

𝑖 ≠ 𝑗:𝜕𝑧𝑖

𝜕𝑊𝑗𝑘2 = 0

𝜕𝑧𝑖2

𝜕𝑊𝑖𝑘2 = 𝑎𝑘

𝜕𝑧𝑖2

𝜕𝑊𝑗𝑘2 =?

𝜕𝑧𝑖2

𝜕𝑊𝑗𝑘2

(j-1)xn+k

Τ𝜕𝐶 𝜕𝑦

(considering Τ𝜕𝑧2 𝜕𝑊2 as a tensor makes thing easier)

𝜕𝑧2

𝜕𝑊2=

a1 z2 y

ො𝑦 𝐶

𝑧2 = 𝑊2𝑎1𝜕𝑦

𝜕𝑧2

𝜕𝑊2

𝜎𝑊2

𝑖 = 𝑗:

𝑖 ≠ 𝑗:𝜕𝑧𝑖

𝜕𝑊𝑗𝑘2 = 0

𝜕𝑧𝑖2

𝜕𝑊𝑖𝑘2 = 𝑎𝑘

𝜕𝑧𝑖2

𝜕𝑊𝑗𝑘2 =?

𝜕𝑧𝑖2

𝜕𝑊𝑗𝑘2

(j-1)xn+k

𝑚 × 𝑛

𝑎𝑎

Τ𝜕𝐶 𝜕𝑦

𝑗 = 1 𝑗 = 2 𝑗 = 3

Considering W2

as a mxn vector

𝜎z2 y

ො𝑦 𝐶

𝜕𝑦

𝜕𝑧2𝜕𝑎1

𝜕𝑧1 𝑊2𝑊1

𝜕𝑧2

𝜕𝑊2

𝜕𝑧1

𝜕𝑊1

𝜕𝐶

𝜕𝑦

𝜕𝑧2𝑊2

𝜕𝑎1

𝜕𝑧1

𝜕𝐶

𝜕𝑊1

𝜕𝐶

𝜕𝑊1=

𝜕𝑧1

𝜕𝑊1= [⋯

𝜕𝐶

𝜕𝑊𝑖𝑗1 ⋯]

Τ𝜕𝐶 𝜕𝑦

Question

Q: Only backward pass for computational graph?

Q: Do we get the same results from the two different approaches?

Computational Graphfor Recurrent Network

Recurrent Network

fh0 h1

……

𝑦𝑡 , ℎ𝑡 = 𝑓 𝑥𝑡 , ℎ𝑡−1;𝑊𝑖 ,𝑊ℎ ,𝑊𝑜

𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑜ℎ𝑡

ℎ𝑡 = 𝜎 𝑊𝑖𝑥𝑡 +𝑊ℎℎ𝑡−1

(biases are ignored here)

Recurrent Network

𝑦𝑡 , ℎ𝑡 = 𝑓 𝑥𝑡, ℎ𝑡−1;𝑊𝑖 ,𝑊ℎ,𝑊𝑜

𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑜ℎ𝑡𝑎𝑡 = 𝜎 𝑊𝑖𝑥𝑡 +𝑊ℎℎ𝑡−1

h1z1m1

+ 𝜎

𝑠𝑜𝑓𝑡𝑚𝑎𝑥

Recurrent Network

𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑜ℎ𝑡

ℎ𝑡 = 𝜎 𝑊𝑖𝑥𝑡 +𝑊ℎℎ𝑡−1

𝑦𝑡 , ℎ𝑡 = 𝑓 𝑥𝑡, ℎ𝑡−1;𝑊𝑖 ,𝑊ℎ,𝑊𝑜

𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑜ℎ𝑡𝑎𝑡 = 𝜎 𝑊𝑖𝑥𝑡 +𝑊ℎℎ𝑡−1

Recurrent Network

ො𝑦1 C1

ො𝑦2 C2 ො𝑦3 C3

𝐶 = 𝐶1 + 𝐶2 + 𝐶3

ො𝑦1 C1

ො𝑦2 C2 ො𝑦3 C3

𝐶 = 𝐶1 + 𝐶2 + 𝐶3

Τ𝜕𝐶3 𝜕𝑦3Τ𝜕𝐶2 𝜕𝑦2Τ𝜕𝐶1 𝜕𝑦1

Τ𝜕𝑦3 𝜕ℎ3Τ𝜕𝑦2 𝜕ℎ2Τ𝜕𝑦1 𝜕ℎ1

Τ𝜕ℎ3 𝜕𝑊ℎ

Τ𝜕ℎ3 𝜕ℎ2Τ𝜕ℎ2 𝜕ℎ1

Τ𝜕ℎ2 𝜕𝑊ℎΤ𝜕ℎ1 𝜕𝑊ℎ

𝜕𝐶

𝜕𝑊ℎ

𝜕𝐶

𝜕𝑊ℎ

𝜕𝐶

𝜕𝑊ℎ

Recurrent Network

𝜕𝐶

𝜕𝑊ℎ1

𝜕ℎ1

𝜕𝑊ℎ

𝜕𝐶1

𝜕𝑦1𝜕𝑦1

𝜕ℎ1𝜕𝐶2

𝜕𝑦2𝜕𝑦2

𝜕ℎ2𝜕ℎ2

𝜕ℎ1

𝜕𝐶3

𝜕𝑦3𝜕𝑦3

𝜕ℎ3𝜕ℎ3

𝜕ℎ2𝜕ℎ2

𝜕ℎ1+ +=

𝜕𝐶

𝜕𝑊ℎ2 = ……

𝜕𝐶

𝜕𝑊ℎ3 = ……

𝜕𝐶

𝜕𝑊ℎ=

𝜕𝐶

𝜕𝑊ℎ1 +

𝜕𝐶

𝜕𝑊ℎ2 +

𝜕𝐶

𝜕𝑊ℎ3

Reference

• Textbook: Deep Learning

• Chapter 6.5

• Calculus on Computational Graphs: Backpropagation

• https://colah.github.io/posts/2015-08-Backprop/

• On chain rule, computational graphs, and backpropagation

• http://outlace.com/Computational-Graph/

Acknowledgement

•感謝翁丞世同學找到投影片上的錯誤

Computational Graph &...

Documents

Transcript of Computational Graph &...

An integral lift of contact homology - Columbia Universitymath.columbia.edu/~nelson/2017_penn.pdf · d = dy ^dx = dx ^dy) ^d = dx ^dy ^dz Jo Nelson An integral lift of contact homology.

Mathematics Quadratic equationnflaw/EIE2106Sem12019-20/Tutorial1-Maths.pdf · Find the maximum/minimum Step 1: Find Step 2: Set 49 dy dx dy d xx x x x32 25101003 1010 dx dx 0 dy dx

Ordinary Differential Equations - Numerical FactoryOrdinary Differential Equations • Analytical Solution: (only in few cases) • Separate variables: dy dx x dy x dx y x C ³ ³

hwkcoversheetword - Georgia Institute of Technologybrewer.ece.gatech.edu/ece3043/hwkcoversheet.pdfPrinting Time:Tuesday, December 11, 2012, 6:22:47 PM x1 y1 x2 y2 dx dy dy/dx 1/dx

C4 Edexcel Solution Bank - Chapter 4...4 dx dt dy dt dy dx 4 sint cost dx dt dy dt dy dx sec2 t sec t tant sec t tant 1 cost cost sint 1 sint dx dt dy dt dy dx 2 sin2t 2 − 2 cos2t

Uplift Education / Overview · Web view2012. 10. 3. · 2. derivative: d 2 y d 2 x = d dx dy dx = d dθ dy dx dθ dx = d dθ dy dx dx dθ = 1 dx dθ d dθ dr dθ sin θ + rcos θ

MPPT of Photovoltaic Systems using Extremum–Seeking · =Ksign µ dy dx : (6) Note that the algorithm measures the sign of dy=dt, whereas the resulting dynamics are governed by dy=dx.

2.1 NUMERICAL SOLUTION OF SIMULTANEOUS FIRST ORDER ORDINARY … Numerical Methods–II 51 Example 1: Given dy dx = z and dz dx =– xz – y with y(0) = 1, z(0) = 0, obtain y and z

OpticsMalus – Optics. 3 One of the rays ( A) belongs to each point of that surface, but if one determines dx ′ / dz ′, dy ′ / dz ′ by means of equations ( C), ( F) then one

Differential Geometry Jay Havaldar v[x] = dx dx v1 + dx dy v2 + dx dz v3 = v1 Intermsofthegradient: dx= v[x] = (1;0;0) v= v1 In other words, the differentialdxis a function which sends

Krakow Energy Lectures 2010 - univie.ac.at · Bunting and Masood-ul-Alam [43], that event horizons in regular, static, ... dSi= ∂i⌋dx∧dy∧dz, (1.1.5) 1.1. THE MASS OF ASYMPTOTICALLY

10. 11. · 2016-01-15 · dy e t dx dx ee e dt − = − = dy =dt = = 0.005 5. 6. 7. A 22 2 2 3 dy t = 3 2 3 3t thus dx t22 2t t4 d d d y dt =; 3 t dt dt dx dx dx dt ⎛⎞dy ⎛⎞dy

Potential Energy Length hence dl-dx = (1/2) (dy/dx) 2 dx dU = (1/2) F (dy/dx) 2 dx potential energy of element dx y(x,t)= y m sin( kx- t) dy/dx= y m.

Xe .ie :±÷÷÷¥÷÷÷÷¥÷ii¥÷...dy y xy dx x xy y = 7) sin sin tan tan cos cos dy x y xy dx x y == 8) 1cos sin y y dy y e x dx xe x + = 9) 2 2 x y x y dy y y e dx yxe = 10)

Getting Started with Java 3D - Computing Science and ... Started with Java 3D Tutorial Preface The Java 3D Tutorial 0-2 T(dx, dy, dz) = 1 0 0 dx 0 1 0 dy 0 0 1 dz 0 0 0 1 Module Overview

GROUNDWATER FLOW MODEL OF OIL SHALE … Kyy Kzz W Ss dx dx dy dy dz dz dt ... (30 days a step) ... Groundwater Flow Model of Oil Shale Mining Area 263Published in: Oil Shale · 2010Authors:

CONFORMALLY FLAT HYPERSURFACES WITH BIANCHI-TYPE …canonical principal Guichard net, dx D1, dy 2, dz 3. It is obvious that the obtained coordinate system is principal; in order to

2 Movimento Retilíneo - sistemas.eel.usp.brsistemas.eel.usp.br/docentes/arquivos/2166002/LOB1018/02-Movimento... · d y z dy dz dx dx dx d yz dz dy yz dx dx dx dy dy dx dt dx dt

Fermi Gas Model - Physics & Astronomyphysics.valpo.edu/courses/p430/ppt/FGM.pdfHeisenberg Uncertainty Principle Identical conditions apply for the y, py, and z, pz--† dVps≡(dx⋅dpx)⋅(dy⋅dpy)⋅(dz⋅dpz)=h3

HAPTER 5: Derivatives of Exponential and Trigonometric ...msmamath.weebly.com/.../6/...ch_5_nelson_solutions.pdf · dy dx 5 d(xe3x) dx 56x2ex3 dy dx 52(3x2)ex3 y 52ex3 dy dx 5 1 2!x