Formal Computational Skillsusers.sussex.ac.uk/~andrewop/Courses/msc_fcs/FCS04... · want gradient...

Formal Computational Skills

Week 4: Differentiation

Overview

Will talk about differentiation, how derivatives are

calculated and its uses (especially for optimisation). Will

end with how it is used to train neural networks

By the end you should:

• Know what dy/dx means

• Know how partial differentiation works

• Know how derivatives are calculated

• Be able to differentiate simple functions

• Intuitively understand the chain rule

• Understand the role of differentiation in optimisation

• Be able to use the gradient descent algorithm

DerivativesDerivative of a function: y = f(x) is dy/dx (or df/dx or df(x)/dx

Read as: “dy by dx”

How a variable changes with respect to (w.r.t) other variables ie the rate of change y wrt x

It is a function of x and for any given value of x calculates the gradient at that point ie how much/fast y changes as x changes

Examples: If x is space/distance, dy/dx tells us the slope or steepness, If x time, dy/dx tells us speeds and accelerations etc

Essential to any calculation in a dynamic world

If dy/dx is >0 function is increasing

The bigger the absolute size of dy/dx (written as |dy/dx|) the faster change is happening and y is assumed to be more complex/difficult to work with

If dy/dx is <0 function is decreasing

If dy/dx = 0 function is constant

In general, dy/dx changes as x changes x

y

x

y

x

y

x

y

Partial Differentiation

For functions of 2 variables (z=x exp(-x2-y2)), the partial derivatives and tell us how z varies wrt each parameter

-2

-1

0

1

2

-2

-1

0

1

2

-0.4

-0.2

0

0.2

0.4Thus partial derivatives go

along the lines

xy

z

x

z

yx

z

x

z

y

z

Calculated by treating the other

variables as if they are constant

And for a function of n

variables y=f(x1,x2, ...,xn),

ix

y

z=x exp(-x2-y2)

y x

x

z y

z

tells us how y changes

if only xi varies

eg z=xy: to get treat y as if it is a constant and

Higher Dimensions

w1

w2

E(w)

nw

E

w

E

w

EE ,,,

21

In more dimensions, the

gradient is a vector called

grad (represented by an

upside down triangle)

Grad tells us steepness

and direction

Elements are partial

derivatives wrt each variable

01w

EHere so 1,0,21 w

E

w

EE

12w

E01w

E

12w

E

Vector FieldsSince grad is a vector, often useful to view vectors as a field

with vectors drawn as arrows

Note that grad points in the steepest direction up hill

Similarly minus grad points in the steepest direction downhill

Calculating the Derivativedy/dx = rate of change of y = change in y/change in x (if dy/dxremained constant)

Thus dy/dx is constant and equals m.

y = 6-2 = 4

So: dy/dx = y/ x = 4/2 =2

6

2

1 3

x = 3-1 = 2

For a straight line, y=mx + c it is easy

Since y remains constant, divide change in y by change in x

Eg y=2x Aside: different sorts of d‟s

Means “a change in”

Means “a small change

in”

d Means “an infinitesimally

small change in”

However if derivative is not constant: we want gradient at a point ie gradient of a tangent to the curve at that point

A tangent is a line that touches the curve at that point

It is approximated by the gradient of a chord = y/ x (a chord is a straight line from one point to another on a curve)

For infinitesimally

small chord:

y/ x = dy/dx and

the approximation is

exact

y+ y

y

x x+ x

y

x

The smaller the

chord, the better the

approximation.

Eg y = x2

y(x+h)

y(x)

x x+h

xhx

xyhxy

x

y

dx

dy )()(

Can get derivatives in this way but have rules for most

222 2)()( hxhxhxhxy

y

x=h

h

xyhxy

dx

dy )()(

Now do this process mathematically;

xhxdx

dy

h2)2(lim

0

hxh

hxh

h

xhxhx

dx

dy2

22 2222

Function | Derivative

01

1

1

1

0 !)!1(!! i

xi

i

i

i

i

i

ix e

i

x

i

x

i

ix

dx

dy

i

xe

y=constant eg y=3, dy/dx = 0 as

no change

NB n! is n factorial and means eg 5! = 5x4x3x2x1

xx

nn

edx

dyey

xdx

dyxy

xdx

dyxy

xdx

dyxy

nxdx

dyxy

1,ln

)sin(,cos

)cos(,sin

, 1

Can prove all

eg why is dy/dx

= ex if y = ex?

y= k f(x) dy/dx = k df/dx

eg y = 3sin(x), dy/dx = 3cos(x)

y=f(x) + g(x), dy/dx = df/dx + dg/dx

y = x3 + ex, dy/dx = 3x2 + ex

Other useful rules:

Product Rule

Quotient Rule

dx

duv

dx

dvu

dx

dy

xvxuy )()(

2)(

)(

v

dx

dvu

dx

duv

dx

dy

xv

xuy

eg: y = x sin(x),

dx

duv

dx

dvu

dx

dy

dy/dx=x sin(x)+ 1cos(x)

u= x, v = sin(x)

du/dx = 1x0 = 1, dv/dx = cos(x)

Also known as function of a function

dx

dv

dv

vdu

dx

dy

xvuy

)(

))((

eg: y = sin(x2),

EASY IN PRACTICE (honestly)

The one you will see most

Chain Rule

u= sin(v), v = x2

du/dv = cos(v), dv/dx = 2x

dy/dx=dx

dv

dv

vdu

dx

dy )(cos(v)2x 2x= cos(x2)

212

1

xwx

212

1

111

11

1 xwx

xwxx

y

111

1

xwx

y1= 3x1

= w11 = 0 So

Some examples:

y1= w11x1 + w12x2

y1= w11x11

1

dx

dy= 3 = w11

1

1

x

y

1111

1

1 0 wwx

y

11

1

w

y= x1

partial d‟s as y a function of 2 variables

Similarly, if: y1= w11x1 + w12x2 + … + w1nxn 16

6

1 wx

y

E = u2 dE/du = 2u E = u2 +2u + 20 dE/du = 2u + 2

1)1(1

1 a

ae

ey Can show dy/da= y(1-y)Sigmoid function:

How does a change in x1 affect the network output?

Intuitively, the bigger the weight on the connection, the more

effect a change in the input along that connection will make.

1

1

x

y

11

1

1 wx

y

212

1

111

11

1 xwx

xwxx

y

x1

x2

y1= w11x1 + w12x2

w11

w12

So

EG: Networks

So intuitively, changing w12 has a larger effect the stronger

the signal travelling along it

212

12

111

1212

1 xww

xwww

y

x1

x2

y1= w11x1 + w12x2

w11

w12

Similarly, w12 affects y1 by:

Note also that by following the path of x2 through the network,

you can see which variables affect it

= x2

x1

x2 y1= w11x1 + w12x2

w11

w12

Now how does E change if the output changes?

Now suppose we want to train the network so that it outputs a

target t1 when the input is x. Need an error function E to

evaluate how far the output y1 is from the target t1

E = (y1 – t1)2

To calculate this we need

the chain rule as E is a

function of a function of y111

1))((

dy

dv

dv

du

dy

dE

yvuE

1dy

dE

where v = (y1-t1)

and u = v2

du/dv = = 2(y1- t1)

dv/dy1 = 1

1dy

dE1x2(y1- t1) =2(y1- t1)and so

Now

2v

12w

E

x2 y1= w11x1 + w12x2

w11

w12

We train the network by adjusting the weights, so we

need to know how the error varies if eg w12 varies ie

E = (y1 – t1)2

The chain of events: w12 affecting y1 affecting E indicates the

chain rule so

12

1

w

y

1y

E

12

1

112 w

y

y

E

w

E

To calculate this note that w12 affects y1 by:

which in turn affects E by:

iij

ij

i

iij

tyxw

y

y

E

w

E2And in general: cf backprop ij

x2

2(y1 – t1)

x2 2(y1 – t1)

Back to 2 (or even N outputs). Consider the route (or more

accurately, the variables) by which a change in w12 affects E:

x1

x2

w11

w21

w12

w22

y1

y2

See that w12 still affects E via x2 and y1 so still have:

2

1

2)(i

ii tyE

12

1

112 w

y

y

E

w

Eiij

ij

i

iij

tyxw

y

y

E

w

E2And:x2 2(y1 – t1)

1y

E

12

1

w

y

Visualise the partial derivatives working along the connections

How about how E is changed by x2: 2x

E

Therefore need to somehow combine the contributions along

these 2 routes …

x1

x2

w11

w21

w12

w22

y1

y2

1y

E

x2 changes y1 along w12 which changes E

x2 also changes y2 along w22 which changes E

Again look at the route along which x2 travels:

12

1

w

y

12

1

w

y 2y

E

2

1

2)(i

ii tyE

Chain rule for partial differentiationIf y is a function of u and v which are both functions of x then:

Quite intuitive: sum the contributions along each path affecting E

dx

dv

v

y

dx

du

u

y

dx

dy

x1

x2

w11

w21

w12

w22

y1

y2

1y

E

2

1

x

y

2

2

x

y 2y

E

2x

ESo: = 2(y1- t1)

2

1

2)(i

ii tyE

Outputs

2)(2 iii wty2

1

1 x

y

y

E

2

2

2 x

y

y

Ew12 + 2(y2- t2)w22

What about another layer? x1

x2

w11

w21

w12

w22

y1

y2

2

1

2)(i

ii ytE

v11

v21

v12

v22

u1

u2

We need to know how E changes wrt the v‟s and w‟s. The

calculation for the w‟s is the same but what about the v‟s?

ji

j

v

x

jiv

Ewhich involves finding and

jx

E

Thus: i

k j

k

kji

j

jji

ux

y

y

E

v

x

x

E

v

E 2

1

Seems complicated but can

do via matrix multiplication

v‟s affect x‟s which affect y‟s which affect E.

So we need:

Finding Maxima/Minima (Optima)At a maximum or minimum (optimum, stationary point etc) the gradient is flat ie dy/dx =0

To find max/min calculate dy/dx and solve dy/dx =0

maxima

minima

Global minimum

= lowest point

global maximum

= highest point

Similarly for a function of n variables y=f(x1,x2, ...,xn), at

optima for i = 1, …, n0ix

y

local maxima

Local minima

Eg: y = x2 , dy/dx = 2x so at optima: dy/dx = 0 => 2x =0 ie x=0

E

w11

Why do we want to find optima? Error minimisation

In neural network training have the error function E = (yi – ti)2

At the global minimum of E, outputs are close as possible to targets, so network is „trained‟

Therefore to train a network „simply‟ set dE/dw=0 and solve

However, In many cases (especially neural network

training) cannot solve dy/dx=0.

In such cases can use gradient descent to change the

weights in such a way that the error decreases

Gradient DescentAnalogy is being on a foggy hillside where we can only see

the area around us.

To get to the valley floor a good plan is to take a step downhill.

To get there quickly, take a step in the steepest direction

downhill ie in the direction of –grad.

As we can‟t see very far and only know the gradient locally

only take a small step

Algorithm

• Start at a random position in space, w

• Calculate

• Update w via: wnew = wold –

• Repeat from 2)(wE

)(wE

is the learning rate and governs the size of the step taken. It is set to a low value and often changed adaptively eg if landscape seems „easy‟ (not jagged or steep) increase step-size etc

w1

w2

E(w)

Original point in

weight space: w

New point in

weight space

221

,0,w

E

w

E

w

EE

)(wEw

Many variations which help with these problems eg:

• Introduce more stochasticity to „escape‟ local minima

• use higher order gradients to improve convergence

• Approximate grad or use a proxy eg take a step, see if lower down or not and if so, stay there eg hill-climbing etc

Problems of local minima. Gradient descent goes to the

closest local minimum and stays there:

Start here Start here

End up here

End up here

Other problems: length of time to get a solution especially if

there are flat regions in the search space; Can be difficult to

calculate grad

E

Higher order derivatives

Rates of rates of change.

Also used to tell whether a function (eg mapping produced by a network) is smooth as it calculates the curvature at a point

is greater for the first mapping than the 2nd

3

3

dx

yd

x

y2

yn

x

y

x

y

In 1 dimension

The higher the order, the more information known about y and so used eg improve gradient descent search

In many dimensions a vector

Eg velocity dy/dx is 1st order, acceleration: d2y/dx2 is 2nd order

Formal Computational Skillsusers.sussex.ac.uk/~andrewop/Courses/msc_fcs/FCS04... · want gradient...

Documents

Transcript of Formal Computational Skillsusers.sussex.ac.uk/~andrewop/Courses/msc_fcs/FCS04... · want gradient...