Formal Computational Skillsusers.sussex.ac.uk/~andrewop/Courses/msc_fcs/FCS04... · want gradient...
Transcript of Formal Computational Skillsusers.sussex.ac.uk/~andrewop/Courses/msc_fcs/FCS04... · want gradient...
Formal Computational Skills
Week 4: Differentiation
Overview
Will talk about differentiation, how derivatives are
calculated and its uses (especially for optimisation). Will
end with how it is used to train neural networks
By the end you should:
• Know what dy/dx means
• Know how partial differentiation works
• Know how derivatives are calculated
• Be able to differentiate simple functions
• Intuitively understand the chain rule
• Understand the role of differentiation in optimisation
• Be able to use the gradient descent algorithm
DerivativesDerivative of a function: y = f(x) is dy/dx (or df/dx or df(x)/dx
Read as: “dy by dx”
How a variable changes with respect to (w.r.t) other variables ie the rate of change y wrt x
It is a function of x and for any given value of x calculates the gradient at that point ie how much/fast y changes as x changes
Examples: If x is space/distance, dy/dx tells us the slope or steepness, If x time, dy/dx tells us speeds and accelerations etc
Essential to any calculation in a dynamic world
If dy/dx is >0 function is increasing
The bigger the absolute size of dy/dx (written as |dy/dx|) the faster change is happening and y is assumed to be more complex/difficult to work with
If dy/dx is <0 function is decreasing
If dy/dx = 0 function is constant
In general, dy/dx changes as x changes x
y
x
y
x
y
x
y
Partial Differentiation
For functions of 2 variables (z=x exp(-x2-y2)), the partial derivatives and tell us how z varies wrt each parameter
-2
-1
0
1
2
-2
-1
0
1
2
-0.4
-0.2
0
0.2
0.4Thus partial derivatives go
along the lines
xy
z
x
z
yx
z
x
z
y
z
Calculated by treating the other
variables as if they are constant
And for a function of n
variables y=f(x1,x2, ...,xn),
ix
y
z=x exp(-x2-y2)
y x
x
z y
z
tells us how y changes
if only xi varies
eg z=xy: to get treat y as if it is a constant and
Higher Dimensions
w1
w2
E(w)
nw
E
w
E
w
EE ,,,
21
In more dimensions, the
gradient is a vector called
grad (represented by an
upside down triangle)
Grad tells us steepness
and direction
Elements are partial
derivatives wrt each variable
01w
EHere so 1,0,21 w
E
w
EE
12w
E01w
E
12w
E
Vector FieldsSince grad is a vector, often useful to view vectors as a field
with vectors drawn as arrows
Note that grad points in the steepest direction up hill
Similarly minus grad points in the steepest direction downhill
Calculating the Derivativedy/dx = rate of change of y = change in y/change in x (if dy/dxremained constant)
Thus dy/dx is constant and equals m.
y = 6-2 = 4
So: dy/dx = y/ x = 4/2 =2
6
2
1 3
x = 3-1 = 2
For a straight line, y=mx + c it is easy
Since y remains constant, divide change in y by change in x
Eg y=2x Aside: different sorts of d‟s
Means “a change in”
Means “a small change
in”
d Means “an infinitesimally
small change in”
However if derivative is not constant: we want gradient at a point ie gradient of a tangent to the curve at that point
A tangent is a line that touches the curve at that point
It is approximated by the gradient of a chord = y/ x (a chord is a straight line from one point to another on a curve)
For infinitesimally
small chord:
y/ x = dy/dx and
the approximation is
exact
y+ y
y
x x+ x
y
x
The smaller the
chord, the better the
approximation.
Eg y = x2
y(x+h)
y(x)
x x+h
xhx
xyhxy
x
y
dx
dy )()(
Can get derivatives in this way but have rules for most
222 2)()( hxhxhxhxy
y
x=h
h
xyhxy
dx
dy )()(
Now do this process mathematically;
xhxdx
dy
h2)2(lim
0
hxh
hxh
h
xhxhx
dx
dy2
22 2222
Function | Derivative
01
1
1
1
0 !)!1(!! i
xi
i
i
i
i
i
ix e
i
x
i
x
i
ix
dx
dy
i
xe
y=constant eg y=3, dy/dx = 0 as
no change
NB n! is n factorial and means eg 5! = 5x4x3x2x1
xx
nn
edx
dyey
xdx
dyxy
xdx
dyxy
xdx
dyxy
nxdx
dyxy
1,ln
)sin(,cos
)cos(,sin
, 1
Can prove all
eg why is dy/dx
= ex if y = ex?
y= k f(x) dy/dx = k df/dx
eg y = 3sin(x), dy/dx = 3cos(x)
y=f(x) + g(x), dy/dx = df/dx + dg/dx
y = x3 + ex, dy/dx = 3x2 + ex
Other useful rules:
Product Rule
Quotient Rule
dx
duv
dx
dvu
dx
dy
xvxuy )()(
2)(
)(
v
dx
dvu
dx
duv
dx
dy
xv
xuy
eg: y = x sin(x),
dx
duv
dx
dvu
dx
dy
dy/dx=x sin(x)+ 1cos(x)
u= x, v = sin(x)
du/dx = 1x0 = 1, dv/dx = cos(x)
Also known as function of a function
dx
dv
dv
vdu
dx
dy
xvuy
)(
))((
eg: y = sin(x2),
EASY IN PRACTICE (honestly)
The one you will see most
Chain Rule
u= sin(v), v = x2
du/dv = cos(v), dv/dx = 2x
dy/dx=dx
dv
dv
vdu
dx
dy )(cos(v)2x 2x= cos(x2)
212
1
xwx
212
1
111
11
1 xwx
xwxx
y
111
1
xwx
y1= 3x1
= w11 = 0 So
Some examples:
y1= w11x1 + w12x2
y1= w11x11
1
dx
dy= 3 = w11
1
1
x
y
1111
1
1 0 wwx
y
11
1
w
y= x1
partial d‟s as y a function of 2 variables
Similarly, if: y1= w11x1 + w12x2 + … + w1nxn 16
6
1 wx
y
E = u2 dE/du = 2u E = u2 +2u + 20 dE/du = 2u + 2
1)1(1
1 a
ae
ey Can show dy/da= y(1-y)Sigmoid function:
How does a change in x1 affect the network output?
Intuitively, the bigger the weight on the connection, the more
effect a change in the input along that connection will make.
1
1
x
y
11
1
1 wx
y
212
1
111
11
1 xwx
xwxx
y
x1
x2
y1= w11x1 + w12x2
w11
w12
So
EG: Networks
So intuitively, changing w12 has a larger effect the stronger
the signal travelling along it
212
12
111
1212
1 xww
xwww
y
x1
x2
y1= w11x1 + w12x2
w11
w12
Similarly, w12 affects y1 by:
Note also that by following the path of x2 through the network,
you can see which variables affect it
= x2
x1
x2 y1= w11x1 + w12x2
w11
w12
Now how does E change if the output changes?
Now suppose we want to train the network so that it outputs a
target t1 when the input is x. Need an error function E to
evaluate how far the output y1 is from the target t1
E = (y1 – t1)2
To calculate this we need
the chain rule as E is a
function of a function of y111
1))((
dy
dv
dv
du
dy
dE
yvuE
1dy
dE
where v = (y1-t1)
and u = v2
du/dv = = 2(y1- t1)
dv/dy1 = 1
1dy
dE1x2(y1- t1) =2(y1- t1)and so
Now
2v
12w
E
x2 y1= w11x1 + w12x2
w11
w12
We train the network by adjusting the weights, so we
need to know how the error varies if eg w12 varies ie
E = (y1 – t1)2
The chain of events: w12 affecting y1 affecting E indicates the
chain rule so
12
1
w
y
1y
E
12
1
112 w
y
y
E
w
E
To calculate this note that w12 affects y1 by:
which in turn affects E by:
iij
ij
i
iij
tyxw
y
y
E
w
E2And in general: cf backprop ij
x2
2(y1 – t1)
x2 2(y1 – t1)
Back to 2 (or even N outputs). Consider the route (or more
accurately, the variables) by which a change in w12 affects E:
x1
x2
w11
w21
w12
w22
y1
y2
See that w12 still affects E via x2 and y1 so still have:
2
1
2)(i
ii tyE
12
1
112 w
y
y
E
w
Eiij
ij
i
iij
tyxw
y
y
E
w
E2And:x2 2(y1 – t1)
1y
E
12
1
w
y
Visualise the partial derivatives working along the connections
How about how E is changed by x2: 2x
E
Therefore need to somehow combine the contributions along
these 2 routes …
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
x2 changes y1 along w12 which changes E
x2 also changes y2 along w22 which changes E
Again look at the route along which x2 travels:
12
1
w
y
12
1
w
y 2y
E
2
1
2)(i
ii tyE
Chain rule for partial differentiationIf y is a function of u and v which are both functions of x then:
Quite intuitive: sum the contributions along each path affecting E
dx
dv
v
y
dx
du
u
y
dx
dy
x1
x2
w11
w21
w12
w22
y1
y2
1y
E
2
1
x
y
2
2
x
y 2y
E
2x
ESo: = 2(y1- t1)
2
1
2)(i
ii tyE
Outputs
2)(2 iii wty2
1
1 x
y
y
E
2
2
2 x
y
y
Ew12 + 2(y2- t2)w22
What about another layer? x1
x2
w11
w21
w12
w22
y1
y2
2
1
2)(i
ii ytE
v11
v21
v12
v22
u1
u2
We need to know how E changes wrt the v‟s and w‟s. The
calculation for the w‟s is the same but what about the v‟s?
ji
j
v
x
jiv
Ewhich involves finding and
jx
E
Thus: i
k j
k
kji
j
jji
ux
y
y
E
v
x
x
E
v
E 2
1
Seems complicated but can
do via matrix multiplication
v‟s affect x‟s which affect y‟s which affect E.
So we need:
Finding Maxima/Minima (Optima)At a maximum or minimum (optimum, stationary point etc) the gradient is flat ie dy/dx =0
To find max/min calculate dy/dx and solve dy/dx =0
maxima
minima
Global minimum
= lowest point
global maximum
= highest point
Similarly for a function of n variables y=f(x1,x2, ...,xn), at
optima for i = 1, …, n0ix
y
local maxima
Local minima
Eg: y = x2 , dy/dx = 2x so at optima: dy/dx = 0 => 2x =0 ie x=0
E
w11
Why do we want to find optima? Error minimisation
In neural network training have the error function E = (yi – ti)2
At the global minimum of E, outputs are close as possible to targets, so network is „trained‟
Therefore to train a network „simply‟ set dE/dw=0 and solve
However, In many cases (especially neural network
training) cannot solve dy/dx=0.
In such cases can use gradient descent to change the
weights in such a way that the error decreases
Gradient DescentAnalogy is being on a foggy hillside where we can only see
the area around us.
To get to the valley floor a good plan is to take a step downhill.
To get there quickly, take a step in the steepest direction
downhill ie in the direction of –grad.
As we can‟t see very far and only know the gradient locally
only take a small step
Algorithm
• Start at a random position in space, w
• Calculate
• Update w via: wnew = wold –
• Repeat from 2)(wE
)(wE
is the learning rate and governs the size of the step taken. It is set to a low value and often changed adaptively eg if landscape seems „easy‟ (not jagged or steep) increase step-size etc
w1
w2
E(w)
Original point in
weight space: w
New point in
weight space
221
,0,w
E
w
E
w
EE
)(wEw
Many variations which help with these problems eg:
• Introduce more stochasticity to „escape‟ local minima
• use higher order gradients to improve convergence
• Approximate grad or use a proxy eg take a step, see if lower down or not and if so, stay there eg hill-climbing etc
Problems of local minima. Gradient descent goes to the
closest local minimum and stays there:
Start here Start here
End up here
End up here
Other problems: length of time to get a solution especially if
there are flat regions in the search space; Can be difficult to
calculate grad
E
Higher order derivatives
Rates of rates of change.
Also used to tell whether a function (eg mapping produced by a network) is smooth as it calculates the curvature at a point
is greater for the first mapping than the 2nd
3
3
dx
yd
x
y2
yn
x
y
x
y
In 1 dimension
The higher the order, the more information known about y and so used eg improve gradient descent search
In many dimensions a vector
Eg velocity dy/dx is 1st order, acceleration: d2y/dx2 is 2nd order