Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial...
Transcript of Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial...
Dr. David A. Clifton
Group Leader – Computational Health Informatics (CHI) Laboratory
Associate Director – Oxford Centre for Affordable Healthcare Technology
College Lecturer in Engineering Science, Balliol College
Artificial Neural Networks B1 Machine Learning
“Though they sink through the sea
they shall rise again
and Death shall have no dominion.” - Dylan Thomas
I have a neural network processor.
Just when you thought it was safe to go back in the water…
Artificial neural networks were start-of-the-art in the mid 1990s
The move towards probabilistic systems gained speed in the late 90s, which offered improved robustness, more principled inference, etc. …ANNs were thought to be a quaint / heuristic subset of such techniques
ANNs suffer from training difficulties (local minima, etc.) that SVMs avoid
“Deep learning” is currently taking up a lot of time at good conferences, where “Deep Convnets” have been mostly applied to image classification
Some background activity in ANN research has persisted: neural spike-trains
Tellingly, IEEE Transactions on Neural Networks changed its name in 2010
Neural networks
ML researchers
The notion is that we have a massively connected network of simple nodes
There is a rich literature on Hebbian learning, reinforcement learning, unsupervised learning, often overlapping with actual neuroscience
Lots of dubious claims arose about “emergent behaviour”
We are concerned with the ML variety: mappings from inputs to outputs
The “input layer” is rather a misnomer: it simply represents the input vector comprised of elements yi
The “hidden units” are connected to each input node by weights, where input element yi is connected to hidden node j via weight wij
We sum the weighted inputs to get the overall activation at aj node j: aj =
The output yj at node j is obtained by passing the activation aj = through some activation function: yj = fa(aj)
We then have a new vector of elements yj in the hidden layer, and we can perform the same process all over again on these yj to get to the output layer yk – hidden node yj is connected to output node yk via weight wjk
Find the activation ak at each output node by summing ak = wjkyj and then use the activation function: yk = fa(ak)
We need a function that maps some activation value a onto an output y = f(a)
We would like squash extremal values such that they do not dominate the network – saturate in the tails
Classic examples include:
The sigmoid:
The softmax:
Hyperbolic tangent:
So if we choose a sigmoidal activation function, our network is simply a weighted sum of inputs being treated like…
a weighted sum of logistic regressions
40 60 80 100 120 140 1605
10
15
20
25
30
35
40
45
50
HR (bpm)
BR
(rp
m)
40 60 80 100 120 140 1605
10
15
20
25
30
35
40
45
50
HR (bpm)
BR
(rp
m)
D t r ai n1 D test
1
Ct r ai n2 C t r ai n
1 Ctest2 Ctest
1
I
J K
I = 2
K = 1 J
#j =
dX
i= 1
x i wi j + x0w0,j
i = 1, ..., d j = 1, ..., J
wi j i j j
#j f
D test1
Topt = 0.35 D test1
Ctest1 Ctest
2 Ctest2
C test2
C test2 Ctest
2
Ctest2
2
2
Ctest2
T
LR MLP
1. Calculate Output –
2. Measure Error – (MSE cost function)
3. Update weights –
4. Repeat from (1) until converges to a limit
gradient descent
Given a cost function , we update each element of wij in iteration
This makes a step change in wij in the direction of greatest decrease of
is the learning rate (0.01 -0.1), and allows us to control convergence…
We don’t want to update the weights in increments that are too large, nor too small…
Can be found
straightforwardly
using “Backprop”
Randomly seed the weights, … and minimize the cost function
Note that this does not tell you when to end training…
At some point the error on your training set will continue to drop, but the error on an independent (validation) set will start to rise …
Now you are overfitting on the training data
The learned function fits the training data closely but it does not generalise well - it cannot cope with unseen data for the same task
Consider three types of fit to the same data:
Piecewise linear
Quadratic regression
Linear regression
Suppose we set aside some of our training data for validation…
Train on the subset left for training, then find the error on the validation set
MSE = 2.4
Train on the subset left for training, then find the error on the validation set
MSE = 0.9
Train on the subset left for training, then find the error on the validation set
MSE = 2.2
Parameters you can adjust: nin=[]; % Number of input nodes
nhidden=[]; % Number of hidden nodes
nout=[]; % Number of output nodes
alpha = []; % Weight decay or momentum
ncycles = []; % Number of training cycles before halt
activfn = {'linear','logistic','softmax'}; % Choice of activation function on input layer
optType = {'quasinew','conjgrad','scg'}; % Choice of algorithm for optimisation
The number of input nodes is fixed by the dimensionality of your data
The number of output nodes is fixed by the number of classes that you have
for classification, this often means one output node per class
for regression, this means one node per output variable
The number of hidden nodes is interesting - how should we select our architecture?
Is there a “right” choice when it comes to architecture?
If classes are not separable in the J input dimensions, then set K > J
If data can be dimensionally reduced (K < J) then we may use PCA (see Friday’s lecture) to estimate the number of hidden nodes, K
We typically find that the size of the eigenvalues for our N x J data matrix rapidly decrease, suggesting that we might like to try setting K accordingly
Here, perhaps K = 6
Cross-validation is a commonly-accepted means of estimating the complexity of an algorithm (Tuesday, Week 1)
Here, we might perform cross-validation by reporting the error rate (FP + FN) for increasing numbers of hidden nodes
Variability on the error may be estimated by performing validation on randomly-selected subsets
Then, examine the median 1 2 3 4 5 6 7 8
30
40
50
60
70
80
90
100
110
120
J
FP
+ F
N
J
± 1
J = 4
T
Ctest1
Ctest2 ✏
T = 0.1
T = 0.9 T T
C̃t r ai n2 T
D test1 ✏
T
T
C1 T
C1
Topt
Number of hidden nodes, K
…we dispensed with the notion that we are limited to a single hidden layer
…or even that we are limited to several hidden layers
…and we just kept going one layer deeper?
The Deep Convnet of Krizhevsky, Sutskever, and Hinton had 60 million parameters…
The input space is a 2-D image
Regions of the image are convolved on a convolution layer to give a local response
We might see image edges and other filters “light up” on the lower layers
The network builds up convolution layers and regular layers
Needless to say, there are considerations
These work well with 2-D images, because the input data map nicely onto a convolution layer
The types of problems that deep networks typically tackle include 1000 classes, 1.2m training images, 50k validation images, 150k test images
Current implementations are optimised for highly-parallel computations (the convolution layer is spatial…), which lend themselves to GPUs
Even so, they take up to a week to train
Dr. David A. Clifton
Group Leader – Computational Health Informatics (CHI) Laboratory
Associate Director – Oxford Centre for Affordable Healthcare Technology
College Lecturer in Engineering Science, Balliol College
Artificial Neural Networks B1 Machine Learning
“Though they sink through the sea
they shall rise again
and Death shall have no dominion.” - Dylan Thomas