Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial...

27
Dr. David A. Clifton Group Leader Computational Health Informatics (CHI) Laboratory Associate Director Oxford Centre for Affordable Healthcare Technology College Lecturer in Engineering Science, Balliol College Artificial Neural Networks B1 Machine Learning “Though they sink through the sea they shall rise again and Death shall have no dominion.” - Dylan Thomas

Transcript of Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial...

Page 1: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Dr. David A. Clifton

Group Leader – Computational Health Informatics (CHI) Laboratory

Associate Director – Oxford Centre for Affordable Healthcare Technology

College Lecturer in Engineering Science, Balliol College

Artificial Neural Networks B1 Machine Learning

“Though they sink through the sea

they shall rise again

and Death shall have no dominion.” - Dylan Thomas

Page 2: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

I have a neural network processor.

Page 3: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Just when you thought it was safe to go back in the water…

Artificial neural networks were start-of-the-art in the mid 1990s

The move towards probabilistic systems gained speed in the late 90s, which offered improved robustness, more principled inference, etc. …ANNs were thought to be a quaint / heuristic subset of such techniques

ANNs suffer from training difficulties (local minima, etc.) that SVMs avoid

Page 4: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

“Deep learning” is currently taking up a lot of time at good conferences, where “Deep Convnets” have been mostly applied to image classification

Some background activity in ANN research has persisted: neural spike-trains

Tellingly, IEEE Transactions on Neural Networks changed its name in 2010

Neural networks

ML researchers

Page 5: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

The notion is that we have a massively connected network of simple nodes

There is a rich literature on Hebbian learning, reinforcement learning, unsupervised learning, often overlapping with actual neuroscience

Lots of dubious claims arose about “emergent behaviour”

We are concerned with the ML variety: mappings from inputs to outputs

Page 6: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

The “input layer” is rather a misnomer: it simply represents the input vector comprised of elements yi

The “hidden units” are connected to each input node by weights, where input element yi is connected to hidden node j via weight wij

We sum the weighted inputs to get the overall activation at aj node j: aj =

Page 7: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

The output yj at node j is obtained by passing the activation aj = through some activation function: yj = fa(aj)

We then have a new vector of elements yj in the hidden layer, and we can perform the same process all over again on these yj to get to the output layer yk – hidden node yj is connected to output node yk via weight wjk

Find the activation ak at each output node by summing ak = wjkyj and then use the activation function: yk = fa(ak)

Page 8: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

We need a function that maps some activation value a onto an output y = f(a)

We would like squash extremal values such that they do not dominate the network – saturate in the tails

Classic examples include:

The sigmoid:

The softmax:

Hyperbolic tangent:

Page 9: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

So if we choose a sigmoidal activation function, our network is simply a weighted sum of inputs being treated like…

a weighted sum of logistic regressions

40 60 80 100 120 140 1605

10

15

20

25

30

35

40

45

50

HR (bpm)

BR

(rp

m)

40 60 80 100 120 140 1605

10

15

20

25

30

35

40

45

50

HR (bpm)

BR

(rp

m)

D t r ai n1 D test

1

Ct r ai n2 C t r ai n

1 Ctest2 Ctest

1

I

J K

I = 2

K = 1 J

#j =

dX

i= 1

x i wi j + x0w0,j

i = 1, ..., d j = 1, ..., J

wi j i j j

#j f

D test1

Topt = 0.35 D test1

Ctest1 Ctest

2 Ctest2

C test2

C test2 Ctest

2

Ctest2

2

2

Ctest2

T

LR MLP

Page 10: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

1. Calculate Output –

2. Measure Error – (MSE cost function)

3. Update weights –

4. Repeat from (1) until converges to a limit

gradient descent

Page 11: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Given a cost function , we update each element of wij in iteration

This makes a step change in wij in the direction of greatest decrease of

is the learning rate (0.01 -0.1), and allows us to control convergence…

We don’t want to update the weights in increments that are too large, nor too small…

Can be found

straightforwardly

using “Backprop”

Page 12: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic
Page 13: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Randomly seed the weights, … and minimize the cost function

Note that this does not tell you when to end training…

Page 14: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

At some point the error on your training set will continue to drop, but the error on an independent (validation) set will start to rise …

Now you are overfitting on the training data

The learned function fits the training data closely but it does not generalise well - it cannot cope with unseen data for the same task

Page 15: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Consider three types of fit to the same data:

Piecewise linear

Quadratic regression

Linear regression

Page 16: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Suppose we set aside some of our training data for validation…

Page 17: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Train on the subset left for training, then find the error on the validation set

MSE = 2.4

Page 18: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Train on the subset left for training, then find the error on the validation set

MSE = 0.9

Page 19: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Train on the subset left for training, then find the error on the validation set

MSE = 2.2

Page 20: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Parameters you can adjust: nin=[]; % Number of input nodes

nhidden=[]; % Number of hidden nodes

nout=[]; % Number of output nodes

alpha = []; % Weight decay or momentum

ncycles = []; % Number of training cycles before halt

activfn = {'linear','logistic','softmax'}; % Choice of activation function on input layer

optType = {'quasinew','conjgrad','scg'}; % Choice of algorithm for optimisation

Page 21: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

The number of input nodes is fixed by the dimensionality of your data

The number of output nodes is fixed by the number of classes that you have

for classification, this often means one output node per class

for regression, this means one node per output variable

The number of hidden nodes is interesting - how should we select our architecture?

Is there a “right” choice when it comes to architecture?

Page 22: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

If classes are not separable in the J input dimensions, then set K > J

If data can be dimensionally reduced (K < J) then we may use PCA (see Friday’s lecture) to estimate the number of hidden nodes, K

We typically find that the size of the eigenvalues for our N x J data matrix rapidly decrease, suggesting that we might like to try setting K accordingly

Here, perhaps K = 6

Page 23: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Cross-validation is a commonly-accepted means of estimating the complexity of an algorithm (Tuesday, Week 1)

Here, we might perform cross-validation by reporting the error rate (FP + FN) for increasing numbers of hidden nodes

Variability on the error may be estimated by performing validation on randomly-selected subsets

Then, examine the median 1 2 3 4 5 6 7 8

30

40

50

60

70

80

90

100

110

120

J

FP

+ F

N

J

± 1

J = 4

T

Ctest1

Ctest2 ✏

T = 0.1

T = 0.9 T T

C̃t r ai n2 T

D test1 ✏

T

T

C1 T

C1

Topt

Number of hidden nodes, K

Page 24: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

…we dispensed with the notion that we are limited to a single hidden layer

…or even that we are limited to several hidden layers

…and we just kept going one layer deeper?

The Deep Convnet of Krizhevsky, Sutskever, and Hinton had 60 million parameters…

Page 25: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

The input space is a 2-D image

Regions of the image are convolved on a convolution layer to give a local response

We might see image edges and other filters “light up” on the lower layers

The network builds up convolution layers and regular layers

Page 26: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Needless to say, there are considerations

These work well with 2-D images, because the input data map nicely onto a convolution layer

The types of problems that deep networks typically tackle include 1000 classes, 1.2m training images, 50k validation images, 150k test images

Current implementations are optimised for highly-parallel computations (the convolution layer is spatial…), which lend themselves to GPUs

Even so, they take up to a week to train

Page 27: Artificial Neural Networks - University of Oxforddavidc/pubs/...NeuralNetworks.pdf · Artificial neural networks were start-of-the-art in the mid 1990s The move towards probabilistic

Dr. David A. Clifton

Group Leader – Computational Health Informatics (CHI) Laboratory

Associate Director – Oxford Centre for Affordable Healthcare Technology

College Lecturer in Engineering Science, Balliol College

Artificial Neural Networks B1 Machine Learning

“Though they sink through the sea

they shall rise again

and Death shall have no dominion.” - Dylan Thomas