Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...

74
26.02.2019 1 M. Wolter, Machine Learning Machine learning Lecture 4 Marcin Wolter IFJ PAN 26 February 2019 Training - cross-validation. Optimalization of hyperparameters. Deep learning

Transcript of Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...

Page 1: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 1M. Wolter, Machine Learning

Machine learning Lecture 4

Marcin Wolter

IFJ PAN

26 February 2019

● Training - cross-validation.● Optimalization of hyperparameters.● Deep learning

Page 2: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 2M. Wolter, Machine Learning

Supervised training

Page 3: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 3M. Wolter, Machine Learning

Overtraining

Correct

Overtraining

● Overtraining – algorithm “learns” the particular events, not the rules.

● This effect appears for all ML algorithms.● Remedy – checking with another, independent dataset.

test

training

STOP

Example of using Neural Network.

Training sample

Test sample

BUT WE USE JUST ONLY A PART OF DATA

FOR TRAINING!!!

Page 4: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 4M. Wolter, Machine Learning

How to train a ML algorithm?

● How to avoid overtraining while learning?

● Should we use one sample for training and another for validating?

● Then we increase the error – we use just a part of data for training.

● Second remark: to avoid ovetraining and find the performance of the trained algorithm we should use one more, third data sample to measure the final performance of the ML algorithm.

● How to optimize the hyperparameters of the ML algorithm (number of trees and their depth for BDT, number of hidden layers, nodes for Neural Network)?

Page 5: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 5M. Wolter, Machine Learning

Validation

Page 6: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 6M. Wolter, Machine Learning

Cross-validation

● We have independant training sample Ln and a test sample T

m.

● Error level of the classifier built on the training sample Ln

● Estimator using "recycled data" (the same data for training and for error calculation) is biased.

● Reduction of bias: division of data into two parts (training & validation). But than we use just a half of information only.

● Cross-validation – out of sample Ln we remove just one event j, train

classifier, validate on single event j. We repeat n times and get the estimator (an average of all n estimators):

● We get an estimator, which is unbiased (in limit of huge n), but training is CPU demanding.

Page 7: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 7M. Wolter, Machine Learning

Cross-validation

● Intermediate solution – k-fold cross-validation

● The sample is divided into k subsamples, k-1 of them we use for training, the one for validation. Then the procedure is repeated with other subsamples and the procedure is repeated k times.

● Smaller CPU usage comparing to cross-validation.

● Recommended k ~ 10 .

● Resulting classifier might be in the simplest case an average of all k classifiers (or they might be joined together in another way) .

Page 8: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 8M. Wolter, Machine Learning

Cross-validation

● 4-times folding

● Finding a dependence of CV from alpha (medical data)

● We can draw the mean and a standard deviation. In the next plot we draw the dependence for each folding.

● As a result we can estimate an error of the cross-validated classifier.

Page 9: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 9M. Wolter, Machine Learning

https://www.slideshare.net/0xdata/top-10-data-science-practitioner-pitfalls

Model performance

Page 10: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 10M. Wolter, Machine Learning

Train vs Test vs Valid

Page 11: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 11M. Wolter, Machine Learning

Hyperparameter optimization

● Nearly each ML method has few hyperparameters (structure of the Neural Net, number of trees and their depth for BDT etc).

● They should be optimized for a given problem.

● Task: for a given data sample find a set of hyperparameters, that the estimated error of the given method is minimized.

● Looks like a typical minimization problem (fitting like), but:

– Getting each measurement is costly

– High noise

– We can get the value of the minimized function (so our error) in the pont x of the hyperparameter space, but we can't get the differential easily.

Page 12: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 12M. Wolter, Machine Learning

Optimization of hyperparameters

● How to optimize:

– „Grid search” - scan over all possible values of parameters.

– „Random search”

– Some type of fitting…● Popular method is the „bayesian optimization”

– Build the probability model

– Take „a priori” distributions of parameters

– Find, for which point in the hyperparameter space you can maximally improve your model

– Find the value of error

– Find the „a posteriori” probability distribution

– Repeat

Page 13: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 13M. Wolter, Machine Learning

How does it work in practice?

● Straight line fitting

y(x, w) = w0 + w

1x fit to the data.

1) Gaussian prior, no data used

2) First data point. We find the likelihood based on this point (left plot)and multiply: priori*likelihood. We get the posterior distribution (right plot).

3) We add the second point and repeat the procedure.

4) Adding all the points one by one.

Remark: data are noisy.

Page 14: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 14M. Wolter, Machine Learning

Starting point

Unknown function (with noise), four observations.Where should we do the next costly probing?

https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

We are searching for a MAXIMUM

Page 15: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 15M. Wolter, Machine Learning

A posteriori distribution

The a posteriori distribution of possible functions, those functions could generate the observed data points.

A set of functions

Page 16: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 16M. Wolter, Machine Learning

A posteriori functions – Gaussian Processes (GP)

These functions should be somehow parametrized, for example they could be Gaussian functions.

Page 17: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 17M. Wolter, Machine Learning

Acquisition function

● Posterior GP (Gaussian Processes) give us the mean of GP functions μ(x) and their expected variation σ2(x).

– Exploration – searching for huge variation

– Exploitation – searching for a smallest/greatest (depends on sign and convention) value of mean μ(x)

● The acquisition policy has to balance these two approaches:

– Probability of Improvement (Kushner 1964):

– Expected Improvement (Mockus 1978)

– GP Upper Confidence Bound (Srinivas et al. 2010):

• aLCB

(x) = μ(x) - κσ(x)

Page 18: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 18M. Wolter, Machine Learning

Here we are searching for minimum

Page 19: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 19M. Wolter, Machine Learning

Where to put the next point?

● Our next chosen point( x ) should has high mean (exploitation) & high variance (exploration).

Page 20: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 20M. Wolter, Machine Learning

We choose next x u(x) – acqusition function (finding maximum)

Page 21: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 21M. Wolter, Machine Learning

Dokonujemy próbkowania i powtarzamy procedurę...

Page 22: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 22M. Wolter, Machine Learning

t=2 t=3

t=4 t=5

MAXIMUM

Page 23: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 23M. Wolter, Machine Learning

Limitations

● Bayesian optimization depends on the parameters chosen

● On the acquisition function

● On the prior selected....

● It’s sequential.

● There are alternative methods, which can be done in parallel (like Random Search or Tree of Parzen Estimators (TPE) used by the HyperOpt package https://github.com/hyperopt/hyperopt).

Page 24: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 24M. Wolter, Machine Learning

Implementations

● Python

– Spearmint https://github.com/JasperSnoek/spearmint

– GPyOpt https://github.com/SheffieldML/GPyOpt

– RoBO https://github.com/automl/RoBO

– Scikit-optimize https://github.com/MechCoder/scikit-optimize

● C++

– MOE https://github.com/yelp/MOE

Page 25: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 25M. Wolter, Machine Learning

Articles

● Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.

● Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.

● Nice tutorial:

https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

Page 26: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 26M. Wolter, Machine Learning

Deep Learning

Page 27: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 27M. Wolter, Machine Learning

● What does “deep learning” mean?● Why does it give better results than other methods in pattern recognition,

speach recognition and others?

Deep Learning

Page 28: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 28M. Wolter, Machine Learning

Short answer:‘Deep Learning’ - using a neural network with many hidden layersA series of hidden layers makes the feature identification first and processes them in the chain of operations: feature identification→ further feature identification → ……. →selection

But NN are well known starting from 80-ties???

We always had good algorithms to train NN with one or two hidden layers.

But they were failing for more layers

NEW: algorithms for training deep networks

huge computing power

Page 29: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 29M. Wolter, Machine Learning

W1

W2

W3

f(x)

1.4

-2.5

-0.06

Activation function

How do we train a NN

Page 30: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 30M. Wolter, Machine Learning

2.7

-8.6

0.002

f(x)

1.4

-2.5

-0.06

x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34

Page 31: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 31M. Wolter, Machine Learning

NN trainingInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Page 32: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 32M. Wolter, Machine Learning

Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Initialization with random weights

Page 33: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 33M. Wolter, Machine Learning

Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Reading dataProcessing them by networkResult compared with the true value

1.4

2.7

1.9 0.80Errror 0.8

The weights are modified. ModificationBased on this error.

We repeat many times, each time modifying the weightsTraining algorithms take care, that the error is smaller and smaller

Page 34: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 34M. Wolter, Machine Learning

Page 35: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 35M. Wolter, Machine Learning

Page 36: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 36M. Wolter, Machine Learning

Page 37: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 37M. Wolter, Machine Learning

Page 38: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 38M. Wolter, Machine Learning

Page 39: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 39M. Wolter, Machine Learning

Page 40: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 40M. Wolter, Machine Learning

Remark

● If the activation function is non-linear, than a neural network with one hidden layer can classify any problem (fits any function).

● There exists a set of weights, which allows to do that. However, the problem is to find this set of weights...

Page 41: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 41M. Wolter, Machine Learning

How to identify the features?

Page 42: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 42M. Wolter, Machine Learning

What is this neuron doing?

Page 43: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 43M. Wolter, Machine Learning

Neurons in the hidden layers are the self-organizing feature detectors

1

63

1 5 10 15 20 25 …

Huge weight

Small weight

Page 44: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 44M. Wolter, Machine Learning

What can it detect?

1

63

1 5 10 15 20 25 …

Huge weight

Small weight

It sends a strong signal, when it finds a horizontal line in the top row of pixels, ignores anything else.

Page 45: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 45M. Wolter, Machine Learning

What can it detect?

1

63

1 5 10 15 20 25 …

Small weight

Sends strong signal, if a dark region in the leftupper corner is found.

Huge weight

Page 46: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 46M. Wolter, Machine Learning

What feature should detect a neural network recognizing the handwriting?

Page 47: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 47M. Wolter, Machine Learning

63

1

Vertical lines

Page 48: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 48M. Wolter, Machine Learning

63

1

Horizontal lines

Page 49: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 49M. Wolter, Machine Learning

63

1

Circles

And what about the sensitivity onposition of these features???

Page 50: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 50M. Wolter, Machine Learning

Next layers can learn the higher level features

etc …Detecting lines

in given places

v

Higher level detectors

etc …

Page 51: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 51M. Wolter

Deep network

Each hidden layer is an automatic feature detector

an auto-encoder

Page 52: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 52M. Wolter

An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs.

The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. If there is a structure in the data, than it should find features.

Page 53: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 53M. Wolter

Hidden layers are trained to identify features

Page 54: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 54M. Wolter

The last layer performs the actual classification

Page 55: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 55M. Wolter, Machine Learning

Such an organised network makes sense...

Our brains probably work in a similar way

Page 56: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 56M. Wolter, Machine Learning

Unfortunately, until recent (10?) years we didn’tknow how to train a deep network

Page 57: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 57M. Wolter

Machine Learning and Deep Learning

● Traditional ML (BDT, NN etc) – the scientist finds good, well discriminating variables (~10), called “features”, and performs classification using them as inputs for the ML algorithm.

● Deep Learning – thousands or millions of input variables (like pixels of a photo), the features are automagically extracted during training.

Page 58: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 58M. Wolter

Deeper network?

Traditional Neural Networks have one or two hidden layers.

Deep Neural Network: a stack of sequentially trained auto encoders, which recognize different features (more complicated in each layer) and automatically prepare a new representation of data. This is how our brains are organized.

But how to train such a stack?

http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/

Page 59: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 59M. Wolter

Training a Deep Neural Network

● In the early 2000s, attempts to train deep neural networks were frustrated by the apparent failure of the well known back-propagation algorithms (backpropagation, gradient descent). Many researchers claimed NN are gone, only Support Vector Machines and Boosted Decision Trees should be used!

● In 2006, Hinton, Osindero and Teh1 first time succeeded in training a deep neural network by first initializing its parameters sequentially, layer by layer. Each layer was trained to produce a representation of its inputs that served as the training data for the next layer. Then the network was tweaked using gradient descent (standard algorithm).

● There was a common belief that Deep NN training requires careful initialization of parameters and sophisticated machine learning algorithms.

1Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554.

Page 60: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 60M. Wolter

Training with a brute force● In 2010, a surprising counter example to the conventional wisdom was

demonstrated1.● Deep neural network was trained to classify the handwritten digits in the

MNIST2 data set, which comprises 60,000 28 × 28 = 784 pixel images for training and 10,000 images for testing.

● They showed that a plain DNN with architecture (784, 2500, 2000, 1500, 1000, 500, 10 – HUGE!!!), trained using standard stochastic gradient descent (Minuit on steroids!), outperformed all other methods that had been applied to the MNIST data set as of 2010. The error rate of this 12 million parameter ∼DNN was 35 images out of 10,000.

The training images were randomly and slightly deformed before every training epoch. The entire set of 60,000 undeformed images could be used as the validation set during training, since none were used as training data.

1 Ciresan DC, Meier U, Gambardella LM, Schmidhuber J. ,Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 2010 Dec; 22 (12): 3207-20.2 http://yann.lecun.com/exdb/mnist/

Page 61: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 61M. Wolter

Why it didn’t work before?● More data, clusters of GPU/CPU (computing power!)

● The particular non-linear activation function chosen for neurons in a neural net makes a big impact on performance, and the one often used by default is not a good choice.

● The old vanishing gradient problem happens, basically, because backpropagation involves a sequence of multiplications that invariably result in smaller derivatives for earlier layers. That is, unless weights are chosen with difference scales according to the layer they are in - making this simple change results in significant improvements.

Page 62: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 62M. Wolter, Machine Learning

Page 63: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 63M. Wolter, Machine Learning

An example – pattern recognition with KERAS and TensorFlow

● CIFAR10 small image classification. Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.

Page 64: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 64M. Wolter, Machine Learning

Deep Neural Network

Training: ~24 h on 12 core machine on our Cloud cluster.

200 epochs

Page 65: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 65M. Wolter, Machine Learning

ResultsRecognized as Really was

Page 66: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 66M. Wolter, Machine Learning

Confusion matrix

Page 67: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

26.02.2019 67M. Wolter, Machine Learning

That’s all for today

Applet showing the performance of deep NN:

http://cs.stanford.edu/people/karpathy/convnetjs/

Page 68: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 68M. Wolter

Backup

Page 69: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 69M. Wolter

New method of training (basic info)

Page 70: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 70M. Wolter

New method of training (basic info)

Train this layer

Page 71: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 71M. Wolter

New method of training (basic info)

Train this layer

Then this

Page 72: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 72M. Wolter

New method of training (basic info)

Trenuj tę warstwę

Then this

Then this

Page 73: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 73M. Wolter

New method of training (basic info)

Train this layer

Then this

Then this

Then this

Page 74: Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine Learning 13 How does it work in practice? Straight line fitting y(x, w) = w 0 + w 1 x fit

15.05.2018 74M. Wolter

New method of training (basic info)

Train this layer

Then this

Then this

Then thisAt the end this