Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...

26.02.2019 1M. Wolter, Machine Learning

Machine learning Lecture 4

Marcin Wolter

IFJ PAN

26 February 2019

● Training - cross-validation.● Optimalization of hyperparameters.● Deep learning


Supervised training


Overtraining

Correct

Overtraining

● Overtraining – algorithm “learns” the particular events, not the rules.

● This effect appears for all ML algorithms.● Remedy – checking with another, independent dataset.

test

training

STOP

Example of using Neural Network.

Training sample

Test sample

BUT WE USE JUST ONLY A PART OF DATA

FOR TRAINING!!!


How to train a ML algorithm?

● How to avoid overtraining while learning?

● Should we use one sample for training and another for validating?

● Then we increase the error – we use just a part of data for training.

● Second remark: to avoid ovetraining and find the performance of the trained algorithm we should use one more, third data sample to measure the final performance of the ML algorithm.

● How to optimize the hyperparameters of the ML algorithm (number of trees and their depth for BDT, number of hidden layers, nodes for Neural Network)?


Validation


Cross-validation

● We have independant training sample Ln and a test sample T

m.

● Error level of the classifier built on the training sample Ln

● Estimator using "recycled data" (the same data for training and for error calculation) is biased.

● Reduction of bias: division of data into two parts (training & validation). But than we use just a half of information only.

● Cross-validation – out of sample Ln we remove just one event j, train

classifier, validate on single event j. We repeat n times and get the estimator (an average of all n estimators):

● We get an estimator, which is unbiased (in limit of huge n), but training is CPU demanding.


Cross-validation

● Intermediate solution – k-fold cross-validation

● The sample is divided into k subsamples, k-1 of them we use for training, the one for validation. Then the procedure is repeated with other subsamples and the procedure is repeated k times.

● Smaller CPU usage comparing to cross-validation.

● Recommended k ~ 10 .

● Resulting classifier might be in the simplest case an average of all k classifiers (or they might be joined together in another way) .


Cross-validation

● 4-times folding

● Finding a dependence of CV from alpha (medical data)

● We can draw the mean and a standard deviation. In the next plot we draw the dependence for each folding.

● As a result we can estimate an error of the cross-validated classifier.


https://www.slideshare.net/0xdata/top-10-data-science-practitioner-pitfalls

Model performance

https://www.slideshare.net/0xdata/top-10-data-science-practitioner-pitfalls


Train vs Test vs Valid


Hyperparameter optimization

● Nearly each ML method has few hyperparameters (structure of the Neural Net, number of trees and their depth for BDT etc).

● They should be optimized for a given problem.

● Task: for a given data sample find a set of hyperparameters, that the estimated error of the given method is minimized.

● Looks like a typical minimization problem (fitting like), but:

– Getting each measurement is costly

– High noise

– We can get the value of the minimized function (so our error) in the pont x of the hyperparameter space, but we can't get the differential easily.


Optimization of hyperparameters

● How to optimize:

– „Grid search” - scan over all possible values of parameters.

– „Random search”

– Some type of fitting…● Popular method is the „bayesian optimization”

– Build the probability model

– Take „a priori” distributions of parameters

– Find, for which point in the hyperparameter space you can maximally improve your model

– Find the value of error

– Find the „a posteriori” probability distribution

– Repeat


How does it work in practice?

● Straight line fitting

y(x, w) = w0 + w

1x fit to the data.

1) Gaussian prior, no data used

2) First data point. We find the likelihood based on this point (left plot)and multiply: priori*likelihood. We get the posterior distribution (right plot).

3) We add the second point and repeat the procedure.

4) Adding all the points one by one.

Remark: data are noisy.


Starting point

Unknown function (with noise), four observations.Where should we do the next costly probing?

https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

We are searching for a MAXIMUM



A posteriori distribution

The a posteriori distribution of possible functions, those functions could generate the observed data points.

A set of functions


A posteriori functions – Gaussian Processes (GP)

These functions should be somehow parametrized, for example they could be Gaussian functions.


Acquisition function

● Posterior GP (Gaussian Processes) give us the mean of GP functions μ(x) and their expected variation σ2(x).

– Exploration – searching for huge variation

– Exploitation – searching for a smallest/greatest (depends on sign and convention) value of mean μ(x)

● The acquisition policy has to balance these two approaches:

– Probability of Improvement (Kushner 1964):

– Expected Improvement (Mockus 1978)

– GP Upper Confidence Bound (Srinivas et al. 2010):

• aLCB

(x) = μ(x) - κσ(x)


Here we are searching for minimum


Where to put the next point?

● Our next chosen point( x ) should has high mean (exploitation) & high variance (exploration).


We choose next x u(x) – acqusition function (finding maximum)


Dokonujemy próbkowania i powtarzamy procedurę...


t=2 t=3

t=4 t=5

MAXIMUM


Limitations

● Bayesian optimization depends on the parameters chosen

● On the acquisition function

● On the prior selected....

● It’s sequential.

● There are alternative methods, which can be done in parallel (like Random Search or Tree of Parzen Estimators (TPE) used by the HyperOpt package https://github.com/hyperopt/hyperopt).


Implementations

● Python

– Spearmint https://github.com/JasperSnoek/spearmint

– GPyOpt https://github.com/SheffieldML/GPyOpt

– RoBO https://github.com/automl/RoBO

– Scikit-optimize https://github.com/MechCoder/scikit-optimize

● C++

– MOE https://github.com/yelp/MOE


Articles

● Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.

● Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.

● Nice tutorial:





Deep Learning


● What does “deep learning” mean?● Why does it give better results than other methods in pattern recognition,

speach recognition and others?

Deep Learning


Short answer:‘Deep Learning’ - using a neural network with many hidden layersA series of hidden layers makes the feature identification first and processes them in the chain of operations: feature identification→ further feature identification → ……. →selection

But NN are well known starting from 80-ties???

We always had good algorithms to train NN with one or two hidden layers.

But they were failing for more layers

NEW: algorithms for training deep networks

huge computing power


W1

W2

W3

f(x)

1.4

-2.5

-0.06

Activation function

How do we train a NN


2.7

-8.6

0.002

f(x)

1.4

-2.5

-0.06

x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34


NN trainingInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …


Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Initialization with random weights


Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Reading dataProcessing them by networkResult compared with the true value

1.4

2.7

1.9 0.80Errror 0.8

The weights are modified. ModificationBased on this error.

We repeat many times, each time modifying the weightsTraining algorithms take care, that the error is smaller and smaller


Remark

● If the activation function is non-linear, than a neural network with one hidden layer can classify any problem (fits any function).

● There exists a set of weights, which allows to do that. However, the problem is to find this set of weights...


How to identify the features?


What is this neuron doing?


Neurons in the hidden layers are the self-organizing feature detectors

…

1

63

1 5 10 15 20 25 …

Huge weight

Small weight


What can it detect?

…

1

63

1 5 10 15 20 25 …

Huge weight

Small weight

It sends a strong signal, when it finds a horizontal line in the top row of pixels, ignores anything else.


What can it detect?

1

63

1 5 10 15 20 25 …

Small weight

Sends strong signal, if a dark region in the leftupper corner is found.

…

Huge weight

…


What feature should detect a neural network recognizing the handwriting?


63

1

Vertical lines


63

1

Horizontal lines


63

1

Circles

And what about the sensitivity onposition of these features???


Next layers can learn the higher level features

etc …Detecting lines

in given places

v

Higher level detectors

etc …

15.05.2018 51M. Wolter

Deep network

Each hidden layer is an automatic feature detector

an auto-encoder

15.05.2018 52M. Wolter

An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs.

The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. If there is a structure in the data, than it should find features.

15.05.2018 53M. Wolter

Hidden layers are trained to identify features

15.05.2018 54M. Wolter

The last layer performs the actual classification


Such an organised network makes sense...

Our brains probably work in a similar way


Unfortunately, until recent (10?) years we didn’tknow how to train a deep network

15.05.2018 57M. Wolter

Machine Learning and Deep Learning

● Traditional ML (BDT, NN etc) – the scientist finds good, well discriminating variables (~10), called “features”, and performs classification using them as inputs for the ML algorithm.

● Deep Learning – thousands or millions of input variables (like pixels of a photo), the features are automagically extracted during training.

15.05.2018 58M. Wolter

Deeper network?

Traditional Neural Networks have one or two hidden layers.

Deep Neural Network: a stack of sequentially trained auto encoders, which recognize different features (more complicated in each layer) and automatically prepare a new representation of data. This is how our brains are organized.

But how to train such a stack?

http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/

http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/

15.05.2018 59M. Wolter

Training a Deep Neural Network

● In the early 2000s, attempts to train deep neural networks were frustrated by the apparent failure of the well known back-propagation algorithms (backpropagation, gradient descent). Many researchers claimed NN are gone, only Support Vector Machines and Boosted Decision Trees should be used!

● In 2006, Hinton, Osindero and Teh1 first time succeeded in training a deep neural network by first initializing its parameters sequentially, layer by layer. Each layer was trained to produce a representation of its inputs that served as the training data for the next layer. Then the network was tweaked using gradient descent (standard algorithm).

● There was a common belief that Deep NN training requires careful initialization of parameters and sophisticated machine learning algorithms.

1Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554.

15.05.2018 60M. Wolter

Training with a brute force● In 2010, a surprising counter example to the conventional wisdom was

demonstrated1.● Deep neural network was trained to classify the handwritten digits in the

MNIST2 data set, which comprises 60,000 28 × 28 = 784 pixel images for training and 10,000 images for testing.

● They showed that a plain DNN with architecture (784, 2500, 2000, 1500, 1000, 500, 10 – HUGE!!!), trained using standard stochastic gradient descent (Minuit on steroids!), outperformed all other methods that had been applied to the MNIST data set as of 2010. The error rate of this 12 million parameter ∼DNN was 35 images out of 10,000.

The training images were randomly and slightly deformed before every training epoch. The entire set of 60,000 undeformed images could be used as the validation set during training, since none were used as training data.

1 Ciresan DC, Meier U, Gambardella LM, Schmidhuber J. ,Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 2010 Dec; 22 (12): 3207-20.2 http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

15.05.2018 61M. Wolter

Why it didn’t work before?● More data, clusters of GPU/CPU (computing power!)

● The particular non-linear activation function chosen for neurons in a neural net makes a big impact on performance, and the one often used by default is not a good choice.

● The old vanishing gradient problem happens, basically, because backpropagation involves a sequence of multiplications that invariably result in smaller derivatives for earlier layers. That is, unless weights are chosen with difference scales according to the layer they are in - making this simple change results in significant improvements.


An example – pattern recognition with KERAS and TensorFlow

● CIFAR10 small image classification. Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.


Deep Neural Network

Training: ~24 h on 12 core machine on our Cloud cluster.

200 epochs


ResultsRecognized as Really was


Confusion matrix


That’s all for today

Applet showing the performance of deep NN:

http://cs.stanford.edu/people/karpathy/convnetjs/

http://cs.stanford.edu/people/karpathy/convnetjs/

15.05.2018 68M. Wolter

Backup

15.05.2018 69M. Wolter

New method of training (basic info)

15.05.2018 70M. Wolter


Train this layer

15.05.2018 71M. Wolter


Train this layer

Then this

15.05.2018 72M. Wolter


Trenuj tę warstwę

Then this

Then this

15.05.2018 73M. Wolter


Train this layer

Then this

Then this

Then this

15.05.2018 74M. Wolter


Train this layer

Then this

Then this

Then thisAt the end this

Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...

Documents

Transcript of Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...