Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...
Transcript of Machine learning - Institute of Nuclear Physics Polish ... · 26.02.2019 M. Wolter, Machine...
26.02.2019 1M. Wolter, Machine Learning
Machine learning Lecture 4
Marcin Wolter
IFJ PAN
26 February 2019
● Training - cross-validation.● Optimalization of hyperparameters.● Deep learning
26.02.2019 2M. Wolter, Machine Learning
Supervised training
26.02.2019 3M. Wolter, Machine Learning
Overtraining
Correct
Overtraining
● Overtraining – algorithm “learns” the particular events, not the rules.
● This effect appears for all ML algorithms.● Remedy – checking with another, independent dataset.
test
training
STOP
Example of using Neural Network.
Training sample
Test sample
BUT WE USE JUST ONLY A PART OF DATA
FOR TRAINING!!!
26.02.2019 4M. Wolter, Machine Learning
How to train a ML algorithm?
● How to avoid overtraining while learning?
● Should we use one sample for training and another for validating?
● Then we increase the error – we use just a part of data for training.
● Second remark: to avoid ovetraining and find the performance of the trained algorithm we should use one more, third data sample to measure the final performance of the ML algorithm.
● How to optimize the hyperparameters of the ML algorithm (number of trees and their depth for BDT, number of hidden layers, nodes for Neural Network)?
26.02.2019 5M. Wolter, Machine Learning
Validation
26.02.2019 6M. Wolter, Machine Learning
Cross-validation
● We have independant training sample Ln and a test sample T
m.
● Error level of the classifier built on the training sample Ln
● Estimator using "recycled data" (the same data for training and for error calculation) is biased.
● Reduction of bias: division of data into two parts (training & validation). But than we use just a half of information only.
● Cross-validation – out of sample Ln we remove just one event j, train
classifier, validate on single event j. We repeat n times and get the estimator (an average of all n estimators):
● We get an estimator, which is unbiased (in limit of huge n), but training is CPU demanding.
26.02.2019 7M. Wolter, Machine Learning
Cross-validation
● Intermediate solution – k-fold cross-validation
● The sample is divided into k subsamples, k-1 of them we use for training, the one for validation. Then the procedure is repeated with other subsamples and the procedure is repeated k times.
● Smaller CPU usage comparing to cross-validation.
● Recommended k ~ 10 .
● Resulting classifier might be in the simplest case an average of all k classifiers (or they might be joined together in another way) .
26.02.2019 8M. Wolter, Machine Learning
Cross-validation
● 4-times folding
● Finding a dependence of CV from alpha (medical data)
● We can draw the mean and a standard deviation. In the next plot we draw the dependence for each folding.
● As a result we can estimate an error of the cross-validated classifier.
26.02.2019 9M. Wolter, Machine Learning
https://www.slideshare.net/0xdata/top-10-data-science-practitioner-pitfalls
Model performance
26.02.2019 10M. Wolter, Machine Learning
Train vs Test vs Valid
26.02.2019 11M. Wolter, Machine Learning
Hyperparameter optimization
● Nearly each ML method has few hyperparameters (structure of the Neural Net, number of trees and their depth for BDT etc).
● They should be optimized for a given problem.
● Task: for a given data sample find a set of hyperparameters, that the estimated error of the given method is minimized.
● Looks like a typical minimization problem (fitting like), but:
– Getting each measurement is costly
– High noise
– We can get the value of the minimized function (so our error) in the pont x of the hyperparameter space, but we can't get the differential easily.
26.02.2019 12M. Wolter, Machine Learning
Optimization of hyperparameters
● How to optimize:
– „Grid search” - scan over all possible values of parameters.
– „Random search”
– Some type of fitting…● Popular method is the „bayesian optimization”
– Build the probability model
– Take „a priori” distributions of parameters
– Find, for which point in the hyperparameter space you can maximally improve your model
– Find the value of error
– Find the „a posteriori” probability distribution
– Repeat
26.02.2019 13M. Wolter, Machine Learning
How does it work in practice?
● Straight line fitting
y(x, w) = w0 + w
1x fit to the data.
1) Gaussian prior, no data used
2) First data point. We find the likelihood based on this point (left plot)and multiply: priori*likelihood. We get the posterior distribution (right plot).
3) We add the second point and repeat the procedure.
4) Adding all the points one by one.
Remark: data are noisy.
26.02.2019 14M. Wolter, Machine Learning
Starting point
Unknown function (with noise), four observations.Where should we do the next costly probing?
https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
We are searching for a MAXIMUM
26.02.2019 15M. Wolter, Machine Learning
A posteriori distribution
The a posteriori distribution of possible functions, those functions could generate the observed data points.
A set of functions
26.02.2019 16M. Wolter, Machine Learning
A posteriori functions – Gaussian Processes (GP)
These functions should be somehow parametrized, for example they could be Gaussian functions.
26.02.2019 17M. Wolter, Machine Learning
Acquisition function
● Posterior GP (Gaussian Processes) give us the mean of GP functions μ(x) and their expected variation σ2(x).
– Exploration – searching for huge variation
– Exploitation – searching for a smallest/greatest (depends on sign and convention) value of mean μ(x)
● The acquisition policy has to balance these two approaches:
– Probability of Improvement (Kushner 1964):
– Expected Improvement (Mockus 1978)
– GP Upper Confidence Bound (Srinivas et al. 2010):
• aLCB
(x) = μ(x) - κσ(x)
26.02.2019 18M. Wolter, Machine Learning
Here we are searching for minimum
26.02.2019 19M. Wolter, Machine Learning
Where to put the next point?
● Our next chosen point( x ) should has high mean (exploitation) & high variance (exploration).
26.02.2019 20M. Wolter, Machine Learning
We choose next x u(x) – acqusition function (finding maximum)
26.02.2019 21M. Wolter, Machine Learning
Dokonujemy próbkowania i powtarzamy procedurę...
26.02.2019 22M. Wolter, Machine Learning
t=2 t=3
t=4 t=5
MAXIMUM
26.02.2019 23M. Wolter, Machine Learning
Limitations
● Bayesian optimization depends on the parameters chosen
● On the acquisition function
● On the prior selected....
● It’s sequential.
● There are alternative methods, which can be done in parallel (like Random Search or Tree of Parzen Estimators (TPE) used by the HyperOpt package https://github.com/hyperopt/hyperopt).
26.02.2019 24M. Wolter, Machine Learning
Implementations
● Python
– Spearmint https://github.com/JasperSnoek/spearmint
– GPyOpt https://github.com/SheffieldML/GPyOpt
– RoBO https://github.com/automl/RoBO
– Scikit-optimize https://github.com/MechCoder/scikit-optimize
● C++
– MOE https://github.com/yelp/MOE
26.02.2019 25M. Wolter, Machine Learning
Articles
● Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
● Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
● Nice tutorial:
https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
26.02.2019 26M. Wolter, Machine Learning
Deep Learning
26.02.2019 27M. Wolter, Machine Learning
● What does “deep learning” mean?● Why does it give better results than other methods in pattern recognition,
speach recognition and others?
Deep Learning
26.02.2019 28M. Wolter, Machine Learning
Short answer:‘Deep Learning’ - using a neural network with many hidden layersA series of hidden layers makes the feature identification first and processes them in the chain of operations: feature identification→ further feature identification → ……. →selection
But NN are well known starting from 80-ties???
We always had good algorithms to train NN with one or two hidden layers.
But they were failing for more layers
NEW: algorithms for training deep networks
huge computing power
26.02.2019 29M. Wolter, Machine Learning
W1
W2
W3
f(x)
1.4
-2.5
-0.06
Activation function
How do we train a NN
26.02.2019 30M. Wolter, Machine Learning
2.7
-8.6
0.002
f(x)
1.4
-2.5
-0.06
x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34
26.02.2019 31M. Wolter, Machine Learning
NN trainingInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
26.02.2019 32M. Wolter, Machine Learning
Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Initialization with random weights
26.02.2019 33M. Wolter, Machine Learning
Training dataInputs Class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Reading dataProcessing them by networkResult compared with the true value
1.4
2.7
1.9 0.80Errror 0.8
The weights are modified. ModificationBased on this error.
We repeat many times, each time modifying the weightsTraining algorithms take care, that the error is smaller and smaller
26.02.2019 34M. Wolter, Machine Learning
26.02.2019 35M. Wolter, Machine Learning
26.02.2019 36M. Wolter, Machine Learning
26.02.2019 37M. Wolter, Machine Learning
26.02.2019 38M. Wolter, Machine Learning
26.02.2019 39M. Wolter, Machine Learning
26.02.2019 40M. Wolter, Machine Learning
Remark
● If the activation function is non-linear, than a neural network with one hidden layer can classify any problem (fits any function).
● There exists a set of weights, which allows to do that. However, the problem is to find this set of weights...
26.02.2019 41M. Wolter, Machine Learning
How to identify the features?
26.02.2019 42M. Wolter, Machine Learning
What is this neuron doing?
26.02.2019 43M. Wolter, Machine Learning
Neurons in the hidden layers are the self-organizing feature detectors
…
1
63
1 5 10 15 20 25 …
Huge weight
Small weight
26.02.2019 44M. Wolter, Machine Learning
What can it detect?
…
1
63
1 5 10 15 20 25 …
Huge weight
Small weight
It sends a strong signal, when it finds a horizontal line in the top row of pixels, ignores anything else.
26.02.2019 45M. Wolter, Machine Learning
What can it detect?
1
63
1 5 10 15 20 25 …
Small weight
Sends strong signal, if a dark region in the leftupper corner is found.
…
Huge weight
…
26.02.2019 46M. Wolter, Machine Learning
What feature should detect a neural network recognizing the handwriting?
26.02.2019 47M. Wolter, Machine Learning
63
1
Vertical lines
26.02.2019 48M. Wolter, Machine Learning
63
1
Horizontal lines
26.02.2019 49M. Wolter, Machine Learning
63
1
Circles
And what about the sensitivity onposition of these features???
26.02.2019 50M. Wolter, Machine Learning
Next layers can learn the higher level features
etc …Detecting lines
in given places
v
Higher level detectors
etc …
15.05.2018 51M. Wolter
Deep network
Each hidden layer is an automatic feature detector
an auto-encoder
15.05.2018 52M. Wolter
An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs.
The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. If there is a structure in the data, than it should find features.
15.05.2018 53M. Wolter
Hidden layers are trained to identify features
15.05.2018 54M. Wolter
The last layer performs the actual classification
26.02.2019 55M. Wolter, Machine Learning
Such an organised network makes sense...
Our brains probably work in a similar way
26.02.2019 56M. Wolter, Machine Learning
Unfortunately, until recent (10?) years we didn’tknow how to train a deep network
15.05.2018 57M. Wolter
Machine Learning and Deep Learning
● Traditional ML (BDT, NN etc) – the scientist finds good, well discriminating variables (~10), called “features”, and performs classification using them as inputs for the ML algorithm.
● Deep Learning – thousands or millions of input variables (like pixels of a photo), the features are automagically extracted during training.
15.05.2018 58M. Wolter
Deeper network?
Traditional Neural Networks have one or two hidden layers.
Deep Neural Network: a stack of sequentially trained auto encoders, which recognize different features (more complicated in each layer) and automatically prepare a new representation of data. This is how our brains are organized.
But how to train such a stack?
http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/
15.05.2018 59M. Wolter
Training a Deep Neural Network
● In the early 2000s, attempts to train deep neural networks were frustrated by the apparent failure of the well known back-propagation algorithms (backpropagation, gradient descent). Many researchers claimed NN are gone, only Support Vector Machines and Boosted Decision Trees should be used!
● In 2006, Hinton, Osindero and Teh1 first time succeeded in training a deep neural network by first initializing its parameters sequentially, layer by layer. Each layer was trained to produce a representation of its inputs that served as the training data for the next layer. Then the network was tweaked using gradient descent (standard algorithm).
● There was a common belief that Deep NN training requires careful initialization of parameters and sophisticated machine learning algorithms.
1Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554.
15.05.2018 60M. Wolter
Training with a brute force● In 2010, a surprising counter example to the conventional wisdom was
demonstrated1.● Deep neural network was trained to classify the handwritten digits in the
MNIST2 data set, which comprises 60,000 28 × 28 = 784 pixel images for training and 10,000 images for testing.
● They showed that a plain DNN with architecture (784, 2500, 2000, 1500, 1000, 500, 10 – HUGE!!!), trained using standard stochastic gradient descent (Minuit on steroids!), outperformed all other methods that had been applied to the MNIST data set as of 2010. The error rate of this 12 million parameter ∼DNN was 35 images out of 10,000.
The training images were randomly and slightly deformed before every training epoch. The entire set of 60,000 undeformed images could be used as the validation set during training, since none were used as training data.
1 Ciresan DC, Meier U, Gambardella LM, Schmidhuber J. ,Deep, big, simple neural nets for handwritten digit recognition. Neural Comput. 2010 Dec; 22 (12): 3207-20.2 http://yann.lecun.com/exdb/mnist/
15.05.2018 61M. Wolter
Why it didn’t work before?● More data, clusters of GPU/CPU (computing power!)
● The particular non-linear activation function chosen for neurons in a neural net makes a big impact on performance, and the one often used by default is not a good choice.
● The old vanishing gradient problem happens, basically, because backpropagation involves a sequence of multiplications that invariably result in smaller derivatives for earlier layers. That is, unless weights are chosen with difference scales according to the layer they are in - making this simple change results in significant improvements.
26.02.2019 62M. Wolter, Machine Learning
26.02.2019 63M. Wolter, Machine Learning
An example – pattern recognition with KERAS and TensorFlow
● CIFAR10 small image classification. Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.
26.02.2019 64M. Wolter, Machine Learning
Deep Neural Network
Training: ~24 h on 12 core machine on our Cloud cluster.
200 epochs
26.02.2019 65M. Wolter, Machine Learning
ResultsRecognized as Really was
26.02.2019 66M. Wolter, Machine Learning
Confusion matrix
26.02.2019 67M. Wolter, Machine Learning
That’s all for today
Applet showing the performance of deep NN:
http://cs.stanford.edu/people/karpathy/convnetjs/
15.05.2018 68M. Wolter
Backup
15.05.2018 69M. Wolter
New method of training (basic info)
15.05.2018 70M. Wolter
New method of training (basic info)
Train this layer
15.05.2018 71M. Wolter
New method of training (basic info)
Train this layer
Then this
15.05.2018 72M. Wolter
New method of training (basic info)
Trenuj tę warstwę
Then this
Then this
15.05.2018 73M. Wolter
New method of training (basic info)
Train this layer
Then this
Then this
Then this
15.05.2018 74M. Wolter
New method of training (basic info)
Train this layer
Then this
Then this
Then thisAt the end this