Neural Networks. Overview
-
Upload
oleksandr-baiev -
Category
Technology
-
view
241 -
download
3
Transcript of Neural Networks. Overview
Neural networks. Overview
Oleksandr Baiev, PhD
Senior Engineer
Samsung R&D Institute Ukraine
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
Canonical/Typical tasks
Solutions in general
๐ฅ๐ = ๐ฅ1, ๐ฅ2, ๐ฅ3, ๐ฅ4, โฆ , ๐ฅ๐ , โฆ ๐ โ ๐
๐ฆ๐ = ๐ฆ1, ๐ฆ2, โฆ , ๐ฆ๐ , โฆ ๐ โ ๐
๐น: ๐ โ ๐Classification
๐ฆ1 = 1,0,0๐ฆ2 = 0,0,1๐ฆ3 = 0,1,0๐ฆ4 = 0,1,0
Index of sample in dataset
sample of class โ0โ
sample of class โ2โ
sample of class โ2โ
sample of class โ1โ
Regression
๐ฆ1 = 0.3๐ฆ2 = 0.2๐ฆ3 = 1.0๐ฆ4 = 0.65
What is artificial Neural Networks?Is it biology?
Simulation of biological neural networks (synapses, axons, chains, layers, etc.) is a good abstraction for understanding topology.
Bio NN is only inspiration and illustration. Nothing more!
What is artificial Neural Networks?Letโs imagine black box!
Finputs
params
outputs
General form:๐๐ข๐ก๐๐ข๐ก๐ = ๐น ๐๐๐๐ข๐ก๐ , ๐๐๐๐๐๐
Steps:1) choose โformโ of F 2) find params
What is artificial Neural Networks?Itโs a simple math!
free parameters
activation function
๐ ๐ =
๐=1
๐
๐ค๐๐๐ฅ๐ + ๐๐
๐ฆ๐ = ๐ ๐ ๐
Output of i-th neuron:
What is artificial Neural Networks?Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
What is artificial Neural Networks?Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
What is artificial Neural Networks?Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
What is artificial Neural Networks?Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
What is artificial Neural Networks?Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
What is artificial Neural Networks?Itโs a simple math!
n inputsm neurons in hidden layer
๐ ๐ =
๐=1
๐
๐ค๐๐๐ฅ๐ + ๐๐
๐ฆ๐ = ๐ ๐ ๐
Output of i-th neuron:
Output of k-th layer:
1) ๐๐ = ๐๐๐๐ + ๐ต๐ =
=
๐ค11 ๐ค12 โฏ ๐ค1๐๐ค21 ๐ค21 โฏ ๐ค21
โฏ โฏ โฏ โฏ๐ค๐1 ๐ค๐2 โฏ ๐ค๐๐ ๐
๐ฅ1๐ฅ2๐ฅ3โฎ๐ฅ๐ ๐
+
๐1๐2๐3โฎ๐๐ ๐
2) ๐๐ = ๐๐ ๐๐
apply element-wise
Kolmagorov & Arnold function superposition
Form of F:
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
How to find parametersW and B?
Supervised learning:
Training set (pairs of variables and responses):๐; ๐ ๐ , ๐ = 1. . ๐
Find: ๐โ, ๐ตโ = ๐๐๐๐๐๐๐,๐ต
๐ฟ ๐น ๐ , ๐
Cost function (loss, error):
logloss: L ๐น ๐ , ๐ =1
๐ ๐=1๐ ๐=1
๐ ๐ฆ๐.๐ log ๐๐,๐
rmse: L ๐น ๐ , ๐ =1
๐ ๐=1๐ ๐น ๐๐ โ ๐๐ 2
โ1โ if in i-th sample is class j else โ0โ
previously scaled:
๐๐,๐ = ๐๐,๐ ๐ ๐๐,๐
Just an examples. Cost function depend on problem (classification, regression) and domain knowledge
Training or optimization algorithm
So, we have model cost ๐ฟ (or error of prediction)
And we want to update weights in order to minimize ๐ณ:
๐คโ = ๐ค + ๐ผฮ๐ค
In accordance to gradient descent: ฮ๐ค = โ๐ป๐ฟ
Itโs clear for network with only one layer (we have predicted outputs and targets, so can evaluate ๐ฟ).
But how to find ๐๐ for hidden layers?
Meet โError Back Propagationโ
Find ฮ๐ค for each layer from the last to the first as influence of weights to cost:
โ๐ค๐,๐ =๐๐ฟ
๐๐ค๐,๐
and:
๐๐ฟ
๐๐ค๐,๐
=๐๐ฟ
๐๐๐
๐๐๐
๐๐ ๐
๐๐ ๐
๐๐ค๐,๐
Error Back PropagationDetails
๐๐ฟ
๐๐ค๐,๐
=๐๐ฟ
๐๐๐
๐๐๐
๐๐ ๐
๐๐ ๐
๐๐ค๐,๐
๐ฟ๐ =๐๐ฟ
๐๐๐
๐๐๐
๐๐ ๐
๐ฟ๐ = ๐ฟโฒ ๐น ๐ , ๐ ๐โฒ ๐ ๐ , ๐๐ข๐ก๐๐ข๐ก ๐๐๐ฆ๐๐
๐ โ ๐๐๐ฅ๐ก ๐๐๐ฆ๐๐ ๐ฟ๐๐ค ๐,๐ ๐โฒ ๐ ๐ , โ๐๐๐๐๐ ๐๐๐ฆ๐๐๐
โ๐ค๐,๐ = โ๐ผ๐ฟ๐ ๐ฅ๐
Gradient Descentin real life
Recall gradient descent:๐คโ = ๐ค + ๐ผฮ๐ค
๐ผ is a โstepโ coefficient. In term of ML โ learning rate.
Recall cost function:
๐ฟ =1
๐
๐
โฆ
GD modification: update ๐ค for each sample.
Sum along all samples,And what if ๐ = 106 or more?
Typical: ๐ผ = 0.01. . 0.1
Gradient DescentStochastic & Minibatch
โBatchโ GD (L for full set)
need a lot of memory
Stochastic GD (L for each sample)fast, but fluctuation
Minibatch GD(L for subsets)
less memory & less fluctuationsSize of minibatch depends on HW Typical: minibatch=32โฆ256
Termination criteria
By epochs count max number of iterations along all data set
By value of gradientwhen gradient is equal to 0 than minimum, but small gradient => very slow learning
When cost didnโt change during several epochsif error is not change than training procedure is not converges
Early stoppingStop when โvalidationโ score starts increase even when โtrainโ score continue decreasing
Typical: epochs=50โฆ200
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
What about โformโ of F?Network topology
โShallowโ networks 1, 2 hidden layers => not enough parameters => pure separation abilities
โDeepโ networks is a NN with 2..10 layers
โVery deepโ networks is a NN with >10 layers
Deep learning. Problems
โข Big networks => Too huge separating ability => Overfitting
โข Vanishing gradient problem during training
โข Complex errorโs surface => Local minimum
โข Curse of dimensionality => memory & computations
๐(๐โ1) ๐(๐)
dim ๐(๐) = ๐ ๐โ1 โ ๐(๐)
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
Additional methodsConventional
โข Momentum (prevent the variations on error surface)
โ๐ค(๐ก) = โ๐ผ๐ป๐ฟ ๐ค ๐ก + ๐ฝโ๐ค(๐กโ1)
๐๐๐๐๐๐ก๐ข๐
โข LR decay (make smaller steps near optimum)
๐ผ(๐ก) = ๐๐ผ(๐กโ1), 0 < ๐ < 1
โข Weight Decay (prevent weight growing, and smooth F)
๐ฟโ = ๐ฟ + ๐ ๐ค(๐ก)
L1 or L2 regularization often used
Typical: ๐ฝ = 0.9
Typical: apply LR decay (๐ = 0.1) each 10..100 epochs
Typical: ๐ฟ2 with ๐ = 0.0005
Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
Additional methodsContemporary
Dropout/DropConnect
โ ensembles of networks
โ 2๐ networks in one: for each example hide neurons output randomly (๐ = 0.5)
Additional methodsContemporary
Data augmentation - more data with all available cases:
โ affine transformations, flips, crop, contrast, noise, scale
โ pseudo-labeling
Additional methodsContemporary
New activation function:
โ Linear: ๐ฆ๐ = ๐ ๐ ๐ = ๐๐ ๐โ ReLU: ๐ฆ๐ = ๐๐๐ฅ ๐ ๐ , 0
โ Leaky ReLU: ๐ฆ๐ = ๐ ๐ ๐ ๐ > 0๐๐ ๐ ๐๐กโ๐๐๐ค๐๐ ๐
โ Maxout: ๐ฆ๐ = ๐๐๐ฅ ๐ 1,๐ , ๐ 2,๐ , โฆ , ๐ ๐,๐
Typical: ๐ = 0.01
Typical: ๐ = 2. . 3
Additional methodsContemporary
Pre-training
โ train layer-by-layer,
โ re-train โotherโ network
Sources
โข Jeffry Hinton Course โNeural Networks for Machine Learningโ [http://www.coursera.org/course/neuralnets]
โข Ian Goodfellow, Yoshua Bengio and Aaron Courville โDeep Learningโ [http://www.deeplearningbook.org/]
โข http://neuralnetworksanddeeplearning.com
โข CS231n: Convolutional Neural Networks for Visual Recognition [http://cs231n.stanford.edu/]
โข CS224d: Deep Learning for Natural Language Processing [http://cs224d.stanford.edu/]
โข Schmidhuber โDeep Learning in Neural Networks: An Overviewโ
โข kaggle.com competitions and forums