Neural Networks. Overview

33
Neural networks. Overview Oleksandr Baiev, PhD Senior Engineer Samsung R&D Institute Ukraine

Transcript of Neural Networks. Overview

Page 1: Neural Networks. Overview

Neural networks. Overview

Oleksandr Baiev, PhD

Senior Engineer

Samsung R&D Institute Ukraine

Page 2: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 3: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 4: Neural Networks. Overview

Canonical/Typical tasks

Page 5: Neural Networks. Overview

Solutions in general

๐‘ฅ๐‘— = ๐‘ฅ1, ๐‘ฅ2, ๐‘ฅ3, ๐‘ฅ4, โ€ฆ , ๐‘ฅ๐‘– , โ€ฆ ๐‘— โˆˆ ๐‘‹

๐‘ฆ๐‘— = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘˜ , โ€ฆ ๐‘— โˆˆ ๐‘Œ

๐น: ๐‘‹ โ†’ ๐‘ŒClassification

๐‘ฆ1 = 1,0,0๐‘ฆ2 = 0,0,1๐‘ฆ3 = 0,1,0๐‘ฆ4 = 0,1,0

Index of sample in dataset

sample of class โ€œ0โ€

sample of class โ€œ2โ€

sample of class โ€œ2โ€

sample of class โ€œ1โ€

Regression

๐‘ฆ1 = 0.3๐‘ฆ2 = 0.2๐‘ฆ3 = 1.0๐‘ฆ4 = 0.65

Page 6: Neural Networks. Overview

What is artificial Neural Networks?Is it biology?

Simulation of biological neural networks (synapses, axons, chains, layers, etc.) is a good abstraction for understanding topology.

Bio NN is only inspiration and illustration. Nothing more!

Page 7: Neural Networks. Overview

What is artificial Neural Networks?Letโ€™s imagine black box!

Finputs

params

outputs

General form:๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก๐‘  = ๐น ๐‘–๐‘›๐‘๐‘ข๐‘ก๐‘ , ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘ 

Steps:1) choose โ€œformโ€ of F 2) find params

Page 8: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

free parameters

activation function

๐‘ ๐‘– =

๐‘—=1

๐‘›

๐‘ค๐‘–๐‘—๐‘ฅ๐‘— + ๐‘๐‘–

๐‘ฆ๐‘– = ๐‘“ ๐‘ ๐‘–

Output of i-th neuron:

Page 9: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

activation: ๐‘ฆ = ๐‘“ ๐‘ค๐‘ฅ + ๐‘ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘(๐‘ค๐‘ฅ + ๐‘)

Page 10: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

activation: ๐‘ฆ = ๐‘“ ๐‘ค๐‘ฅ + ๐‘ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘(๐‘ค๐‘ฅ + ๐‘)

Page 11: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

activation: ๐‘ฆ = ๐‘“ ๐‘ค๐‘ฅ + ๐‘ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘(๐‘ค๐‘ฅ + ๐‘)

Page 12: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

activation: ๐‘ฆ = ๐‘“ ๐‘ค๐‘ฅ + ๐‘ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘(๐‘ค๐‘ฅ + ๐‘)

Page 13: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

activation: ๐‘ฆ = ๐‘“ ๐‘ค๐‘ฅ + ๐‘ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘(๐‘ค๐‘ฅ + ๐‘)

Page 14: Neural Networks. Overview

What is artificial Neural Networks?Itโ€™s a simple math!

n inputsm neurons in hidden layer

๐‘ ๐‘– =

๐‘—=1

๐‘›

๐‘ค๐‘–๐‘—๐‘ฅ๐‘— + ๐‘๐‘–

๐‘ฆ๐‘– = ๐‘“ ๐‘ ๐‘–

Output of i-th neuron:

Output of k-th layer:

1) ๐‘†๐‘˜ = ๐‘Š๐‘˜๐‘‹๐‘˜ + ๐ต๐‘˜ =

=

๐‘ค11 ๐‘ค12 โ‹ฏ ๐‘ค1๐‘›๐‘ค21 ๐‘ค21 โ‹ฏ ๐‘ค21

โ‹ฏ โ‹ฏ โ‹ฏ โ‹ฏ๐‘ค๐‘š1 ๐‘ค๐‘š2 โ‹ฏ ๐‘ค๐‘š๐‘› ๐‘˜

๐‘ฅ1๐‘ฅ2๐‘ฅ3โ‹ฎ๐‘ฅ๐‘› ๐‘˜

+

๐‘1๐‘2๐‘3โ‹ฎ๐‘๐‘› ๐‘˜

2) ๐‘Œ๐‘˜ = ๐‘“๐‘˜ ๐‘†๐‘˜

apply element-wise

Kolmagorov & Arnold function superposition

Form of F:

Page 15: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 16: Neural Networks. Overview

How to find parametersW and B?

Supervised learning:

Training set (pairs of variables and responses):๐‘‹; ๐‘Œ ๐‘– , ๐‘– = 1. . ๐‘

Find: ๐‘Šโˆ—, ๐ตโˆ— = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐‘Š,๐ต

๐ฟ ๐น ๐‘‹ , ๐‘Œ

Cost function (loss, error):

logloss: L ๐น ๐‘‹ , ๐‘Œ =1

๐‘ ๐‘–=1๐‘ ๐‘—=1

๐‘€ ๐‘ฆ๐‘–.๐‘— log ๐‘“๐‘–,๐‘—

rmse: L ๐น ๐‘‹ , ๐‘Œ =1

๐‘ ๐‘–=1๐‘ ๐น ๐‘‹๐‘– โˆ’ ๐‘Œ๐‘– 2

โ€œ1โ€ if in i-th sample is class j else โ€œ0โ€

previously scaled:

๐‘“๐‘–,๐‘— = ๐‘“๐‘–,๐‘— ๐‘— ๐‘“๐‘–,๐‘—

Just an examples. Cost function depend on problem (classification, regression) and domain knowledge

Page 17: Neural Networks. Overview

Training or optimization algorithm

So, we have model cost ๐ฟ (or error of prediction)

And we want to update weights in order to minimize ๐‘ณ:

๐‘คโˆ— = ๐‘ค + ๐›ผฮ”๐‘ค

In accordance to gradient descent: ฮ”๐‘ค = โˆ’๐›ป๐ฟ

Itโ€™s clear for network with only one layer (we have predicted outputs and targets, so can evaluate ๐ฟ).

But how to find ๐œŸ๐’˜ for hidden layers?

Page 18: Neural Networks. Overview

Meet โ€œError Back Propagationโ€

Find ฮ”๐‘ค for each layer from the last to the first as influence of weights to cost:

โˆ†๐‘ค๐‘–,๐‘— =๐œ•๐ฟ

๐œ•๐‘ค๐‘–,๐‘—

and:

๐œ•๐ฟ

๐œ•๐‘ค๐‘–,๐‘—

=๐œ•๐ฟ

๐œ•๐‘“๐‘—

๐œ•๐‘“๐‘—

๐œ•๐‘ ๐‘—

๐œ•๐‘ ๐‘—

๐œ•๐‘ค๐‘–,๐‘—

Page 19: Neural Networks. Overview

Error Back PropagationDetails

๐œ•๐ฟ

๐œ•๐‘ค๐‘–,๐‘—

=๐œ•๐ฟ

๐œ•๐‘“๐‘—

๐œ•๐‘“๐‘—

๐œ•๐‘ ๐‘—

๐œ•๐‘ ๐‘—

๐œ•๐‘ค๐‘–,๐‘—

๐›ฟ๐‘— =๐œ•๐ฟ

๐œ•๐‘“๐‘—

๐œ•๐‘“๐‘—

๐œ•๐‘ ๐‘—

๐›ฟ๐‘— = ๐ฟโ€ฒ ๐น ๐‘‹ , ๐‘Œ ๐‘“โ€ฒ ๐‘ ๐‘— , ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

๐‘™ โˆˆ ๐‘›๐‘’๐‘ฅ๐‘ก ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ ๐›ฟ๐‘™๐‘ค ๐‘—,๐‘™ ๐‘“โ€ฒ ๐‘ ๐‘— , โ„Ž๐‘–๐‘‘๐‘‘๐‘’๐‘› ๐‘™๐‘Ž๐‘ฆ๐‘’๐‘Ÿ๐‘ 

โˆ†๐‘ค๐‘–,๐‘— = โˆ’๐›ผ๐›ฟ๐‘— ๐‘ฅ๐‘–

Page 20: Neural Networks. Overview

Gradient Descentin real life

Recall gradient descent:๐‘คโˆ— = ๐‘ค + ๐›ผฮ”๐‘ค

๐›ผ is a โ€œstepโ€ coefficient. In term of ML โ€“ learning rate.

Recall cost function:

๐ฟ =1

๐‘

๐‘

โ€ฆ

GD modification: update ๐‘ค for each sample.

Sum along all samples,And what if ๐‘ = 106 or more?

Typical: ๐›ผ = 0.01. . 0.1

Page 21: Neural Networks. Overview

Gradient DescentStochastic & Minibatch

โ€œBatchโ€ GD (L for full set)

need a lot of memory

Stochastic GD (L for each sample)fast, but fluctuation

Minibatch GD(L for subsets)

less memory & less fluctuationsSize of minibatch depends on HW Typical: minibatch=32โ€ฆ256

Page 22: Neural Networks. Overview

Termination criteria

By epochs count max number of iterations along all data set

By value of gradientwhen gradient is equal to 0 than minimum, but small gradient => very slow learning

When cost didnโ€™t change during several epochsif error is not change than training procedure is not converges

Early stoppingStop when โ€œvalidationโ€ score starts increase even when โ€œtrainโ€ score continue decreasing

Typical: epochs=50โ€ฆ200

Page 23: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 24: Neural Networks. Overview

What about โ€œformโ€ of F?Network topology

โ€œShallowโ€ networks 1, 2 hidden layers => not enough parameters => pure separation abilities

โ€œDeepโ€ networks is a NN with 2..10 layers

โ€œVery deepโ€ networks is a NN with >10 layers

Page 25: Neural Networks. Overview

Deep learning. Problems

โ€ข Big networks => Too huge separating ability => Overfitting

โ€ข Vanishing gradient problem during training

โ€ข Complex errorโ€™s surface => Local minimum

โ€ข Curse of dimensionality => memory & computations

๐‘š(๐‘–โˆ’1) ๐‘š(๐‘–)

dim ๐‘Š(๐‘–) = ๐‘š ๐‘–โˆ’1 โˆ— ๐‘š(๐‘–)

Page 26: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 27: Neural Networks. Overview

Additional methodsConventional

โ€ข Momentum (prevent the variations on error surface)

โˆ†๐‘ค(๐‘ก) = โˆ’๐›ผ๐›ป๐ฟ ๐‘ค ๐‘ก + ๐›ฝโˆ†๐‘ค(๐‘กโˆ’1)

๐‘š๐‘œ๐‘š๐‘’๐‘›๐‘ก๐‘ข๐‘š

โ€ข LR decay (make smaller steps near optimum)

๐›ผ(๐‘ก) = ๐‘˜๐›ผ(๐‘กโˆ’1), 0 < ๐‘˜ < 1

โ€ข Weight Decay (prevent weight growing, and smooth F)

๐ฟโˆ— = ๐ฟ + ๐œ† ๐‘ค(๐‘ก)

L1 or L2 regularization often used

Typical: ๐›ฝ = 0.9

Typical: apply LR decay (๐‘˜ = 0.1) each 10..100 epochs

Typical: ๐ฟ2 with ๐œ† = 0.0005

Page 28: Neural Networks. Overview

Neural networks. Overview

โ€ข Common principles

โ€“ Structure

โ€“ Learning

โ€ข Shallow and Deep NN

โ€ข Additional methods

โ€“ Conventional

โ€“ Voodoo

Page 29: Neural Networks. Overview

Additional methodsContemporary

Dropout/DropConnect

โ€“ ensembles of networks

โ€“ 2๐‘ networks in one: for each example hide neurons output randomly (๐‘ƒ = 0.5)

Page 30: Neural Networks. Overview

Additional methodsContemporary

Data augmentation - more data with all available cases:

โ€“ affine transformations, flips, crop, contrast, noise, scale

โ€“ pseudo-labeling

Page 31: Neural Networks. Overview

Additional methodsContemporary

New activation function:

โ€“ Linear: ๐‘ฆ๐‘– = ๐‘“ ๐‘ ๐‘– = ๐‘Ž๐‘ ๐‘–โ€“ ReLU: ๐‘ฆ๐‘– = ๐‘š๐‘Ž๐‘ฅ ๐‘ ๐‘– , 0

โ€“ Leaky ReLU: ๐‘ฆ๐‘– = ๐‘ ๐‘– ๐‘ ๐‘– > 0๐‘Ž๐‘ ๐‘– ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’

โ€“ Maxout: ๐‘ฆ๐‘– = ๐‘š๐‘Ž๐‘ฅ ๐‘ 1,๐‘– , ๐‘ 2,๐‘– , โ€ฆ , ๐‘ ๐‘˜,๐‘–

Typical: ๐‘Ž = 0.01

Typical: ๐‘˜ = 2. . 3

Page 32: Neural Networks. Overview

Additional methodsContemporary

Pre-training

โ€“ train layer-by-layer,

โ€“ re-train โ€œotherโ€ network

Page 33: Neural Networks. Overview

Sources

โ€ข Jeffry Hinton Course โ€œNeural Networks for Machine Learningโ€ [http://www.coursera.org/course/neuralnets]

โ€ข Ian Goodfellow, Yoshua Bengio and Aaron Courville โ€œDeep Learningโ€ [http://www.deeplearningbook.org/]

โ€ข http://neuralnetworksanddeeplearning.com

โ€ข CS231n: Convolutional Neural Networks for Visual Recognition [http://cs231n.stanford.edu/]

โ€ข CS224d: Deep Learning for Natural Language Processing [http://cs224d.stanford.edu/]

โ€ข Schmidhuber โ€œDeep Learning in Neural Networks: An Overviewโ€

โ€ข kaggle.com competitions and forums