Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Lecture 15: From (Sigmoidal) Perceptrons toNeural Networks

Reference: We will be referring to sections etc of ‘Deep Learning’by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

Recap: Non-linearity via Kernelization (Eg., LR)2 The Regularized (Logistic) Cross-Entropy Loss function

(minimized wrt w ∈ �p):

E (w) = −

�y (i) log fw

�x(i)

�1 − y (i)

�log

�1 − fw

�x(i)

�� +

2m�w�22 (1)

3 Equivalent dual kernelized objective1

(minimized wrt α ∈ �m):

ED (α) =

−y (i)K�x(i), x(j)

�αj +

2αi K

�x(i), x(j)

�αj

αj K�x(i), x(j)

Decision function fw(x) =1

1+ exp

αjK�x, x(j)

1Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks:

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks: Universal Approximation Properties andDepth (Section 6.4)

With a single hidden layer of a sufficient size and a reasonable choiceof nonlinearity (including the sigmoid, hyperbolic tangent, and RBFunit), one can represent any smooth function to any desired accuracyThe greater the required accuracy, the more hidden units are requiredNo free lunch theorems

Problem in Perspective

Given data points xi , i = 1, 2, . . . ,m

Possible class choices: c1, c2, . . . , ckWish to generate a mapping/classifier

f : x → {c1, c2, . . . , ck}

To get class labels y1, y2, . . . , ym

Problem in Perspective

In general, series of mappings

xf (·)−−→ y

g(·)−−→ zh(·)−−→ {c1, c2, . . . , ck}

where y , z are in some latent space.

https://playground.tensorflow.org

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

https://playground.tensorflow.org

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12).

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Demostration at https://www.desmos.com/calculator

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

θ (such as 12).

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Neural Networks: Cascade of layers of perceptrons giving younon-linearity. Check out https://playground.tensorflow.org/

Recall: Logistic Extension to multi-class

1 Each class c = 1, 2, . . . ,K − 1 can have a different weightvector [wc,1,wc,2, . . . ,wc,k , . . . ,wc,K−1] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

1 +K−1�

e−(wk)Tφ(x)

for c = 1, . . . ,K − 1 so that

p(Y = K |φ(x)) = 1

1 +K−1�

e−(wk)Tφ(x)

Softmax: (Equivalent) LR extension to multi-class

1 Each class c = 1, 2, . . . ,K can have a different weight vector[wc,1,wc,2 . . .wc,K ] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

e−(wk)Tφ(x)

for c = 1, . . . ,K .2 This has one set of additional (redundant) weight vector

parameters3 Tutorial 7: Show the (simple) equivalence between the two

formulations

Multi-layer Perceptron/LR (Neural Network) andVC Dimension

Measure for (non)separability using a classifier?

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Aspect 2: Cardinality of largest set of points that canbe shatteredVC (Vapnik - Chervonenkis) dimension ⇒ Richness of space offunctions learnable by a statistical classification algorithm.

A classification function fw is said to shatter a set of data points(x1, x2, . . . , xn) if, for all assignments of labels to those points, thereexists a w such that fw makes no errors when evaluating that set ofdata points.Cardinality of the largest set of points that f (w) can shatter is itsVC-dimension (see extra & optional material on VC-dimension).

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗� ✗�

✗�

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗� ✗�

✗�

Four points can be shattered using axis-parallel rectangles

✗�

✗� ✗� ✗�✗�✗�✗�

✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

✗�✗�✗�✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �),

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �),

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2),

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n),

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n), VC dimension = n + 1 (seeextra slides for proof)

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions f (x) = g�wTφ(x)

�where g(s) can be any of

following:1 sign/step function: g(s) = sign(s) or g(s) = 1 if s ∈ [θ,∞) and

g(s) = 0 otherwise2 sigmoid function: g(s) = 1

θ (such as 12). tanh(s) = 2sigmoid(2s)− 1

4 Softplus: g(s) = ln (1 + es): A smooth version of ReLU5 Others: leaky ReLU, RBF, Maxout, Hard tanh, absolute value

rectification (Section 6.2.1)Neural Networks: Cascade of layers of perceptrons giving you non-linearity

The 4 Design Choices in a Neural Network (Section6.1)

Some activation functions

Derivatives of some activation functions

Some interesting visualizations

https://distill.pub/2018/building-blocks/

https://distill.pub/2017/feature-visualization/

http://colah.github.io/posts/2015-01-Visualizing-Representations/

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Simple OR using (step) perceptron

θ = 12

−0.25

x ∨ y

AND using (step) perceptron

θ = 12

−1.25

x ∧ y

NOT using perceptron

θ = 12

Feed-forward Neural Nets

z1 = g (�

z2 = g (�

f1 = g(.)

f2 = g(.)

inputs

Eg: Feed-forward Neural Net for XOR (θ = 0)

z1 = g (�

z2 = g (�

−0.25

fw = g(.)

1−1.25

inputs

Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Documents

Transcript of Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks…kkask/Spring-2018 CS273P/slides/08... · 2018. 5. 3. · • Neural networks, multi-layer perceptrons

Introduction to Neural Networkssaurabhg.web.illinois.edu/.../lec20_neural_nets.pdf · Introduction to Neural Networks Slides from L. Lazebnik, B. Hariharan. Outline •Perceptrons

Evolution of Time in Neural Networks: From the Present to ...€¦ · Many neural network models are based on a feedforward topology (perceptrons, backpropagation networks, radial

Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/images/9/9f/AN2DL_02_2019... · - From Perceptrons to Feed Forward Neural Networks - Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

Multilayer Perceptrons CS/CMPE 537 – Neural Networks.

Perceptrons and Learning Learning in Neural Networks.

Exercise I: Perceptrons and Multi-layer perceptrons for classification ...

Multilayer Neural Networks - Computer Action Teamweb.cecs.pdx.edu/~mm/MachineLearningFall2018/NNs.pdf · Multilayer Neural Networks (sometimes called “Multilayer Perceptrons”

Hebb Nets, Perceptrons and Adaline Nets Based on …jkalita/work/cs587/2014/03SimpleNets.pdf · Simple Neural Nets for Pattern Classification Hebb Nets, Perceptrons and Adaline Nets

Neural Networks and Brain Function - oxcns.org Treves 1998... · fundamental types of neural network, including pattern associators, autoassociators, competitive networks, perceptrons,

Perceptrons - Temporal Dynamics of Learning Centertdlc.ucsd.edu/events/boot_camp_2015/Cottrell_perceptrons.pdf · Frank Rosenblatt studied a simple version of a neural ... (1962)

Multilayer Perceptrons

Supervised Learning Networks. Linear perceptron networks Multi-layer perceptrons Mixture of experts Decision-based neural networks Hierarchical neural.

An Analysis on Neural Dynamics with Saturated Sigmoidal … · 2016-12-07 · analyze a variety of models in the field of the neural networks. geywords--Saturated attractor, Saturated

Unit 8: Introduction to neural networks. Perceptrons · 2013. 12. 12. · Unit 8: Introduction to neural networks. Perceptrons D. Balbont n Noval F. J. Mart n Mateos J. L. Ruiz Reina

Multilayer Neural Networks (sometimes called “Multilayer Perceptrons” or MLPs)

2 RESEARCH REPORT - Michigan · Two types of artificial neural networks, multi-layer perceptrons and ensembles of neural networks (ENNs), were developed to predict the condition ratings

4 Feedforward Multilayer Neural Networks — part Iusers.monash.edu/~app/CSE5301/Lnts/Ld.pdf · • Multilayer Perceptrons — Feedforward neural networks Each layer of the network

Neural Networks ( Multi-Layer Perceptrons )

Dave Reed Connectionist approach to AI neural networks, neuron model perceptrons threshold logic, perceptron training, convergence theorem single layer.