Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Post on 04-May-2022

2 views 0 download

Transcript of Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Lecture 15: From (Sigmoidal) Perceptrons toNeural Networks

Reference: We will be referring to sections etc of ‘Deep Learning’by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

Recap: Non-linearity via Kernelization (Eg., LR)2 The Regularized (Logistic) Cross-Entropy Loss function

(minimized wrt w ∈ �p):

E (w) = −

1

m

m�

i=1

�y (i) log fw

�x(i)

�+

�1 − y (i)

�log

�1 − fw

�x(i)

��� +

λ

2m�w�22 (1)

3 Equivalent dual kernelized objective1

(minimized wrt α ∈ �m):

ED (α) =

m�

i=1

m�

j=1

−y (i)K�x(i), x(j)

�αj +

λ

2αi K

�x(i), x(j)

�αj

+ log

1 +

m�

j=1

αj K�x(i), x(j)

(2)

Decision function fw(x) =1

1+ exp

m�

j=1

αjK�x, x(j)

1Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks:

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks: Universal Approximation Properties andDepth (Section 6.4)

With a single hidden layer of a sufficient size and a reasonable choiceof nonlinearity (including the sigmoid, hyperbolic tangent, and RBFunit), one can represent any smooth function to any desired accuracyThe greater the required accuracy, the more hidden units are requiredNo free lunch theorems

Problem in Perspective

Given data points xi , i = 1, 2, . . . ,m

Possible class choices: c1, c2, . . . , ckWish to generate a mapping/classifier

f : x → {c1, c2, . . . , ck}

To get class labels y1, y2, . . . , ym

Problem in Perspective

In general, series of mappings

xf (·)−−→ y

g(·)−−→ zh(·)−−→ {c1, c2, . . . , ck}

where y , z are in some latent space.

https://playground.tensorflow.org

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

https://playground.tensorflow.org

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12).

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Demostration at https://www.desmos.com/calculator

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12).

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Neural Networks: Cascade of layers of perceptrons giving younon-linearity. Check out https://playground.tensorflow.org/

Recall: Logistic Extension to multi-class

1 Each class c = 1, 2, . . . ,K − 1 can have a different weightvector [wc,1,wc,2, . . . ,wc,k , . . . ,wc,K−1] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

1 +K−1�

k=1

e−(wk)Tφ(x)

for c = 1, . . . ,K − 1 so that

p(Y = K |φ(x)) = 1

1 +K−1�

k=1

e−(wk)Tφ(x)

Softmax: (Equivalent) LR extension to multi-class

1 Each class c = 1, 2, . . . ,K can have a different weight vector[wc,1,wc,2 . . .wc,K ] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

K�

k=1

e−(wk)Tφ(x)

for c = 1, . . . ,K .2 This has one set of additional (redundant) weight vector

parameters3 Tutorial 7: Show the (simple) equivalence between the two

formulations

Multi-layer Perceptron/LR (Neural Network) andVC Dimension

Measure for (non)separability using a classifier?

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

n

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Measure for (non)separability using a classifier?

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

n

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Measure for (non)separability using a classifier?

Aspect 2: Cardinality of largest set of points that canbe shatteredVC (Vapnik - Chervonenkis) dimension ⇒ Richness of space offunctions learnable by a statistical classification algorithm.

A classification function fw is said to shatter a set of data points(x1, x2, . . . , xn) if, for all assignments of labels to those points, thereexists a w such that fw makes no errors when evaluating that set ofdata points.Cardinality of the largest set of points that f (w) can shatter is itsVC-dimension (see extra & optional material on VC-dimension).

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗�

✗�

✗� ✗�

✗�

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗�

✗�

✗� ✗�

✗�

Four points can be shattered using axis-parallel rectangles

✗�

✗� ✗� ✗�✗�✗�✗�

✗�

✗�

✗�

✗�

✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

✗�✗�✗�✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �),

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �),

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2),

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n),

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n), VC dimension = n + 1 (seeextra slides for proof)

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions f (x) = g�wTφ(x)

�where g(s) can be any of

following:1 sign/step function: g(s) = sign(s) or g(s) = 1 if s ∈ [θ,∞) and

g(s) = 0 otherwise2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12). tanh(s) = 2sigmoid(2s)− 1

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es): A smooth version of ReLU5 Others: leaky ReLU, RBF, Maxout, Hard tanh, absolute value

rectification (Section 6.2.1)Neural Networks: Cascade of layers of perceptrons giving you non-linearity

The 4 Design Choices in a Neural Network (Section6.1)

Some activation functions

Derivatives of some activation functions

Some interesting visualizations

https://distill.pub/2018/building-blocks/

https://distill.pub/2017/feature-visualization/

http://colah.github.io/posts/2015-01-Visualizing-Representations/

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Simple OR using (step) perceptron

x

y

b

θ = 12

1

1

−0.25

x ∨ y

AND using (step) perceptron

x

y

b

θ = 12

1

1

−1.25

x ∧ y

NOT using perceptron

x

b

θ = 12

−1

0.75

¬x

Feed-forward Neural Nets

xn

x2

x1

1

z1 = g (�

)

z2 = g (�

)

wn1

w21

w11

w01

wn2

w22

w12

w02

f1 = g(.)

f2 = g(.)

inputs

Eg: Feed-forward Neural Net for XOR (θ = 0)

Eg: Feed-forward Neural Net for XOR (θ = 0)

x2

x1

1

z1 = g (�

)

z2 = g (�

) 1

1

1

−0.25

−1

−1

1.25

fw = g(.)

1

1−1.25

inputs