Download - Machine Learning I 80-629A Apprentissage Automatique I 80-629

Transcript
Page 1: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Neural Networks— Week #5

Machine Learning I 80-629A

Apprentissage Automatique I 80-629

Page 2: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

This lecture

• Neural Networks

A. Modeling

B. Fitting

C. Deep neural networks

D. In practice

Some of today’s material is (adapted) from Joelle Pineau’s slides

2

Page 3: Machine Learning I 80-629A Apprentissage Automatique I 80-629

From Linear Classification to Neural Networks

Page 4: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Recall Linear Classification

]�

]�

Page 5: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Recall Linear Classification

^(]) = \!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

Decision

]�

]�

Page 6: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Recall Linear Classification

^(]) = \!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

Decision

]�

]�

Page 7: Machine Learning I 80-629A Apprentissage Automatique I 80-629

What if data is not linearly separable?

]�

]�

Exclusive OR (XOR)

Page 8: Machine Learning I 80-629A Apprentissage Automatique I 80-629

What if data is not linearly separable?

]�

]�Use the joint decision of several linear classifier?

Exclusive OR (XOR)

Page 9: Machine Learning I 80-629A Apprentissage Automatique I 80-629

What if data is not linearly separable?

]�

]�Use the joint decision of several linear classifier?

]�

]�

\′"]+\′�

\!]+\�

Exclusive OR (XOR)

Page 10: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Combining models

]�

]�

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

K′(]) :

K(]) :

Page 11: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Combining models

]�

]�

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�

K′(]) :

K(]) :

Page 12: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Combining models

]�

]�

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

1. Evaluate each model 2. Combine the output of models

K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�

K′(]) :

K(]) :

Page 13: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Combining models

]�

]�

\′"]+\′�

\!]+\�

(\�]+\�) > � =�(\�]+\�) < � =�

(\��]+\��) > � =�

(\��]+\��) < � =�

1. Evaluate each model 2. Combine the output of models

K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�

K��(]) = YMWJXMTQI(\����K(])K�(])

�+\��

�)

K′(]) :

K(]) :

Page 14: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

Page 15: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

]�

]�

Page 16: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

K′(])

K(])]�

]�

Page 17: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

K′′(])

K′(])

K(])]�

]�

Page 18: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

K′′(])

K′(])

K(])]�

]�

{ , }

Page 19: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

K′′(])

K′(])

K(])]�

]�

{ , }

Neural Network

Page 20: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Combining model (graphical view)

7

K′′(])

K′(])

K(])]�

]�

{ , }

Perceptron/ Neuron

Neural Network

Page 21: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Feed-forward neural network

• Each arrow denotes a connection

• A signal associated with a weight

• Each node is the weighted sum of its input followed by a non-linear activation

• Connections go left to right

• No connections within a layer

• No backward connections (recurrent)

8

Input Layer Hidden Layer(s) Output Layer

σ(∑

N

\N�]N)

σ(∑

N

\N�T�N)

]�

]�

1

Page 22: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Feed-forward neural network

1. An input layer

• Its size is the number of inputs + 1

2. One or more hidden layer(s)

• Their size is a hyper-parameter

3. An output layer

• Its size is the number of outputs

9

Input Layer Hidden Layer(s) Output Layer

]�

]�

1

Page 23: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Compute a prediction (forward pass)

Input Layer Hidden Layer(s) Output Layer

σ(∑

N

\N�]N)

σ(∑

N

\N�T�N)

]�

]�

1.

2.

1

Page 24: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Neural Networks

• Flexible model class

• Highly-non linear models

• Good for regression/classification/density estimation

• Models behind “Deep Learning”

• Historical aspects

11

Page 25: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Learning the Parameters of a Neural Network

Page 26: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural network

13

Page 27: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural networkHow do we estimate the model’s parameters?

13

Page 28: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural networkHow do we estimate the model’s parameters?

• No-closed form solution

13

Page 29: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural networkHow do we estimate the model’s parameters?

• No-closed form solution

• Gradient-based optimization

13

Page 30: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural networkHow do we estimate the model’s parameters?

• No-closed form solution

• Gradient-based optimization

• Threshold functions are not differentiable

13

Page 31: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fitting a neural networkHow do we estimate the model’s parameters?

• No-closed form solution

• Gradient-based optimization

• Threshold functions are not differentiable

• Replace by sigmoid (inverse logit). A soft threshold.

13

ɀǣǕȅȒǣƳ(Ə) :=(

+ exp(−Ə)

)

<latexit sha1_base64="dLhR9EENMeitoFPRf4huIjKVqJk=">AAAD4nicjVLLbtNAFJ02PIp5pbBkM6KqlIi0itmAkJCqtqFdNFAeaSvFUTQe3zijjD3WzHVIZHnHih1iyyfwNewQLPgHfoBxUqo8KsRItu6ce87ce4+un0hhsF7/vrJaunL12vW1G87NW7fv3C2v3zsxKtUcWlxJpc98ZkCKGFooUMJZooFFvoRTf7BX5E+HoI1Q8TscJ9CJWBiLnuAMLdQtv/AQRpgZEUZKBHllemV5lT57Tj0JPaxQr6cZz9w8cx95MEoqWxeknHpahH2sdssb9e365NDlwD0PNnb2fu/+IoQcd9dXP3iB4mkEMXLJjGm79QQ7GdMouITc8VIDCeMDFkLbhjGLwHSyycA53bRIQHtK2y9GOkFnFRmLjBlHvmVGDPtmMVeAl+XaKfaedjIRJylCzKeFeqmkqGjhHg2EBo5ybAPGtbC9Ut5n1h60Hs9V8ZUaIPNNjWnNxrUolSi0ej83WDZpIAE+j47SWHAVwAIqcYSa5c7mbMPBUCRmxpxUFwQDGDERF+ZkhyCHYBtl9CWkcJG1hYt0ZV+EAk3tyO5DXDvQAIPqkmTuvab9GbBD4dZbiMSuksFfxr/evFT2H73M6JoQiDRaMODo4E1YMDsTM5kpVjN3HMfbB7tXGpoWfZWA1SudeY088wqe72eN3NLs3rqLW7ocnDzedm382i5wg0zPGnlAHpIKcckTskMOyTFpEU6+km/kB/lZCkofS59Kn6fU1ZVzzX0yd0pf/gB3nFEj</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="s6DDuxI+w1jNPnbWU6NWyHo2G8Q=">AAAD4nicjVJLb9NAEN7WPEp4pXDksiKqlIi0irmAkJAqaGgPDZRH2kpxFK3XE2eVtdfaHYdElm+cuCGu/AR+DTcEP4Z1Eqo8KsRKtma/+b6dmU/jJ1IYbDR+bmw6V65eu751o3Tz1u07d8vb906NSjWHNldS6XOfGZAihjYKlHCeaGCRL+HMH74s8mcj0Eao+ANOEuhGLIxFX3CGFuqVX3kIY8yMCCMlgrw6u7K8Rp89p56EPlap19eMZ26euY88GCfV3QtSTj0twgHWeuVKY68xPXQ9cOdBhczPSW9785MXKJ5GECOXzJiO20iwmzGNgkvIS15qIGF8yELo2DBmEZhuNh04pzsWCWhfafvFSKfooiJjkTGTyLfMiOHArOYK8LJcJ8X+024m4iRFiPmsUD+VFBUt3KOB0MBRTmzAuBa2V8oHzNqD1uOlKr5SQ2S+qTOt2aQepRKFVh+XBsumDSTAl9FxGguuAlhBJY5Rs7y0s9hwMBKJWTAn1QXBAEZMxIU52RHIEdhGGX0NKVxkbeEiXT0QoUBTP7b7ENcPNcCwtiZZeq9lfwbsULj7HiLxQsngL+Nfb14q+49eFnQtCEQarRhwfPguLJjdqZnMFKuZl0ol7wDsXmloWfRNAlavdOY188wreL6fNXNLs3vrrm7penD6eM+18dtGZb853+At8oA8JFXikidknxyRE9ImnHwnP8gv8tsJnM/OF+frjLq5MdfcJ0vH+fYHjqtOiw==</latexit>

Page 32: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Fit the parameters (w) (backward pass)

• Derive a gradient wrt the parameters (w)

• The back-propagation starts from the output node(s) and heads toward the input(s)

• In practice, the order of the computation is important

14

Input Layer Hidden Layer(s) Output Layer

]�

]�

1

(^− ˆ)�

�(^ � ˆ)�

�\O=

�(^ � K(�

N \NTN))�

�\O

=�(^ � K(

�N� \N�K(

�O \O]O)N�)�

�\O

Page 33: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Gradient descent• No closed-form formula

• Repeat the following steps (for t=0,1,2,… until convergence):

1. Calculate a gradient

2. Apply the update

• Stochastic gradient descent

• One example at a time

• Batch gradient descent

• All examples at a time

15

∇\YNO

\Y+�NO = \Y

NO − α∇\YNO

Page 34: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

What can an MLP learn?1. A single unit (neuron)

• Linear classifier + non-linear output

2. A network with a single hidden layer

• Any continuous function (but may require exponentially many hidden units as a function of the inputs)

3. A network with two (or more) hidden layers

• Any function can be approximated with arbitrary accuracy.

16

Page 35: Machine Learning I 80-629A Apprentissage Automatique I 80-629

The Importance of Representations

Page 36: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

From Neural Networks to Deep Neural Networks

18

A neural Network

Page 37: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

From Neural Networks to Deep Neural Networks

18

A neural Network A deep neural Network

Page 38: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

From Neural Networks to Deep Neural Networks

18

A neural Network A deep neural Network

Modern deep learning provides a powerful framework for supervised learning. By adding more layers and more units within a layer, a deep network can represent functions of increasing complexity.

Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html

Page 39: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Another View of deep learning• Representations are important

[From: Honglak Lee (and Graham Taylor)]19

Page 40: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Representations

[From: Honglak Lee (and Graham Taylor]20

Page 41: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Representations

[From: Honglak Lee (and Graham Taylor]20

Page 42: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Representations

[From: Honglak Lee (and Graham Taylor]20

Page 43: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Representations

[From: Honglak Lee (and Graham Taylor]20

Page 44: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Data Layer 1 Layer 2 Classifier… Output

21

Page 45: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Machine Translation

22

Universal Sentence

Representation

French Encoder

English Encoder

Spanish Encoder

French Decoder

English Decoder

Spanish Decoder

<latexit sha1_base64="728H+MTBszQF6C/oMDOXf+EkQ1Q=">AAADZ3icfVJdaxNBFJ1m/ajrR1sFEXxZDYUKMezqgz5WbbGg1YomLWRDmZ29ScfMxzJzNyYsC/4EX/Wf+RP8F84msXST4oWFyzlz75w5e5JMcIth+Hut4V25eu36+g3/5q3bdzY2t+52rc4Ngw7TQpuThFoQXEEHOQo4yQxQmQg4TkZvKv54DMZyrb7gNIO+pEPFB5xRdFA3Hqca7elmM2yHswpWm2jRNMmijk63Gu/iVLNcgkImqLW9KMywX1CDnAko/Ti3kFE2okPouVZRCbZfzOSWwbZD0mCgjfsUBjPU374wUkiHWzCGYtmzIHmiRdq/uLSg0tqpTNwySfHMLnMVeBnXy3Hwsl9wleUIis21DHIRoA4qe4KUG2Aopq6hzHD3nICdUUMZOhNrtyRaj5AmtkWdzmlL5gK50d9qby9mAjJgdXSSK850CkuowAkaWjcvHfPMLuybzP1zVllASbmq7CsOQIzB6aTBB8jhnHX3VvTOHh9ytK337n+r1lsDMHqyMlLbd3ju/dPPzvrXzvp/J/6389IxP94Dlw8Dh272YwaO1aaI98sirrYlSbFflr6LX7QcttWm+6wdPW+Hn8Lm7qvv8yCuk4fkMdkhEXlBdskBOSIdwshX8oP8JL8af7wN7773YH60sbYI7z1SK+/RX/X8IL0=</latexit>

<latexit sha1_base64="728H+MTBszQF6C/oMDOXf+EkQ1Q=">AAADZ3icfVJdaxNBFJ1m/ajrR1sFEXxZDYUKMezqgz5WbbGg1YomLWRDmZ29ScfMxzJzNyYsC/4EX/Wf+RP8F84msXST4oWFyzlz75w5e5JMcIth+Hut4V25eu36+g3/5q3bdzY2t+52rc4Ngw7TQpuThFoQXEEHOQo4yQxQmQg4TkZvKv54DMZyrb7gNIO+pEPFB5xRdFA3Hqca7elmM2yHswpWm2jRNMmijk63Gu/iVLNcgkImqLW9KMywX1CDnAko/Ti3kFE2okPouVZRCbZfzOSWwbZD0mCgjfsUBjPU374wUkiHWzCGYtmzIHmiRdq/uLSg0tqpTNwySfHMLnMVeBnXy3Hwsl9wleUIis21DHIRoA4qe4KUG2Aopq6hzHD3nICdUUMZOhNrtyRaj5AmtkWdzmlL5gK50d9qby9mAjJgdXSSK850CkuowAkaWjcvHfPMLuybzP1zVllASbmq7CsOQIzB6aTBB8jhnHX3VvTOHh9ytK337n+r1lsDMHqyMlLbd3ju/dPPzvrXzvp/J/6389IxP94Dlw8Dh272YwaO1aaI98sirrYlSbFflr6LX7QcttWm+6wdPW+Hn8Lm7qvv8yCuk4fkMdkhEXlBdskBOSIdwshX8oP8JL8af7wN7773YH60sbYI7z1SK+/RX/X8IL0=</latexit>

Page 46: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Machine learning Make machines that can learn

Deep learning A set of machine learning techniques

based on neural networks

Idea: Hugo Larochelle

Page 47: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Machine learning Make machines that can learn

Deep learning A set of machine learning techniques

based on neural networks

Idea: Hugo Larochelle

Representation learning Machine learning paradigm to discover data representations

Page 48: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Deep neural networks

• Several layers of hidden nodes

• Parameters at different levels of representation

24Figure 1: We would like the raw input image to be transformed into gradually higher levels of representation,representing more and more abstract functions of the raw input, e.g., edges, local shapes, object parts,etc. In practice, we do not know in advance what the “right” representation should be for all these levelsof abstractions, although linguistic concepts might help guessing what the higher levels should implicitlyrepresent.

3

[Figure from Yoshua Bengio]

Page 49: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629 [Figures from Yoshua Bengio]25

Page 50: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Neural Network Hyper-parameters

Page 51: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Hyperparameters

1. Model specific

• Activation functions (output & hidden), Network size

2. Optimisation Objective

• Regularization, Early-stopping, Dropout

3. Optimization procedure

• Momentum, Adaptive learning rates

27

Page 52: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

28

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ǣ

ɯǣɴǣ)

Page 53: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

28

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ǣ

ɯǣɴǣ)

Page 54: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

•Non-linearities increase model representation power

28

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ǣ

ɯǣɴǣ)

Page 55: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

•Non-linearities increase model representation power

•Non-linearities increase the difficult of optimization

28

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ǣ

ɯǣɴǣ)

Page 56: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:

•Non-linearities increase model representation power

•Non-linearities increase the difficult of optimization

• Different functions for hidden units and output units

28

<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>

ǔ(∑

ǣ

ɯǣɴǣ)

Page 57: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation functions — hidden units

• Traditional

• Logistic-like units

• Saturate on both sides

• Derivable everywhere

29

ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>

ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>

[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]

Page 58: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation functions — hidden units

• Traditional

• Logistic-like units

• Saturate on both sides

• Derivable everywhere

29

ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>

ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>

[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]

Page 59: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation functions — hidden units

• Traditional

• Logistic-like units

• Saturate on both sides

• Derivable everywhere

29

• Rectified linear units (Relu)

• Non-derivable at a single point

• Now Standard

• Better results / faster training

• Shuts off units

• Leaky Relu

ǔ(ɿ) = max{, ɿ}<latexit sha1_base64="GbUEfDiBNvnjfmEEpvbxAIDwbWQ=">AAADeXicfVJdaxNBFJ0mftT4lSr44svUUkglDRv7oC9C1RYLWq1o2kI2hNnZm3TIfCwzd2PSdcHf4qv+IX+LL84mtXTT4oVhLufMvXPv4USJFA6D4PdSpXrt+o2by7dqt+/cvXe/vvLg0JnUcuhwI409jpgDKTR0UKCE48QCU5GEo2j0puCPxmCdMPoLThPoKTbUYiA4Qw/1648GjdMN+pKGik1omNGgSU/DnPbra0ErmAW9nLTPkrXt1e+kiIP+SuVdGBueKtDIJXOu2w4S7GXMouAS8lqYOkgYH7EhdH2qmQLXy2YL5HTdIzEdGOuPRjpDa+sXSjLlcQfWMsy7DpSIjIx7F5tmTDk3VZFvphieuEWuAK/iuikOXvQyoZMUQfP5LINUUjS0EIzGwgJHOfUJ41b4dSg/YZZx9LKWfomMGSGLXJP5OadNlUoU1nwt7Z7NBkiAl9FJqgU3MSygEidoWVm8eCwSdybfZK6fl8oBKiZ0IV+2B3IMfk5GP0AK56z/t6AbO2Io0DXfewfo5lsLMNq4VFLqt3+u/eZnL/1rL/2/F//reWVZLdwB7w8L+772YwKeNTYLd/MsLLpFUbab5zVvv/ai2S4nh89a7a1W8Mn78NXch2SZPCZPSIO0yXOyTfbIAekQTr6RH+Qn+VX5U12tNqpP508rS/ObPCSlqG79BRPjJjk=</latexit>

ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>

ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>

Ǖ(ɿ) = max{, ɿ}+ αmin{, ɿǣ}<latexit sha1_base64="tNDTbrt6UmRHIkHIIwjEGouO1lQ=">AAADk3icfVJdaxNBFJ1m/ajxK1V88mW0FFJMw0YfFESoNcWCRiuatpBdwuzsTTJkPpaZ2Zh0WfBX+Gt81f/gv3E2aUs3KV4Y5nLO3Dv3Hk6UcGas7/9dq3jXrt+4uX6revvO3Xv3axsPjoxKNYUuVVzpk4gY4ExC1zLL4STRQETE4Tgavyv44wlow5T8ZmcJhIIMJRswSqyD+jV/WD/dxm9wIMgUBxn2G/g0yPEzHBCejIjDmTzH+wwHeb+26Tf9eeDVpHWWbO4++YGKOOxvVD4EsaKpAGkpJ8b0Wn5iw4xoyyiHvBqkBhJCx2QIPZdKIsCE2Xy1HG85JMYDpd2RFs/R6talkkw43IDWxOY9A4JFisfh5aYZEcbMROSaCWJHZpkrwKu4XmoHr8KMySS1IOlilkHKsVW4kBLHTAO1fOYSQjVz62A6IppQ6wQv/RIpNbYkMg3i5pw1RMot0+p7afdsPkACtIxOU8moimEJ5XZqNSmLF09YYs7kmy70c1IZsIIwWciXHQCfgJuT4E+QwgXr/i3oepsNmTWNj84bsvFeA4y3V0pK/ToX2u98ddLvOenPX/yv55Vl1aANzh8aOq72cwKOVToL9vMsKLpFUbaf51Vnv9ay2VaTo+fN1oum/8X58O3Ch2gdPUZPUR210Eu0iw7QIeoiin6iX+g3+uM98l57e1578bSytrjRQ1QKr/MPjMYueg==</latexit>

[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]

Page 60: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Activation functions — Output units

30

σ(F) =�

�+ exp(−F)

XTKYRF](F)N =J]U(FN)∑O exp(FO)

Output type Output Unit Equivalent Statistical Distribution

Binary (0,1) Bernoulli

Categorical (0,1,2,3,k) Multinoulli

Continuous Identity(z) = z Gaussian

Multi-modal mean, (co-)variance, components Mixture of Gaussians

ɀǣǕȅȒǣƳ(ɿ) =

+ exp(−ɿ)<latexit sha1_base64="dMsVesdqf5yE82ASLgFUnJGN2Rk=">AAADjXicfZLbbhMxEIbdLIcSDk2Bu95YVJUSSKNdEIKLgiraqpWgUA49SNko8nonqRV7vbK9Iam178DTcAuvwdvgTULVTSpGsjT6f894/GmilDNtfP/PUsW7cfPW7eU71bv37j9Yqa0+PNEyUxSOqeRSnUVEA2cJHBtmOJylCoiIOJxGg53CPx2C0kwm38w4hY4g/YT1GCXGSd3a09DAyFjN+kKyOK9fNPAbHPYUoTbIbYCf4RBGaX3zopF3a+t+y58EXkyCWbK+vfV4becL1I+6q5X3YSxpJiAxlBOt24Gfmo4lyjDKIa+GmYaU0AHpQ9ulCRGgO3byqRxvOCXGPancSQyeqNWNKyVWOF2DUsTkbQ2CRZLHnatNLRFaj0XkmglizvW8V4jXee3M9F53LEvSzEBCp7P0Mo6NxAVEHDMF1PCxSwhVzH0H03PioBmHuvRKJOXAkEg3iZtz3BQZN0zJ76W/28kAKdCyOsoSRmUMcyo3I6NIGV48ZKme4RtN+TlUGowgLCnw2QPgQ3BzEvwRMrh03buFXd9lfWZ084PbiqS5rwAGjYWSUr/DS/abXx36dw79vxv/63ltWTXcBbcfCg5d7acUnCuVDfdyGxbdosju5XnVrV8wv2yLycnzVvCi5X92e/gWTWMZraEnqI4C9AptowN0hI4RRT/QT/QL/fZWvJfelje7W1ma1TxCpfD2/wI6US64</latexit>

ɀȒǔɎȅƏɴ(ɿǣ) =exp(ɿǣ)∑ǣ′ exp(ɿǣ′)

<latexit sha1_base64="C8rgjhM2JzvG2gHJ44PWY0F11SY=">AAADonicfVJdb9MwFPUWPkb56uCRF4tuopVKlcIDvAwm2MQEDDaNbpOaqnLcm86qHUe2U1Ks/Av+Av+EByRe4YF/g9OWaWknLEU6Osf3+t6TEyacaeP7f1ZWvStXr11fu1G5eev2nbvV9XvHWqaKQodKLtVpSDRwFkPHMMPhNFFARMjhJBy9LvSTMSjNZPzJTBLoCTKMWcQoMY7qV18GBjJjtYyMIFle/9JnDbyFg0gRagPIkimT20Cnom/ZoxzPyQI3ctyv1vyWPz14GbTnoLa99bX2I9j4ftBfX30XDCRNBcSGcqJ1t+0npmeJMoxyyCtBqiEhdESG0HUwJgJ0z043zfGmYwY4ksp9scFTtrJ5ocQKx2tQipi8q0GwUPJB72JTS4TWExG6ZoKYM72oFeRlWjc10fOeZXGSGojpbJYo5dhIXDiLB0wBNXziAKGKuXUwPSPOR+P8L70SSjkyJNRN4uacNEXKDVPyc2l3Ox0gAVpmszRmVA5ggeUmM4qUzRuMWaLn9mUz/5xVGtyPZnFhn90DPgY3J8EfIIVz1b1byPUdNmRGN9+7qMTNNwpg1FgqKfXbP/f+8ZGz/pWz/t+N//W8tKwS7IDLh4J9V/sxAadKZYNdF8WiWxja3TyvuPi1F8O2DI6ftNpPW/6hy+ELNDtr6AF6iOqojZ6hbbSHDlAHUfQN/US/0G9vw3vrHXpHs6urK/Oa+6h0vOAvvCQ53A==</latexit>

Page 61: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Regularization• Weight decay on the parameters

• L2 penalty on the parameters

• Early stopping of the optimization procedure

• Monitor the validation loss and terminate when it stops improving

• Number of hidden layers and hidden units per layer

31

Page 62: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Momentum

• Pro: Can allow you to jump over small local optima

• Pro: Goes faster through flat areas by using acquired speed

• Con: Can also jump over global optima

• Con: One more hyper-parameter to tune

• More advanced adaptive steps: adagrad, adam

32

<latexit sha1_base64="sKY+EgrIOYJedFEs1rb/QhcIVTk=">AAAEd3icnVPtatRAFM22q9b1q9WfigyWli1ul40UFKRQaksLtlo/1hZ21nIzubsddpIJMzf7Qcj7+AS+i2/iTyfbpTRtQXFCyOHcuefeOXMTJEpaarV+Vebmq7du31m4W7t3/8HDR4tLj79ZnRqBbaGVNicBWFQyxjZJUniSGIQoUHgcDN4V8eMhGit1/JUmCXYj6MeyJwWQo04Xf46+Z/TSz9nqJitgztYZB5WcAeMxBAocS4y/LR7GCceU1fcMhBJjYjtohfuu5YxzVlud7uK1YaHFAyRgw/9QY6Mmi3TkUBqt5U6v1CGxl2zI+enicqvZmi52HfgzsOzN1tHp0twPHmqRFrpCgbUdv5VQNwNDUijMazy1mIAYQB87DsYQoe1mU4NztuKYkPW0ca/rcMrWVi6lZJHjLRoDlHcsRjLQKuxeFs0gsnYSBU4sAjqzV2MFeVOsk1LvTTeTcZISxuK8l16qGGlWXCgLpUFBauIACCPdcZg4AwOC3LWXqgRaDwgC2wDX56QRpYqk0aPS2bNpAwmKMjtOYyl0iFdYRWMyUDYvHMrEzuwbn/vnrLJIEci4sC/bRzVE1yewD5jiRdTVLcL1HdmXZBsHbkLjxp5BHKxdSynpHV54v/7FWb/trP8n0Rvz/prWaScJGuH+uc3P2E8VmMaBHpWZ7s3aNb6DbvgMHroCH50KkDYZ380zXpQMgmw3z2tutv2rk3wdHL9q+htN3/+0sby1PRvzBe+p98Kre7732tvy9r0jr+2JyrPKduV95WD+d/V5dbVaP986V5nlPPFKq+r/AcuMfWo=</latexit>

ɯɎ+ = ɯɎ � ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ٣

ɮ = �ɮ� ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ ɯ ȅȒȅƺȇɎɖȅ٣

ɯɎ+ = ɯɎ + ɮ

Page 63: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Wide or Deep?

33

Page 64: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Wide or Deep?

33

[Figure 6.6, Deep Learning, book]

Page 65: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Wide or Deep?

34

[Figure 6.7, Deep Learning, book]

Page 66: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Dropout

35 [Figure 7.6, Deep Learning]

• Standard regularization technique

• At training drop a percentage of the units

• Used for non-output layer

• Prevents co-adaptation / Bagging

• At test: use the full network

• Normalize the weights

Page 67: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Page 68: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Neural Network Takeaways

Page 69: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Neural Networks takeaways

• Very flexible models

• Composed of simple units (neurons)

• Adapt to different types of data

• (Highly) non-linear models

• E.g., Can learn to order/rank inputs easily

• Scale to very large datasets

• May require “fiddling” with model architecture + optimization hyper-parameters

• Standardizing data can be very important

38

Page 70: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

Where do NNs shine

• Input is high-dimensional discrete or real-valued

• Output is discrete or real valued, or a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human interpretability is not important

• The computation of the output based on the input has to be fast

39

Page 71: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629 40

Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given sufficiently large models and sufficiently large datasets of labeled training examples.

Other tasks, that cannot be described as associating one vector to another, or that are difficult enough that a person would require time to think and reflect in order to accomplish the task, remain beyond the scope of deep learning for now.

Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html

Page 72: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Neural Networks in Practice

Page 73: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

In practice• Software now derives gradients automatically

• You specify the architecture of the network

• Connection pattern

• Number of hidden layers

• Number of layers

• Activation functions

• Learning rate (learning rate updates)

• Dropout

• For intuitions: https://playground.tensorflow.org

42

Page 74: Machine Learning I 80-629A Apprentissage Automatique I 80-629

Laurent Charlin — 80-629

A selection of standard tools (in python)

• Scikit-learn

• Machine learning toolbox

• Feed-forward neural networks

• Neural network specific tools

• PyTorch, Tensorflow

• Keras

• More specific tools for specific tasks:

• caffe for computer vision, pySpark for distributed computations, NLTK for natural language processing

43