The Emergence Theory of Representation Learning · Desiderata of Representations!3 data x task y...

The Emergence Theory of Representation Learning

Stefano Soatto, Alessandro AchilleUCLA & AWS

�1

October 3, 2019

menu• Part I: What is a representation? Desiderata

• What variational principle(s) define optimal representations?

• Part II: What does Deep Learning have to do with it? Generalization and the Information Lagrangian

• Duality and the Emergence Bound

• Where is the information in deep neural networks?

• Part III: What if the task is not known completely?

• Critical Learning Periods

• The space of learning tasks

• Task Topology, Task Reachability

!2

Desiderata of Representations

!3

datax

tasky

representationz

nuisancesn

y = x

(compression, auto encoding, prediction); encompasses supervised, un-supervised, self-supervised, semi-supervised…

• Sufficient (for the task)

• Invariant (to nuisances)

• Minimal

!4

datax

tasky

representationz

nuisancesn

I(y; z) = I(y;x)

n ? y ) I(n; z) = 0

I(x; z) = minimal

Desiderata of Representations

A Variational Principle?

• Sufficient (for the task)

• Invariant (to nuisances)

• Minimal (information)

• Easy to work with?

!5

datax

tasky

representationz

nuisancesn

I(y; z) = I(y;x)

minq(z|x)

L .= H

p,q

(y|z) + �I(z;x)minq(z|x)

L .

= �I(y; z) + �I(z;x)

[Tishby-Bialek-Pereira ’99]

Information Bottleneck (IB)

I(z;x) = smallest

• Claim: is sufficient, a nuisance; then

• and there exists a nuisance for which equality holds

!6

z n

I(z;n) I(z;x)� I(x; y)

invariance minimality constant

A. Achille and S. Soatto, On the Emergence of Invariance and Disentangling in Deep Representations; JMLR 2018; ArXiv:1706.01350

A Variational Principle?

Examples• Nuisances have a group structure: Maximal Invariance

• Localization (SLAM)

• Diffeo/homeomorphisms of the domain and range of an image:

• General viewpoint and illumination invariants (Attributed Reeb Trees) [Sundaramoorthi, Varadarajan, etc.] 2005-2009

• Local affine domain & range transformations:

• DSP-SIFT [Dong] 2011-2015

• Non-invertible nuisances:

• Occlusions, Scale… Give up on Maximal Invariance

!7

This Information Bottleneck is wishful thinking

• The task is a function of (test) data we have not yet seen!

• The Information Bottleneck is a statement of desire

!8

z ⇠ p(z|x)

minq(z|x)

L .= H

p,q

(y|z) + �I(z;x)

A. Saxe et al., On the Information Bottleneck Theory of Deep Learning, ICLR 2018

!9

dataset modelD wqw(y|x) �! p(y|x)

qw(y|x) = argminHp,q(D|w)

{xi, yi} ⇠ p

SGD

GSD

Desiderata of Deep Learning

generalizationLtest(qw(y|x))

1

N(1� 1/2�)[Hp,q(D|w) + �KL(q(w|D)||p(w))]

.

=

NX

i=1

� log qw(yi|xi)+�I(D;w)

i = 1, . . . , N

PAC-Bayes bound (Catoni, 2008; McAllester, 2013)

+�I(D;w){Empirical Cross-Entropy?

!10

dataset modelD qw(y|x) �! p(y|x)

{xi, yi} ⇠ p

The Information Lagrangian

Ltest(qw(y|x)) 1

N(1� 1/2�)[Hp,q(D|w) + �KL(q(w|D)||p(w))]

i = 1, . . . , N

minq(z|x)

L .= H

p,q

(y|z) + �I(z;x)

Hp,q(D|w) + �I(D; p(w))LN (w).=

Past data (training set)

Future data (test sample)

?

A few questions (preview)

• What is the relation between the two bottlenecks?

• What “information”? the weights are fixed, and there is only one dataset!

• What is the “prior”? and the “posterior”?

• The second term of the Information Lagrangian is not there in practice!

Measuring Information by Adding Noise

�12

Idea: We can estimate the amount of information contained in the weights by corrupting them with noise and measuring the decrease in performance.

C. Shannon, Prediction and Entropy of Printed English, Bell System Technical Journal, 1951

Example: Shannon (1951) estimates the information content of the English language by corrupting random letters and measuring the reconstruction error of English speakers.

“Thif is a vevy moisy party” → “This is a very noisy party”

Prediction and Entropy of PrintedBy C. E. SHANNON

(Manuscript &ceiDcd Sept. IS, I950)

A Dew method of estimating the entropy and redundancy of a language isdescribed. This method exploits the knowledge of the language statistics pos-sessed by those who speak the language, and depends on experimental resultsin prediction of the next letter when the preceding text is known. Results ofexperiments in prediction are given, and some properties of an ideal predictor aredeveloped.

1. INTRODUCTION

I N A previous paper! the entropy and redundancy of a language havebeen defined. The entropy is a statistical parameter which measures,

in a certain sense, how much infonnation is produced on the average foreach letter of a text in the language. If the language is translated into binarydigits (0 or 1) in the most efficient way, the entropy H is the average numberof binary digits required per letter of the original language. The redundancy,on the other hand, measures the amount of constraint imposed on a text inthe language_ due to its statistical structure, e.g., in English thehigh fre-quency of the letter E, the strong tendency of H to follow T or of U to followQ: It was estimated that when statistical effects extending over not morethan eight letters are considered the entropy is roughly 2.3 bits per letter,the redundancy about 50 per cent.Since then a new method has been found for estimating· these quantIties,

which is more sensitive and takes account of long range statistics, iniluencesextending over phrases, sentences, etc. This method is based on a study ofthe predictability of English; how well can the next letter of a text be pre-dicted when the preceding .1\7 letters are known. The results of some experi-ments in prediction will be given, and a theoretical analysis of some of theproperties of ideal prediction. By combining the experimental and theoreti-cal results it is possible to estimate upper and lower bounds for the entropyand redundancy. From this analysis it appears that, in ordinary literaryEnglish; the long range statistical effects (up to 100 letters) reduce theentropy to somethiD.g of the order of one bit per letter, with a correspondingredundancy of roughly 75%. The redundancy may be still higher whenstructure extending over paragraphs, chapters, etc: is included. However, asthe lengths involved ate 4J.creased, the parameters in question become more

1 C. E. Shannon, <lA Mathematical Theory of Communication," Bdt S;'stem TedmicalJournal, v. 27, pp. 379-423, 623-656, July, October, 1948.

PREDICTION AN

erratic and uncertain, and t:involved.

2. ENTROPY CALCULA

One method of calculatineFo, F I , F 2 , ••• , which sueof the language into accounthe entropy; it meato statistics extending over _

FH = -LP(b,i,j

-:-L pCb,

in which: bi is a block of Aj is an arbitrarypCb, ,j) is the prpoJj) is the cone

and is EThe equation (1) can be

(conditional entropy) of thknown. As N is increased;and the entropy, H, is givt

The N-gram entropies 1standard tables of letter,punctuation are ignored Vi

be taken (by definition) tofrequencies and is given t

"F , =-Li=l .

The digram approximatio

F, - p(i

7.70 - 4.2: Fletcher Pratt, "Secret a·

Ii;;

L(w) = Hp,q(D|w) + � KL(q(w |D) k p(w))<latexit sha1_base64="6i8XzKlj6XP/3l19n2sKsSZ7OzE=">AAADXHicjVLRahNBFN1tqtaN1VTBF18GQyHFEHaDoChCwT5EGrGCaQuZEGZn7yZDd2a3M7PGMJlv8cmP8sVvcTaJkLQKXhg4nHvvmTv3TFxkTOkw/Onv1Hbv3L23dz+oP9h/+Khx8Phc5aWkMKB5lsvLmCjImICBZjqDy0IC4XEGF/HV+yp/8RWkYrn4oucFjDiZCJYySrSjxo0fh6jfmh2hd6g3NkX72rYwJ3pKSWZO7MIlXiAcgyYIlxq+aYNP++a6NVtsVB1ZUzgJa80HkeaSL5URE2gGbDLVygb/ecO/pceNZtgJl4Fug2gNmt46zsYH/nec5LTkIDTNiFLDKCz0yBCpGc3ABrhUUBB6RSYwdFAQDmpklvu06NAxCXJvcUdotGQ3OwzhSs157CqrWdXNXEX+LTcsdfp6ZJgo3DIFXV2UlhnSOarMQQmTQHU2d4BQydysiE6JJFQ7C4MASxAwoznnRCQGp4SzbJ5ASspMW4NV+gdvDUR5LCsjHLvZfdq3w+7I4LwASXQuqxUYR6KWaUYWt/ECt02z6zwI8Am4JUr46B70aV1usJbWuBNsyy6/yUq5FAnI2E0PleLYdbiUSitRW1ka3TTwNjjvdqKwE31+2Tzurc3d8555z72WF3mvvGOv5515A4/6dT/y3/hvd37Vdmv12v6qdMdf9zzxtqL29Dd7ThSf</latexit><latexit sha1_base64="6i8XzKlj6XP/3l19n2sKsSZ7OzE=">AAADXHicjVLRahNBFN1tqtaN1VTBF18GQyHFEHaDoChCwT5EGrGCaQuZEGZn7yZDd2a3M7PGMJlv8cmP8sVvcTaJkLQKXhg4nHvvmTv3TFxkTOkw/Onv1Hbv3L23dz+oP9h/+Khx8Phc5aWkMKB5lsvLmCjImICBZjqDy0IC4XEGF/HV+yp/8RWkYrn4oucFjDiZCJYySrSjxo0fh6jfmh2hd6g3NkX72rYwJ3pKSWZO7MIlXiAcgyYIlxq+aYNP++a6NVtsVB1ZUzgJa80HkeaSL5URE2gGbDLVygb/ecO/pceNZtgJl4Fug2gNmt46zsYH/nec5LTkIDTNiFLDKCz0yBCpGc3ABrhUUBB6RSYwdFAQDmpklvu06NAxCXJvcUdotGQ3OwzhSs157CqrWdXNXEX+LTcsdfp6ZJgo3DIFXV2UlhnSOarMQQmTQHU2d4BQydysiE6JJFQ7C4MASxAwoznnRCQGp4SzbJ5ASspMW4NV+gdvDUR5LCsjHLvZfdq3w+7I4LwASXQuqxUYR6KWaUYWt/ECt02z6zwI8Am4JUr46B70aV1usJbWuBNsyy6/yUq5FAnI2E0PleLYdbiUSitRW1ka3TTwNjjvdqKwE31+2Tzurc3d8555z72WF3mvvGOv5515A4/6dT/y3/hvd37Vdmv12v6qdMdf9zzxtqL29Dd7ThSf</latexit><latexit sha1_base64="6i8XzKlj6XP/3l19n2sKsSZ7OzE=">AAADXHicjVLRahNBFN1tqtaN1VTBF18GQyHFEHaDoChCwT5EGrGCaQuZEGZn7yZDd2a3M7PGMJlv8cmP8sVvcTaJkLQKXhg4nHvvmTv3TFxkTOkw/Onv1Hbv3L23dz+oP9h/+Khx8Phc5aWkMKB5lsvLmCjImICBZjqDy0IC4XEGF/HV+yp/8RWkYrn4oucFjDiZCJYySrSjxo0fh6jfmh2hd6g3NkX72rYwJ3pKSWZO7MIlXiAcgyYIlxq+aYNP++a6NVtsVB1ZUzgJa80HkeaSL5URE2gGbDLVygb/ecO/pceNZtgJl4Fug2gNmt46zsYH/nec5LTkIDTNiFLDKCz0yBCpGc3ABrhUUBB6RSYwdFAQDmpklvu06NAxCXJvcUdotGQ3OwzhSs157CqrWdXNXEX+LTcsdfp6ZJgo3DIFXV2UlhnSOarMQQmTQHU2d4BQydysiE6JJFQ7C4MASxAwoznnRCQGp4SzbJ5ASspMW4NV+gdvDUR5LCsjHLvZfdq3w+7I4LwASXQuqxUYR6KWaUYWt/ECt02z6zwI8Am4JUr46B70aV1usJbWuBNsyy6/yUq5FAnI2E0PleLYdbiUSitRW1ka3TTwNjjvdqKwE31+2Tzurc3d8555z72WF3mvvGOv5515A4/6dT/y3/hvd37Vdmv12v6qdMdf9zzxtqL29Dd7ThSf</latexit><latexit sha1_base64="6i8XzKlj6XP/3l19n2sKsSZ7OzE=">AAADXHicjVLRahNBFN1tqtaN1VTBF18GQyHFEHaDoChCwT5EGrGCaQuZEGZn7yZDd2a3M7PGMJlv8cmP8sVvcTaJkLQKXhg4nHvvmTv3TFxkTOkw/Onv1Hbv3L23dz+oP9h/+Khx8Phc5aWkMKB5lsvLmCjImICBZjqDy0IC4XEGF/HV+yp/8RWkYrn4oucFjDiZCJYySrSjxo0fh6jfmh2hd6g3NkX72rYwJ3pKSWZO7MIlXiAcgyYIlxq+aYNP++a6NVtsVB1ZUzgJa80HkeaSL5URE2gGbDLVygb/ecO/pceNZtgJl4Fug2gNmt46zsYH/nec5LTkIDTNiFLDKCz0yBCpGc3ABrhUUBB6RSYwdFAQDmpklvu06NAxCXJvcUdotGQ3OwzhSs157CqrWdXNXEX+LTcsdfp6ZJgo3DIFXV2UlhnSOarMQQmTQHU2d4BQydysiE6JJFQ7C4MASxAwoznnRCQGp4SzbJ5ASspMW4NV+gdvDUR5LCsjHLvZfdq3w+7I4LwASXQuqxUYR6KWaUYWt/ECt02z6zwI8Am4JUr46B70aV1usJbWuBNsyy6/yUq5FAnI2E0PleLYdbiUSitRW1ka3TTwNjjvdqKwE31+2Tzurc3d8555z72WF3mvvGOv5515A4/6dT/y3/hvd37Vdmv12v6qdMdf9zzxtqL29Dd7ThSf</latexit>

�13


fixed prioroutput of training

A. Achille et al., The Information Complexity of Learning Tasks, their Structure and their Distance, ArXiv 2019H. Li et al., Visualizing the Loss Landscape of Neural Nets, ICLR 2018

Fisher Information: Gaussian prior

⇒ Implicitly minimized by SGD

F = curvature of loss landscapeKL =

kwk2

�2+ log |2�2NF + I|

<latexit sha1_base64="i+hZgQzq41BRFUrePu8XRqY96bc=">AAADXXicbVJdaxNBFJ1NrNa1tqk++ODLYAgkWMNuEBREKBhLS1utYNpCZhtmJ5Nk6eyHM7OGMDs/1Ccf/RveTRcxSS8MHM49c+feeybMRKS05/1yavUHWw8fbT92n+w83d1r7D+7VGkuGR+wVKTyOqSKiyjhAx1pwa8zyWkcCn4V3n4q81c/uVRRmnzXi4wHMZ0m0SRiVAM1avxpYXJ6Zn605wXpd6zJ2vOOxR/xG0xEOsXFUeG2SEz1LAzNZzsi/eGmPAD9SXv+AQMF6jTjkupUJjTm5vRsvRpeK7cuX6l2XzEykZQZUsxJcdOzhgiYdkwB4tfVM71/HP6Cj4A+KUaNptf1loE3gV+BJqriYrTv3JBxyvKYJ5oJqtTQ9zIdGCp1xAS3LskVzyi7pVM+BFg2qAKzdMTiFjBjPEklnETjJfv/DUNjpRZxCMpyG2o9V5L35Ya5nrwPTJRkueYJu3tokgusU1zai8eR5EyLBQDKZAS9YjajsDANn8B1ScLnLI1jmowNGGmHvcBs7rhtmr4lB6QgB6bZsx0Yts9hCZKfQ0NfK7khWloDZ7VqHywp+2ZUmL617spoLA5lNJ1poMERf33/m+Cy1/W9rv/tbfPwuPJmG71Er1Ab+egdOkTH6AINEHPOHeUUjq39rm/Vd+q7d9KaU915jlai/uIv2CANUQ==</latexit><latexit sha1_base64="i+hZgQzq41BRFUrePu8XRqY96bc=">AAADXXicbVJdaxNBFJ1NrNa1tqk++ODLYAgkWMNuEBREKBhLS1utYNpCZhtmJ5Nk6eyHM7OGMDs/1Ccf/RveTRcxSS8MHM49c+feeybMRKS05/1yavUHWw8fbT92n+w83d1r7D+7VGkuGR+wVKTyOqSKiyjhAx1pwa8zyWkcCn4V3n4q81c/uVRRmnzXi4wHMZ0m0SRiVAM1avxpYXJ6Zn605wXpd6zJ2vOOxR/xG0xEOsXFUeG2SEz1LAzNZzsi/eGmPAD9SXv+AQMF6jTjkupUJjTm5vRsvRpeK7cuX6l2XzEykZQZUsxJcdOzhgiYdkwB4tfVM71/HP6Cj4A+KUaNptf1loE3gV+BJqriYrTv3JBxyvKYJ5oJqtTQ9zIdGCp1xAS3LskVzyi7pVM+BFg2qAKzdMTiFjBjPEklnETjJfv/DUNjpRZxCMpyG2o9V5L35Ya5nrwPTJRkueYJu3tokgusU1zai8eR5EyLBQDKZAS9YjajsDANn8B1ScLnLI1jmowNGGmHvcBs7rhtmr4lB6QgB6bZsx0Yts9hCZKfQ0NfK7khWloDZ7VqHywp+2ZUmL617spoLA5lNJ1poMERf33/m+Cy1/W9rv/tbfPwuPJmG71Er1Ab+egdOkTH6AINEHPOHeUUjq39rm/Vd+q7d9KaU915jlai/uIv2CANUQ==</latexit><latexit sha1_base64="i+hZgQzq41BRFUrePu8XRqY96bc=">AAADXXicbVJdaxNBFJ1NrNa1tqk++ODLYAgkWMNuEBREKBhLS1utYNpCZhtmJ5Nk6eyHM7OGMDs/1Ccf/RveTRcxSS8MHM49c+feeybMRKS05/1yavUHWw8fbT92n+w83d1r7D+7VGkuGR+wVKTyOqSKiyjhAx1pwa8zyWkcCn4V3n4q81c/uVRRmnzXi4wHMZ0m0SRiVAM1avxpYXJ6Zn605wXpd6zJ2vOOxR/xG0xEOsXFUeG2SEz1LAzNZzsi/eGmPAD9SXv+AQMF6jTjkupUJjTm5vRsvRpeK7cuX6l2XzEykZQZUsxJcdOzhgiYdkwB4tfVM71/HP6Cj4A+KUaNptf1loE3gV+BJqriYrTv3JBxyvKYJ5oJqtTQ9zIdGCp1xAS3LskVzyi7pVM+BFg2qAKzdMTiFjBjPEklnETjJfv/DUNjpRZxCMpyG2o9V5L35Ya5nrwPTJRkueYJu3tokgusU1zai8eR5EyLBQDKZAS9YjajsDANn8B1ScLnLI1jmowNGGmHvcBs7rhtmr4lB6QgB6bZsx0Yts9hCZKfQ0NfK7khWloDZ7VqHywp+2ZUmL617spoLA5lNJ1poMERf33/m+Cy1/W9rv/tbfPwuPJmG71Er1Ab+egdOkTH6AINEHPOHeUUjq39rm/Vd+q7d9KaU915jlai/uIv2CANUQ==</latexit><latexit sha1_base64="i+hZgQzq41BRFUrePu8XRqY96bc=">AAADXXicbVJdaxNBFJ1NrNa1tqk++ODLYAgkWMNuEBREKBhLS1utYNpCZhtmJ5Nk6eyHM7OGMDs/1Ccf/RveTRcxSS8MHM49c+feeybMRKS05/1yavUHWw8fbT92n+w83d1r7D+7VGkuGR+wVKTyOqSKiyjhAx1pwa8zyWkcCn4V3n4q81c/uVRRmnzXi4wHMZ0m0SRiVAM1avxpYXJ6Zn605wXpd6zJ2vOOxR/xG0xEOsXFUeG2SEz1LAzNZzsi/eGmPAD9SXv+AQMF6jTjkupUJjTm5vRsvRpeK7cuX6l2XzEykZQZUsxJcdOzhgiYdkwB4tfVM71/HP6Cj4A+KUaNptf1loE3gV+BJqriYrTv3JBxyvKYJ5oJqtTQ9zIdGCp1xAS3LskVzyi7pVM+BFg2qAKzdMTiFjBjPEklnETjJfv/DUNjpRZxCMpyG2o9V5L35Ya5nrwPTJRkueYJu3tokgusU1zai8eR5EyLBQDKZAS9YjajsDANn8B1ScLnLI1jmowNGGmHvcBs7rhtmr4lB6QgB6bZsx0Yts9hCZKfQ0NfK7khWloDZ7VqHywp+2ZUmL617spoLA5lNJ1poMERf33/m+Cy1/W9rv/tbfPwuPJmG71Er1Ab+egdOkTH6AINEHPOHeUUjq39rm/Vd+q7d9KaU915jlai/uIv2CANUQ==</latexit>

Loss

Weight configuration

Sharp minima Flat minimum

The Information in a Deep Neural Network

* Hochreiter and Schmidhuber, Flat Minima, Neural Computation ,1997

Shannon Information: adapted prior p(w) := ED[q(w |D)]<latexit sha1_base64="BElT0xNoaDY1VPzy38twwxc4aRw=">AAACi3icbZBdaxNBFIYn61ddraZ66c3gIiRQwm4UlKJQsUVBxQqmLWSWMDs5SYbOxzozawjT/R/+mt7Wv+C/cTbdi3544MDLc84w73mLUnDr0vRvJ7p1+87dexv34wcPNx897m49ObS6MgxGTAttjgtqQXAFI8edgOPSAJWFgKPi5EMzP/oFxnKtfrhVCbmkc8VnnFEX0KQ7LHvLPt55h4mkblEUfr+e+LVmVPi9uh7/7C1PL4F+Pukm6SBdF74pslYkqK2DyVbnPZlqVklQjglq7ThLS5d7ahxnAuqYVBZKyk7oHMZBKirB5n59XI1fBDLFM21CK4fX9PILT6W1K1mEzcalvT5r4P9m48rN3uSeq7JyoNjFR7NKYKdxkxSecgPMiVUQlBkevGK2oIYyF/KMY6JgybSUVE09+fylHg9zT3QJhjptmhN8gLjnk6wm2+SUbPtkWPfDsXsQQjDwNRj61q574kztQ8dX3DNZGD5fuIBD6Nn1iG+Kw+EgSwfZ91fJ7sc2/g30DD1HPZSh12gXfUIHaIQY+o3O0Dn6E21GL6Od6O3FatRp3zxFVyra/we1+ccX</latexit><latexit sha1_base64="BElT0xNoaDY1VPzy38twwxc4aRw=">AAACi3icbZBdaxNBFIYn61ddraZ66c3gIiRQwm4UlKJQsUVBxQqmLWSWMDs5SYbOxzozawjT/R/+mt7Wv+C/cTbdi3544MDLc84w73mLUnDr0vRvJ7p1+87dexv34wcPNx897m49ObS6MgxGTAttjgtqQXAFI8edgOPSAJWFgKPi5EMzP/oFxnKtfrhVCbmkc8VnnFEX0KQ7LHvLPt55h4mkblEUfr+e+LVmVPi9uh7/7C1PL4F+Pukm6SBdF74pslYkqK2DyVbnPZlqVklQjglq7ThLS5d7ahxnAuqYVBZKyk7oHMZBKirB5n59XI1fBDLFM21CK4fX9PILT6W1K1mEzcalvT5r4P9m48rN3uSeq7JyoNjFR7NKYKdxkxSecgPMiVUQlBkevGK2oIYyF/KMY6JgybSUVE09+fylHg9zT3QJhjptmhN8gLjnk6wm2+SUbPtkWPfDsXsQQjDwNRj61q574kztQ8dX3DNZGD5fuIBD6Nn1iG+Kw+EgSwfZ91fJ7sc2/g30DD1HPZSh12gXfUIHaIQY+o3O0Dn6E21GL6Od6O3FatRp3zxFVyra/we1+ccX</latexit><latexit sha1_base64="BElT0xNoaDY1VPzy38twwxc4aRw=">AAACi3icbZBdaxNBFIYn61ddraZ66c3gIiRQwm4UlKJQsUVBxQqmLWSWMDs5SYbOxzozawjT/R/+mt7Wv+C/cTbdi3544MDLc84w73mLUnDr0vRvJ7p1+87dexv34wcPNx897m49ObS6MgxGTAttjgtqQXAFI8edgOPSAJWFgKPi5EMzP/oFxnKtfrhVCbmkc8VnnFEX0KQ7LHvLPt55h4mkblEUfr+e+LVmVPi9uh7/7C1PL4F+Pukm6SBdF74pslYkqK2DyVbnPZlqVklQjglq7ThLS5d7ahxnAuqYVBZKyk7oHMZBKirB5n59XI1fBDLFM21CK4fX9PILT6W1K1mEzcalvT5r4P9m48rN3uSeq7JyoNjFR7NKYKdxkxSecgPMiVUQlBkevGK2oIYyF/KMY6JgybSUVE09+fylHg9zT3QJhjptmhN8gLjnk6wm2+SUbPtkWPfDsXsQQjDwNRj61q574kztQ8dX3DNZGD5fuIBD6Nn1iG+Kw+EgSwfZ91fJ7sc2/g30DD1HPZSh12gXfUIHaIQY+o3O0Dn6E21GL6Od6O3FatRp3zxFVyra/we1+ccX</latexit><latexit sha1_base64="BElT0xNoaDY1VPzy38twwxc4aRw=">AAACi3icbZBdaxNBFIYn61ddraZ66c3gIiRQwm4UlKJQsUVBxQqmLWSWMDs5SYbOxzozawjT/R/+mt7Wv+C/cTbdi3544MDLc84w73mLUnDr0vRvJ7p1+87dexv34wcPNx897m49ObS6MgxGTAttjgtqQXAFI8edgOPSAJWFgKPi5EMzP/oFxnKtfrhVCbmkc8VnnFEX0KQ7LHvLPt55h4mkblEUfr+e+LVmVPi9uh7/7C1PL4F+Pukm6SBdF74pslYkqK2DyVbnPZlqVklQjglq7ThLS5d7ahxnAuqYVBZKyk7oHMZBKirB5n59XI1fBDLFM21CK4fX9PILT6W1K1mEzcalvT5r4P9m48rN3uSeq7JyoNjFR7NKYKdxkxSecgPMiVUQlBkevGK2oIYyF/KMY6JgybSUVE09+fylHg9zT3QJhjptmhN8gLjnk6wm2+SUbPtkWPfDsXsQQjDwNRj61q574kztQ8dX3DNZGD5fuIBD6Nn1iG+Kw+EgSwfZ91fJ7sc2/g30DD1HPZSh12gXfUIHaIQY+o3O0Dn6E21GL6Od6O3FatRp3zxFVyra/we1+ccX</latexit>

ED[KL] = I(w ;D)<latexit sha1_base64="cHixmDngW53NgNUg1Pf3aoMeF2A=">AAADEXicbVHdihMxGM2Mf+v419VLb4Kl2MJaZorgwiIsWGVlV1zB7i40Y8mkaRs2mYxJxlLSPIWXPol34q1P4NuY6c7FtvWDwOF8J1/ynZMVnGkTx3+D8MbNW7fv7NyN7t1/8PBRY/fxmZalInRAJJfqIsOacpbTgWGG04tCUSwyTs+zyzdV//wbVZrJ/LNZFDQVeJqzCSPYeGrU+NGC6PjEfm3Pl6jfcbZozzsOvoYvIOJyCpfvllELCWxmWWbfuhHqD7flqde/b88PoKe8WhZUYSNVjgW1xycb0zaGbYqvzxo1mnE3XhXcBkkNmqCu09Fu8AWNJSkFzQ3hWOthEhcmtVgZRjh1ESo1LTC5xFM69LB6VKd25aKDLc+M4UQqf3IDV+z1GxYLrRci88pqB73Zq8j/9YalmeynluVFaWhOrh6alBwaCatI4JgpSgxfeICJYv6vkMywwsT44KII5XROpBA4H1tvvhv2Urttcts2E4f20BLt2WbPdfyyfepNUPSD/9DHWm6RUc76sz617+wqF4K57TsXra1GRKbYdGY87RNJNv3fBme9bhJ3k08vm4dHdTY74Cl4BtogAa/AITgCp2AASBAEz4M4SMLv4c/wV/j7ShoG9Z0nYK3CP/8ADxX0bg==</latexit><latexit sha1_base64="cHixmDngW53NgNUg1Pf3aoMeF2A=">AAADEXicbVHdihMxGM2Mf+v419VLb4Kl2MJaZorgwiIsWGVlV1zB7i40Y8mkaRs2mYxJxlLSPIWXPol34q1P4NuY6c7FtvWDwOF8J1/ynZMVnGkTx3+D8MbNW7fv7NyN7t1/8PBRY/fxmZalInRAJJfqIsOacpbTgWGG04tCUSwyTs+zyzdV//wbVZrJ/LNZFDQVeJqzCSPYeGrU+NGC6PjEfm3Pl6jfcbZozzsOvoYvIOJyCpfvllELCWxmWWbfuhHqD7flqde/b88PoKe8WhZUYSNVjgW1xycb0zaGbYqvzxo1mnE3XhXcBkkNmqCu09Fu8AWNJSkFzQ3hWOthEhcmtVgZRjh1ESo1LTC5xFM69LB6VKd25aKDLc+M4UQqf3IDV+z1GxYLrRci88pqB73Zq8j/9YalmeynluVFaWhOrh6alBwaCatI4JgpSgxfeICJYv6vkMywwsT44KII5XROpBA4H1tvvhv2Urttcts2E4f20BLt2WbPdfyyfepNUPSD/9DHWm6RUc76sz617+wqF4K57TsXra1GRKbYdGY87RNJNv3fBme9bhJ3k08vm4dHdTY74Cl4BtogAa/AITgCp2AASBAEz4M4SMLv4c/wV/j7ShoG9Z0nYK3CP/8ADxX0bg==</latexit><latexit sha1_base64="cHixmDngW53NgNUg1Pf3aoMeF2A=">AAADEXicbVHdihMxGM2Mf+v419VLb4Kl2MJaZorgwiIsWGVlV1zB7i40Y8mkaRs2mYxJxlLSPIWXPol34q1P4NuY6c7FtvWDwOF8J1/ynZMVnGkTx3+D8MbNW7fv7NyN7t1/8PBRY/fxmZalInRAJJfqIsOacpbTgWGG04tCUSwyTs+zyzdV//wbVZrJ/LNZFDQVeJqzCSPYeGrU+NGC6PjEfm3Pl6jfcbZozzsOvoYvIOJyCpfvllELCWxmWWbfuhHqD7flqde/b88PoKe8WhZUYSNVjgW1xycb0zaGbYqvzxo1mnE3XhXcBkkNmqCu09Fu8AWNJSkFzQ3hWOthEhcmtVgZRjh1ESo1LTC5xFM69LB6VKd25aKDLc+M4UQqf3IDV+z1GxYLrRci88pqB73Zq8j/9YalmeynluVFaWhOrh6alBwaCatI4JgpSgxfeICJYv6vkMywwsT44KII5XROpBA4H1tvvhv2Urttcts2E4f20BLt2WbPdfyyfepNUPSD/9DHWm6RUc76sz617+wqF4K57TsXra1GRKbYdGY87RNJNv3fBme9bhJ3k08vm4dHdTY74Cl4BtogAa/AITgCp2AASBAEz4M4SMLv4c/wV/j7ShoG9Z0nYK3CP/8ADxX0bg==</latexit><latexit sha1_base64="cHixmDngW53NgNUg1Pf3aoMeF2A=">AAADEXicbVHdihMxGM2Mf+v419VLb4Kl2MJaZorgwiIsWGVlV1zB7i40Y8mkaRs2mYxJxlLSPIWXPol34q1P4NuY6c7FtvWDwOF8J1/ynZMVnGkTx3+D8MbNW7fv7NyN7t1/8PBRY/fxmZalInRAJJfqIsOacpbTgWGG04tCUSwyTs+zyzdV//wbVZrJ/LNZFDQVeJqzCSPYeGrU+NGC6PjEfm3Pl6jfcbZozzsOvoYvIOJyCpfvllELCWxmWWbfuhHqD7flqde/b88PoKe8WhZUYSNVjgW1xycb0zaGbYqvzxo1mnE3XhXcBkkNmqCu09Fu8AWNJSkFzQ3hWOthEhcmtVgZRjh1ESo1LTC5xFM69LB6VKd25aKDLc+M4UQqf3IDV+z1GxYLrRci88pqB73Zq8j/9YalmeynluVFaWhOrh6alBwaCatI4JgpSgxfeICJYv6vkMywwsT44KII5XROpBA4H1tvvhv2Urttcts2E4f20BLt2WbPdfyyfepNUPSD/9DHWm6RUc76sz617+wqF4K57TsXra1GRKbYdGY87RNJNv3fBme9bhJ3k08vm4dHdTY74Cl4BtogAa/AITgCp2AASBAEz4M4SMLv4c/wV/j7ShoG9Z0nYK3CP/8ADxX0bg==</latexit>

https://arxiv.org/abs/1904.03292

A few questions (preview)

• The second term of the Information Lagrangian is not there in practice!

• (inductive bias of SGD)

• Now that we can compute the Info in the Weights, what does it look like as we learn?

• (critical learning periods in deep networks)


Training Epoch

Info

rmat

ion

in W

eigh

ts

Relation between Fisher and Shannon

�15

SGD minimizes the Fisher Information of the Weights. However, generalization is governed by the Shannon Information.

Proposition. Assuming the dataset is parametrized in a differentiable way, we have:

Where w* = w*(D) is the result of running SGD on dataset D and F(w) is the Fisher Information Matrix in w.

I(w ;D) ⇡ H(D)� EDhlog

⇣(2⇡e)k

|rDw ⇤ F (w ⇤) rDw ⇤T |

⌘i

<latexit sha1_base64="7HEzsngsbTWoQ+Z17ANv70cLV04=">AAADCXicbVHdbtMwFHYyfkbZoINLbiwqpBSNKqmQQOJmghaNPzGkdZtUp5XjuqlVx44ch1JcPwFPwx3ilqfggnfB6XJBO45k6/P3nXN8fpKcs0KH4W/P37l2/cbN3VuN23v7d+42D+6dFbJUhA6I5FJdJLignAk60ExzepErirOE0/Nk/qrSzz9TVTApTvUyp3GGU8GmjGDtqHFz8SZYvICo14YI57mSX+BxUL2eQNQfox56ydIh4jKtQICmChMTdFHOIG2P5taskMAJx84TLkaPIYLwdeBAu0JmQ7Oj05WtsrSrKx43W2EnXBu8CqIatEBtJ+MD7yuaSFJmVGjCcVEMozDXscFKM8KpbaCyoDkmc5zSoYMCZ7SIzXpCFj5yzAROpXJHaLhm/40wOCuKZZY4zwzrWbGtVeT/tGGpp89jw0ReairI5UfTkkMtYTVuOGGKEs2XDmCimKsVkhl2Y9RuKY0GEnRBZJZhMTHo3Xs77MYGyZwqrKWqWjCOhIFpRRYdohU6NK2ubbtme9QNQdEPrqCPtbtBWlnjzmbWnjWoqptgbnp2S+zXYpKYvtM22iZZolg6045224q2d3MVnHU7UdiJPj1tHb2t97YLHoCHIAAReAaOwDE4AQNAwB9vx9vz9v1v/nf/h//z0tX36pj7YMP8X38B/XzytA==</latexit><latexit sha1_base64="7HEzsngsbTWoQ+Z17ANv70cLV04=">AAADCXicbVHdbtMwFHYyfkbZoINLbiwqpBSNKqmQQOJmghaNPzGkdZtUp5XjuqlVx44ch1JcPwFPwx3ilqfggnfB6XJBO45k6/P3nXN8fpKcs0KH4W/P37l2/cbN3VuN23v7d+42D+6dFbJUhA6I5FJdJLignAk60ExzepErirOE0/Nk/qrSzz9TVTApTvUyp3GGU8GmjGDtqHFz8SZYvICo14YI57mSX+BxUL2eQNQfox56ydIh4jKtQICmChMTdFHOIG2P5taskMAJx84TLkaPIYLwdeBAu0JmQ7Oj05WtsrSrKx43W2EnXBu8CqIatEBtJ+MD7yuaSFJmVGjCcVEMozDXscFKM8KpbaCyoDkmc5zSoYMCZ7SIzXpCFj5yzAROpXJHaLhm/40wOCuKZZY4zwzrWbGtVeT/tGGpp89jw0ReairI5UfTkkMtYTVuOGGKEs2XDmCimKsVkhl2Y9RuKY0GEnRBZJZhMTHo3Xs77MYGyZwqrKWqWjCOhIFpRRYdohU6NK2ubbtme9QNQdEPrqCPtbtBWlnjzmbWnjWoqptgbnp2S+zXYpKYvtM22iZZolg6045224q2d3MVnHU7UdiJPj1tHb2t97YLHoCHIAAReAaOwDE4AQNAwB9vx9vz9v1v/nf/h//z0tX36pj7YMP8X38B/XzytA==</latexit><latexit sha1_base64="7HEzsngsbTWoQ+Z17ANv70cLV04=">AAADCXicbVHdbtMwFHYyfkbZoINLbiwqpBSNKqmQQOJmghaNPzGkdZtUp5XjuqlVx44ch1JcPwFPwx3ilqfggnfB6XJBO45k6/P3nXN8fpKcs0KH4W/P37l2/cbN3VuN23v7d+42D+6dFbJUhA6I5FJdJLignAk60ExzepErirOE0/Nk/qrSzz9TVTApTvUyp3GGU8GmjGDtqHFz8SZYvICo14YI57mSX+BxUL2eQNQfox56ydIh4jKtQICmChMTdFHOIG2P5taskMAJx84TLkaPIYLwdeBAu0JmQ7Oj05WtsrSrKx43W2EnXBu8CqIatEBtJ+MD7yuaSFJmVGjCcVEMozDXscFKM8KpbaCyoDkmc5zSoYMCZ7SIzXpCFj5yzAROpXJHaLhm/40wOCuKZZY4zwzrWbGtVeT/tGGpp89jw0ReairI5UfTkkMtYTVuOGGKEs2XDmCimKsVkhl2Y9RuKY0GEnRBZJZhMTHo3Xs77MYGyZwqrKWqWjCOhIFpRRYdohU6NK2ubbtme9QNQdEPrqCPtbtBWlnjzmbWnjWoqptgbnp2S+zXYpKYvtM22iZZolg6045224q2d3MVnHU7UdiJPj1tHb2t97YLHoCHIAAReAaOwDE4AQNAwB9vx9vz9v1v/nf/h//z0tX36pj7YMP8X38B/XzytA==</latexit><latexit sha1_base64="7HEzsngsbTWoQ+Z17ANv70cLV04=">AAADCXicbVHdbtMwFHYyfkbZoINLbiwqpBSNKqmQQOJmghaNPzGkdZtUp5XjuqlVx44ch1JcPwFPwx3ilqfggnfB6XJBO45k6/P3nXN8fpKcs0KH4W/P37l2/cbN3VuN23v7d+42D+6dFbJUhA6I5FJdJLignAk60ExzepErirOE0/Nk/qrSzz9TVTApTvUyp3GGU8GmjGDtqHFz8SZYvICo14YI57mSX+BxUL2eQNQfox56ydIh4jKtQICmChMTdFHOIG2P5taskMAJx84TLkaPIYLwdeBAu0JmQ7Oj05WtsrSrKx43W2EnXBu8CqIatEBtJ+MD7yuaSFJmVGjCcVEMozDXscFKM8KpbaCyoDkmc5zSoYMCZ7SIzXpCFj5yzAROpXJHaLhm/40wOCuKZZY4zwzrWbGtVeT/tGGpp89jw0ReairI5UfTkkMtYTVuOGGKEs2XDmCimKsVkhl2Y9RuKY0GEnRBZJZhMTHo3Xs77MYGyZwqrKWqWjCOhIFpRRYdohU6NK2ubbtme9QNQdEPrqCPtbtBWlnjzmbWnjWoqptgbnp2S+zXYpKYvtM22iZZolg6045224q2d3MVnHU7UdiJPj1tHb2t97YLHoCHIAAReAaOwDE4AQNAwB9vx9vz9v1v/nf/h//z0tX36pj7YMP8X38B/XzytA==</latexit>

A. and Soatto, Where is the Information in a Deep Network?, 2018

Emergence Bound: Simple weights → simple activations

�16A. and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018A. and Soatto, Where is the Information in a Deep Neural Network?, 2019

Theorem (Emergence Bound): Let z = fw(x) be a layer of a network. To first-order, the effective information in the activations is:

where F(w) is the Fisher Information of the weights, Jf is the Jacobian of fw w.r.t. w.

I (x ; z) ⇡ H(x)� log⇣

(2⇡e)k

|rx

fw

(x)t Jtf

F (w) Jf

rx

fw

(x)|

⌘

<latexit sha1_base64="V/ij1ZPg5tFKv3GC8qR8skSVN2c=">AAAEq3icdVNtb9MwEM5KgRHeNviCxJcTVaUEtqmpkEAaSBO0aBtFDO1V1GnkpE5rLW9zXNLNtfg1fIXfw7/Bzrpp7YYlJ4/vHtt3j+/8LKI5bzT+LlRuVW/fubt4z7z/4OGjx0vLTw7ydMQCsh+kUcqOfJyTiCZkn1MekaOMERz7ETn0jz9q/+EPwnKaJnv8NCNujAcJDWmAuTJ5y5VndRRjPgxwJDoS3gNqe6JAOY3hxComqGVL6HY81LIK24VXgHzCMaBR0ifMZzggAn3uiAuqyBRNSg9xMuZiKwlTFpcXAU2ADwkcEjoY8lyaddiyinVQmwDhLGPpGDYtvVrVEaAW+kAHXRSlAw0sFKqrhNVEGQVi946lmKAE+xFWTCh6LwEBfLIUsDUSMz7Z25tIfYqtP65Z37W4rRONaeLdED28Ay5vkOFSBbPemSbISa6pETmB8wgdKRydwcWiWcolFcfX6bS94gYtr0WgyS6YW9NbSBhKa7wOZ1elGpdKKX2gFAj+r9AYQq9Q/B5HsO2F+qe1ssuVlmuON5Hlkba3VGusNcoB14EzBTVjOna85YUz1E+DUUwSHkQ4z7tOI+OuwIzTICLSRKOcZDg4xgPSVTDBMcldUdawhLqy9EEVjJoJh9J6dYfAcZ6fxr5i6nrN533aeJOvO+LhW1fQJBtxkgTnF4WjCHgKuiGgTxkJeHSqAA4YVbFCMMRKTa7axjRRQoogjWOc9HWxyG7TFSjNCMM8ZToFoYxgiZoj0QqaoBVRa0pbJdsiSgRGvqiAvk7pAnEmhZqzp7akuOzBlpxztqdO3xdt5ZtJO4h9pvtJmdVrOfNvcx0cNNecxprz7XVtY3v6bovGc+OFYRmO8cbYMDaNHWPfCCo/K78qvyt/qqvV3er3KjqnVhame54aM6NK/gHiioBI</latexit><latexit sha1_base64="V/ij1ZPg5tFKv3GC8qR8skSVN2c=">AAAEq3icdVNtb9MwEM5KgRHeNviCxJcTVaUEtqmpkEAaSBO0aBtFDO1V1GnkpE5rLW9zXNLNtfg1fIXfw7/Bzrpp7YYlJ4/vHtt3j+/8LKI5bzT+LlRuVW/fubt4z7z/4OGjx0vLTw7ydMQCsh+kUcqOfJyTiCZkn1MekaOMERz7ETn0jz9q/+EPwnKaJnv8NCNujAcJDWmAuTJ5y5VndRRjPgxwJDoS3gNqe6JAOY3hxComqGVL6HY81LIK24VXgHzCMaBR0ifMZzggAn3uiAuqyBRNSg9xMuZiKwlTFpcXAU2ADwkcEjoY8lyaddiyinVQmwDhLGPpGDYtvVrVEaAW+kAHXRSlAw0sFKqrhNVEGQVi946lmKAE+xFWTCh6LwEBfLIUsDUSMz7Z25tIfYqtP65Z37W4rRONaeLdED28Ay5vkOFSBbPemSbISa6pETmB8wgdKRydwcWiWcolFcfX6bS94gYtr0WgyS6YW9NbSBhKa7wOZ1elGpdKKX2gFAj+r9AYQq9Q/B5HsO2F+qe1ssuVlmuON5Hlkba3VGusNcoB14EzBTVjOna85YUz1E+DUUwSHkQ4z7tOI+OuwIzTICLSRKOcZDg4xgPSVTDBMcldUdawhLqy9EEVjJoJh9J6dYfAcZ6fxr5i6nrN533aeJOvO+LhW1fQJBtxkgTnF4WjCHgKuiGgTxkJeHSqAA4YVbFCMMRKTa7axjRRQoogjWOc9HWxyG7TFSjNCMM8ZToFoYxgiZoj0QqaoBVRa0pbJdsiSgRGvqiAvk7pAnEmhZqzp7akuOzBlpxztqdO3xdt5ZtJO4h9pvtJmdVrOfNvcx0cNNecxprz7XVtY3v6bovGc+OFYRmO8cbYMDaNHWPfCCo/K78qvyt/qqvV3er3KjqnVhame54aM6NK/gHiioBI</latexit><latexit sha1_base64="V/ij1ZPg5tFKv3GC8qR8skSVN2c=">AAAEq3icdVNtb9MwEM5KgRHeNviCxJcTVaUEtqmpkEAaSBO0aBtFDO1V1GnkpE5rLW9zXNLNtfg1fIXfw7/Bzrpp7YYlJ4/vHtt3j+/8LKI5bzT+LlRuVW/fubt4z7z/4OGjx0vLTw7ydMQCsh+kUcqOfJyTiCZkn1MekaOMERz7ETn0jz9q/+EPwnKaJnv8NCNujAcJDWmAuTJ5y5VndRRjPgxwJDoS3gNqe6JAOY3hxComqGVL6HY81LIK24VXgHzCMaBR0ifMZzggAn3uiAuqyBRNSg9xMuZiKwlTFpcXAU2ADwkcEjoY8lyaddiyinVQmwDhLGPpGDYtvVrVEaAW+kAHXRSlAw0sFKqrhNVEGQVi946lmKAE+xFWTCh6LwEBfLIUsDUSMz7Z25tIfYqtP65Z37W4rRONaeLdED28Ay5vkOFSBbPemSbISa6pETmB8wgdKRydwcWiWcolFcfX6bS94gYtr0WgyS6YW9NbSBhKa7wOZ1elGpdKKX2gFAj+r9AYQq9Q/B5HsO2F+qe1ssuVlmuON5Hlkba3VGusNcoB14EzBTVjOna85YUz1E+DUUwSHkQ4z7tOI+OuwIzTICLSRKOcZDg4xgPSVTDBMcldUdawhLqy9EEVjJoJh9J6dYfAcZ6fxr5i6nrN533aeJOvO+LhW1fQJBtxkgTnF4WjCHgKuiGgTxkJeHSqAA4YVbFCMMRKTa7axjRRQoogjWOc9HWxyG7TFSjNCMM8ZToFoYxgiZoj0QqaoBVRa0pbJdsiSgRGvqiAvk7pAnEmhZqzp7akuOzBlpxztqdO3xdt5ZtJO4h9pvtJmdVrOfNvcx0cNNecxprz7XVtY3v6bovGc+OFYRmO8cbYMDaNHWPfCCo/K78qvyt/qqvV3er3KjqnVhame54aM6NK/gHiioBI</latexit><latexit sha1_base64="V/ij1ZPg5tFKv3GC8qR8skSVN2c=">AAAEq3icdVNtb9MwEM5KgRHeNviCxJcTVaUEtqmpkEAaSBO0aBtFDO1V1GnkpE5rLW9zXNLNtfg1fIXfw7/Bzrpp7YYlJ4/vHtt3j+/8LKI5bzT+LlRuVW/fubt4z7z/4OGjx0vLTw7ydMQCsh+kUcqOfJyTiCZkn1MekaOMERz7ETn0jz9q/+EPwnKaJnv8NCNujAcJDWmAuTJ5y5VndRRjPgxwJDoS3gNqe6JAOY3hxComqGVL6HY81LIK24VXgHzCMaBR0ifMZzggAn3uiAuqyBRNSg9xMuZiKwlTFpcXAU2ADwkcEjoY8lyaddiyinVQmwDhLGPpGDYtvVrVEaAW+kAHXRSlAw0sFKqrhNVEGQVi946lmKAE+xFWTCh6LwEBfLIUsDUSMz7Z25tIfYqtP65Z37W4rRONaeLdED28Ay5vkOFSBbPemSbISa6pETmB8wgdKRydwcWiWcolFcfX6bS94gYtr0WgyS6YW9NbSBhKa7wOZ1elGpdKKX2gFAj+r9AYQq9Q/B5HsO2F+qe1ssuVlmuON5Hlkba3VGusNcoB14EzBTVjOna85YUz1E+DUUwSHkQ4z7tOI+OuwIzTICLSRKOcZDg4xgPSVTDBMcldUdawhLqy9EEVjJoJh9J6dYfAcZ6fxr5i6nrN533aeJOvO+LhW1fQJBtxkgTnF4WjCHgKuiGgTxkJeHSqAA4YVbFCMMRKTa7axjRRQoogjWOc9HWxyG7TFSjNCMM8ZToFoYxgiZoj0QqaoBVRa0pbJdsiSgRGvqiAvk7pAnEmhZqzp7akuOzBlpxztqdO3xdt5ZtJO4h9pvtJmdVrOfNvcx0cNNecxprz7XVtY3v6bovGc+OFYRmO8cbYMDaNHWPfCCo/K78qvyt/qqvV3er3KjqnVhame54aM6NK/gHiioBI</latexit>

Let z = fw(x) be a layer of a network, and let zn be the representation obtained by adding noise to the weights. We define the effective information as Ieff (x; z) = I(x; zn)

Information in activations

Fisher of Weights

!17

Two Bottlenecks

minq(z|x)

L .= H

p,q

(y|z) + �I(z;x)

LN (w).=

dataset weights real distributionD

p(y|x)

data activations labelInvariance

ActivationsFuture

GeneralizationWeights

Past

�17

Information Bottleneck

Information Lagrangian

w

x z y

Hp,q(D|w) + �KL(q(w|D)||p(w))WHERE IS THE INFORMATION?

WHAT PRIORS? WHAT POSTERIOR?

ANY RELATION BETWEEN THESE TWO?

{ }, (car, horse, deer, …)Training Set

Test Image

Weights

Minimal InvariantRepresentation

Emergence Bound

• A sufficient representation that minimizes the information the weights contain about past data, maximizes invariance of the representation of future data.

• Pertains to the combination of DNNs (sufficient capacity to overfit) and SGD (inductive bias)

�20

Phase transition

𝛽 < 1 ⇒ overfitting

𝛽 > 1 ⇒ underfitting

𝛽 < 1 ⇒ overfitting

𝛽 >> 1 ⇒ underfitting

fitting

Using the regularized loss:

10�2 10�1 100 101 102

Value of �

0%

20%

40%

60%

80%

100%

Tra

iner

ror

All-CNN

ResNet

Small AlexNet

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018

Phase transition

L(w) = Hp,q(D|w) + �KL(q(w |D)kp(w))

For random labels there is a transition between over- and under-fitting at β = 1.

�21

What’s next?

1. This addresses what is an optimal representation for a given task

2. Even an optimal representation may be useless (garbage-in/garbage-out)

3. What if the task is not known ahead of time?

4. When are two tasks close? What is the distance between two tasks?

5. Can one predict if a model pre-trained on a task will perform well on another?

A Topology on the Space of Tasks

�22

Distance between tasks:

d(𝒟1 → 𝒟2) = I(𝒟1𝒟2; w) − I(𝒟1; w)Complexity of

learning togetherComplexity of learning one

A., Paolini, Mbeng, Soatto, The Information Complexity of Learning Tasks, their Structure and their Distance, 2019

Notice that this is an asymmetric distance

https://arxiv.org/abs/1904.03292

�23

A Topology on the Space of Tasks

Kolmogorov (asymmetric) distance between tasks:

d(D1 ! D2) = K(D2|D1)<latexit sha1_base64="B5vDG0L907x/vKwl5t4+gSWNz4s=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZaYIuhEEKwpuKtgHdMqQyaQ2NMkMSUYp43yRG39FF4KKuPUzzLSzqK0HAodzziX3Hj9iVGnbfrfm5hcWl5ZXVgtr6xubW8XtnaYKY4lJA4cslG0fKcKoIA1NNSPtSBLEfUZa/uA881v3RCoails9jEiXoztBexQjbSSveOEK8oBDzpEIEreWJi5Huo8RS2ppWgjKbs1zoKtDaEj1EJ7C6/JEwqs+ZoFDr1iyK/YIcJY4OSmBHHWv+OoGIY45ERozpFTHsSPdTZDUFDOSFtxYkQjhAbojHUMF4kR1k9G5KTwwSgB7oTRPaDhSJycSxJUact8ks1XVtJeJ/3mdWPdOugkVUayJwOOPejGD5vysOxhQSbBmQ0MQltTsCnEfSYS1abhgSnCmT54lzWrFsSvOzVHp7DKvYwXsgX1QBg44BmfgCtRBA2DwBF7AB/i0nq0368v6HkfnrHxmF/yB9fMLMvSnUA==</latexit><latexit sha1_base64="B5vDG0L907x/vKwl5t4+gSWNz4s=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZaYIuhEEKwpuKtgHdMqQyaQ2NMkMSUYp43yRG39FF4KKuPUzzLSzqK0HAodzziX3Hj9iVGnbfrfm5hcWl5ZXVgtr6xubW8XtnaYKY4lJA4cslG0fKcKoIA1NNSPtSBLEfUZa/uA881v3RCoails9jEiXoztBexQjbSSveOEK8oBDzpEIEreWJi5Huo8RS2ppWgjKbs1zoKtDaEj1EJ7C6/JEwqs+ZoFDr1iyK/YIcJY4OSmBHHWv+OoGIY45ERozpFTHsSPdTZDUFDOSFtxYkQjhAbojHUMF4kR1k9G5KTwwSgB7oTRPaDhSJycSxJUact8ks1XVtJeJ/3mdWPdOugkVUayJwOOPejGD5vysOxhQSbBmQ0MQltTsCnEfSYS1abhgSnCmT54lzWrFsSvOzVHp7DKvYwXsgX1QBg44BmfgCtRBA2DwBF7AB/i0nq0368v6HkfnrHxmF/yB9fMLMvSnUA==</latexit><latexit sha1_base64="B5vDG0L907x/vKwl5t4+gSWNz4s=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZaYIuhEEKwpuKtgHdMqQyaQ2NMkMSUYp43yRG39FF4KKuPUzzLSzqK0HAodzziX3Hj9iVGnbfrfm5hcWl5ZXVgtr6xubW8XtnaYKY4lJA4cslG0fKcKoIA1NNSPtSBLEfUZa/uA881v3RCoails9jEiXoztBexQjbSSveOEK8oBDzpEIEreWJi5Huo8RS2ppWgjKbs1zoKtDaEj1EJ7C6/JEwqs+ZoFDr1iyK/YIcJY4OSmBHHWv+OoGIY45ERozpFTHsSPdTZDUFDOSFtxYkQjhAbojHUMF4kR1k9G5KTwwSgB7oTRPaDhSJycSxJUact8ks1XVtJeJ/3mdWPdOugkVUayJwOOPejGD5vysOxhQSbBmQ0MQltTsCnEfSYS1abhgSnCmT54lzWrFsSvOzVHp7DKvYwXsgX1QBg44BmfgCtRBA2DwBF7AB/i0nq0368v6HkfnrHxmF/yB9fMLMvSnUA==</latexit><latexit sha1_base64="B5vDG0L907x/vKwl5t4+gSWNz4s=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZaYIuhEEKwpuKtgHdMqQyaQ2NMkMSUYp43yRG39FF4KKuPUzzLSzqK0HAodzziX3Hj9iVGnbfrfm5hcWl5ZXVgtr6xubW8XtnaYKY4lJA4cslG0fKcKoIA1NNSPtSBLEfUZa/uA881v3RCoails9jEiXoztBexQjbSSveOEK8oBDzpEIEreWJi5Huo8RS2ppWgjKbs1zoKtDaEj1EJ7C6/JEwqs+ZoFDr1iyK/YIcJY4OSmBHHWv+OoGIY45ERozpFTHsSPdTZDUFDOSFtxYkQjhAbojHUMF4kR1k9G5KTwwSgB7oTRPaDhSJycSxJUact8ks1XVtJeJ/3mdWPdOugkVUayJwOOPejGD5vysOxhQSbBmQ0MQltTsCnEfSYS1abhgSnCmT54lzWrFsSvOzVHp7DKvYwXsgX1QBg44BmfgCtRBA2DwBF7AB/i0nq0368v6HkfnrHxmF/yB9fMLMvSnUA==</latexit>

How much more structure do we need to learn?

Difficult task to easy task

Easy task to difficult task

Similar tasks cluster together

�24

MNISTFashion MNIST

SVHN

CIFAR-10ImageNet

KITTI

TASK2VEC: Embedding tasks in a metric space

�25

Recovers species taxonomy on iNaturalist

Task Embeddings Domain Embeddings

Actinopterygii (n)

Amphibia (n)

Arachnida (n)

Aves (n)

Fungi (n)

Insecta (n)

Mammalia (n)

Mollusca (n)

Plantae (n)

Protozoa (n)

Reptilia (n)

Category (m)

Color (m)

Gender (m)

Material (m)

Neckline (m)

Pants (m)

Pattern (m)

Shoes (m)

Task Embeddings Domain Embeddings

Actinopterygii (n)

Amphibia (n)

Arachnida (n)

Aves (n)

Fungi (n)

Insecta (n)

Mammalia (n)

Mollusca (n)

Plantae (n)

Protozoa (n)

Reptilia (n)

Category (m)

Color (m)

Gender (m)

Material (m)

Neckline (m)

Pants (m)

Pattern (m)

Shoes (m)

Recovers a meaningful topology on hundred of tasks

A. et al., TASK2VEC: Task embedding for meta-learning, 2019

Proposing an optimal expert for the task

�26

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

ICCV#5171

ICCV#5171

ICCV 2019 Submission #5171. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3: TASK2VEC often selects the best available experts. Violin plot of the test error distribution (shaded) on tasksfrom the CUB-200 dataset (columns) obtained by training a linear classifier over several expert feature extractors (points).Most specialized feature extractors perform similarly on a given task, and similar or worse than a generic feature extractorpre-trained on ImageNet (blue triangles). However, in some cases a carefully chosen expert, trained on a related task, cangreatly outperform all others (long lower whiskers). The model selection algorithm based on TASK2VEC can predict anexpert to use for the task (red cross, lower is better) and often recommends the optimal, or near optimal, feature extractorwithout performing an expensive brute-force training and evaluation over all available experts. Columns are ordered by normof the task embedding vector. Tasks with lower embedding norm have lower error and more “complex” task (task with higherembedding norm) tend to benefit more from a specialized expert.

sults: Tasks that are correlated in the dataset, such as binaryclasses corresponding to the same categorical attribute, mayend up far away from each other and close to other tasks thatare semantically more similar (e.g., the jeans category taskis close to the ripped attribute and the denim material). Inthe visualization, this non-trivial grouping is reflected in themixture of colors of semantically related nearby tasks.

We also compare the TASK2VEC embedding with a do-main embedding baseline, which only exploits the inputdistribution p(x) rather than the task distribution p(x, y).While some tasks are highly correlated with their domain(e.g., tasks from iNaturalist), other tasks differ only on thelabels (e.g., all the attribute tasks of iMaterialist, whichshare the same clothes domain). Accordingly, the domainembedding recovers similar clusters on iNaturalist. How-ever, on iMaterialst domain embedding collapses all tasksto a single uninformative cluster (not a single point due toslight noise in embedding computation).

Task Embedding encodes task difficulty. The scatter-plotin Fig. 3 compares the norm of embedding vectors vs. per-formance of the best expert (or task specific model for caseswhere we have the diagonal computed). As suggested bythe analysis for the two-layer model, the norm of the taskembedding also correlates with the complexity of the taskon real tasks and architectures.

4.2. Model Selection

Given a task, our aim is to select an expert feature extrac-tor that maximizes the classification performance on thattask. We propose two strategies: (1) embed the task andselect the feature extractor trained on the most similar task,and (2) jointly embed the models and tasks, and select amodel using the learned metric (see Section 3.4). Noticethat (1) does not use knowledge of the model performanceon various tasks, which makes it more widely applicablebut requires we know what task a model was trained for andmay ignore the fact that models trained on slightly differ-ent tasks may still provide an overall better feature extrac-tor (for example by over-fitting less to the task they weretrained on).

In Table 1 we compare the overall results of the variousproposed metrics on the model selection meta-tasks. Onboth the iNat+CUB and Mixed meta-tasks, the AsymmetricTASK2VEC model selection is close to the ground-truth op-timal, and significantly improves over both chance, and overusing an generic ImageNet expert. Notice that our methodhas O(1) complexity, while searching over a collection ofN experts is O(N).Error distribution. In Fig. 3, we show in detail the errordistribution of the experts on multiple tasks. It is interest-ing to observe that the classification error obtained usingmost experts clusters around some mean value, and littleimprovement is observed over using a generic expert. On

7

Allows to select the best expert to solve a task and substantially reduce error and training time.

�27

A snag: Critical Periods

Two almost identical tasks, yet it is not possible to fine-tune from one to the other.

Excursus: Critical Periods for learning

Follow-up: Task reachability. Complexity is physical.

�28

Critical periods

Critical periods: A time-period in early development where sensory deficits can permanently impair the acquisition of a skill

Examples: monocular deprivation, cataracts, imprinting, language acquisition

Deficit Normal training

0 180 + NN

Kitten does not recover vision in covered eye

Image from Cnops et al., 2008

Hubel and Wiesel

�29

Critical periods in Deep Networks

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Deficit Normal training

0 160+NN

Show network blurred images to simulate cataract

The network does not classify correctly if the deficit is removed to late

A short deficit at epoch ~40 is enough to permanently damage the network!

�30

Critical learning periods and Information in Weights

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Sensitivity to deficits peaks when network is absorbing information.Is minimal when the network is consolidating information.

�31

Solutions Move after Critical Periods

The moving is not for the better; does not affect performance.

A Details of the Experiments375

We use the standard ResNet-18 architecture [11] in all the experiments, stated unless otherwise. We376

train all the networks using SGD (momentum = 0.9) with a batch-size of 128 for 200 epochs, except377

when regularization is applied late during the training, where we train for an extra 50 epochs, to378

ensure the convergence of the networks. For experiments with ResNet-18, we use an initial learning379

rate of 0.1, with learning rate decay factor of 0.97 per epoch and a weight decay coefficient of 0.0005.380

In the piece-wise constant learning rate experiment, we use an initial learning rate of 0.1 and decay it381

by a factor of 0.1 every 60 epochs. While in the constant learning experiment we fix it to 0.001. For382

the All-CNN [31] experiments, we use an initial learning rate of 0.05 with a weight decay coefficient383

of 0.001 and a learning rate decay of 0.97 per epoch. For All-CNN, we do not use dropout and instead384

we add Batch-Normalization to all layers.385

A.1 Path Plotting386

We follow the method proposed by [23] to plot the training trajectories of the DNNs for varying387

duration of regularization (Figure 2 A). More precisely, we combine the weights of the network388

(stored at regular intervals during the training) for different duration of regularization into a single389

matrix M and then project them on the first two principal components of M .390

Figure 6: PCA-projection of the training paths for All-CNN on CIFAR-10.

B Additional Experiments391

Figure 7: Trace of FIM as a function of training epochs for delayed and sliding window applicationof weight decay for ResNet-18 on CIFAR-10.

11

�32

High-level deficits do not have a critical period

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018Picture from “The world is upside down” — The Innsbruck Goggle Experiments of Theodor Erismann and Ivo Kohler, Sachse et al.

Deficits that only change high-level statistics of the data do not show a critical period.

was perfectly correct. For example, a participant drew apicture in a quality as if drawn without wearing reversingspectacles.

After taking off the glasses, however, participants saw thewhole world upside down, a distortion “in the opposite di-rection” (negative after-effect), but the reversed vision only

lasted a few minutes.“The top-bottom perspectives of vision only emerge in

constant interaction with experiences of the other senses(particularly the tactile sense and muscle sense). Therefore,the position of the retinal image in the background of the eyeis only significant as long as older experiences from the pastcontinue to have an effect. In the experiment, they arereduced step by step and are, via a stage of ‘ambiguous top-bottom perspective’, connected in a new way with the newvisual impressions” (Kohler, 1951b, p. 33). The studies showthat first, movement behavior returns to normal, and only

then is followed by perception. Successful adaptations to achanged world of perception require a person's active explo-ration of and interaction with their environment (Hommel &Nattkemper, 2011; Kohler, 1951b).

It was always due to Erismann's initiative, verve, experi-ence, and scientific curiosity thatmore andmore students and

associates were ready to help with new long-term studies

with those special goggles, committing themselves for weeksor even months. (For example, Kohler wore binocularreversing spectacles for 24 h a day over a period of 124 daysfrom November 1946 until March 1947; this was probablyunique within psychological research, cf. Kohler, 1951a, 1964).

The Innsbruck Goggle Experiments were progressive alsoin terms of their methodology. By means of the long-termstudies, the Innsbruck department left the lab and investi-gated various processes and effects in the realm of percep-tion psychology in the field, that is, under ecologically validconditions. Various activities, such as watching a movie or a

circus performance, going to a tavern, after a few days evenbike, motor bike, and going on ski tours, were naturally partof the studies of the goggle-wearing participants (Figs. 5 and6). Part of the results were not only subjective data fromself-observation and peer-observation, but also data fromquantitative measures of adaptation performance ineveryday life.

Because of World War II, important results of the Inns-bruck Goggle Experiments on the topic of “Emergent Percep-tion” ewhich was the title of a presentation of Erismann 1947in Bonn (orig. “Werden der Wahrnehmung”; additionally

published as a conference report of the Society of GermanPsychologists, DGPs)e could only be publishedwith a delay. In1948, at the XII. International Congress of Psychology inEdinburgh, Kohler had the opportunity to open up researchcontacts to the English-speaking field of psychology (cf.Kohler, 1950). The publication of Kohler's habilitation thesis“On the composition and transition of the world of percep-tion” (1951a; orig. “Uber Aufbau und Wandlungen derFig. 4 e Reversing spectacles by Erismann from 1947; a

metal mirror enables reversing beams concerning top-bottom orientation.

Fig. 5 e Participant with prismatic goggles (top/bottom) inInnsbruck (Austria).

c o r t e x 9 2 ( 2 0 1 7 ) 2 2 2e2 3 2 227

High-level deficits do not exhibit a critical period

Low-level deficit exhibit a critical period

�33

Information is physical

Idea: When using SGD, the Fisher Information adds a drag term controlled by the batch size

Veff

= U|{z}

Real loss

+

k

Blog |F |

| {z }Drag term

How can the Fisher Information affect the learning dynamics?

SGD MINIMIZES THE FISHER INFORMATION OF THE WIGHTS (INDUCTIVE BIAS OF SGD)

�34

A path-integral approximation

p(w(t)|w0, t0) = e1D

RL

<latexit sha1_base64="CywQLux+J2m2BMYT/D5SZpQ8/vw=">AAACHnicbVDLSgMxFM34rPVVdekmWIQWpMyIohuhqAsXLirYB3TqkEkzbWjmQXLHUsb5Ejf+ihsXigiu9G9MHwttPRA4nHMvOfe4keAKTPPbmJtfWFxazqxkV9fWNzZzW9s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3exdCv3zOpeBjcwiBiLZ90Au5xSkBLTu44KvQLUHzAfcc8wOCYRXyG2V1ie5LQxEqTyxTbPABs+wS6lIjkOsWpk8ubJXMEPEusCcmjCSpO7tNuhzT2WQBUEKWalhlBKyESOBUszdqxYhGhPdJhTU0D4jPVSkbnpXhfK23shVI/nWSk/t5IiK/UwHf15DCkmvaG4n9eMwbvtJXwIIqBBXT8kRcLDCEedoXbXDIKYqAJoZLrrJh2iS4GdKNZXYI1ffIsqR2WLLNk3Rzly+eTOjJoF+2hArLQCSqjK1RBVUTRI3pGr+jNeDJejHfjYzw6Z0x2dtAfGF8/PR+gwQ==</latexit><latexit sha1_base64="CywQLux+J2m2BMYT/D5SZpQ8/vw=">AAACHnicbVDLSgMxFM34rPVVdekmWIQWpMyIohuhqAsXLirYB3TqkEkzbWjmQXLHUsb5Ejf+ihsXigiu9G9MHwttPRA4nHMvOfe4keAKTPPbmJtfWFxazqxkV9fWNzZzW9s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3exdCv3zOpeBjcwiBiLZ90Au5xSkBLTu44KvQLUHzAfcc8wOCYRXyG2V1ie5LQxEqTyxTbPABs+wS6lIjkOsWpk8ubJXMEPEusCcmjCSpO7tNuhzT2WQBUEKWalhlBKyESOBUszdqxYhGhPdJhTU0D4jPVSkbnpXhfK23shVI/nWSk/t5IiK/UwHf15DCkmvaG4n9eMwbvtJXwIIqBBXT8kRcLDCEedoXbXDIKYqAJoZLrrJh2iS4GdKNZXYI1ffIsqR2WLLNk3Rzly+eTOjJoF+2hArLQCSqjK1RBVUTRI3pGr+jNeDJejHfjYzw6Z0x2dtAfGF8/PR+gwQ==</latexit><latexit sha1_base64="CywQLux+J2m2BMYT/D5SZpQ8/vw=">AAACHnicbVDLSgMxFM34rPVVdekmWIQWpMyIohuhqAsXLirYB3TqkEkzbWjmQXLHUsb5Ejf+ihsXigiu9G9MHwttPRA4nHMvOfe4keAKTPPbmJtfWFxazqxkV9fWNzZzW9s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3exdCv3zOpeBjcwiBiLZ90Au5xSkBLTu44KvQLUHzAfcc8wOCYRXyG2V1ie5LQxEqTyxTbPABs+wS6lIjkOsWpk8ubJXMEPEusCcmjCSpO7tNuhzT2WQBUEKWalhlBKyESOBUszdqxYhGhPdJhTU0D4jPVSkbnpXhfK23shVI/nWSk/t5IiK/UwHf15DCkmvaG4n9eMwbvtJXwIIqBBXT8kRcLDCEedoXbXDIKYqAJoZLrrJh2iS4GdKNZXYI1ffIsqR2WLLNk3Rzly+eTOjJoF+2hArLQCSqjK1RBVUTRI3pGr+jNeDJejHfjYzw6Z0x2dtAfGF8/PR+gwQ==</latexit><latexit sha1_base64="CywQLux+J2m2BMYT/D5SZpQ8/vw=">AAACHnicbVDLSgMxFM34rPVVdekmWIQWpMyIohuhqAsXLirYB3TqkEkzbWjmQXLHUsb5Ejf+ihsXigiu9G9MHwttPRA4nHMvOfe4keAKTPPbmJtfWFxazqxkV9fWNzZzW9s1FcaSsioNRSgbLlFM8IBVgYNgjUgy4ruC1d3exdCv3zOpeBjcwiBiLZ90Au5xSkBLTu44KvQLUHzAfcc8wOCYRXyG2V1ie5LQxEqTyxTbPABs+wS6lIjkOsWpk8ubJXMEPEusCcmjCSpO7tNuhzT2WQBUEKWalhlBKyESOBUszdqxYhGhPdJhTU0D4jPVSkbnpXhfK23shVI/nWSk/t5IiK/UwHf15DCkmvaG4n9eMwbvtJXwIIqBBXT8kRcLDCEedoXbXDIKYqAJoZLrrJh2iS4GdKNZXYI1ffIsqR2WLLNk3Rzly+eTOjJoF+2hArLQCSqjK1RBVUTRI3pGr+jNeDJejHfjYzw6Z0x2dtAfGF8/PR+gwQ==</latexit>

1) Approximate SGD with gradient descent + white noise. Use MSR formalism to obtain probability of following a path w(t):

2) Assume most path are perturbations of distinct “critical” paths:

3) Approximate the loss function quadratically along critical paths, and integrate out the perturbations to find total probability of crossing bottleneck:

p(wf , tf |w0, t0) = e��L(w;D)

Z wf

w0

e�1

2D

R tft0

12 u(t)

2+V (u(t))dtdu(t)<latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit>

Static part Dynamic partDepends only on the IBL at initial point and final point

Depends on the existence of likely path between the two

SGD EFFECTIVELY MINIMIZES THE IBL FOR THE WEIGHTS

Achille, Mbeng, Soatto, Dynamics of learning, arXiv 2018

�35

p(wf , tf |w0, t0) = e��L(w;D)

Z wf

w0

e�1

2D

R tft0

12 u(t)

2+V (u(t))dtdu(t)<latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit><latexit sha1_base64="afp5fSgDfev6ch0MDdx5tnjO8JU=">AAACiHicbZFdb9MwFIadwGCUjxW45OaICinVoHKqoYEQ0gQV4oKLIdFuUtNGjuNs1pwPxSdUlfFv4T9xx7/BySINNo5k6/E57/HH66RSUiOlvz3/1u2dO3d37w3uP3j4aG/4+MlCl03NxZyXqqxPE6aFkoWYo0QlTqtasDxR4iS5+NjWT76LWsuy+IbbSqxydlbITHKGLhUPf1bBJs5eAsbZj01MW6BjeA9ibV5BNBMKGUQ5w3POlPlig827q+XMjm0kC4yN67RrN2cWus4oqxk3oTXTWa/AToGt4qroOC3RNDbA8XoK+7AIGodjSNFCCi3HwxGd0C7gJoQ9jEgfx/Hwl9uTN7kokCum9TKkFa4Mq1FyJewgarSoGL9gZ2LpsGC50CvTGWnhhcukkJW1GwVCl/27w7Bc622eOGVrgr5ea5P/qy0bzN6sjCyqBkXBLw/KGgVYQvsrkMpacFRbB4zX0t0V+DlzPqH7u4EzIbz+5JuwmE5COgm/HoyOPvV27JJn5DkJSEgOyRH5TI7JnHBvx9v3DrzX/sCn/qH/9lLqe33PU/JP+B/+AIDgwSo=</latexit>

Static part Dynamic part

Information Lagrangian Critical Periods

Reachability

SGD EFFECTIVELY MINIMIZES THE IBL FOR THE WEIGHTS

Achille, Mbeng, Soatto, The Dynamic Distance Between Tasks, NeurIPS Workshop 2018

Path Integral Approximation and Task Reachability

Information Plasticity in Deep Networks

�36

0 50 100 1500.0

0.5

1.0

1.5

No deficit

0 50 100 150

Blur deficit until epoch 40

0 50 100 150

Blur deficit until epoch 100Layer 0

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Layer 7Nor

m.

Info

inW

eigh

ts

0 50 100 1500.0

0.5

1.0

1.5

No deficit

0 50 100 150

Flip deficit until epoch 40

0 50 100 150

Flip deficit until epoch 100Layer 0

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Layer 7

Epoch

Nor

m.

Info

inW

eigh

ts

High-level deficit do not change layer organization

Introducing a blur deficit changes layer organization

A., Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

�37

Summary

1. Emergence Theory addresses optimal representations for a given task.

2. Tasks live in a complex space, where “distances” depend not just on the geometry of the residual landscape (static component), but also on the direction of travel (asymmetric distance).

3. Critical Periods expose the importance of the transient of learning; introduced the notion of “Information Plasticity”

4. Learning Dynamics: Tasks may or may not be reachable depending on the dynamics of learning. Dynamic distance between tasks and reachability.

references• A. Achille & Ss: On the Emergence of Invariance and Disentanglement in Deep

Representations, JMLR 2018

• A. Achille & Ss: Information Dropout, PAMI 2018

• A. Achille & Ss: A Separation Principle for Control in the Age of Deep Learning, Annual Reviews, 2018 (also ArXiv)

• A. Achille, et al: Critical Learning Periods in Deep Neural Networks, ICLR 2019

• A. Achille et al: Where is the Information in a Deep Neural Network? ArXiv 2019

!38

The Emergence Theory of Representation Learning · Desiderata of Representations!3 data x task y...

Documents

Transcript of The Emergence Theory of Representation Learning · Desiderata of Representations!3 data x task y...