Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik...
-
date post
15-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik...
![Page 1: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/1.jpg)
Efficient Training in high-dimensional weight space
Theoretische Physik und Astrophysik
Computational Physics
Julius-Maximilians-Universität Würzburg
Am Hubland, D-97074 Würzburg, Germany
http://theorie.physik.uni-wuerzburg.de/~biehl
Wiskunde & Informatica
Intelligent Systems
Rijksuniversiteit Groningen, Postbus 800,
NL-9718 DD Groningen, The Netherlands
[email protected], www.cs.rug.nl/~biehl
, Michael Biehl Michael BiehlChristoph Bunzmann,Robert Urbanczik
![Page 2: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/2.jpg)
Efficient training in high-dimensional weight space
Learning from examples
A model situation layered neural networks student teacher scenario
The dynamics of on-line learning on-line gradient descent delayed learning, plateau states
Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results
Summary, Outlook selected further topics prospective projects
![Page 3: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/3.jpg)
Learning from examples
choice of adjustable parameters in
adaptive information processing systems
· based on example data, e.g. input/output pairs in
classification tasks
time series prediction
regression problems
supervised
learning
· parameterizes a hypothesis
e.g. for an unknown classification or regression task
· guided by the optimization of an appropriate
objective or cost function
e.g. performance with respect to the example data
· results in generalization ability
e.g. the successful classification of novel data
![Page 4: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/4.jpg)
Theory of learning processes
· description of specific e.g. hand written digit recognition applications
- particular training scheme
- given real world problem
- special set of example data ...
· typical properties of e.g. learning curves
model scenarios - network architecture- statistics of data, noise
understanding/prediction of relevant phenomena, algorithm design
trade off: general validity / applicability
- learning algorithm
· general results
- statistical properties of data - specific task
- details of training procedure ...
independent of
e.g. performance bounds
![Page 5: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/5.jpg)
NRI ξ
input data
sigmoidal hidden
activation, e.g. g(x) = erf (a x)
ξ w x
A two-layered network: the soft committee machine
K
1k k ξw g σ
input/output relation RIRI N
( fixed hidden to output weights )
Nk RIw
adaptive weights
ξ w g σ kk
hidden units K21k ...,,
SCM+ adaptive thresholds:universal approximator
![Page 6: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/6.jpg)
M K ideal situation: perfectly matching complexity
Student teacher scenario
M K unlearnable rule
M K over-sophisticated student
interesting effects
relevant cases
K
1k k ξw g ξ σ
adaptive student
hidden unitsK
M
1m m ξw g ξ τ
*
teacher
M
(best) rule parameterization
? ? ? ? ? ? ?
5
![Page 7: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/7.jpg)
training based on the performance w.r.t. example data, e.g.
2μμP
1μ
μP
1μ )ξ ( τ - )ξ ( σ
21
P1
ξ e P1
E
input/output pairs: P
1μμμ )ξ ( τ ,ξ DI
examples for the unknown function or rule ) ξ ( τ
(reliable)
evaluation after training
generalization error ξ
ξ e e G
expected error for a novel input DI τ ,ξ
w.r.t. density of inputs / set of test inputs
![Page 8: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/8.jpg)
Statistical Physics approach
· consider large systems, in the thermodynamic limit N (K,M«N)
dimension of input data
number of adjustable parametersN
· perform averages
over stochastic training process T
over randomized example data, quenched disorder DI
(technically) simplest case: reliable teacher outputs,
isotropic input density: independent components
with zero mean / unit variance
· evaluate typical properties
e.g. the learning curve P vs. e IDT G
· description in terms of macroscopic quantities
e.g. overlap parameters
student/teacher similarity measure
jiij*mjjm w w Q ,w w R
next: eg
![Page 9: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/9.jpg)
The generalization error
)x ( g - )x ( g ) ξ ( e e 2 *
mM
1mkK
1kG 2K1
ξ w x , ξ w x *m
*mk k
(sums of many random numbers)
Central Limit Theorem: correlated Gaussians for large N
0 x x *m k jkkj k j Q w w xx
mn*n
*m
*n
*m jm
*mj
*m j δ w w xx R w w xx
first and second
moments:
jk jm,GkG QR e ) ξ ( e w e
averages over integrals overξ *
m k x,x
K N
microscopic
macroscopic
½ (K2+K) + K M
![Page 10: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/10.jpg)
Dynamics of on-line gradient descent
presentation of single examples
weights after presentation of: w 1-μ1-μ examples
On-line learning step: e η/N w w1-μ
μ1-μ μ
novel, random example: , )ξ ( τ ,ξ μμ 2
μμμ )ξ ( τ - )ξ ( σ 21
e
number of examples discrete learning time
· no explicit storage of all examples ID required
· little computational effort per examplepractical advantages:
mathematical ease: typical dynamics of learning can be evaluated on
average over a randomized sequence of examples
coupled ODEs for {Rjm,Qij} in time =P/(KN)
![Page 11: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/11.jpg)
projections *m
μkkm
μk
μjjk ww ) μ (Rww ) μ (Q
recursions, e.g. μ*m
μk
μμkmkm x(xg'τσ η N / 1
) 1-μ (R - ) μ (R)
large N • average over latest example Gaussian ξμ *μ
m μk xx ,
• mean recursions coupled ODE in continuous time
N K μ
α training time
~ examples per weight
learning curve ) α ( e G ), α (R ), α (Q kmjk
![Page 12: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/12.jpg)
100 200 3000
0.01
0.02
0.03
0.04
0.05
0
eG
= P/(KN)
Biehl, Riegler, WöhlerJ.Phys. A (1996) 4769
perfect generalization
0 e G
fast initial decrease
example: K = M = 2, = 1.5, Rij(0) 0
quasi-stationary plateau states with all
dominate the learning process
R Rij
w wj ijij orth. *
unspecialized student weights
10
learning curve
ah
a!
![Page 13: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/13.jpg)
example: K = M = 2, Tmn = mn, = 1, Rij(0) 0,
100 200 3000
0.5
1.0
0.0
R11, R22 Q11, Q22
R12, R21 Q21= Q21
1w
2w
* 1w
* 2w
permutation symmetry of branches in the student network
evolution of overlap parameters
![Page 14: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/14.jpg)
N
Qjm
mean
s tan
dard
devi a
t ion
quantity
Monte Carlo simulations self-averaging
1/N
1/N
![Page 15: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/15.jpg)
Plateau length
platα if all
assume randomized initialization of weight vectors
NlnN)( P, K N,lnα N1 O(0)R plat jk examples needed for successful learning !
KN P for RR jmjj hidden unit specialization
requires a priori knowledge (initial macroscopic overlaps)
property of the learning scenario
necessary phase of training
or
artifact of the training prescription
???
R(0)R jk exactly
self-avg.
![Page 16: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/16.jpg)
S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications
t testE
![Page 17: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/17.jpg)
Training by Principal Component Analysis
problem: delayed specialization in ( K N ) dimensional weight space
idea:
A) identification (approximation) of the subspace of
B) actual training within this low-dimensional space
* mw
Σλ 1 eigenvector
M
1m
* mΣ ww
λ ( K-1 ) e.v. ) M2,3,m (*
m* 1m wwΔ
example: soft committee teacher (K=M), isotropic input density
modified correlation matrix ji2
ijT2 ξ ξ)ξ(τC ,ξ ξ)ξ(τC
eigenvalues and eigenvectors: ΔoΣ λλ λ
( N-K ) e.v. * mw u
oλ
![Page 18: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/18.jpg)
empirical estimate from a limited data set
P
1μ
μ
j
μ
i μ2P
ij ξ ξ )ξ (τ P1
C
· optimization of w.r.t. E kja ( K2 K N coefficients)
( # of examples P = NK K2 )
note: required memory N2 does not increase with P
) K1,2,j (Pk
K
1k
kjj Δaw
· representation of student weights
B) specialization in the K - dimensional space of PkΔ
· determinelargest eigenvalue, e.v.
(K-1) smallest eigenvalues, e.v. K),2,(kPkΔ
P1
PΣ Δw
1
![Page 19: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/19.jpg)
DItypical properties: given a random set of P = N K examples
formal partition sum
quenched free energy
D)(IC P β -exp d Z PT N
ID Z ln ~ replica trick
saddle point integration limit
typical overlap with teacher weightsIDT
2 *ii
2 w ( ρ )
measures the success of teacher space identification A)
B) given , determine the optimal eG
achievable by a linear combination of
Δ i
![Page 20: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/20.jpg)
K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)
B)
P = K N examples
c (K=2) = 4.49
c (K=3) = 8.70
large K theory:
c (K) ~ 2.94 K (N-indep.!)
A)
c
B) given , determine the optimal eG
achievable by a linear combination of
Δ i
![Page 21: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/21.jpg)
K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)
c
P = K N examples
c (K=2) = 4.49
c (K=3) = 8.70
large K theory:
c (K) ~ 2.94 K (N-indep.!)
A)
B)
Bunzmann, Biehl, UrbanczikPhys. Rev. Lett. 86, 2166 (2001)
unspecialized
specialized
specialization without
a priori knowledge
( c independent of N )
15
![Page 22: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/22.jpg)
spectrum of matrix CP, teacher with M = 7 hidden units
K-1 = 6 smallest eigenvalues
algorithm requires no
prior knowledge of M
PCA hints at the required
model complexity
potential application: model selection
![Page 23: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/23.jpg)
· model situation, supervised learning
- the soft committee machine
- student teacher scenario
- randomized training data
· dynamics of on-line gradient descent
- delayed learning due to symmetry breaking
necessary specialization processes
· statistical physics inspired approach
- large systems
- thermal (training) and disorder (data) average
- typical, macroscopic properties
Summary
· efficient training
- PCA based learning algorithm
reduces dimensionality of the problem
- specialization without a priori knowledge
![Page 24: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/24.jpg)
Further topics · perceptron training (single layer) optimal stability classification dynamics of learning
· unsupervised learning principal component analysis competitive learning, clustered data
· specialization processes discontinuous learning curves delayed learning, plateau states
· dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks
· algorithm design variational method, optimal algorithms construction algorithm
· non-trivial statistics of data learning from noisy data time-dependent rules
![Page 25: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.](https://reader035.fdocuments.in/reader035/viewer/2022070412/56649d595503460f94a395d7/html5/thumbnails/25.jpg)
· unsupervised learning
density estimation, feature detection,
clustering, (Learning) Vector Quantization
compression, self-organizing maps
· application relevant architectures and algorithms
Local Linear Model Trees
Learning Vector Quantization
Support Vector Machines
Selected Prospective Projects
· model selection
estimate complexity of a rule
or mixture density
· algorithm design
variational optimization, e.g.
alternative correlation matrix μ
j
μ
i μ
ij ξ ξ )ξ τ(FC