Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik...

Efficient Training in high-dimensional weight space

Theoretische Physik und Astrophysik

Computational Physics

Julius-Maximilians-Universität Würzburg

Am Hubland, D-97074 Würzburg, Germany

http://theorie.physik.uni-wuerzburg.de/~biehl

Wiskunde & Informatica

Intelligent Systems

Rijksuniversiteit Groningen, Postbus 800,

NL-9718 DD Groningen, The Netherlands

[email protected], www.cs.rug.nl/~biehl

, Michael Biehl Michael BiehlChristoph Bunzmann,Robert Urbanczik

Efficient training in high-dimensional weight space

Learning from examples

A model situation layered neural networks student teacher scenario

The dynamics of on-line learning on-line gradient descent delayed learning, plateau states

Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results

Summary, Outlook selected further topics prospective projects

Learning from examples

choice of adjustable parameters in

adaptive information processing systems

· based on example data, e.g. input/output pairs in

classification tasks

time series prediction

regression problems

supervised

learning

· parameterizes a hypothesis

e.g. for an unknown classification or regression task

· guided by the optimization of an appropriate

objective or cost function

e.g. performance with respect to the example data

· results in generalization ability

e.g. the successful classification of novel data

Theory of learning processes

· description of specific e.g. hand written digit recognition applications

- particular training scheme

- given real world problem

- special set of example data ...

· typical properties of e.g. learning curves

model scenarios - network architecture- statistics of data, noise

understanding/prediction of relevant phenomena, algorithm design

trade off: general validity / applicability

- learning algorithm

· general results

- statistical properties of data - specific task

- details of training procedure ...

independent of

e.g. performance bounds

NRI ξ

input data

sigmoidal hidden

activation, e.g. g(x) = erf (a x)

ξ w x

A two-layered network: the soft committee machine

K

1k k ξw g σ

input/output relation RIRI N

( fixed hidden to output weights )

Nk RIw

adaptive weights

ξ w g σ kk

hidden units K21k ...,,

SCM+ adaptive thresholds:universal approximator

M K ideal situation: perfectly matching complexity

Student teacher scenario

M K unlearnable rule

M K over-sophisticated student

interesting effects

relevant cases

K

1k k ξw g ξ σ

adaptive student

hidden unitsK

M

1m m ξw g ξ τ

*

teacher

M

(best) rule parameterization

? ? ? ? ? ? ?

5

training based on the performance w.r.t. example data, e.g.

2μμP

1μ

μP

1μ )ξ ( τ - )ξ ( σ

21

P1

ξ e P1

E

input/output pairs: P

1μμμ )ξ ( τ ,ξ DI

examples for the unknown function or rule ) ξ ( τ

(reliable)

evaluation after training

generalization error ξ

ξ e e G

expected error for a novel input DI τ ,ξ

w.r.t. density of inputs / set of test inputs

Statistical Physics approach

· consider large systems, in the thermodynamic limit N (K,M«N)

dimension of input data

number of adjustable parametersN

· perform averages

over stochastic training process T

over randomized example data, quenched disorder DI

(technically) simplest case: reliable teacher outputs,

isotropic input density: independent components

with zero mean / unit variance

· evaluate typical properties

e.g. the learning curve P vs. e IDT G

· description in terms of macroscopic quantities

e.g. overlap parameters

student/teacher similarity measure

jiij*mjjm w w Q ,w w R

next: eg

The generalization error

)x ( g - )x ( g ) ξ ( e e 2 *

mM

1mkK

1kG 2K1

ξ w x , ξ w x *m

*mk k

(sums of many random numbers)

Central Limit Theorem: correlated Gaussians for large N

0 x x *m k jkkj k j Q w w xx

mn*n

*m

*n

*m jm

*mj

*m j δ w w xx R w w xx

first and second

moments:

jk jm,GkG QR e ) ξ ( e w e

averages over integrals overξ *

m k x,x

K N

microscopic

macroscopic

½ (K2+K) + K M

Dynamics of on-line gradient descent

presentation of single examples

weights after presentation of: w 1-μ1-μ examples

On-line learning step: e η/N w w1-μ

μ1-μ μ

novel, random example: , )ξ ( τ ,ξ μμ 2

μμμ )ξ ( τ - )ξ ( σ 21

e

number of examples discrete learning time

· no explicit storage of all examples ID required

· little computational effort per examplepractical advantages:

mathematical ease: typical dynamics of learning can be evaluated on

average over a randomized sequence of examples

coupled ODEs for {Rjm,Qij} in time =P/(KN)

projections *m

μkkm

μk

μjjk ww ) μ (Rww ) μ (Q

recursions, e.g. μ*m

μk

μμkmkm x(xg'τσ η N / 1

) 1-μ (R - ) μ (R)

large N • average over latest example Gaussian ξμ *μ

m μk xx ,

• mean recursions coupled ODE in continuous time

N K μ

α training time

~ examples per weight

learning curve ) α ( e G ), α (R ), α (Q kmjk

100 200 3000

0.01

0.02

0.03

0.04

0.05

0

eG

= P/(KN)

Biehl, Riegler, WöhlerJ.Phys. A (1996) 4769

perfect generalization

0 e G

fast initial decrease

example: K = M = 2, = 1.5, Rij(0) 0

quasi-stationary plateau states with all

dominate the learning process

R Rij

w wj ijij orth. *

unspecialized student weights

10

learning curve

ah

a!

example: K = M = 2, Tmn = mn, = 1, Rij(0) 0,

100 200 3000

0.5

1.0

0.0

R11, R22 Q11, Q22

R12, R21 Q21= Q21

1w

2w

* 1w

* 2w

permutation symmetry of branches in the student network

evolution of overlap parameters

N

Qjm

mean

s tan

dard

devi a

t ion

quantity

Monte Carlo simulations self-averaging

1/N

1/N

Plateau length

platα if all

assume randomized initialization of weight vectors

NlnN)( P, K N,lnα N1 O(0)R plat jk examples needed for successful learning !

KN P for RR jmjj hidden unit specialization

requires a priori knowledge (initial macroscopic overlaps)

property of the learning scenario

necessary phase of training

or

artifact of the training prescription

???

R(0)R jk exactly

self-avg.

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications

t testE

Training by Principal Component Analysis

problem: delayed specialization in ( K N ) dimensional weight space

idea:

A) identification (approximation) of the subspace of

B) actual training within this low-dimensional space

* mw

Σλ 1 eigenvector

M

1m

* mΣ ww

λ ( K-1 ) e.v. ) M2,3,m (*

m* 1m wwΔ

example: soft committee teacher (K=M), isotropic input density

modified correlation matrix ji2

ijT2 ξ ξ)ξ(τC ,ξ ξ)ξ(τC

eigenvalues and eigenvectors: ΔoΣ λλ λ

( N-K ) e.v. * mw u

oλ

empirical estimate from a limited data set

P

1μ

μ

j

μ

i μ2P

ij ξ ξ )ξ (τ P1

C

· optimization of w.r.t. E kja ( K2 K N coefficients)

( # of examples P = NK K2 )

note: required memory N2 does not increase with P

) K1,2,j (Pk

K

1k

kjj Δaw

· representation of student weights

B) specialization in the K - dimensional space of PkΔ

· determinelargest eigenvalue, e.v.

(K-1) smallest eigenvalues, e.v. K),2,(kPkΔ

P1

PΣ Δw

1

DItypical properties: given a random set of P = N K examples

formal partition sum

quenched free energy

D)(IC P β -exp d Z PT N

ID Z ln ~ replica trick

saddle point integration limit

typical overlap with teacher weightsIDT

2 *ii

2 w ( ρ )

measures the success of teacher space identification A)

B) given , determine the optimal eG

achievable by a linear combination of

Δ i

K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)

B)

P = K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

A)

c

B) given , determine the optimal eG

achievable by a linear combination of

Δ i

K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)

c

P = K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

A)

B)

Bunzmann, Biehl, UrbanczikPhys. Rev. Lett. 86, 2166 (2001)

unspecialized

specialized

specialization without

a priori knowledge

( c independent of N )

15

spectrum of matrix CP, teacher with M = 7 hidden units

K-1 = 6 smallest eigenvalues

algorithm requires no

prior knowledge of M

PCA hints at the required

model complexity

potential application: model selection

· model situation, supervised learning

- the soft committee machine

- student teacher scenario

- randomized training data

· dynamics of on-line gradient descent

- delayed learning due to symmetry breaking

necessary specialization processes

· statistical physics inspired approach

- large systems

- thermal (training) and disorder (data) average

- typical, macroscopic properties

Summary

· efficient training

- PCA based learning algorithm

reduces dimensionality of the problem

- specialization without a priori knowledge

Further topics · perceptron training (single layer) optimal stability classification dynamics of learning

· unsupervised learning principal component analysis competitive learning, clustered data

· specialization processes discontinuous learning curves delayed learning, plateau states

· dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks

· algorithm design variational method, optimal algorithms construction algorithm

· non-trivial statistics of data learning from noisy data time-dependent rules

· unsupervised learning

density estimation, feature detection,

clustering, (Learning) Vector Quantization

compression, self-organizing maps

· application relevant architectures and algorithms

Local Linear Model Trees

Learning Vector Quantization

Support Vector Machines

Selected Prospective Projects

· model selection

estimate complexity of a rule

or mixture density

· algorithm design

variational optimization, e.g.

alternative correlation matrix μ

j

μ

i μ

ij ξ ξ )ξ τ(FC

Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik...

Documents

Transcript of Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik...