Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
2
Transcript of Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
Yes, All Models Are Wrong, but……from a CS/ML perspective this may not necessarily be less of big problem.
• Training: We want to gain an optimal amount of predictive accuracy per unit time.
• Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision.
• Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well).
• Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC).
Fight or flight
Not Bayesian Nor Frequentist
But Mandelbrotist…
Is there a deep connection between learning, computation and chaos theory?
Perspective
xn ~ p(x) model/inductive bias
f p(x) f (x) 1
N f (xn )
n
Pseudo-samples
Inte
grati
onpr
edic
tion
learning
inference
Integrationprediction
herding
Herding
PfE ˆ][
Nonlinear Dynamical System.Nonlinear Dynamical System.Generate pseudo-samples “S”.Generate pseudo-samples “S”.
PherdingfESf ˆ][)(
herdingSg )(
prediction consistency
Herding
)(][
)(maxarg
ˆ SffEWW
SfWS
kkPkk
kkk
S
• weights to not converge, Monte Carlo sums do• Maximization does not have to be perfect (see PCT theorem).• Deterministic• No step-size• Only very simple operations (no exponentiation, logarithms etc.)
Ising/Hopfield Model Network
wk fk (S)k
wijij
sis j wii
si
si* wijs j wi
j
*ˆ
**ˆ
][
][
iipii
jijipijij
ssEWW
ssssEWW
Neuron fires if input exceeds threshold
Synapse depresses ifpre- & postsynapticneurons fire.
Threshold depresses after neuron fires
Herding as a Dynamical System
)(][)( ˆ1, tkkPktkttk SffEWWFW
S
w
data
kkkt
Sktt SfWWS )(maxarg)(
constant
Piecewise constant fct. of W
kkkt
Stt SfWSSSGS )(maxarg),...,,( 121
Markov process in W
1
1ˆ0 )(][
t
iikkPkkt SffEWW Infinite memory process in S
Convergence
Translation:
Choose St such that:
Then:
v t E ˆ P [ f ] fk (St )
Wkk k
Wk E ˆ P [ fk ] fk (S)
k
0
)1
(|~][)(1
| ˆ1 T
OfEsfT kP
T
ttk
s=1
s=2
s=3s=4
s=5
s=6
s=[1,1,2,5,2...
Equivalent to “Perceptron Cycling Theorem”(Minsky ’68)
Period Doubling
W t1 RW t (1 W t )
As we change R (T) the number of fixed points change.
Wk,t1 Wk,t fk fk (x)exp
Wk ',t
Tk '
fk (x)
x
expWk ',t
Tk '
fk (x)
x
T=0: herding
“edge of chaos”
Applications
• Classification
• Compression
• Modeling Default Swaps
• Monte Carlo Integration
• Image Segmentation
• Natural Language Processing
• Social Networks
ExampleClassifier from local Image features:
P(Object Category | Local Image Information)
Classifier from boundary detection:
P(Object Categories are Different across Boundary | Boundary Information)
+
Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope)
Combine with
Herding
Topological EntropyTheorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is:
However, we are interested in the sub-extensive entropy: [Nemenman et al.]
hsubtop limT
logW (T)
log(T) lim
T
w log(T)
log(T) w
Theorem:
Conjecture:
htop limT
logW (T)
T lim
T
w log(T)
T 0
hsubtop K
hsubtop K
(K = nr. of parameters)
(for typical herding systems)
S=1,3,2
Learning Systems
)log(2
~)]|([ ~)(log NK
XpHtermsextensiveXPEvidenceBayesian
Herding is not random and not IID due to negative auto-correlations. The information in its sequence is: .
We can therefore represent the original (random) data sample by a much smallersubset without loss of information content (N instead of N2 samples).
These shorter herding sequences can be used to efficiently approximate averages byMonte Carlo sums.
)log(NK
Information we learn from the random IID data.
Conclusions
• Herding is an efficient alternative for learning in MRFs.
• Edge of chaos dynamics provides more efficient information processing than random sampling.
• General principle that underlies information processing in the brain ?
• We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?