New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Deep Boltzmann Machines

Ruslan Salakutdinov and Geoffrey E. Hinton

Amish Goel

University of Illinois Urbana Champaign

agoel10@illinois.edu

December 2, 2016

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 1 / 16

Overview

1 IntroductionRepresentation of the model

2 Learning in Boltzmann MachinesVariational Lower Bound - Mean Field ApproximationStochastic Approximation Procedure - Persistent Markov Chains

3 Additional Tricks for DBMGreedy Pretraining of the ModelDiscriminative Finetuning

4 Simulation results

Introduction

Boltzmann Machine - Pairwise Markov Random Fields. Consider a setof random variables as latent i.e. hidden (h) and others as visible (v).

The probability distribution for binary random variables is given by

Pθ(v, h) =1

Zθe−Eθ(v,h),θ = {L, J,W}

Eθ(v,h) = −1

2vTLv− 1

2hT Jh− vTWh,

Figure: Model for Boltzmann Machines

Representation

While Boltzmann Machine is a powerful model over the data, it iscomputationally expensive to learn. So, one considers severalapproximations to Boltzmann machines.

Figure: Boltzmann Machines vs RBM

Deep Boltzmann Machine consider hidden nodes in several layers,with a layer being units that have no direct connections.

Figure: Model for Deep Boltzmann MachinesRuslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 4 / 16

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

pθ(v,h)

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

∂ln(Lθ(v))

∂θ= −

p(h|v)∂Eθ(v,h)

∂θ︸︷︷︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸︷︷︸Model Dependent Expectation

pθ(v,h)

ln∑h

exp(−Eθ(v,h))

∂ln(Lθ(v))

∂θ= −

p(h|v)∂Eθ(v,h)

+∑v,h

p(v,h)∂Eθ(v,h)

pθ(v,h)

ln∑h

exp(−Eθ(v,h))

∂ln(Lθ(v))

∂θ= −

p(h|v)∂Eθ(v,h)

+∑v,h

p(v,h)∂Eθ(v,h)

Using gradient ascent by substituting Eθ(v,h) in the gradientobtained in previous equation, one can obtain the update for therespective parameters as,

∆W = α(EPdata[vhT ]− EPmodel

[vhT ]),

∆L = α(EPdata[vvT ]− EPmodel

[vvT ]),

∆J = α(EPdata[hhT ]− EPmodel

[hhT ]),

∆b = α(EPdata[v]− EPmodel

∆c = α(EPdata[h]− EPmodel

The parameters updates in the M.L.E. is very costly in the previoussteps as would need to sum over exponential number of terms tocompute both expectations. One needs Approximations.

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

pθ(v,h)

qµ(h|v)

qµ(h|v)pθ(vi ,h)

≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

ln (pθ(v)) = ln

pθ(v,h)

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

ln (pθ(v)) = ln

pθ(v,h)

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

M∏i=1

qµ(hi |vi )(

2vTLv +

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

2vTLv +

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

L(qµ,θ) =∑h

M∏i=1

qµ(hi |vi )(

2vTLv +

2hT Jh + vTWh

2vTLv +

M∑j=1

He(µj)

L(qµ,θ) =∑h

M∏i=1

qµ(hi |vi )(

2vTLv +

2hT Jh + vTWh

2vTLv +

M∑j=1

He(µj)

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

Wijvi +∑m 6=j

Jmjµm

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

Wijvi +∑m 6=j

Jmjµm

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

∆L = αt

([vvT ]−

N∑i=1

∆J = αt

([µµT ]−

N∑i=1

TLv+ 12µ

j=1 He(µj)

∆W = αt

([vµT ]−

N∑i=1

∆L = αt

([vvT ]−

N∑i=1

∆J = αt

([µµT ]−

N∑i=1

TLv+ 12µ

j=1 He(µj)

∆W = αt

([vµT ]−

N∑i=1

∆L = αt

([vvT ]−

N∑i=1

∆J = αt

([µµT ]−

N∑i=1

Overall Algorithm for Training Boltzmann Machines

Data: Training set Sn of N binary data vectors v and M, the number ofpersistent Markov chains

Initialize vector θ0 and M samples : {v0,1, h0,1}, ..., {v0,M , h

0,M};for t =0 to T (number of iterations) do

for each n ∈ Sn doRandomly initalize µn and run updates to obtain µn

µj ← σ(∑

i Wijvi +∑

m 6=j Jmjµm

)endfor m = 1 to M (number of persistent markov chains) do

Sample (vt+1,m, ht+1,m

) given (vt+1,m, ht+1,m

) by running Gibbssampler

endUpdate θ using equation (6) (adjusting for batch data) and decreasethe learning rate αt .

Learning for Deep Boltzmann Machines

For Deep Boltzmann Machines, L = 0 and J would have manyzero-blocks as hidden unit interactions layered. So somecomputations simplified.

Gibbs sampling procedure is simplified as all units in one layer can besampled parallely.

But, learning observed slow, and Greedy Pretraining can result infaster convergence of parameters.

Pretraining in Deep Boltzmann Machines

Training each RBM separately, with some weight scaling.

Figure: Greedy Layerwise Pretraining for DBM

Discriminative Finetuning in Deep Boltzmann Machines

Further, an additional step of finetuning is also considered to improvethe performance.

For example, for a 2 hidden layer DBM, an approximate posterior isused as an augmented input to a neural network with weights ofnetwork initialized using parameters of DBM.

Figure: Finetuning the parameters of DBM

Some Experimental Results and Observations

Training a DBM for modeling handwritten digits in MNIST dataset.

(a) DBM Model used for Training (b) Examples of handwritten digits

Figure: An example of DBM used for MNIST data generation withtraining done for 60000 examples

Some interesting observations :- Without Greedy Pretraining, themodels were not producing good results.

Using Discriminative fine tuning, DBM gave 99.5% accuracy, best onMNIST dataset for recognition at that time.

Thank You

New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Documents

Transcript of New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Stefan Boltzmann

Lattice-Boltzmann-Methode · Lattice-Boltzmann-Methode •Niklas Schultheiß •06.07.2016 Numerische Strömungssimulation –LGCA –Lattice-Boltzmann-Methode –Umsetzung der Boltzmann-Gleichung

Boltzmann Shannon

Deterministic Methods for the Boltzmann Equation€¦ · The Boltzmann equation (4) with nonzero Lorentz force Lis sometimes called the Maxwell-Boltzmann system (or the Poisson-Boltzmann

From Lattice Boltzmann Method to Lattice Boltzmann Flux … · From Lattice Boltzmann Method to Lattice Boltzmann Flux Solver Yan Wang 1, ... flows [8,13–15], compressible flows

Lattice Boltzmann Simulations of Cholesteric LCsnsasm10/cates.pdf · Lattice Boltzmann Simulations of Cholesteric LCs ... 0.25 0.50 0.75 2D boojum 3D ... Lattice Boltzmann Simulations

Non-linear Boltzmann Equationmctp/SciPrgPgs/events/2013/CAP13/... · Non-Linear Boltzmann Su, Lim, Shellard Introduction Motivation Background Part I: 2nd-Order Boltzmann Equation

Boltzmann Time Bomb

Restricted Boltzmann Machines for Collaborative Filteringswoh.web.engr.illinois.edu/courses/IE598/handout/fall... · 2016-12-03 · Restricted Boltzmann Machines for Collaborative

Lattice Boltzmann Chen

Lattice Boltzmann book

Boltzmann Transport Equation

Chanwoo - Boltzmann

Introduction to the lattice Boltzmann method · Introduction Molecular dynamics Lattice Boltzmann Finite volumes. 9 Introduction Lattice Boltzmann. 10 Introduction Lattice Boltzmann

MEDIA REPRESENTATION AND ETHNICITY A2 SOCIOLOGY. Starter Getting you thinking task Answer the three questions on the handout.

The reactive quantum Boltzmann equations: A derivation ......The reactive quantum Boltzmann equations: A derivation from an arrangement channel space representation and BBGKY hierarchy

Boltzmann-Gibbs distributionBoltzmann-Gibbs distribution Learning rule: The “Boltzmann machine” (Hinton & Sejnowski) s i s j T ij “Boltzmann machine” with hidden units (Hinton

BOLTZMANN EQUATION WITH EXTERNAL FORCE AND VLASOV-POISSON-BOLTZMANN SYSTEM …rjduan/Preprint for Homepage/BE-VPB... · 2005-10-24 · BOLTZMANN EQUATION WITH EXTERNAL FORCE AND VLASOV-POISSON-BOLTZMANN

Improved electronic measurement of the Boltzmann … · Improved electronic measurement of the Boltzmann ... standard uncertainty is 3.5 10-6 ... electronic measurement of the Boltzmann

Ludwig Boltzmann BZ-1 - Gustavus Adolphus Collegehomepages.gac.edu/~anienow/CHE-371a/Lectures/Boltzmann...Ludwig Boltzmann BZ-1 Boltzmann is well-known as a chemical theorist – he