New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Post on 14-Oct-2020

1 views 0 download

Transcript of New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Deep Boltzmann Machines

Ruslan Salakutdinov and Geoffrey E. Hinton

Amish Goel

University of Illinois Urbana Champaign

agoel10@illinois.edu

December 2, 2016

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 1 / 16

Overview

1 IntroductionRepresentation of the model

2 Learning in Boltzmann MachinesVariational Lower Bound - Mean Field ApproximationStochastic Approximation Procedure - Persistent Markov Chains

3 Additional Tricks for DBMGreedy Pretraining of the ModelDiscriminative Finetuning

4 Simulation results

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 2 / 16

Introduction

Boltzmann Machine - Pairwise Markov Random Fields. Consider a setof random variables as latent i.e. hidden (h) and others as visible (v).

The probability distribution for binary random variables is given by

Pθ(v, h) =1

Zθe−Eθ(v,h),θ = {L, J,W}

Eθ(v,h) = −1

2vTLv− 1

2hT Jh− vTWh,

Figure: Model for Boltzmann Machines

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 3 / 16

Representation

While Boltzmann Machine is a powerful model over the data, it iscomputationally expensive to learn. So, one considers severalapproximations to Boltzmann machines.

Figure: Boltzmann Machines vs RBM

Deep Boltzmann Machine consider hidden nodes in several layers,with a layer being units that have no direct connections.

Figure: Model for Deep Boltzmann MachinesRuslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 4 / 16

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines

Using gradient ascent by substituting Eθ(v,h) in the gradientobtained in previous equation, one can obtain the update for therespective parameters as,

∆W = α(EPdata[vhT ]− EPmodel

[vhT ]),

∆L = α(EPdata[vvT ]− EPmodel

[vvT ]),

∆J = α(EPdata[hhT ]− EPmodel

[hhT ]),

∆b = α(EPdata[v]− EPmodel

[v]),

∆c = α(EPdata[h]− EPmodel

[h]),

(2)

The parameters updates in the M.L.E. is very costly in the previoussteps as would need to sum over exponential number of terms tocompute both expectations. One needs Approximations.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 6 / 16

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)

≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

∑i

Wijvi +∑m 6=j

Jmjµm

,

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 9 / 16

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

∑i

Wijvi +∑m 6=j

Jmjµm

,

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 9 / 16

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Overall Algorithm for Training Boltzmann Machines

Data: Training set Sn of N binary data vectors v and M, the number ofpersistent Markov chains

Initialize vector θ0 and M samples : {v0,1, h0,1}, ..., {v0,M , h

0,M};for t =0 to T (number of iterations) do

for each n ∈ Sn doRandomly initalize µn and run updates to obtain µn

µj ← σ(∑

i Wijvi +∑

m 6=j Jmjµm

)endfor m = 1 to M (number of persistent markov chains) do

Sample (vt+1,m, ht+1,m

) given (vt+1,m, ht+1,m

) by running Gibbssampler

endUpdate θ using equation (6) (adjusting for batch data) and decreasethe learning rate αt .

end

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 11 / 16

Learning for Deep Boltzmann Machines

For Deep Boltzmann Machines, L = 0 and J would have manyzero-blocks as hidden unit interactions layered. So somecomputations simplified.

Gibbs sampling procedure is simplified as all units in one layer can besampled parallely.

But, learning observed slow, and Greedy Pretraining can result infaster convergence of parameters.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 12 / 16

Pretraining in Deep Boltzmann Machines

Training each RBM separately, with some weight scaling.

Figure: Greedy Layerwise Pretraining for DBM

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 13 / 16

Discriminative Finetuning in Deep Boltzmann Machines

Further, an additional step of finetuning is also considered to improvethe performance.

For example, for a 2 hidden layer DBM, an approximate posterior isused as an augmented input to a neural network with weights ofnetwork initialized using parameters of DBM.

Figure: Finetuning the parameters of DBM

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 14 / 16

Some Experimental Results and Observations

Training a DBM for modeling handwritten digits in MNIST dataset.

(a) DBM Model used for Training (b) Examples of handwritten digits

Figure: An example of DBM used for MNIST data generation withtraining done for 60000 examples

Some interesting observations :- Without Greedy Pretraining, themodels were not producing good results.

Using Discriminative fine tuning, DBM gave 99.5% accuracy, best onMNIST dataset for recognition at that time.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 15 / 16

Thank You

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 16 / 16