d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed...

36
d-VMP: Distributed Variational Message Passing Andrés R. Masegosa 1 , Ana M. Martínez 2 , Helge Langseth 1 , Thomas D. Nielsen 2 , Antonio Salmerón 3 , Darío Ramos-López 3 , Anders L. Madsen 2,4 1 Department of Computer Science, Aalborg University, Denmark 2 Department of Computer and Information Science, The Norwegian University of Science and Technology, Norway 3 Department of Mathematics, University of Almería, Spain 4 Hugin Expert A/S, Aalborg, Denmark Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 1

Transcript of d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed...

Page 1: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

d-VMP:

Distributed Variational Message Passing

Andrés R. Masegosa1, Ana M. Martínez2,Helge Langseth1, Thomas D. Nielsen2, Antonio Salmerón3,

Darío Ramos-López3, Anders L. Madsen2,4

1Department of Computer Science, Aalborg University, Denmark2 Department of Computer and Information Science,

The Norwegian University of Science and Technology, Norway3Department of Mathematics, University of Almería, Spain

4 Hugin Expert A/S, Aalborg, Denmark

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 1

Page 2: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 2

Page 3: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 3

Page 4: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Motivation

I LargeI ImbalanceI ? valuesI Complex distributions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 4

Page 5: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Motivation

I LargeI ImbalanceI ? valuesI Complex distributions

Attribute range

Den

sity

0 1 2 3 4 5 6

01

23

45

6

SVI 1%SVI 5%SVI 10%VMP 1%

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 5

Page 6: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Motivation

I Goal: learn a generative model for a finantial dataset to monitor thecustomers and make predictions for a single customer.

X i H i

✓↵

i = 1, . . . ,N

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 6

Page 7: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Popular existing approach: SVI

IStochastic Variational Inference: iteratively updates the modelparameters based on subsampled data batches.

INo estimation of all local hidden variables of the model.

INo generation of lower bound.

IPoor fit if batch of data is not representative from all data.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 7

Page 8: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Our contribution:

Id-VMP: a distributed message passing scheme.

IDefined for a broader class of models (than SVI).

IBetter and faster convergence results compared to SVI.

IPosterior over all local latent variables and lower bound.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 8

Page 9: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 9

Page 10: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Models:

I Bayesian learning on iid. data using conjugate exponential BN models:

ln p(X ) = ln hX + sX · ⌘ � AX (⌘)

X i H i

✓↵

i = 1, . . . ,N

I We want to calculate p(✓,H |D).

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 10

Page 11: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Variational Inference:

I Approximate p(✓,H |D) (often intractable) by finding tractableposterior distributions q 2 Q by minimizing:

minq(✓,H)2Q

KL(q(✓,H)|p(✓,H |D)),

I In the mean field variational approach, Q is assumed to fullyfactorize:

q(✓,H) =MY

k=1

q(✓k)NY

i=1

JY

j=1

q(Hi ,j),

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 11

Page 12: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Variational Inference:

I Approximate p(✓,H |D) (often intractable) by finding tractableposterior distributions q 2 Q by minimizing:

minq(✓,H)2Q

KL(q(✓,H)|p(✓,H |D)),

I In the mean field variational approach, Q is assumed to fullyfactorize:

q(✓,H) =MY

k=1

q(✓k)NY

i=1

JY

j=1

q(Hi ,j),

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 12

Page 13: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Variational Inference:

I Variational Inference exploits:

lnP(D)

constant

= L(q(✓,H))

Maximize

+ KL(q(✓,H)|p(✓,H |D))

Minimize

,

I Iterative coordinate ascent of the variational distributions.I Updates in the variational distribution of a variable only

involves variables in its Markov blanket.I Coordinate ascent algorithm formulated as a message passing

scheme.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 13

Page 14: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Variational Message Passing, VMP:

IMessage from parent to child: moment parameters(expectation of the sufficient statistics).

IMessage from child to parent: natural parameters(based on the messages received from the co-parents).

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 14

Page 15: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 15

Page 16: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Distributed optimization of the lower bound:

Master

✓↵

q(t)(✓)

X

H

✓1

i = 1, . . . ,N

Slave 1

q(t)(H1)

X

H

✓2

i = 1, . . . ,N

Slave 2

q(t)(H2)

X

H

✓3

i = 1, . . . ,N

Slave 3

q(t)(H3)

q(t)(✓) is broadcasted to all the slave nodes.

maxq2Q L(q(✓,H))

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 16

Page 17: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Distributed optimization of the lower bound:

Master

✓↵

q(t)(✓)

X

H

✓1

i = 1, . . . ,N

Slave 1

q(t)(H1)

X

H

✓2

i = 1, . . . ,N

Slave 2

q(t)(H2)

X

H

✓3

i = 1, . . . ,N

Slave 3

q(t)(H3)

q(t+1)(H) = arg maxq(H) L(q(H), q(t)(✓))

maxq2Q L(q(✓,H))

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 17

Page 18: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Distributed optimization of the lower bound:

Master

✓↵

q(t)(✓)

X

H

✓1

i = 1, . . . ,N

Slave 1

q(t)(H1)

X

H

✓2

i = 1, . . . ,N

Slave 2

q(t)(H2)

X

H

✓3

i = 1, . . . ,N

Slave 3

q(t)(H3)

q(t+1)(Hn) = arg maxq(Hn) Ln(q(Hn), q(t)(✓))

maxq2Q L(q(✓,H))

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 18

Page 19: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Distributed optimization of the lower bound:

Master

✓↵

q(t)(✓)

X

H

✓1

i = 1, . . . ,N

Slave 1

q(t)(H1)

X

H

✓2

i = 1, . . . ,N

Slave 2

q(t)(H2)

X

H

✓3

i = 1, . . . ,N

Slave 3

q(t)(H3)

q(t+1)(✓) = arg maxq(✓)

L(q(t)(H), q(✓))

May be coupled

maxq2Q L(q(✓,H))

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 19

Page 20: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Candidate solutions:

I Resort to a generalized mean-field approximation as SVI: doesnot factorize over the global parameters.

IProhibitive for models with a large number of global (coupled)

parameters, e.g. linear regression.

I Our proposal: VMP as a distributed projected natural

gradient ascent algorithm (PGNA).

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 20

Page 21: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

d-VMP as a projected natural gradient ascent

IInsight 1: VMP can be expressed as a projected naturalgradient ascent algorithm.

⌘(t+1)X = ⌘(t)

X + ⇢X ,t [r⌘L(⌘(t))]+X (1)

I [·] is the projection operator.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 21

Page 22: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

d-VMP as a projected natural gradient ascent

IInsight 2: The natural gradient of the lower bound can beexpressed as follows:

r⌘✓L = mPa(✓)!✓ +

XmHi!✓

IThe gradient can be computed in parallel.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 22

Page 23: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

d-VMP as a projected natural gradient ascent

IInsight 3: Global parameters are “coupled” only if they belongto each other’s Markov blanket.

IDefine a disjoint partition of the global parameters:

R = {J1, . . . ,JS}

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 23

Page 24: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

d-VMP as a projected natural gradient ascent

I d-VMP is based on performing independent global updatesover the global parameters of each partition:

⌘(t+1)Jr

= ⌘(t)Jr

+ ⇢r ,t [r⌘L(⌘(t))]+Jr

I ⇢r ,t is the learning rate. If |Jr | = 1 then ⇢r ,t = 1.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 24

Page 25: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

dVMP as a distributed PNGA algorithm:

Master

✓↵

⌘(t+1)Jr

X

H

✓1

i = 1, . . . ,N

Slave 1

⌘(t+1)1

X

H

✓2

i = 1, . . . ,N

Slave 2

⌘(t+1)2

X

H

✓3

i = 1, . . . ,N

Slave 3

⌘(t+1)3

⌘(t)Jr+ ⇢r ,t [O⌘L(⌘(t))]+Jr

For all Jr 2 R

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 25

Page 26: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 26

Page 27: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Model fit to the data

X ij H ijYi

Hi ✓

j = 1, . . . , J

i = 1, . . . ,N

Attribute rangeD

ensi

ty

0 1 2 3 4 5 6

01

23

45

6

SVI 1%SVI 5%SVI 10%VMP 1%

I Representative sample of 55K clients (N) and 33 attributes (J).I “Unrolled” model of more than 3.5M nodes (75% latent variables).

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 27

Page 28: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Model fit to the data

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 28

Page 29: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Model fit to the data

0 500 1000 1500 2000

−1.5

e+07

−1.0

e+07

−5.0

e+06

0.0e

+00

Time (seconds)

Glo

bal l

ower

bou

nd Alg. BS(data%)/LRSVI 1%/0.55SVI 1%/0.75SVI 1%/0.99SVI 5%/0.55SVI 5%/0.75SVI 5%/0.99SVI 10%/0.55SVI 10%/0.75SVI 10%/0.99d−VMP

I Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 29

Page 30: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Test marginal log-likelihood

BS (% data) LR Log-Likel.

SVI

1 %0.55 -180902.870.75 -298564.030.99 -426979.52

5 %0.55 -177302.240.75 -333264.160.99 -628105.70

10 %0.55 -347035.220.75 -397525.450.99 -538087.13

d-VMP 1.0 67265.34

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 30

Page 31: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Mixtures of learnt posteriors for one attribute

Attribute range

Den

sity

−10000 0 10000 20000

0.00

000.

0004

0.00

080.

0012

SVI 1% SVI 5%SVI 10%VMP 1%

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 31

Page 32: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Scalability settings

I Generated data set of 42 million samples per client and 12variables.

I “Unrolled” model of more than 1 billion (10

9) nodes

(75% latent variables).I AMIDST Toolbox with Apache Flink.I Amazon Web Services (AWS) as distributed computing

environment.

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 32

Page 33: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Scalability results

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 33

Page 34: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Outline

1 Motivation

2 Variational Message Passing

3 d-VMP

4 Experimental results

5 Conclusions

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 34

Page 35: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Conclusions

I Variational methods can be scaled using distributedcomputation instead of sampling techniques.

I Bayesian learning in model with more than 1 billion nodes(75% of hidden).

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 35

Page 36: d-VMP: Distributed Variational Message Passingalessandro/pgm/Martinez.pdf · d-VMP: Distributed Variational Message Passing Andrés R. Masegosa1, Ana M. Martínez2, Helge Langseth1,

Thank you for your attention

Questions?

You can download our open source Java toolbox:amidsttoolbox.com

Acknowledgments: This project has received funding from the European Union’s

Seventh Framework Programme for research, technological development and

demonstration under grant agreement no 619209

Int. Conf. on Probabilistic Graphical Models, Lugano, Sept. 6–9, 2016 36