Streaming Variational Bayes - MIT...

40
Streaming Variational Bayes Tamara Broderick, Nick Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan

Transcript of Streaming Variational Bayes - MIT...

Page 1: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Streaming Variational Bayes

Tamara Broderick, Nick Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan

Page 2: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Overview• Big Data inference generally non-Bayesian

• Why Bayes? Complex models, coherent treatment of uncertainty, etc.

• We deliver: SDA-Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference

• Experiments on streaming topic discovery (Wikipedia: 3.6M docs, Nature: 350K docs)

Page 3: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Overview• Big Data inference generally non-Bayesian

• Why Bayes? Complex models, coherent treatment of uncertainty, etc.

• We deliver: SDA-Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference

• Experiments on streaming topic discovery (Wikipedia: 3.6M docs, Nature: 350K docs)

Page 4: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Overview• Big Data inference generally non-Bayesian

• Why Bayes? Complex models, coherent treatment of uncertainty, etc.

• We deliver: SDA-Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference

• Experiments on streaming topic discovery (Wikipedia: 3.6M docs, Nature: 350K docs)

Page 5: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Overview• Big Data inference generally non-Bayesian

• Why Bayes? Complex models, coherent treatment of uncertainty, etc.

• We deliver: SDA-Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference

• Experiments on streaming topic discovery (Wikipedia: 3.6M docs, Nature: 350K docs)

Page 6: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Overview• Big Data inference generally non-Bayesian

• Why Bayes? Complex models, coherent treatment of uncertainty, etc.

• We deliver: SDA-Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference

• Experiments on streaming topic discovery (Wikipedia: 3.6M docs, Nature: 350K docs)

Page 7: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

Page 8: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

Page 9: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

Page 10: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

Page 11: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

Page 12: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Background• Posterior: belief about unobserved parameters 𝜃

after observing data 𝑥

• Variational Bayes (VB): approximate posterior by solving optimization problem (min KL divergence)

• Batch VB: solves VB using coordinate descent

• Stochastic Variational Inference (SVI): solves VB using stochastic gradient descent

NOT STREAMING

Page 13: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 14: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 15: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 16: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 17: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 18: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 19: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

A

x

p(✓)prior

data

q(✓) ⇡ p(✓ | x) posterior

batch alg

• Posterior update is iterative: !

• Choose any posterior approximation 𝐴:

!

• Iterate approximation if matches prior form:

SDA-Bayes: Streaming

p(✓ | xold

, x

new

) / p(✓ | xold

) · p(xnew

| ✓)

Page 20: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 21: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 22: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 23: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 24: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 25: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 26: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Can calculate posteriors in parallel and combine with Bayes’ Rule:

!

!

• Could substitute approximation found by 𝐴 instead

• Update is just addition if prior and approximate posterior are in same exponential family:

SDA-Bayes: Distributed

p(✓ | x1, . . . , xN )

/"

NY

n=1

p(xn | ✓)#

p(✓) /"

NY

n=1

p(✓ | xn) p(✓)�1

#p(✓)

p(✓ | x1, . . . , xN ) ⇡ q(✓) / exp

("⇠0 +

NX

n=1

(⇠n � ⇠0)

#· T (✓)

)

Page 27: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

• Each worker iterates:

!

!

!

• Each time the master receives from a worker, it updates synchronously:

SDA-Bayes: Asynchronous

1. Collect a new data point x.

2. Copy the master posterior parameter locally: ⇠

(local) ⇠

(post)

3. Compute the local approximate posterior parameter ⇠ usingA with ⇠

(local)

as the prior parameter

4. Return �⇠ := ⇠ � ⇠

(local)

�⇠

⇠(post) ⇠(post) +�⇠

Page 28: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Case Study: LDA

• Topic: theme potentially shared by multiple documents

• Latent Dirichlet Allocation (LDA): a topic model

• (Unsupervised) inference problem: discover the topics and identify which topics occur in which documents

Page 29: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Case Study: LDA

• Topic: theme potentially shared by multiple documents

• Latent Dirichlet Allocation (LDA): a topic model

• (Unsupervised) inference problem: discover the topics and identify which topics occur in which documents

Page 30: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Case Study: LDA

• Topic: theme potentially shared by multiple documents

• Latent Dirichlet Allocation (LDA): a topic model

• (Unsupervised) inference problem: discover the topics and identify which topics occur in which documents

Page 31: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Case Study: LDA

• Topic: theme potentially shared by multiple documents

• Latent Dirichlet Allocation (LDA): a topic model

• (Unsupervised) inference problem: discover the topics and identify which topics occur in which documents

Page 32: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Experimental Setup• SDA-Bayes with batch VB for 𝐴 vs. SVI (not

designed for streaming)

• Training: 3.6M Wikipedia, 350K Nature

• Testing: 10K Wikipedia, 1K Nature

• Performance measure: log predictive probability on held-out words in held-out testing documents; higher is better

Page 33: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Experimental Setup• SDA-Bayes with batch VB for 𝐴 vs. SVI (not

designed for streaming)

• Training: 3.6M Wikipedia, 350K Nature

• Testing: 10K Wikipedia, 1K Nature

• Performance measure: log predictive probability on held-out words in held-out testing documents; higher is better

Page 34: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Experimental Setup• SDA-Bayes with batch VB for 𝐴 vs. SVI (not

designed for streaming)

• Training: 3.6M Wikipedia, 350K Nature

• Testing: 10K Wikipedia, 1K Nature

• Performance measure: log predictive probability on held-out words in held-out testing documents; higher is better

Page 35: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Experimental Setup• SDA-Bayes with batch VB for 𝐴 vs. SVI (not

designed for streaming)

• Training: 3.6M Wikipedia, 350K Nature

• Testing: 10K Wikipedia, 1K Nature

• Performance measure: log predictive probability on held-out words in held-out testing documents; higher is better

Page 36: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Results• SDA-Bayes (streaming) as good as SVI (not

streaming); 32 threads and 1 thread shown

Page 37: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Results• More threads in SDA improves runtime and

performance

Page 38: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Results• More threads in SDA improves runtime and

performance

Page 39: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Results• SVI is sensitive to the pre-specified number of

documents D

Page 40: Streaming Variational Bayes - MIT CSAILpeople.csail.mit.edu/tbroderick/files/broderick_2014_streaming_variational.pdfOverview • Big Data inference generally non-Bayesian • Why

Further information• Streaming, distributed Bayesian learning without

performance loss

• Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming variational Bayes. NIPS 2013

!

!

• Code and slides at www.tamarabroderick.com