MCMC algorithms for sampling from multimodal and …...However, guarantees for MCMC do not cover...

MCMC algorithms for sampling from

multimodal and changing distributions

Holden Lee

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Mathematics

Adviser: Sanjeev Arora

June 2019

c Copyright by Holden Lee, 2019.

All Rights Reserved

Abstract

The problem of sampling from a probability distribution is a fundamental problem in

Bayesian statistics and machine learning, with applications throughout the sciences.

One common algorithmic framework for solving this problem is Markov Chain Monte

Carlo (MCMC). However, a large gap exists between simple settings where MCMC

has been proven to work, and complex settings arising in practice. In this thesis, I

make progress towards closing this gap, focusing on two hurdles in particular.

In Chapter 2, I consider the problem of sampling from multimodal distributions.

Many distributions arising in practice, from simple mixture models to deep genera-

tive models, are multimodal, so any Markov chain which makes local moves will get

stuck in one mode. Although a variety of temperature heuristics are used to address

this problem, their theoretical guarantees are not well-understood even for simple

multimodal distributions. I analyze an algorithm combining Langevin diffusion with

simulated tempering, a heuristic which speeds up mixing by transitioning between

different temperatures of the distribution. I develop a general method to prove mix-

ing time using “soft decompositions” of Markov processes, and use it to prove rapid

mixing for (polynomial) mixtures of log-concave distributions.

In Chapter 3, I address the problem of sampling from the distributions 𝑝𝑡(𝑥) ∝

𝑒−∑𝑡

𝑘=0𝑓𝑘(𝑥) for each epoch 1 ≤ 𝑡 ≤ 𝑇 in an online manner, given a sequence of (con-

vex) functions 𝑓0, . . . , 𝑓𝑇 . This problem arises in large-scale Bayesian inference (for

instance, online logistic regression) where instead of obtaining all the observations at

once, one constantly acquires new data, and must continuously update the distribu-

tion. All previous results for this problem imply a bound on the number of gradient

evaluations at each epoch 𝑡 that grows at least linearly in 𝑡. For this problem, I

show that a certain variance-reduced SGLD (stochastic gradient Langevin dynamics)

algorithm solves the online sampling problem with fixed TV-error 𝜀 with an almost

constant number of gradient evaluations per epoch.

iii

Acknowledgements

I would like to thank my adviser Sanjeev Arora for supporting me throughout my

Ph.D., pointing me to relevant problems in the field, and helping with presentations;

Allan Sly for being a reader; and Weinan E and Ramon van Handel for serving on

my committee. Thanks also to Zeev Dvir for support during the first two years.

I would like to thank all my co-authors: Sanjeev Arora, Rong Ge, Elad Hazan,

Tengyu Ma, Oren Mangoubi, Andrej Risteski, Karan Singh, Nisheeth Vishnoi, Cyril

Zhang, and Yi Zhang. I’ve had many engaging discussions with them, with other

members of the research group including Orestis Plevrakis, Nikunj Saunshi, and Kiran

Vodrahalli, and with the “Machine Learning Rant Group.” I would especially like to

thank Andrej and Cyril for their dedication to our research projects, and for always

being enthusiastic about sharing ideas. Andrej was already ready to give advice as

well as feedback on drafts and presentations. Cyril made sure all our collaborators

were on the same page, and even when our proofs were falling apart, kept calm and

carried on.

Thanks to Jill LeClair and Mitra Kelly for administrative support throughout my

Ph.D.

I thank 2D for being a warm, loving community and for keeping me well-fed for the

past four years, and Arch & Arrow and Graduate Improv for providing a much-needed

creative outlet and community.

I would like to dedicate this thesis to my father, T.Y. Lee, for getting me started

on my mathematical journey, and teaching me that good character is more important

than achievement. Finally I thank my mother, Ching-An Lee, for her continual love

and care.

iv

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction: MCMC algorithms 1

1.1 The problem of sampling . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Applications of sampling . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Bayesian modeling . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Theoretical computer science . . . . . . . . . . . . . . . . . . 9

1.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 New results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 Sampling from multimodal distributions . . . . . . . . . . . . 11

1.4.2 Sampling from changing distributions . . . . . . . . . . . . . . 12

2 Sampling from multimodal distributions using simulated tempering 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Overview of algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Simulated tempering . . . . . . . . . . . . . . . . . . . . . . . 20

v

2.2.3 Main algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Overview of the proof techniques . . . . . . . . . . . . . . . . . . . . 22

2.4 Theorem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Simulated tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Markov process decomposition theorems . . . . . . . . . . . . . . . . 31

2.6.1 Simple density decomposition theorem . . . . . . . . . . . . . 33

2.6.2 General density decomposition theorem . . . . . . . . . . . . . 36

2.6.3 Theorem for simulated tempering . . . . . . . . . . . . . . . . 42

2.7 Simulated tempering for gaussians with equal variance . . . . . . . . 48

2.7.1 Mixtures of gaussians all the way down . . . . . . . . . . . . . 48

2.7.2 Comparing to the actual chain . . . . . . . . . . . . . . . . . . 53

2.8 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.9 Proof of main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3 Online sampling from log-concave distributions 69

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2.2 Result in the online setting . . . . . . . . . . . . . . . . . . . . 75

3.2.3 Result in the offline setting . . . . . . . . . . . . . . . . . . . . 77

3.2.4 Application to Bayesian logistic regression . . . . . . . . . . . 79

3.3 Algorithm and proof techniques . . . . . . . . . . . . . . . . . . . . . 81

3.3.1 Overview of online algorithm . . . . . . . . . . . . . . . . . . 81

3.3.2 Overview of offline algorithm . . . . . . . . . . . . . . . . . . 83

3.4 Proof overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.4.1 Online problem . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.4.2 Offline problem . . . . . . . . . . . . . . . . . . . . . . . . . . 88

vi

3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.6 Proof of online theorem (Theorem 3.2.4) . . . . . . . . . . . . . . . . 91

3.6.1 Bounding the variance of the stochastic gradient . . . . . . . . 92

3.6.2 Bounding the escape time from a ball . . . . . . . . . . . . . . 93

3.6.3 Bounding the TV error . . . . . . . . . . . . . . . . . . . . . . 96

3.6.4 Setting the constants; Proof of main theorem . . . . . . . . . . 101

3.7 Proof of offline theorem (Theorem 3.2.6) . . . . . . . . . . . . . . . . 106

3.8 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.9 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . 112

A Background on Markov chains and processes 129

A.1 Markov chains and processes . . . . . . . . . . . . . . . . . . . . . . . 129

A.2 Langevin diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

B Appendix for Chapter 2 135

B.1 General log-concave densities . . . . . . . . . . . . . . . . . . . . . . 135

B.1.1 Simulated tempering for log-concave densities . . . . . . . . . 135

B.1.2 Proof of main theorem for log-concave densities . . . . . . . . 137

B.2 Perturbation tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . 140

B.2.1 Simulated tempering for distribution with perturbation . . . . 140

B.2.2 Proof of main theorem with perturbations . . . . . . . . . . . 140

B.3 Continuous version of decomposition theorem . . . . . . . . . . . . . 144

B.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.5 Lower bound when Gaussians have different variance . . . . . . . . . 149

B.5.1 Construction of 𝑔 and closeness of two functions . . . . . . . . 153

C Appendix for Chapter 3 157

C.1 Proof for logistic regression application . . . . . . . . . . . . . . . . . 157

vii

C.1.1 Theorem for general posterior sampling, and application to lo-

gistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 157

C.1.2 Proof of Theorem C.1.1 . . . . . . . . . . . . . . . . . . . . . 160

C.1.3 Online logistic regression: Proof of Lemma C.1.2 and Theo-

rem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

D Calculations on probability distributions 169

D.1 Chi-squared and KL inequalities . . . . . . . . . . . . . . . . . . . . . 169

D.2 Chi-squared divergence calculations for log-concave distributions . . . 173

D.3 A probability ratio calculation . . . . . . . . . . . . . . . . . . . . . . 181

D.4 Other facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

viii

Chapter 1

Introduction: MCMC algorithms

1.1 The problem of sampling

In this thesis, we consider the question of sampling from a probability distribution

𝑃 whose density function is specified up to a partition function (normalizing con-

stant) [LV06],

𝑝(𝑥) =𝑒−𝑓(𝑥)∫

Ω 𝑒−𝑓(𝑥) 𝑑𝑥. (1.1)

The domain Ω could be a continuous domain, such as R𝑑 or a subset of R𝑑, or a discrete

domain such as the boolean cube 0, 1𝑑. We are interested in designing efficient

algorithms for the high-dimensional setting, i.e., algorithms that scale polynomially

in the dimension 𝑑.

Problem 1.1.1 (Sampling). Let Ω = R𝑑. Given query access to 𝑓(𝑥) and ∇𝑓(𝑥),

sample from a distribution ‹𝑃 that is 𝜀-close (in TV-distance or another distance) to

the distribution 𝑃 with density function (1.1).

For general functions 𝑓 , the problem is intractable (#P-hard), so we will need

more assumptions about the structure of 𝑓 to provide provable guarantees.

1

As we describe in Section 1.2, sampling is a fundamental problem in statistics,

machine learning, and theoretical computer science. It also has applications to simu-

lation of physical systems. The main approach is Markov Chain Monte Carlo [HH13;

Liu08; Bro+11; CSI12], which we describe in Section 1.3.

However, guarantees for MCMC do not cover many practical problems of interest.

We describe our progress in theoretical guarantees for MCMC methods in Section 1.4.

1.2 Applications of sampling

1.2.1 Bayesian modeling

In Bayesian statistics and machine learning, one starts by assuming that observations

are generated by a fixed probabilistic model with unknown parameters 𝜃. Below, we

describe several key tasks in this framework.

The problem of learning the parameters is to find the posterior distribution

of the parameters 𝜃, given the observations. One assumes a prior distribution on

the parameters, 𝑝(𝜃), and fixes a probabilistic model of how the observed random

variables 𝑌 are generated from 𝜃 (and perhaps some other information 𝑋), 𝑝(·|𝑥, 𝜃).

By Bayes’s Rule, the posterior distribution of the parameters 𝜃 is

𝑝(𝜃|𝑦, 𝑥) =𝑝(𝑦|𝑥, 𝜃)𝑝(𝜃)

𝑝(𝑦)∝ 𝑝(𝑦|𝑥, 𝜃)𝑝(𝜃). (1.2)

The problem of inferring the latent variables arises in latent variable models.

A latent variable model is a model for observations 𝑌 that is simple when it is

conditioned on the parameter 𝜃 and some latent (hidden) variable 𝐻 that is not

observed, given by 𝑝𝜃(·|ℎ). Here, 𝐻 is a random variable whose distribution depends

in a known way on 𝜃. The goal is to “infer” the probability distribution on the hidden

2

variable 𝐻 from the observations 𝑌 . Again by Bayes’s Rule, if 𝜃 is known,

𝑝𝜃(ℎ|𝑥) =𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ)

𝑝𝜃(𝑥)∝ 𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ). (1.3)

Although the numerator is easy to evaluate, the denominator 𝑝𝜃(𝑥) =∫𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ) 𝑑ℎ

can be NP-hard to approximate even for simple models like topic models [SR11].

A probability distribution may be difficult to work with even in a fully observed

model that does not involve latent variables. Even if 𝜃 is known, if 𝑝𝜃(·) is only given

up to a normalization constant (similar to the situation in both tasks above), to

obtain a sample from this distribution, we must implicitly solve a counting problem,

which can also be intractable in general.

In each of the tasks, one desires to understand a certain probability distribution.

However, in general this distribution has no succinct description that allows us to

extract useful information for downstream tasks. We may want to calculate marginals,

or calculate E𝜃∼𝑃 𝑔(𝜃) for some function 𝑔; this includes the mean, variance, and

proportion within a given set. There are two main approaches to understanding the

probability distribution; these comprise the main approaches to Bayesian modeling.

One method is Markov Chain Monte Carlo (MCMC). The idea of MCMC is to design

a Markov chain which has the desired distribution as the stationary distribution. By

running the Markov chain, one obtains samples from the probability distribution;

these samples can then be used to estimate desired quantities like E𝜃∼𝑃 𝑔(𝜃). Another

approach is variational inference [W+08], which seeks to approximate the distribution

with a distribution from a family of distributions that is easier to optimize over, such

as product distributions. We will focus on MCMC in this work.

We now give examples of probabilistic models where MCMC can be used.

Logistic regression: Logistic regression is a fundamental and widely used model

3

in Bayesian statistics [AC93]. In Bayesian logistic regression, one models the data

(𝑢𝑡 ∈ R𝑑, 𝑦𝑡 ∈ −1, 1) as follows: there is some unknown 𝜃0 ∈ R𝑑 such that given

𝑢𝑡 (the independent variable), for all 𝑡 ∈ 1, . . . , 𝑇 the dependent variable 𝑦𝑡 follows

a Bernoulli distribution with “success” probability 𝜑(𝑢⊤𝑡 𝜃) (𝑦𝑡 = 1 with probability

𝜑(𝑢⊤𝑡 𝜃) and −1 otherwise) where 𝜑(𝑥) = 1

1+𝑒−𝑥 is the logistic function:

𝜃 ∼ 𝑃

𝑦𝑡 ∼ Bernoulli(𝜑(𝑢⊤𝑡 𝜃)) ∀1 ≤ 𝑡 ≤ 𝑇.

Given the prior distribution 𝑝(𝜃), the posterior distribution is given by

𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) = 𝑝(𝜃)𝑝((𝑦𝑡)

𝑇𝑡=1|(𝑢𝑡)

𝑇𝑡=1, 𝜃) (1.4)

= 𝑝(𝜃)𝑇∏𝑡=1

𝜑(𝑦𝑡𝑢⊤𝑡 𝜃) = 𝑝(𝜃)

𝑇∏𝑡=1

1

1 + 𝑒−𝑦𝑡𝑢⊤𝑡 𝜃

(1.5)

Hence this problem fits in the framework of Problem 1.1.1 with 𝑓(𝜃) = − ln 𝑝(𝜃) +∑𝑇𝑡=1 ln(1 + 𝑒−𝑦𝑡𝑢⊤

𝑡 𝜃).

Here, 𝑢𝑡 is a “feature vector,” and the parameter 𝜃 specifies the effects that the

different features have on predicting 𝑦𝑡. More generally, other functions can be used

for the link function 𝜑. Once 𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) is estimated, one can then estimate the re-

sponse probability for a new data point 𝑝(𝑦|𝑢, (𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) =

∫𝜑(𝑦𝑢⊤𝜃)𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)

𝑇𝑡=1) 𝑑𝜃.

While logistic regression is a simple model, Bayesian inference is still a practical chal-

lenge for large or streaming datasets [HCB16]. We will apply our online sampling

algorithm to this problem in Chapter 3.

Latent Dirichlet allocation: Latent Dirichlet allocation (LDA) [BNJ03] is per-

haps the most common probabilistic model for collections of documents. It falls into

the general framework of topic modeling [SS09], a latent variable model where doc-

4

uments are generated as follows: there is a collection of topics (which are unobserved),

and each topic is associated with a probability distribution over words. A document

is a mixture of topics, and to generate a word from the document, a topic in the mix-

ture is drawn, and a word is drawn from the word distribution of the topic. In LDA,

the relative frequencies of words and topics is a draw from a Dirichlet distribution.

Formally, let 𝐾 be the number of topics, 𝑉 be the number of words in the vocab-

ulary. For hyperparameters 𝛼 and 𝛽, LDA is given by

word frequencies for topics 𝜙𝑘 ∼ Dirichlet𝑉 (𝛽) ∀1 ≤ 𝑘 ≤ 𝐾

topic frequencies for documents 𝜃𝑑 ∼ Dirichlet𝐾(𝛼) ∀1 ≤ 𝑑 ≤𝑀

topics for words in document 𝑧𝑑,𝑖 ∼ Multinomial𝐾(𝜃𝑑) ∀1 ≤ 𝑑 ≤𝑀, 1 ≤ 𝑖 ≤ 𝑁𝑑

words in document 𝑤𝑑,𝑖 ∼ Multinomial𝑉 (𝜙𝑧𝑑,𝑖) ∀1 ≤ 𝑑 ≤𝑀, 1 ≤ 𝑖 ≤ 𝑁𝑑.

Note that LDA is an unsupervised learning problem; an algorithm for LDA allows us

to learn the parameters and infer the topics from the documents without a labeled

dataset. It is useful for document classification and information retrieval, and also

challenging in the setting of large, streaming data.

Dirichlet process mixture model: A Dirichlet process mixture model [Nea00]

is the most commonly used probabilistic model for mixtures of distributions. It is a

latent variable model: observations are generated from one distribution in a mixture,

but which mixture is not revealed. It is defined as follows.

Let 𝐹 (𝜃) be a family of probability distributions indexed by 𝜃 ∈ Θ. Let 𝐺0 be the

prior distribution on 𝜃. The Dirichlet process mixture model on 𝐹 with 𝑘 components,

parameter 𝛼, and prior 𝐺0 is as follows:

mixture components 𝜃1, . . . , 𝜃𝑘 ∼ 𝐺0

5

mixture proportions 𝑝 ∼ Dirichlet𝑘(𝛼)

class assignments 𝑐𝑖 ∼ Multinomial(𝑝) ∀1 ≤ 𝑖 ≤ 𝑁

observations 𝑦𝑖 ∼ 𝐹 (𝜃𝑐𝑖) ∀1 ≤ 𝑖 ≤ 𝑁.

For example, in the case of Gaussian mixtures, 𝜃 = (𝜇,Σ), 𝐹 (𝜇,Σ) = 𝑁(𝜇,Σ), and

one choice for 𝐺0 is 𝜇 ∼ 𝑁(0, 𝜎20𝐼) and Σ−1 ∼ Gamma(1, 𝜎−2

0 ). In both LDA and the

Dirichlet mixture model, the difficulty comes from the fact that the latent variables

(topics/classes) are not observed, and hence to find what settings of 𝜃 are likely, one

must integrate or sum over all possible topic or class assignments.

The posterior distribution is naturally multimodal, even in very simple cases. For

example, fixing the proportions 𝑝 =Ä1𝑘, . . . , 1

𝑘

äand fixing the variances to be Σ = 𝐼𝑑,

𝜎20 = 1, the posterior update after seeing 𝑦1, . . . , 𝑦𝑁 is∝ ∏𝑁

𝑛=1

Å∑𝑘𝑗=1

1𝑘

expÅ−‖𝑦𝑛−𝜇𝑗‖2

2

ãã,

which when expanded, is seen to be a mixture of exponentially many gaussians. The

naive Markov chain can be torpidly mixing [TD14]. We make progress towards sam-

pling from such multimodal distributions in Chapter 2.

Markov random fields: A Markov random field is a probabilistic model for a

collection of variables whose dependency structure is described as a graph. The state

space is Ω = 𝑆𝑉 , where 𝑆 is the possible states for each variable, and 𝑉 is the set of

vertices of the graph 𝐺. The probability distribution is given by

𝑝(𝑥) ∝ exp

Ñ ∑𝑐∈Cl(𝐺)

𝜑𝑐(𝑥𝑐)

é(1.6)

where Cl(𝐺) is the set of cliques of 𝐺. The Ising model is the special case where

the possible states are 𝑆 = −1, 1, all interactions are pairwise, and depend on just

6

whether the two variables are the same:

𝑝(𝑥) ∝ exp

Ñ ∑(𝑖,𝑗)∈𝐸(𝐺)

𝐽𝑖,𝑗𝑥𝑖𝑥𝑗

é. (1.7)

Note that when the interaction strengths 𝐽𝑖,𝑗 (or 𝜑𝑐) are known, there is no hidden

variable; yet obtaining basic information about the distribution (such as calculating

the marginals, or obtaining a sample) can be computationally difficult. These tasks

are roughly equivalent to the problem of computing the partition function, which is

#P-hard in general.

The Ising model was introduced as a statistical physics model for magnets; how-

ever, it has since then found many other applications. For example, the Ising model

can be used to define Bayesian neural networks, such as the deep Boltzmann machine

(DBM). In this setting, the nodes are divided into layers and 𝐺 has edges between

adjacent layers; the coefficients are unknown and are to be learned from data.

Finally, we note that one interesting line of research is to automate, as much as

possible, Bayesian inference in the same way that automatic differentiation has au-

tomated optimization. The field of probabilistic programming [FRT14] aims to

combine algorithm and programming language design to make this possible. However,

in terms of theoretical guarantees for convergence, the building blocks of Bayesian in-

ference (like MCMC) are less well-understood than the building block of optimization

algorithms (gradient descent and its variants).

1.2.2 Optimization

Machine learning approaches can roughly be classified into a “probabilistic modeling”

(Bayesian) approach and “optimization” approach, with plenty of overlap; this par-

allels the Bayesian and frequentist approaches to statistics. The Bayesian approach

7

produces probability distributions as answers—for the parameters of the model and

for the predictions—and the hope is that it accurately captures the uncertainty and

the spread in the problem; it often relies on designing the probabilistic model in such

a way such that the parameters have explanatory power. The optimization approach,

on the other hand, will choose one setting of parameters, and one function giving

the predictions—the one that minimizes the loss function. Feedforward neural net-

works are a hallmark success of this approach. Training these models is often more

efficient and straightforward, but broadly speaking, the results are thought to be less

interpretable than those obtained using probabilistic models.

However, there are many connections between optimization and sampling. Many

optimization algorithms rely implicitly on some kind of sampling algorithm, or include

techniques that are inspired by sampling algorithms.

The primary connection between optimization and sampling is the following: if

we sample from the probability distribution 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), we are more likely to

obtain samples 𝑥 where 𝑓(𝑥) is small. We can take this to the extreme: In the limit

as the inverse temperature 𝛽 → ∞, the distribution 𝑝𝛽(𝑥) ∝ 𝑒−𝛽𝑓 is peaked at

exactly the global minima. In this sense, optimization is “just” a sampling problem.

In reality, however, we often cannot take 𝛽 → ∞, so there is a tradeoff between the

ability to sample and the quality of the solution.

Sampling algorithms have been used in nonconvex optimization to escape local

minima. The most well-studied such algorithm is Langevin Monte Carlo, which is

essentially gradient descent plus noise. Many researchers have heuristically explained

the success (or limitations) of stochastic gradient descent (SGD) using Langevin dy-

namics [WT11; MHB16]; Langevin dynamics has also been rigorously analyzed for

non-convex optimization [ZLC17]. Langevin dynamics has also inspired temperature-

based methods such as entropy-SGD for optimizing neural networks [Cha+16; YZM17].

Other algorithms are not explicitly based on LMC, but also add noise in judicious

8

ways to help escape saddle points or local minima [Ge+15; AG16; Jin+17; Jin+18].

Thus, a better understanding of sampling algorithms, and stronger guarantees for

them, will translate to a better understanding of and stronger guarantees for opti-

mization algorithms.

Secondly, sampling algorithms arise as a subroutine in online decision problems.

One generic algorithm for online optimization, called multiplicative weights, is to sam-

ple a point 𝑥𝑡 from the exponential of the (suitably weighted) negative loss ([CL06;

AHK12], [NR17, Lemma 10]). Indeed, there are settings such as online logistic re-

gression in which the only known way to achieve optimal regret is through a Bayesian

sampling approach [Fos+18], with lower bounds known for the naive convex opti-

mization approach [HKL14]. The celebrated Cover’s algorithm [Cov11] is an optimal

algorithm for portfolio management based on posterior sampling. Another applica-

tion is for Thompson sampling for bandit problems [CL11; AG12; AG13b; AG13a;

Rus+18; DFE18].

1.2.3 Theoretical computer science

It is of great interest in theoretical computer science to develop efficient algorithms

that approximately counts objects such as independent sets or matchings in graphs,

or the number of solutions to constraint satisfaction problems. One way to develop

algorithms for counting is to first develop algorithms for sampling, because the prob-

lems of approximate counting and sampling are equivalent for self-reducible problems.

If we can define a Markov chain on the set of solutions to a counting problem, then

the counting problem has a fully polynomial-time randomized approximation scheme

(FPRAS).

This strategy has been used to obtain a FPRAS for the partition function of

the ferromagnetic Ising model [JS93], for counting the number of perfect matchings

in a graph, and for approximating the permanent of a matrix with nonnegative en-

9

tries [JSV04].

In the continuous setting, Markov chains have been used for volume computation

of convex sets [DFK91].

1.3 Markov Chain Monte Carlo

A Markov chain 𝑀 on a state space Ω is a random process 𝑋0, 𝑋1, . . . where the

probability law for the next state depends on just the previous state, in a time-

invariant fashion: There is a fixed transition kernel 𝑇 , where 𝑇 (𝑥, ·) is a measure on

Ω for each 𝑥 ∈ Ω, such that

P(𝑋𝑡+1 ∈ 𝐴|𝑋0, . . . , 𝑋𝑡) = P(𝑋𝑡+1 ∈ 𝐴|𝑋𝑡) = 𝑇 (𝑋𝑡, 𝐴). (1.8)

A stationary measure 𝑃 for the Markov chain is such that if 𝑋0 ∼ 𝑃 , then 𝑋1 ∼ 𝑃

and hence 𝑋𝑡 ∼ 𝑃 for all 𝑡. A Markov chain is ergodic if the stationary measure 𝑃

is unique and for any starting measure, the probability measure 𝜋𝑡 of 𝑋𝑡 converges

to 𝑃 in TV-distance. Define the 𝜀-mixing time of 𝑀 to be the smallest 𝑡 such that

for all starting measures 𝜋0,

𝑑𝑇𝑉 (𝜋𝑡, 𝑃 ) ≤ 𝜀, (1.9)

and informally say that 𝑀 has rapid mixing if 𝑡 is small (polynomial in relevant

parameters) and torpid mixing if 𝑡 is large.

The theory of Markov chains is a beautiful and well-developed area of mathemat-

ics; however, there is a gap between the simple settings where rapid mixing has been

established (often with sharp rates), and the settings in which MCMC is often used.

Techniques for bounding the mixing time of Markov chains include spectral meth-

ods, functional inequalities inspired from geometry (such as Cheeger’s inequality and

10

canonical paths), coupling, and Harris recurrence; see [Dia09; Dia11] for a survey. We

give more background on Markov chains in Appendix A.

MCMC has been used in diverse fields, including physics (magnetization, simu-

lation of gases and fluids), chemistry (simulation of molecules), biology (genetics),

engineering (feedback control systems), medicine, economics, social science, food sci-

ence (because there is no such thing as too much bread), and forensics (such as

figuring out who killed Granny [Bal19]).

Key challenges in MCMC include overcoming torpid mixing for multimodal distri-

butions, designing efficient algorithms for big datasets and for streaming data (where

one pass over the data can be expensive), and assessing the quality of samples ob-

tained.

1.4 New results

Despite the extensive literature on Markov Chain Monte Carlo (MCMC) methods,

two shortcomings of current methods limit their applicability to real-world settings:

probability distributions in practice are multimodal, and applications often require

that the distribution be updated in an online fashion in response to streaming data.

1.4.1 Sampling from multimodal distributions

Many distributions that arise in practice, from simple mixture models to deep gener-

ative models, are multimodal, and hence any Markov chain which makes local moves

will get stuck in one mode. Although a variety of temperature heuristics are used to

address this problem in practice, their theoretical guarantees are not well-understood

even for simple multimodal distributions.

A natural Markov chain called Langevin diffusion mixes in polynomial time for

log-concave distributions [BE85], but can take exponential time otherwise (including

11

for multimodal distributions) [Bov+04]. In Chapter 2, I address this problem by

combining Langevin diffusion with simulated tempering. The resulting Markov

chain mixes more rapidly by transitioning between different temperatures of the dis-

tribution. To show rapid mixing, I developed a general method to prove mixing

time using “soft decompositions” of Markov chains, and use this theory to show that

our algorithm [GLR18] provably samples from polynomial mixtures of log-concave

distributions in polynomial time.

1.4.2 Sampling from changing distributions

Likewise motivated by sampling problems which arise in practice, in Chapter 3, I

address the problem of obtaining independent samples from the distributions 𝑝𝑡(𝑥) ∝

𝑒−∑𝑡

𝑘=0 𝑓𝑘(𝑥) for each epoch 1 ≤ 𝑡 ≤ 𝑇 in an online manner, given a sequence of

(convex) functions 𝑓0, . . . , 𝑓𝑇 . This problem arises in large-scale Bayesian inference

where instead of obtaining all the observations at once, one constantly acquires new

data, and must continuously update the distribution. All previous results for this

problem imply a bound on the number of gradient evaluations at each epoch 𝑡 that

grows at least linearly in 𝑡. This is an inherent limitation of any Markov chain which

incorporates all data at each time step.

To overcome this problem, I show that a certain variance-reduced SGLD (stochas-

tic gradient Langevin dynamics) algorithm [LMV19] solves the online sampling prob-

lem with fixed TV-error 𝜀 with an almost constant number of gradient evaluations

each epoch. We make mild assumptions that are motivated by applications such as

online logistic regression. As described in Section 1.2.2, our work on online sampling

has applications to online optimization.

12

Chapter 2

Sampling from multimodal

distributions using simulated

tempering

This chapter is based on work in [GLR18].

2.1 Introduction

Dealing with multimodal distributions is a core challenge in Markov Chain Monte

Carlo. The naive Markov chain often does not mix rapidly, and we obtain samples

from only one part of the support of the distribution.

Practitioners have dealt with this problem through a variety of heuristics. A

popular family of approaches involve changing the temperature of the distribution.

However, there has been little theoretical analysis of such methods. We give provable

guarantees for a temperature-based method called simulated tempering when it is

combined with Langevin diffusion.

Previous theoretical results on sampling have focused on log-concave distributions,

i.e., distributions of the form 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) for a convex function 𝑓(𝑥). This is anal-

13

ogous to convex optimization where the objective function 𝑓(𝑥) is convex. Recently,

there has been renewed interest in analyzing a popular Markov Chain for sampling

from such distributions, when given gradient access to 𝑓—a natural setup for the pos-

terior sampling task described above. In particular, a Markov chain called Langevin

Monte Carlo (see Section 2.2.1), popular with Bayesian practitioners, has been proven

to work, with various rates depending on the precise properties of 𝑓 [Dal16; DM16;

Dal17; CB18b; DMM18].

Yet, just as many interesting optimization problems are nonconvex, many interest-

ing sampling problems are not log-concave. A log-concave distribution is necessarily

uni-modal: its density function has only one local maximum, which is necessarily a

global maximum. This fails to capture many interesting scenarios. Many simple pos-

terior distributions are neither log-concave nor uni-modal, for instance, the posterior

distribution of the means for a mixture of gaussians, given a sample of points from the

mixture of gaussians. In a more practical direction, complicated posterior distribu-

tions associated with deep generative models [RMW14] and variational auto-encoders

[KW13] are believed to be multimodal as well.

In this work we initiate an exploration of provable methods for sampling “beyond

log-concavity,” in parallel to optimization “beyond convexity”. As worst-case results

are prohibited by hardness results, we must make assumptions on the distributions of

interest. As a first step, we consider a mixture of strongly log-concave distributions

of the same shape. This class of distributions captures the prototypical multimodal

distribution, a mixture of Gaussians with the same covariance matrix. Our result

is also robust in the sense that even if the actual distribution has density that is

only close to a mixture that we can handle, our algorithm can still sample from

the distribution in polynomial time. Note that the requirement that all Gaussians

have the same covariance matrix is in some sense necessary: in Appendix B.5 we

show that even if the covariance of two components differ by a constant factor, no

14

algorithm (with query access to 𝑓 and∇𝑓) can achieve the same robustness guarantee

in polynomial time.

2.1.1 Problem statement

We formalize the problem of interest as follows.

Problem 2.1.1. Let 𝑓 : R𝑑 → R be a function. Given query access to ∇𝑓(𝑥) and 𝑓(𝑥)

at any point 𝑥 ∈ R𝑑, sample from the probability distribution with density function

𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).

In particular, consider the case where 𝑒−𝑓(𝑥) is the density function of a mixture

of strongly log-concave distributions that are translates of each other. That is, there is

a base function 𝑓0 : R𝑑 → R, centers 𝜇1, 𝜇2, . . . , 𝜇𝑚 ∈ R𝑑, and weights 𝑤1, 𝑤2, . . . , 𝑤𝑚

(∑𝑚

𝑖=1 𝑤𝑖 = 1) such that

𝑓(𝑥) = − log

(𝑚∑𝑖=1

𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)

), (2.1)

For notational convenience, we will define 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖).

The function 𝑓0 specifies a basic “shape” around the modes, and the means 𝜇𝑖

indicate the locations of the modes.

Without loss of generality we assume the mode of the distribution 𝑒−𝑓0(𝑥) is at 0

(∇𝑓0(0) = 0). We also assume 𝑓0 is twice differentiable, and for any 𝑥 the Hessian

is sandwiched between 𝜅𝐼 ⪯ ∇2𝑓0(𝑥)) ⪯ 𝐾𝐼. Such functions are called 𝜅-strongly-

convex, 𝐾-smooth functions. The corresponding distribution 𝑒−𝑓0(𝑥) are strongly

log-concave distributions. 1

1On a first read, we recommend concentrating on the case 𝑓0(𝑥) =1

2𝜎2 ‖𝑥‖2. This corresponds tothe case where all the components are spherical Gaussians with mean 𝜇𝑖 and covariance matrix 𝜎2𝐼.

15

2.1.2 Our results

We show that there is an efficient algorithm that can sample from this distribution

given just access to 𝑓(𝑥) and ∇𝑓(𝑥).

Theorem 2.1.2 (Main). Given 𝑓(𝑥) as defined in Equation (2.1), where the base

function 𝑓0 satisfies for any 𝑥, 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼, and ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖 ∈ [𝑚],

there is an algorithm (given as Algorithm 2 with appropriate setting of parameters)

with running time polyÄ

1𝑤min

, 𝐷, 𝑑, 1𝜀, 1𝜅, 𝐾

ä, which given query access to ∇𝑓 and 𝑓 ,

outputs a sample from a distribution within TV-distance 𝜀 of 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).

Note that importantly the algorithm does not have direct access to the mixture

parameters 𝜇𝑖, 𝑤𝑖, 𝑖 ∈ [𝑛] (otherwise the problem would be trivial). Sampling from this

mixture is thus non-trivial: algorithms that are based on making local steps (such

as the ball-walk [LS93; Vem05] and Langevin Monte Carlo) cannot move between

different components of the gaussian mixture when the gaussians are well-separated.

In the algorithm we use simulated tempering (see Section 2.2.2), which is a technique

that adjusts the “temperature” of the distribution in order to move between different

components.

Of course, requiring the distribution to be exactly a mixture of log-concave distri-

butions is a very strong assumption. Our results can be generalized to all functions

that are “close” to a mixture of log-concave distributions.

More precisely, assume the function 𝑓 satisfies the following properties:

∃𝑓 : R𝑑 → R where∥∥∥𝑓 − 𝑓

∥∥∥∞≤ ∆ ,

∥∥∥∇𝑓 −∇𝑓∥∥∥∞≤ 𝜏 and ‖∇2𝑓 −∇2𝑓‖2 ≤ 𝜏, ∀𝑥 ∈ R𝑑

(2.2)

and 𝑓(𝑥) = − log

(𝑚∑𝑖=1


)(2.3)

where ∇𝑓0(0) = 0, and ∀𝑥, 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼. (2.4)

16

That is, 𝑓 is within a 𝑒Δ multiplicative factor of an (unknown) mixture of log-

concave distributions. Our theorem can be generalized to this case.

Theorem 2.1.3 (general case). For function 𝑓(𝑥) that satisfies Equations (2.2),

(2.3), and (2.4), there is an algorithm (given as Algorithm 2 with appropriate setting

of parameters) that runs in time polyÄ

1𝑤min

, 𝐷, 𝑑, 1𝜀, 𝑒Δ, 𝜏, 1

𝜅, 𝐾

ä, which given query

access to ∇𝑓 and 𝑓 , outputs a sample 𝑥 from a distribution that has TV-distance at

most 𝜀 from 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).

Both main theorems may seem simple. In particular, one might conjecture that it

is easy to use local search algorithms to find all the modes. However in Section B.4,

we give a few examples to show that such simple heuristics do not work (e.g. random

initialization is not enough to find all the modes).

The assumption that all the mixture components share the same 𝑓0 (hence when

applied to Gaussians, all Gaussians have same covariance) is also necessary. In Sec-

tion B.5, we give an example where for a mixture of two gaussians, even if the co-

variance only differs by a constant factor, any algorithm that achieves similar gau-

rantees as Theorem 2.1.3 must take exponential time. The limiting factor is approx-

imately finding all the mixture components. We note that when the approximate

locations of the mixture are known, there are heuristic ways to temper them differ-

ently; see [TRR18].

2.2 Overview of algorithm

Our algorithm combines Langevin diffusion, a chain for sampling from distributions

in the form 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given only gradient access to 𝑓 and simulated tempering, a

heuristic used for tackling multimodality. We briefly define both of these and recall

what is known for both of these techniques. For technical prerequisites on Markov

chains, the reader can refer to Appendix A.

17

The basic idea to keep in mind is the following: A Markov chain with local

moves such as Langevin diffusion gets stuck in a local mode. Creating a “meta-

Markov chain” which changes the temperature (the simulated tempering chain) can

exponentially speed up mixing.

2.2.1 Langevin dynamics

Langevin Monte Carlo is an algorithm for sampling from 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given access

to the gradient of the log-pdf, ∇𝑓 .

The continuous version, overdamped Langevin diffusion (often simply called Langevin

diffusion), is a stochastic process described by the stochastic differential equation

(henceforth SDE)

𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡) 𝑑𝑡 +√

2 𝑑𝑊𝑡 (2.5)

where 𝑊𝑡 is the Wiener process (Brownian motion). For us, the crucial fact is that

Langevin dynamics converges to the stationary distribution given by 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).

We will always assume that∫R𝑑 𝑒−𝑓(𝑥) 𝑑𝑥 <∞ and 𝑓 ∈ 𝐶2(R𝑑).

Substituting 𝛽𝑓 for 𝑓 in (2.5) gives the Langevin diffusion process for inverse

temperature 𝛽, which has stationary distribution ∝ 𝑒−𝛽𝑓(𝑥). Equivalently we can

consider the temperature as changing the magnitude of the noise:

𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡)𝑑𝑡 +»

2𝛽−1𝑑𝑊𝑡.

Of course algorithmically we cannot run a continuous-time process, so we run a

discretized version of the above process: namely, we run a Markov chain where the

random variable at time 𝑡 is described as

𝑋𝑡+1 = 𝑋𝑡 − 𝜂∇𝑓(𝑋𝑡) +»

2𝜂𝜉𝑘, 𝜉𝑘 ∼ 𝑁(0, 𝐼) (2.6)

18

where 𝜂 is the step size. (The reason for the√𝜂 scaling is that running Brownian

motion for 𝜂 of the time scales the variance by√𝜂.) This is analogous to how gradient

descent is a discretization of gradient flow.

Prior work on Langevin dynamics

For Langevin dynamics, convergence to the stationary distribution is a classic result

[Bha78]. Fast mixing for log-concave distributions is also a classic result: [BE85;

Bak+08] show that log-concave distributions satisfy a Poincare and log-Sobolev in-

equality, which characterize the rate of convergence—If 𝑓 is 𝛼-strongly convex, then

the mixing time is on the order of 1𝛼

. Of course, algorithmically, one can only run

a “discretized” version of the Langevin dynamics. Analyses of the discretization

are more recent: [Dal16; DM16; Dal17; DK17; CB18b; DMM18] give running times

bounds for sampling from a log-concave distribution over R𝑑, and [BEL18] give a

algorithm to sample from a log-concave distribution restricted to a convex set by in-

corporating a projection. We note these analysis and ours are for the simplest kind of

Langevin dynamics, the overdamped case; better rates are known for underdamped

dynamics ([Che+17]), if a Metropolis-Hastings rejection step is used ([Dwi+18]), and

for Hamiltonian Monte Carlo which takes into account momentum ([MS17]).

[RRT17; Che+18; VW19] carefully analyze the effect of discretization for arbi-

trary non-log-concave distributions with certain regularity properties, but the mixing

time is exponential in general; furthermore, it has long been known that transition-

ing between different modes can take exponentially long, a phenomenon known as

meta-stability [Bov+02; Bov+04; BGK05]. The Holley-Stroock Theorem (see e.g.

[BGL13]) shows that guarantees for mixing extend to distributions 𝑒−𝑓(𝑥) where 𝑓(𝑥)

is a “nice” function that is close to a convex function in 𝐿∞ distance; however, this

does not address more global deviations from convexity. [MV17] consider a more

general model with multiplicative noise.

19

2.2.2 Simulated tempering

For distributions that are far from being log-concave and have many deep modes,

additional techniques are necessary. One proposed heuristic, out of many, is simu-

lated tempering, which swaps between Markov chains that are different temperature

variants of the original chain. The intuition is that the Markov chains at higher tem-

perature can move between modes more easily, and hence, the higher-temperature

chain acts as a “bridge” to move between modes.

Indeed, Langevin dynamics corresponding to a higher temperature distribution—

with 𝛽𝑓 rather than 𝑓 , where 𝛽 < 1—mixes faster. (Here, we use terminology from

statistical physics, letting 𝜏 denote teh temperature and 𝛽 = 1𝜏

denote the inverse

temperature.) A high temperature flattens out the distribution. However, we can’t

simply run Langevin at a higher temperature because the stationary distribution is

wrong; the simulated tempering chain combines Markov chains at different tempera-

tures in a way that preserves the stationary distribution.

We can define simulated tempering with respect to any sequence of Markov chains

𝑀𝑖 on the same space Ω. Think of 𝑀𝑖 as the Markov chain corresponding to temper-

ature 𝑖, with stationary distribution 𝑒−𝛽𝑖𝑓 .

Then we define the simulated tempering Markov chain as follows.

∙ The state space is [𝐿] × Ω: 𝐿 copies of the state space (in our case R𝑑), one

copy for each temperature.

∙ The evolution is defined as follows.

1. If the current point is (𝑖, 𝑥), then evolve according to the 𝑖th chain 𝑀𝑖.

2. Propose swaps with some rate 𝜆. When a swap is proposed, attempt to

move to a neighboring chain, 𝑖′ = 𝑖±1. With probability min𝑝𝑖′(𝑥)/𝑝𝑖(𝑥), 1,

the transition is successful. Otherwise, stay at the same point. This is a

20

Metropolis-Hastings step; its purpose is to preserve the stationary distri-

bution.2

The crucial fact to note is that the stationary distribution is a “mixture” of the

distributions corresponding to the different temperatures. Namely:

Proposition 2.2.1. [MP92; Nea96] If the 𝑀𝑘, 1 ≤ 𝑘 ≤ 𝐿 are reversible Markov

chains with stationary distributions 𝑝𝑘, then the simulated tempering chain 𝑀 is a

reversible Markov chain with stationary distribution

𝑝(𝑖, 𝑥) =1

𝐿𝑝𝑖(𝑥).

The typical setting of simulated tempering is as follows. The Markov chains come

from a smooth family of Markov chains with parameter 𝛽 ≥ 0, and 𝑀𝑖 is the Markov

chain with parameter 𝛽𝑖, where 0 ≤ 𝛽1 ≤ . . . ≤ 𝛽𝐿 = 1. We are interested in sampling

from the distribution when 𝛽 is large (𝜏 is small). However, the chain suffers from

torpid mixing in this case, because the distribution is more peaked. The simulated

tempering chain uses smaller 𝛽 (larger 𝜏) to help with mixing. For us, the stationary

distribution at inverse temperature 𝛽 is ∝ 𝑒−𝛽𝑓(𝑥).

Prior work on simulated tempering

Provable results of this heuristic are few and far between. [WSH09a; Zhe03] lower-

bound the spectral gap for generic simulated tempering chains, using a Markov chain

decomposition technique due to [MR02]. However, for the Problem 2.1.1 that we are

interested in, the spectral gap bound in [WSH09a] is exponentially small as a function

2This can be defined as either a discrete or continuous Markov chain. For a discrete chain,we propose a swap with probability 𝜆 and follow the current chain with probability 1 − 𝜆. For acontinuous chain, the time between swaps is an exponential distribution with decay 𝜆 (in other words,the times of the swaps forms a Poisson process). Note that simulated tempering is traditionallydefined for discrete Markov chains, but we will use the continuous version. See Definition 2.5.1 forthe formal definition.

21

of the number of modes. Drawing inspiration from [MR02], we establish a Markov

chain decomposition technique that overcomes this.

One issue that comes up in simulated tempering is estimating the partition func-

tions; various methods have been proposed for this [PP07; Lia05].

2.2.3 Main algorithm

Our algorithm is intuitively the following. Take a sequence of inverse temperatures

𝛽𝑖, starting at a small value and increasing geometrically towards 1. Run simulated

tempering Langevin on these temperatures, suitably discretized. Take the samples

that are at the 𝐿th temperature.

Note that there is one complication: the standard simulated tempering chain

assumes that we can compute the ratio between temperatures 𝑝𝑖′ (𝑥)𝑝𝑖(𝑥)

. However, we

only know the probability density functions up to a normalizing factor (the partition

function). To overcome this, we note that if we use the ratios 𝑟𝑖′𝑝𝑖′ (𝑥)𝑟𝑖𝑝𝑖(𝑥)

instead, for∑𝐿𝑖=1 𝑟𝑖 = 1, then the chain converges to the stationary distribution with 𝑝(𝑥, 𝑖) =

𝑟𝑖𝑝𝑖(𝑥). Thus, it suffices to estimate each partition function up to a constant factor.

We can do this inductively: running the simulated tempering chain on the first ℓ

levels, we can estimate the partition function 𝑍ℓ+1; then we can run the simulated

tempering chain on the first ℓ+ 1 levels. This is what Algorithm 2 does when it calls

Algorithm 1 as subroutine.

A formal description of the algorithm follows.

2.3 Overview of the proof techniques

We summarize the main ingredients and crucial techniques in the proof. Full proofs

appear in the following sections.

22

Algorithm 1 Simulated tempering Langevin Monte Carlo

INPUT: Temperatures 𝛽1, . . . , 𝛽ℓ; partition function estimates “𝑍1, . . . , “𝑍ℓ; step size𝜂, time 𝑇 , rate 𝜆, variance of initial distribution 𝜎0.OUTPUT: A random sample 𝑥 ∈ R𝑑 (approximately from the distribution 𝑝ℓ(𝑥) ∝𝑒−𝛽ℓ𝑓(𝑥)).Let (𝑖, 𝑥) = (1, 𝑥0) where 𝑥0 ∼ 𝑁(0, 𝜎2

0𝐼).Let 𝑛 = 0, 𝑇0 = 0.while 𝑇𝑛 < 𝑇 do

Determine the next transition time: Draw 𝜉𝑛+1 from the exponential distribu-tion 𝑝(𝑥) = 𝜆𝑒−𝜆𝑥, 𝑥 ≥ 0.

Let 𝜉𝑛+1 ←[ min𝑇 − 𝑇𝑛, 𝜉𝑛+1, 𝑇𝑛+1 = 𝑇𝑛 + 𝜉𝑛+1.

Let 𝜂′ = 𝜉𝑛+1/⌈𝜉𝑛+1

𝜂

⌉(the largest step size < 𝜂 that evenly divides into 𝜉𝑛+1).

Repeat⌈𝜉𝑛+1

𝜂

⌉times: Update 𝑥 according to 𝑥 ← [ 𝑥 − 𝜂′𝛽𝑖∇𝑓(𝑥) +

√2𝜂′𝜉,

𝜉 ∼ 𝑁(0, 𝐼).If 𝑇𝑛+1 < 𝑇 (i.e., the end time has not been reached), let 𝑖′ = 𝑖 ± 1 with

probability 12. If 𝑖′ is out of bounds, do nothing. If 𝑖′ is in bounds, make a type 2

transition, where the acceptance ratio is minß

𝑒−𝛽𝑖′𝑓(𝑥)/𝑍𝑖′

𝑒−𝛽𝑖𝑓(𝑥)/𝑍𝑖, 1™

.

𝑛← [ 𝑛 + 1.end whileIf the final state is (ℓ, 𝑥) for some 𝑥 ∈ R𝑑, return 𝑥. Otherwise, re-run the chain.

Algorithm 2 Main algorithm

INPUT: A function 𝑓 : R𝑑, satisfying assumption (2.2), to which we have gradientaccess.OUTPUT: A random sample 𝑥 ∈ R𝑑.Let 0 ≤ 𝛽1 < · · · < 𝛽𝐿 = 1 be a sequence of inverse temperatures satisfying (2.192)and (2.193).Let “𝑍1 = 1.for ℓ = 1→ 𝐿 do

Run the simulated tempering chain in Algorithm 1 with temperatures 𝛽1, . . . , 𝛽ℓ,partition function estimates “𝑍1, . . . , “𝑍ℓ, step size 𝜂, time 𝑇 , and rate 𝜆 given byLemma 2.9.2.

If ℓ = 𝐿, return the sample.If ℓ < 𝐿, repeat to get 𝑛 = 𝑂(𝐿2 ln

Ä1𝛿

ä) samples, and let ’𝑍ℓ+1 =

𝑍ℓ

Ä1𝑛

∑𝑛𝑗=1 𝑒

(−𝛽ℓ+1+𝛽ℓ)𝑓(𝑥𝑗)ä.

end for

23

Step 1: Define a continuous version of the simulated tempering Markov chain

(Definition 2.5.1, Lemma 2.5.2), where transition times are real numbers determined

by an exponential weighting time distribution.

Step 2: Prove a new decomposition theorem (Theorem 2.6.5) for bounding the

spectral gap (or equivalently, the mixing time) of the simulated tempering chain we

define. This is the main technical ingredient, and also a result of independent interest.

While decomposition theorems have appeared in the Markov Chain literature

(e.g. [MR02]), typically one partitions the state space, and bounds the spectral gap

using (1) the probability flow of the chain inside the individual sets, and (2) between

different sets.

In our case, we decompose the Markov chain itself; this includes a decomposition

of the stationary distribution into components. (More precisely, we show a decom-

position theorem on the generator of the tempering chain.) We would like to do this

because in our setting, the stationary distribution is exactly a mixture distribution

(Problem 2.1.1).

Our Markov chain decomposition theorem bounds the spectral gap (mixing time)

of a simulated tempering chain in terms of the spectral gap (mixing time) of two

chains:

1. “component” chains on the mixture components

2. a “projected” chain whose state space is the set of components, and which

captures the action of the chain between components as well as the distance

between the mixture components (measured in terms of their overlap)

This means that if the Markov chain on the individual components mixes rapidly,

and the “projected” chain mixes rapidly, then the simulated tempering chain mixes

rapidly as well. (Note [MR02, Theorem 1.2] does partition into mixture components,

24

but they only consider the special case where they components are laid out in a chain.)

The mixing time of a continuous Markov chain is quantified by a Poincare in-

equality.

Theorem (Simplified version of Theorem 2.6.5). Consider the simulated tempering

chain 𝑀 with rate 𝜆 = 1𝐶, where the Markov chain at the 𝑖th level (temperature)

is 𝑀𝑖 = (Ω,L𝑖) with stationary distribution 𝑝𝑖, for 1 ≤ 𝑖 ≤ 𝐿. Suppose we have

a decomposition of the Markov chain at each level, 𝑝𝑖𝑀𝑖 =∑𝑚

𝑗=1𝑤𝑖,𝑗𝑝𝑖,𝑗𝑀𝑖,𝑗, where∑𝑚𝑗=1𝑤𝑖,𝑗 = 1. If each 𝑀𝑖,𝑗 satisfies a Poincare inequality with constant 𝐶, and the

projected chain 𝑀 satisfies a Poincare inequality with constant 𝐶, then 𝑀 satisfies a

Poincare inequality with constant 𝑂(𝐶(1 + 𝐶)).

Here, the projected chain 𝑀 is the chain on [𝐿]× [𝑚] with probability flow in the

same and adjacent levels given by

𝑇 ((𝑖, 𝑗), (𝑖, 𝑗′)) = 𝑤𝑖,𝑗′𝛿(𝑖,𝑗),(𝑖,𝑗′) (2.7)

𝑇 ((𝑖, 𝑗), (𝑖± 1, 𝑗)) = 𝛿(𝑖,𝑗),(𝑖±1,𝑗) (2.8)

where 𝛿(𝑖,𝑗),(𝑖′,𝑗′) :=∫Ω min𝑝𝑖,𝑗(𝑥), 𝑝𝑖′,𝑗′(𝑥) 𝑑𝑥 is the overlap.

The decomposition theorem is the reason why we use a slightly different simulated

tempering chain, which is allowed to transition at arbitrary times, with some rate 𝜆.

Such a chain “composes” nicely with the decomposition of the Langevin chain, and

allows a better control of the Dirichlet form of the tempering chain, which governs

the mixing time.

Step 3: Finally, we need to apply the decomposition theorem to our setup,

namely a distribution which is a mixture of strongly log-concave distributions. The

“components” of the decomposition in our setup are simply the mixture components

𝑒−𝑓0(𝑥−𝜇𝑗). We rely crucially on the fact that Langevin diffusion on a mixture distri-

25

bution decomposes into Langevin diffusion on the individual components.

We actually first analyze the hypothetical simulated tempering Langevin chain

on 𝑝𝑖 ∝∑𝑚

𝑗=1 𝑤𝑗𝑒−𝛽𝑗𝑓0(𝑥−𝜇𝑗) (Theorem 2.7.1)—i.e., where the stationary distribution

for each temperature is a mixture. Then in Lemma 2.7.5 we compare to the actual

simulated tempering Langevin that we can run, where 𝑝𝑖 ∝ 𝑝𝛽. To do this, we use

the fact that 𝑝𝑖 is off from 𝑝𝑖 by at most 1𝑤min

. (This is the only place where a factor

of 𝑤min comes in.)

To use our Markov chain decomposition theorem, we need to show two things:

1. The component chains mix rapidly: this follows from the classic fact that

Langevin diffusion mixes rapidly for log-concave distributions.

2. The projected chain mixes rapidly: The “projected” chain is defined as having

more probability flow between mixture components in the same or adjacent

temperatures which are close together in 𝜒2-divergence.

By choosing the temperatures close enough, we can ensure that the correspond-

ing mixture components in adjacent temperatures are close (in the sense of

having high overlap). By choosing the highest temperature large enough, we

can ensure that all the mixture components at the highest temperature are

close.

From this it follows that we can easily get from any component to any other (by

traveling up to the highest temperature and then back down). Thus the pro-

jected chain mixes rapidly from the method of canonical paths, Theorem A.1.3.

Note that the equal variance (for gaussians) or shape (for general log-concave dis-

tributions) condition is necessary here. For gaussians with different variance, the

Markov chain can fail to mix between components at the highest temperature. This

is because scaling the temperature changes the variance of all the components equally,

26

and preserves their ratio (which is not equal to 1).

Step 4: We analyze the error from discretization (Lemma 2.8.1), and choose

parameters so that it is small. We show that in Algorithm 2 we can inductively

estimate the partition functions. When we have all the estimates, we can run the

simulated tempering chain on all the temperatures to get the desired sample.

2.4 Theorem statements

We restate the main theorems more precisely. First define the assumptions.

Assumption 2.4.1. The function 𝑓 satisfies the following. There exists a function

𝑓 that satisfies the following properties.

1. 𝑓 , ∇𝑓 , and ∇2𝑓 are close to 𝑓 :

∥∥∥𝑓 − 𝑓∥∥∥∞≤ ∆ ,

∥∥∥∇𝑓 −∇𝑓∥∥∥∞≤ 𝜏 and ∇2𝑓(𝑥) ⪯ ∇2𝑓(𝑥) + 𝜏𝐼, ∀𝑥 ∈ R𝑑

(2.9)

2. 𝑓 is the log-pdf of a mixture:

𝑓(𝑥) = − log

(𝑚∑𝑖=1


)(2.10)

where ∇𝑓0(0) = 0 and

(a) 𝑓0 is 𝜅-strongly convex: ∇2𝑓0(𝑥) ⪰ 𝜅𝐼 for 𝜅 > 0.

(b) 𝑓0 is 𝐾-smooth: ∇2𝑓0(𝑥) ⪯ 𝐾𝐼.

Our main theorem is the following.

27

Theorem 2.4.2 (Main theorem, Gaussian version). Suppose

𝑓(𝑥) = − ln

Ñ𝑚∑𝑗=1

𝑤𝑗 exp

(−‖𝑥− 𝜇𝑗‖2

2𝜎2

)é(2.11)

on R𝑑 where∑𝑚

𝑗=1 𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑗 > 0, and 𝐷 = max1≤𝑗≤𝑚 ‖𝜇𝑗‖. Then

Algorithm 2 with parameters satisfying 𝑡, 𝑇, 𝜂−1, 𝛽−11 , (𝛽𝑖−𝛽𝑖−1)

−1 = polyÄ

1𝑤min

, 𝐷, 𝑑, 1𝜎2 ,

1𝜀

äproduces a sample from a distribution 𝑝′ with ‖𝑝− 𝑝′‖1 ≤ 𝜀 in time poly

Ä1

𝑤min, 𝐷, 𝑑, 1

𝜎2 ,1𝜀

ä.

The precise parameter choices are given in Lemma 2.9.2. Examining the parame-

ters, the number of temperatures required is 𝐿 = ‹Θ(𝑑), the amount of time to simulate

the Markov chain is 𝑇 = ‹ΘÅ𝐿2

𝑤3min

ã, the step size is 𝜂 = ‹Θ Ä

𝜀2

𝑑𝑇

ä, so the total amount

of steps is 𝑇𝜂

= ‹Θ Ä𝑇 2𝑑𝜀2

ä= ‹ΘÅ

𝑑5

𝜀2𝑤6min

ã. Note that to obtain the actual complexity, we

need to additionally multiply by a factor of 𝐿4 = ‹Θ(𝑑4): one factor of 𝐿 comes from

needing to estimate the partition function at each temperature, a factor of 𝐿2 comes

from the fact that we need 𝐿2 samples at each temperature to do this, and the final

factor of 𝐿 comes from the fact that we reject the sample if the sample is not at the

final temperature. (We have not made an effort to optimize this 𝐿4 factor.)

Our more general theorem allows the mixture component to come from an arbi-

trary log-concave distribution 𝑝(𝑥) ∝ 𝑒−𝑓0(𝑥).

Theorem 2.4.3 (Main theorem). Suppose 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) and 𝑓(𝑥) = − lnÄ∑𝑚

𝑖=1𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)

äon R𝑑, where function 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0 is 𝜅-strongly convex,

𝐾-smooth, and has minimum at 0),∑𝑚

𝑖=1𝑤𝑖 = 1, 𝑤min = min1≤𝑖≤𝑚 𝑤𝑖 > 0, and

𝐷 = max1≤𝑖≤𝑛 ‖𝜇𝑖‖. Then Algorithm 2 with parameters satisfying 𝑡, 𝑇, 𝜂−1, 𝛽−11 , (𝛽𝑖−

𝛽𝑖−1)−1 = poly

Ä1

𝑤min, 𝐷, 𝑑, 1

𝜀, 1𝜅, 𝐾

äproduces a sample from a distribution 𝑝′ with

‖𝑝− 𝑝′‖1 ≤ 𝜀 in time polyÄ

1𝑤min

, 𝐷, 𝑑, 1𝜀, 1𝜅, 𝐾

ä.

The precise parameter choices are given in Lemma B.1.3.

Theorem 2.4.4 (Main theorem with perturbations). Keep the setup of Theorem 2.4.3.

If instead 𝑓 satisfies Assumption 2.4.1 (𝑓 is ∆-close in 𝐿∞ norm to the log-pdf of a

28

mixture of log-concave distributions), then the result of Theorem 2.4.3 holds with an

additional factor of poly(𝑒Δ, 𝜏) in the running time.

2.5 Simulated tempering

First we define a continuous version of the simulated tempering Markov chain (Def-

inition 2.5.1). Unlike the usual definition of a simulated tempering chain in the

literature, the transition times can be arbitrary real numbers. Our definition falls out

naturally from writing down the generator L as a combination of the generators for

the individual chains and for the transitions between temperatures (Lemma 2.5.2).

Because L decomposes in this way, the Dirichlet form E will be easier to control in

Theorem 2.6.5.

Definition 2.5.1. Let 𝑀𝑖, 𝑖 ∈ [𝐿] be a sequence of continuous Markov processes with

state space Ω with stationary distributions 𝑝𝑖(𝑥) (with respect to a reference measure).

Let 𝑟𝑖, 1 ≤ 𝑖 ≤ 𝐿 satisfy

𝑟𝑖 > 0,𝐿∑𝑖=1

𝑟𝑖 = 1.

Define the continuous simulated tempering Markov process 𝑀st with rate 𝜆

and relative probabilities 𝑟𝑖 as follows.

The states of 𝑀st are [𝐿]× Ω.

For the evolution, let (𝑇𝑛)𝑛≥0 be a Poisson point process on R≥0 with rate 𝜆, i.e.,

𝑇0 = 0 and

𝑇𝑛+1 − 𝑇𝑛|𝑇1, . . . , 𝑇𝑛 ∼ Exp(𝜆) (2.12)

with probability density 𝑝(𝑡) = 𝜆𝑒−𝜆𝑡. If the state at time 𝑇𝑛 is (𝑖, 𝑥), then the Markov

process evolves according to 𝑀𝑖 on the time interval [𝑇𝑛, 𝑇𝑛+1). The state 𝑋𝑇𝑛+1 at

time 𝑇𝑛+1 is obtained from the state 𝑋−𝑇𝑛+1

:= lim𝑡→𝑇−𝑛+1

𝑋𝑡 by a “Type 2” transition:

29

If 𝑋−𝑇𝑛+1

= (𝑖, 𝑥), then transition to (𝑗 = 𝑖± 1, 𝑥) each with probability

1

2min

®𝑟𝑗𝑝𝑗(𝑥)

𝑟𝑖𝑝𝑖(𝑥), 1

´and stay at (𝑖, 𝑥) otherwise. (If 𝑗 is out of bounds, then don’t move.)

Lemma 2.5.2. Let 𝑀𝑖, 𝑖 ∈ [𝐿] be a sequence of continuous Markov processes with

state space Ω, generators L𝑖 (with domains 𝒟(L𝑖)), and unique stationary distribu-

tions 𝑝𝑖. Then the continuous simulated tempering Markov process 𝑀st with rate 𝜆

and relative probabilities 𝑟𝑖 has generator L defined by the following equation, where

𝑔 = (𝑔1, . . . , 𝑔𝐿) ∈ ∏𝐿𝑖=1𝒟(L𝑖):

(L 𝑔)(𝑖, 𝑦) = (L𝑖𝑔𝑖)(𝑦) +𝜆

2

∑

1 ≤ 𝑗 ≤ 𝐿

𝑗 = 𝑖± 1

min



´(𝑔𝑗(𝑥)− 𝑔𝑖(𝑥))

. (2.13)

The corresponding Dirichlet form is

E (𝑔, 𝑔) = −⟨𝑔,L 𝑔⟩ (2.14)

=𝐿∑𝑖=1

𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) +𝜆

2

∑1 ≤ 𝑖, 𝑗 ≤ 𝐿

𝑗 = 𝑖± 1

∫Ω𝑟𝑖𝑝𝑖(𝑥) min



´(𝑔𝑖(𝑥)2 − 𝑔𝑖(𝑥)𝑔𝑗(𝑥)) 𝑑𝑥

(2.15)

=𝐿∑𝑖=1


4

∑1 ≤ 𝑖, 𝑗 ≤ 𝐿

𝑗 = 𝑖± 1

∫Ω

(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥 (2.16)

where E𝑖(𝑔𝑖, 𝑔𝑖) = −⟨𝑔𝑖,L𝑖𝑔𝑖⟩𝑃𝑖.

Proof. Continuous simulated tempering is a Markov process because the Poisson pro-

cess is memoryless. We show that its generator equals L . Let 𝐹 be the operator

30

which acts by

𝐹𝑔(𝑥, 𝑖) = 𝑔𝑖(𝑥) +1

2

∑1 ≤ 𝑗 ≤ 𝐿

𝑗 = 𝑖± 1

min



´(𝑔𝑗(𝑥)− 𝑔𝑖(𝑥)) (2.17)

Let 𝑁𝑡 = max 𝑛 : 𝑇𝑛 ≤ 𝑡. Let 𝑃𝑗,𝑡 be such that (𝑃𝑗,𝑡𝑔)(𝑥) = E𝑀𝑗[𝑔(𝑥𝑡)|𝑥0 = 𝑥], the

expected value after running 𝑀𝑗 for time 𝑡, and let 𝑃𝑡 the same operator for 𝑀 .

We have, letting 𝑃 ′𝑠 =

∑𝐿𝑗=1 𝛿𝑗 × 𝑃𝑗,𝑠 (where 𝛿𝑗(𝑖) = 1𝑖=𝑗 is a function on [𝐿]),

𝑃𝑡𝑔 = P(𝑁𝑡 = 0)𝐿∑

𝑗=1

𝛿𝑗 × 𝑃𝑗,𝑡𝑔𝑗 +∫ 𝑡

0𝑃 ′𝑠𝐹𝑃 ′

𝑡−𝑠𝑔P(𝑡1 = 𝑑𝑠,𝑁𝑡 = 1) + P(𝑁𝑡 ≥ 2)ℎ.

(2.18)

where ‖ℎ‖𝐿2(𝑃 ) ≤ ‖𝑔‖𝐿2(𝑃 ) (by contractivity of Markov processes). Here, 𝑃 ′𝑠𝐹𝑃 ′

𝑡−𝑠

comes from moving for time 𝑠 at one level, doing a level change, then moving for

time 𝑡− 𝑠 on the new level. By basic properties of the Poisson process, P(𝑁𝑡 = 0) =

1− 𝜆𝑡+𝑂(𝑡2), P(𝑡1 = 𝑠,𝑁𝑡 = 1) = 𝜆+𝑂(𝑡) for 0 ≤ 𝑠 ≤ 𝑡, and P(𝑁 ≥ 2) = 𝑂(𝑡2), so

𝑑

𝑑𝑡(𝑃𝑡𝑔)|𝑡=0 = −𝜆

𝐿∑𝑗=1

𝛿𝑗 × 𝑃𝑗,𝑡𝑔𝑗︸︷︷︸𝑔

+𝐿∑

𝑗=1

𝛿𝑗 ×L𝑗𝑔𝑗 + 𝜆𝐹𝑔 = L 𝑔. (2.19)

2.6 Markov process decomposition theorems

For ease of reading, we first prove a simple density decomposition theorem, Theo-

rem 2.6.1. (This may be skipped, as it is a consequence of Theorem 2.6.3, up to

constants.) Then we will prove a general density decomposition theorem for Markov

processes, Theorem 2.6.3. Then we show how to specialize Theorem 2.6.3 to the case

of simulated tempering in Theorem 2.6.5, which is the density decomposition theorem

31

that we use to prove the main Theorem 2.4.2. We also give a version of the theorem

for a continuous index set, Theorem B.3.1.

We compare Theorems 2.6.1 and 2.6.3 to decomposition theorem in the litera-

ture, [MR02, Theorem 1.1, 1.2] and [WSH09a, Theorem 5.2]. Note that our theo-

rems are stated for continuous-time Markov processes, while the others are stated for

discrete-time; however, either proof could be adapted for the other setting.

∙ In Theorem 2.6.1 we use the Poincare constants of the component Markov

processes, and the distance of their stationary distributions to each other, to

bound the Poincare constant of the original chain. (Theorem 2.6.1 gives a bound

in terms of the 𝜒2 divergences, but Remark 2.6.2 gives the bound in terms of

the “overlap” quantity which is used in the literature.)

This is a generalization of [MR02, Theorem 1.2], which deals with the special

case where the state space is partitioned into overlapping sets, and [MR02,

Theorem 1.2], which covers the special case where the component distributions

are laid out in a chain. Our theorem can deal with any arrangement.

∙ In Theorem 2.6.3 we additionally use the “probability flow” between components

to get a more general bound. It involves partitioning the pairs of indices 𝐼 × 𝐼

into 𝑆 and 𝑆↔, where to get a good bound, one puts (𝑖, 𝑗) where 𝑝𝑖 and 𝑝𝑗 are

close into 𝑆, and (𝑖, 𝑗) where there is a lot of “probability flow” between 𝑝𝑖 and

𝑝𝑗 into 𝑆↔. Theorem 2.6.1 is the special case of Theorem 2.6.3 when 𝑆↔ = 𝜑.

Note that [WSH09a, Theorem 5.2] is similar to the case where 𝑆 = 𝜑. However,

they depend only on the probability flow, while we depend on the “overlap” in

the probability flow.

32

2.6.1 Simple density decomposition theorem

Theorem 2.6.1 (Simple density decomposition theorem). Let 𝑀 = (Ω,L ) be a

(continuous-time) Markov process with stationary measure 𝑃 and Dirichlet form

E (𝑔, 𝑔) = −⟨𝑔,L 𝑔⟩𝑃 . Suppose the following hold.

1. There is a decomposition

⟨𝑓,L 𝑔⟩ =𝑚∑𝑗=1

𝑤𝑗 ⟨𝑓,L𝑗𝑔⟩𝑃𝑗(2.20)

𝑃 =𝑚∑𝑗=1

𝑤𝑗𝑃𝑗. (2.21)

where L𝑗 is the generator for some Markov chain 𝑀𝑗 on Ω with stationary

distribution 𝑃𝑗.

2. (Mixing for each 𝑀𝑗) The Dirichlet form E𝑗(𝑓, 𝑔) = −⟨𝑓,L𝑗𝑔⟩𝑃𝑗satisfies the

Poincare inequality

Var𝑃𝑗(𝑔) ≤ 𝐶E𝑗(𝑔, 𝑔). (2.22)

3. (Mixing for projected chain) Define the projected process 𝑀 as the Markov pro-

cess on [𝑚] generated by L , where L acts on 𝑔 ∈ 𝐿2([𝑚]) by3

L 𝑔(𝑗) =∑

1≤𝑘≤𝑚,𝑘 =𝑗

[𝑔(𝑘)− 𝑔(𝑗)]𝑇 (𝑗, 𝑘) (2.23)

where 𝑇 (𝑗, 𝑘) =𝑤𝑘

𝜒2max(𝑃𝑗||𝑃𝑘)

(2.24)

where 𝜒2max(𝑃 ||𝑄) := max𝜒2(𝑃 ||𝑄), 𝜒2(𝑄||𝑃 ). (Define 𝑇 (𝑗, 𝑘) = 0 if this is

infinite.) Let 𝑃 be the stationary distribution of 𝑀 ; 𝑀 satisfies the Poincare

3𝑀 is defined so that the rate of diffusion from 𝑗 to 𝑘 is given by 𝑇 (𝑗, 𝑘).

33

inequality

Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.25)

Then 𝑀 satisfies the Poincare inequality

Var𝑃 (𝑔) ≤ 𝐶

(1 +

𝐶

2

)E (𝑔, 𝑔). (2.26)

Remark 2.6.2. The theorem also holds with 𝑇 (𝑗, 𝑘) = 𝑤𝑘𝛿𝑗,𝑘, where 𝛿𝑗,𝑘 is defined

by

𝑄𝑗,𝑘(𝑑𝑥) = min

®𝑑𝑃𝑘

𝑑𝑃𝑗

, 1

´𝑃𝑗(𝑑𝑥) = min𝑝𝑗, 𝑝𝑘 𝑑𝑥 (2.27)

𝛿𝑗,𝑘 = 𝑃𝑗,𝑘(Ω) =∫Ω

min𝑝𝑗, 𝑝𝑘 𝑑𝑥 (2.28)

where the equalities on the RHS hold when each 𝑃𝑖 has density function 𝑝𝑖. For this

definition of 𝑇 , the theorem holds with conclusion

Var𝑃 (𝑔) ≤ 𝐶Ä1 + 2𝐶

äE (𝑔, 𝑔). (2.29)

Proof. First, note that a stationary distribution 𝑃 of 𝑀 is given by 𝑝(𝑗) := 𝑃 (𝑗) =

𝑤𝑗, because 𝑤𝑗𝑇 (𝑗, 𝑘) = 𝑤𝑘𝑇 (𝑘, 𝑗). (Note that the reason 𝑇 has a maximum of 𝜒2

divergences in the denominator is to make this “detailed balance” condition hold.)

Given 𝑔 ∈ 𝐿2(Ω), define 𝑔 ∈ 𝐿2([𝑚]) by 𝑔(𝑗) = E𝑃𝑗𝑔. Then decomposing the

variance into the variance within the 𝑃𝑗 and between the 𝑃𝑗, and using Assumptions

2 and 3 gives

Var𝑝(𝑔) =𝑚∑𝑗=1

𝑤𝑗

∫(𝑔(𝑥)− E[𝑔(𝑥)])2 𝑃𝑗(𝑑𝑥) (2.30)

34

=𝑚∑𝑗=1

𝑤𝑗

∫(𝑔(𝑥)− E

𝑃𝑗

[𝑔(𝑥)])2 𝑃𝑗(𝑑𝑥) +𝑚∑𝑗=1

𝑤𝑗(E𝑃𝑗

𝑔 − E𝑃𝑔)2 (2.31)

≤𝑚∑𝑗=1

𝑤𝑗

∫(𝑔 − E

𝑃𝑗

𝑔)2 𝑃𝑗(𝑑𝑥) +𝑚∑𝑗=1

𝑝(𝑗)(𝑔(𝑗)− E𝑃

𝑔)2 (2.32)

≤ 𝐶𝑚∑𝑗=1

𝑤𝑗E𝑃𝑗(𝑔, 𝑔) + Var𝑃 (𝑔) (2.33)

≤ 𝐶E (𝑔, 𝑔) + 𝐶E (𝑔, 𝑔). (2.34)

Note E (𝑔, 𝑔) =∑𝑚

𝑗=1𝑤𝑗E𝑃𝑗(𝑔, 𝑔) follows from Assumption 1. Now

E (𝑔, 𝑔) =1

2

𝑚∑𝑗=1

𝑚∑𝑘=1

(𝑔(𝑗)− 𝑔(𝑘))2𝑤𝑗𝑇 (𝑗, 𝑘) (2.35)

≤ 1

2

𝑚∑𝑗=1

𝑚∑𝑘=1

(𝑔(𝑗)− 𝑔(𝑘))2𝑤𝑗𝑤𝑘

𝜒2(𝑃𝑗||𝑃𝑘)(2.36)

≤ 1

2

𝑚∑𝑗=1

𝑚∑𝑘=1

Var𝑃𝑗(𝑔)𝑤𝑗𝑤𝑘 by Lemma D.1.1 (2.37)

≤ 1

2

𝑚∑𝑗=1

𝑤𝑗𝐶E𝑗(𝑔, 𝑔) =𝐶

2E (𝑔, 𝑔). (2.38)

Thus

(2.34) ≤ 𝐶E (𝑔, 𝑔) +𝐶𝐶

2E (𝑔, 𝑔) (2.39)

as needed.

For Remark 2.6.2, let ‹𝑃𝑗,𝑘(𝑑𝑥) =𝑃𝑗,𝑘(𝑑𝑥)

𝛿𝑗,𝑘=

𝑃𝑗,𝑘(𝑑𝑥)

𝑃𝑗,𝑘(Ω); it is ‹𝑃𝑗,𝑘 normalized to be a

probability distribution. Note that we can instead bound (2.36) as follows.

(2.36) ≤𝑚∑𝑗=1

𝑚∑𝑘=1

ÑE𝑃𝑗

𝑔 − E𝑃𝑗,𝑘

𝑔

é2

+

ÑE𝑃𝑘

𝑔 − E𝑃𝑗,𝑘

𝑔

é2𝑤𝑗𝑤𝑘𝛿𝑗,𝑘 (2.40)

by (𝑎 + 𝑏)2 ≤ 2(𝑎2 + 𝑏2)

≤𝑚∑𝑗=1

𝑚∑𝑘=1

[Var𝑃𝑗(𝑔)𝜒2(‹𝑃𝑗,𝑘||𝑃𝑗) + Var𝑃𝑘

(𝑔)𝜒2(‹𝑃𝑗,𝑘||𝑃𝑘)]𝑤𝑗𝑤𝑘𝛿𝑗,𝑘 (2.41)

by Lemma D.1.1

35

≤𝑚∑𝑗=1

𝑚∑𝑘=1

(Var𝑃𝑗(𝑔) + Var𝑃𝑘

(𝑔))𝑤𝑗𝑤𝑘 (2.42)

by Lemma D.1.3

≤𝑚∑𝑗=1

𝑤𝑗2𝐶E𝑗(𝑔, 𝑔) = 2𝐶E (𝑔, 𝑔) (2.43)

which gives (2.29).

2.6.2 General density decomposition theorem

Theorem 2.6.3 (General density decomposition theorem). Consider a Markov pro-

cess 𝑀 = (Ω,L ) with stationary distribution 𝑝. Let 𝑀𝑖 = (Ω𝑖,L𝑖) be Markov pro-

cesses for 𝑖 ∈ 𝐼 (|𝐼| finite), with 𝑀𝑖 supported on Ω𝑖 ⊆ Ω (possibly overlapping) and

with stationary distribution 𝑃𝑖. Let the Dirichlet forms be E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 and

E𝑖(𝑔, ℎ) = −⟨𝑔,L𝑖ℎ⟩𝑃𝑖. (This only depends on 𝑔|Ω𝑖

and ℎ|Ω𝑖.)

Suppose the following hold.


⟨𝑓,L 𝑔⟩𝑃 =∑𝑖∈𝐼

𝑤𝑖 ⟨𝑓,L𝑖𝑔⟩𝑃𝑖+∑𝑖,𝑗∈𝐼

𝑤𝑖 ⟨𝑓, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖(2.44)

𝑃 =∑𝑖∈𝐼

𝑤𝑖𝑃𝑖. (2.45)

where T𝑖,𝑗 acts on a function Ω𝑗 → R (or Ω→ R, by restriction) to give a func-

tion Ω𝑖 → R, by T𝑖,𝑗𝑔(𝑥) =∫Ω𝑗

𝑔(𝑦)𝑇𝑖,𝑗(𝑥, 𝑑𝑦), and 𝑇𝑖,𝑗 satisfies the following:4

∙ For every 𝑥 ∈ Ω𝑖, 𝑇𝑖,𝑗(𝑥, ·) is a measure on Ω𝑗.

∙ 𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) = 𝑤𝑗𝑃𝑗(𝑑𝑥)𝑇𝑗,𝑖(𝑦, 𝑑𝑥).

4We are breaking up the generator into a part that describes flow within the components, andflow between the components. 𝑃𝑖,𝑗 describes the probability flow between Ω𝑖 and Ω𝑗 . Note thatfixing the 𝑀𝑖 = (Ω𝑖,L𝑖), the decomposition into the T𝑖,𝑗 may be non-unique, because the Ω𝑖’s canoverlap.

36

2. (Mixing for each 𝑀𝑖) 𝑀𝑖 satisfies the Poincare inequality

Var𝑃𝑖(𝑔) ≤ 𝐶E𝑖(𝑔, 𝑔). (2.46)

3. (Mixing for projected chain) Let 𝐼×𝐼 = 𝑆⊔𝑆↔ be a partition of 𝐼×𝐼 such that

(𝑖, 𝑗) ∈ 𝑆∙ ⇐⇒ (𝑗, 𝑖) ∈ 𝑆∙, for ∙ ∈ ,↔.5 Suppose that 𝑇 : 𝐼 × 𝐼 → R≥0

satisfies6

∀𝑖 ∈ 𝐼,∑

𝑗:(𝑖,𝑗)∈𝑆

𝑇 (𝑖, 𝑗)𝜒2(𝑃𝑗||𝑃𝑖) ≤ 𝐾1 (2.47)

∀𝑖 ∈ 𝐼,∑

𝑗:(𝑖,𝑗)∈𝑆↔

𝑇 (𝑖, 𝑗)𝜒2(‹𝑃𝑖,𝑗||𝑃𝑖) ≤ 𝐾2 (2.48)

∀(𝑖, 𝑗) ∈ 𝑆↔, 𝑇 (𝑖, 𝑗) ≤ 𝐾3𝑄𝑖,𝑗(Ω𝑖,Ω𝑗) (2.49)

∀𝑖, 𝑗 ∈ 𝐼, 𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖) (2.50)

where 𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) and ‹𝑃𝑖,𝑗(𝑑𝑥) = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥,Ω𝑗)

𝑄𝑖,𝑗(Ω𝑖,Ω𝑗).

Define the projected chain 𝑀 as the Markov chain on 𝐼 generated by L , where

L acts on 𝑔 ∈ 𝐿2(𝐼) by

L 𝑔(𝑖) =∑𝑗∈𝐼

(𝑔(𝑗)− 𝑔(𝑖))𝑇 (𝑖, 𝑗). (2.51)

Let 𝑃 be the stationary distribution of 𝑀 and E (𝑔, 𝑔) = −¨𝑔,L 𝑔

∂; 𝑀 satisfies

the Poincare inequality

Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.52)

5The idea will be to choose 𝑆 to contain pairs (𝑖, 𝑗) such that the distributions 𝑃𝑖, 𝑃𝑗 are close,and to choose 𝑆↔ to contain pairs such that there is a lot of probability flow between 𝑃𝑖 and 𝑃𝑗 .

6𝑇 represents probability flow, and will be chosen to have large value on (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗

are close (from 𝑆), and on (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗 have a large probability flow between them, as

determined by ‹𝑃𝑖,𝑗 .

37


Var𝑃 (𝑔) ≤ max

®𝐶

Ç1 +

Ç1

2𝐾1 + 3𝐾2

å𝐶

å, 3𝐾3𝐶

´E (𝑔, 𝑔). (2.53)

To use this theorem, we would choose 𝑇 (𝑖, 𝑗) - 1𝜒2(𝑃𝑗 ||𝑃𝑖)

for (𝑖, 𝑗) ∈ 𝑆 and

𝑇 (𝑖, 𝑗) - 1

𝜒2(𝑃𝑖,𝑗 ||𝑃𝑖)for (𝑖, 𝑗) ∈ 𝑆↔, i.e., choose the projected chain to have large

probability flow between distributions that are close, or have a lot of flow between

them; ‹𝑃𝑖,𝑗 measures how much 𝑃𝑖 is “flowing” into 𝑃𝑗.

Remark 2.6.4. As in Remark 2.6.2, we can replace (2.47) by

∀𝑖 ∈ 𝐼,∑

𝑗:(𝑖,𝑗)∈𝑆

𝑇 (𝑖, 𝑗)𝛿𝑖,𝑗 ≤ 𝐾1 where 𝛿𝑖,𝑗 =∫Ω

min

®𝑑𝑃𝑗

𝑑𝑃𝑖

, 1

´𝑃𝑖(𝑑𝑥) (2.54)

and obtain

Var𝑃 (𝑔) ≤ max¶𝐶Ä1 + (2𝐾1 + 3𝐾2)𝐶

ä, 3𝐾3𝐶

©E (𝑔, 𝑔). (2.55)

Proof. First, we make some preliminary observations and definitions.

1. A stationary distribution 𝑃 of 𝑀 is given by 𝑝(𝑖) := 𝑃 (𝑖) = 𝑤𝑖, because

𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖) by (2.50).

2. For 𝑖, 𝑗 ∈ 𝐼, define

𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) : = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.56)‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) : =𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)

𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)(2.57)‹𝑃𝑖,𝑗(𝑑𝑥) : =

∫𝑦∈Ω𝑗

‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) =𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥,Ω𝑗)

𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)(2.58)

38

and note

𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦)

𝑤𝑖𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)=

𝑤𝑗𝑃𝑗(𝑑𝑦)𝑇𝑗,𝑖(𝑦, 𝑑𝑥)

𝑤𝑗𝑄𝑗,𝑖(Ω𝑗,Ω𝑖)(2.59)

=⇒ ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = ‹𝑄𝑗,𝑖(𝑑𝑦, 𝑑𝑥). (2.60)

3. Let

E(𝑔, 𝑔) =∑𝑖∈𝐼

𝑤𝑖E𝑖(𝑔, 𝑔) = −∑𝑖∈𝐼

𝑤𝑖 ⟨𝑔,L𝑖𝑔⟩𝑃𝑖(2.61)

E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼

𝑤𝑖 ⟨𝑔, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖. (2.62)

4. We can write E↔ in terms of squares as follows.

E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼

∫Ω𝑖

𝑔(𝑥)𝑤𝑖

∫Ω𝑗

(𝑔(𝑦)− 𝑔(𝑥))𝑇𝑖,𝑗(𝑥, 𝑑𝑦)𝑃𝑖(𝑑𝑥) (2.63)

=∑𝑖,𝑗∈𝐼

1

2

∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

𝑔(𝑥)2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.64)

−∫Ω𝑖

∫Ω𝑗

𝑔(𝑥)𝑔(𝑦)𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) +1

2

∫Ω𝑖

∫Ω𝑗

𝑔(𝑦)2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦)

(2.65)

=1

2

∑𝑖,𝑗∈𝐼

∫Ω𝑖

∫Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.66)

=1

2

∑𝑖,𝑗∈𝐼

∫Ω𝑖

∫Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦). (2.67)

Given 𝑔 ∈ 𝐿2(Ω), define 𝑔 ∈ 𝐿2(𝐼) by 𝑔(𝑖) = E𝑃𝑖𝑔. We decompose the variance into

the variance within and between the parts, and then use the Poincare inequality on

each part.

Var𝑃 (𝑔) =∑𝑖∈𝐼

𝑤𝑖

∫Ω𝑖

(𝑔(𝑥)− E𝑃

[𝑔(𝑥)])2 𝑃𝑖(𝑑𝑥) (2.68)

39

=∑𝑖∈𝐼

𝑤𝑖

ñÇ∫(𝑔(𝑥)− E

𝑃𝑖

[𝑔(𝑥)])2 𝑃𝑖(𝑑𝑥)

å+ (E

𝑃𝑖

𝑔 − E𝑃𝑔)2ô

(2.69)

≤ 𝐶∑𝑖∈𝐼

𝑤𝑖E𝑖(𝑔, 𝑔) + Var𝑃 (𝑔) (2.70)

≤ 𝐶E(𝑔, 𝑔) + 𝐶E (𝑔, 𝑔). (2.71)

Now we break up E as follows,

E (𝑔, 𝑔) =1

2

∑(𝑖,𝑗)∈𝑆

(𝑔(𝑖)− 𝑔(𝑗))2𝑤𝑖𝑇 (𝑖, 𝑗)

︸︷︷︸𝐴

+1

2

∑(𝑖,𝑗)∈𝑆↔

(𝑔(𝑖)− 𝑔(𝑗))2𝑤𝑖𝑇 (𝑖, 𝑗)

︸︷︷︸𝐵

. (2.72)

First, as in Theorem 2.6.1, we bound 𝐴 by Lemma D.1.1,

𝐴 ≤∑

(𝑖,𝑗)∈𝑆

Var𝑃𝑖(𝑔)𝜒2(𝑃𝑗||𝑃𝑖)𝑤𝑖𝑇 (𝑖, 𝑗) (2.73)

≤ 𝐾1

∑(𝑖,𝑗)∈𝑆

𝑤𝑖 Var𝑃𝑖(𝑔) (2.74)

≤ 𝐾1𝐶E(𝑔, 𝑔). (2.75)

For the second term,

𝐵 =∑

(𝑖,𝑗)∈𝑆↔

ñ∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)

ô𝑤𝑖𝑇 (𝑖, 𝑗) (2.76)

=∑

(𝑖,𝑗)∈𝑆↔

∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) (2.77)

+∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)

2𝑤𝑖𝑇 (𝑖, 𝑗) (2.78)

Note that

∫Ω𝑖

∫Ω𝑗

𝑔(𝑥)(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) =∫Ω𝑖

𝑔(𝑥)

Ç𝑃𝑖(𝑑𝑥)−

∫Ω𝑗

‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)

å(2.79)

40

=∫Ω𝑖

𝑔(𝑥)Ä𝑃𝑖(𝑑𝑥)− ‹𝑃𝑖,𝑗(𝑑𝑥)

ä(2.80)

and similarly, because ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = ‹𝑄𝑖,𝑗(𝑑𝑦, 𝑑𝑥) by (2.60),

∫Ω𝑖

∫Ω𝑗

𝑔(𝑦)(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) =∫Ω𝑗

𝑔(𝑦)Ä𝑃𝑗(𝑑𝑦)− ‹𝑃𝑗,𝑖(𝑑𝑦)

ä. (2.81)

Hence

𝐵 =∑

(𝑖,𝑗)∈𝑆↔

∫𝑥∈Ω𝑖


ä−∫𝑥∈Ω𝑗


ä(2.82)

+∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗


2𝑤𝑖𝑇 (𝑖, 𝑗) (2.83)

≤ 3∑

(𝑖,𝑗)∈𝑆↔

Ç∫𝑥∈Ω𝑖


äå2

+

Ç∫𝑥∈Ω𝑗


äå2

(2.84)

+

Ç∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗


å2𝑤𝑖𝑇 (𝑖, 𝑗) (2.85)

by Cauchy-Schwarz (2.86)

≤ 3∑

(𝑖,𝑗)∈𝑆↔

Var𝑃𝑖(𝑔)𝜒2(‹𝑃𝑖,𝑗||𝑃𝑖)𝑤𝑖𝑇 (𝑖, 𝑗) + Var𝑃𝑖

(𝑔)𝜒2(‹𝑃𝑗,𝑖||𝑃𝑗)𝑤𝑗𝑇 (𝑗, 𝑖) (2.87)

+∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)𝑇 (𝑖, 𝑗)

(2.88)

by Lemma D.1.1 and (2.50) (𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖)) (2.89)

≤ 3

𝐾2

∑𝑖∈𝐼

𝑤𝑖 Var𝑃𝑖(𝑔) + 𝐾2

∑𝑗∈𝐼

𝑤𝑗 Var𝑃𝑗(𝑔) (2.90)

by (2.48) and (2.49) (2.91)

+ 𝐾3

∑(𝑖,𝑗)∈𝑆↔

∫𝑥∈Ω𝑖

∫𝑦∈Ω𝑗

(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)

(2.92)

≤ 6𝐾2𝐶E(𝑔, 𝑔) + 6𝐾3E↔(𝑔, 𝑔) (2.93)

41

where the last line follows from the Poincare inequality on each 𝑀𝑖, and (2.67). Then

Var𝑃 (𝑔) ≤ (2.71) ≤ 𝐶E(𝑔, 𝑔) + 𝐶

Ç1

2(𝐴 + 𝐵)

å(2.94)

≤ 𝐶E(𝑔, 𝑔) + 𝐶1

2(𝐾1𝐶E↔(𝑔, 𝑔) + 6𝐾2𝐶E(𝑔, 𝑔) + 6𝐾3E↔(𝑔, 𝑔))

(2.95)

≤ max

®𝐶

Ç1 +

Ç1

2𝐾1 + 3𝐾2

å𝐶

å, 3𝐾3𝐶

´E (𝑔, 𝑔). (2.96)

2.6.3 Theorem for simulated tempering

We now apply the general density decomposition theorem 2.6.3 to simulated tem-

pering. The state space is [𝐿] × Ω, and we consider a decomposition of 𝑀 into

𝑀𝑖,𝑗 = ([𝑖] × Ω,L𝑖,𝑗) for (𝑖, 𝑗) ∈ 𝐼 := (𝑖, 𝑗) : 1 ≤ 𝑖 ≤ 𝐿, 1 ≤ 𝑗 ≤ 𝑚𝑖, where 𝑚𝑖 is

the number of components in the 𝑖th level. (For our application, we consider when

all the 𝑚𝑖 are equal to a fixed 𝑚.) Here, 𝑀𝑖,𝑗 is a Markov process on the 𝑖th level.

Note that an index in 𝐼 is an ordered pair (𝑖, 𝑗), not to be confused with the 𝑖, 𝑗 in

Theorem 2.6.3.

To apply the general density decomposition theorem we first need to decompose

the generator into generators of component parts L(𝑖,𝑗) and probability flow between

parts, L(𝑖,𝑗),(𝑖′,𝑗′) = T(𝑖,𝑗),(𝑖′,𝑗′)−Id. In simulated tempering, the L(𝑖,𝑗) generate Markov

processes within the levels, and the L(𝑖,𝑗),(𝑖′,𝑗′) cover the transitions between adjacent

levels. Next, we have to partition the index set 𝐼×𝐼 into 𝑆, which will contain pairs

such that the corresponding distributions are close, and 𝑆↔, which will contain pairs

with a lot of probability flow between them. The 𝑆 will contain pairs within the

same level (just the highest temperature, in our application), while 𝑆↔ will contain

pairs between adjacent levels.

We will see that the quantity 𝑄(𝑖,𝑗),(𝑖′=𝑖±1,𝑗′)([𝑖] × Ω, [𝑖′] × Ω) is simply the “over-

42

lap” between 𝑃(𝑖,𝑗) and 𝑃(𝑖′,𝑗′), and the measure ‹𝑃(𝑖,𝑗),(𝑖′,𝑗′) is roughly min𝑝(𝑖,𝑗), 𝑝(𝑖′,𝑗′),

appropriately normalized. 𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) is defined in terms of these quantities

(see (2.100)), which means that the projected process has large probability flow be-

tween nodes (𝑖, 𝑗), (𝑖′ ∈ 𝑖− 1, 𝑖, 𝑖 + 1, 𝑗′) in the same and adjacent levels where the

probability distributions are close.

One can apply Theorem 2.6.3 in different ways, as there are different ways to

choose 𝑇 . In our case, we define it just to include connections at the highest level

and for the same component between adjacent levels, so the adjacency graph of 𝑇

contains a complete graph at the highest temperature, and “chains” going down the

levels. For a different simulated tempering problem, one could define it differently,

for example, with a more complex tree structure.

For ease of notation, we will sometimes drop the parentheses in the subscripts,

e.g., 𝑝𝑖,𝑗 = 𝑝(𝑖,𝑗).

Theorem 2.6.5 (Density decomposition theorem for simulated tempering). Consider

simulated tempering 𝑀 with Markov processes 𝑀𝑖 = (Ω,L𝑖), 1 ≤ 𝑖 ≤ 𝐿. Let the

stationary distribution of 𝑀𝑖 be 𝑃𝑖, the relative probabilities be 𝑟𝑖, and the rate be 𝜆.

Let the Dirichlet forms be E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 and E𝑖(𝑔, ℎ) = −⟨𝑔,L𝑖ℎ⟩𝑃𝑖. Assume

the probability measures have density functions with respect to some reference measure

𝑑𝑥, represented by the lower-case letter: 𝑃𝑖(𝑑𝑥) = 𝑝𝑖(𝑥) 𝑑𝑥.

Represent a function 𝑔 ∈ [𝐿] × Ω as (𝑔1, . . . , 𝑔𝐿). Let 𝑃 be the stationary distri-

bution on [𝐿] × Ω, L be the generator, and E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 be the Dirichlet

form.

Suppose the following hold.


⟨𝑓,L𝑖𝑔⟩𝑃𝑖=

𝑚𝑖∑𝑗=1

𝑤𝑖,𝑗

¨𝑓,L(𝑖,𝑗)𝑔

∂𝑃(𝑖,𝑗)

for 𝑓, 𝑔 : Ω𝑖 → R (2.97)

43

𝑃𝑖 =𝑚∑𝑗=1

𝑤𝑖,𝑗𝑃(𝑖,𝑗). (2.98)

where L𝑖,𝑗 is the generator for some Markov chain 𝑀𝑖,𝑗 on 𝑖 × Ω with sta-

tionary measure 𝑃(𝑖,𝑗).

2. (Mixing for each 𝑀𝑖,𝑗) 𝑀𝑖,𝑗 satisfies the Poincare inequality

Var𝑃(𝑖,𝑗)(𝑔) ≤ 𝐶E𝑖,𝑗(𝑔, 𝑔) (2.99)

where E𝑖,𝑗(𝑔, 𝑔) = −⟨𝑔,L𝑖,𝑗𝑔⟩𝑃(𝑖,𝑗).

3. (Mixing for projected chain) Define

𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) =

𝑤1,𝑗′

𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))

, 𝑖 = 𝑖′ = 1

𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗), 𝑖′ = 𝑖± 1, 𝑗 = 𝑗′

0, else

(2.100)

where 𝜒2max(𝑃 ||𝑄) := max𝜒2(𝑃 ||𝑄), 𝜒2(𝑄||𝑃 ), 𝐾 > 0 is any constant, and

𝛿(𝑖,𝑗),(𝑖′,𝑗′) =∫Ω

min

®𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)

𝑟𝑖𝑤𝑖,𝑗

, 𝑝(𝑖′,𝑗′)(𝑥)

´𝑑𝑥. (2.101)

Define the projected chain 𝑀 as the Markov chain on [𝑛] generated by L =

T − Id, so that L acts on 𝑔 ∈ 𝐿2([𝑛]) by

L 𝑔(𝑖, 𝑗) =𝐿∑𝑖=1

𝑚∑𝑗=1

𝐿∑𝑖′=1

𝑚∑𝑗′=1

[𝑔(𝑖′, 𝑗′)− 𝑔(𝑖, 𝑗)]𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)). (2.102)

Let 𝑃 be the stationary distribution of 𝑀 ; 𝑀 satisfies the Poincare inequality

Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.103)

44


Var𝑃 (𝑔) ≤ max

𝐶

Ç1 +

Ç1

2+ 6𝐾

å𝐶

å,6𝐾𝐶

𝜆

E (𝑔, 𝑔). (2.104)

Proof. We first relate 𝑀 to 𝑀 ′ on [𝐿] × Ω defined as follows. 𝑀 ′ has transition

probability from level 𝑖 to level 𝑖′ = 𝑖 ± 1 given by∑𝑚

𝑗=1 𝑟𝑖𝑤𝑖,𝑗 minß

𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™

rather than 𝑟𝑖 min𝑟𝑖′𝑝𝑖′ (𝑥)𝑟𝑖𝑝𝑖(𝑥)

, 1

. We show that E ′(𝑔, 𝑔) ≤ E (𝑔, 𝑔) below; this basically

follows from the fact that the probability flow between any two distinct points in 𝑀 ′

is at most the probability flow in 𝑀 . More precisely (letting 𝑝L denote the functional

defined by (𝑝L 𝑓)(𝑥) = 𝑝(𝑥)(L 𝑓)(𝑥)),

𝑝L ′ =𝐿∑𝑖=1

𝑟𝑖𝑝𝑖L𝑖 +𝜆

2

∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

𝑚∑𝑗=1

𝑟𝑖𝑤𝑖,𝑗 min


𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1

(𝑔𝑖′ − 𝑔𝑖)(𝑥)

(2.105)

E ′(𝑔, 𝑔) = −⟨𝑔,L ′𝑔⟩𝑃 (2.106)

=𝐿∑𝑖=1

𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) (2.107)

+𝜆

2

∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

∫Ω

𝑚∑𝑗=1

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥) min



(𝑔𝑖(𝑥)2 − 𝑔𝑖(𝑥)𝑔𝑖′(𝑥)) 𝑑𝑥

(2.108)

=𝐿∑𝑖=1

𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) (2.109)

+𝜆

4

∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

∫Ω

𝑚∑𝑗=1

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥) min



(𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2 𝑑𝑥

(2.110)

=𝐿∑𝑖=1


4

∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

∫Ω

𝑚∑𝑗=1

min¶𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥), 𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥)

©(𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2

(2.111)

45

≤𝐿∑𝑖=1


4

∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

∫Ω

min 𝑟𝑖′𝑝𝑖′(𝑥), 𝑟𝑖𝑝𝑖(𝑥) (𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2 = E (𝑔, 𝑔).

(2.112)

Thus it suffices to prove a Poincare inequality for 𝑀 ′. We will apply Theorem 2.6.3

with

𝑇(𝑖,𝑗),(𝑖′,𝑗′)((𝑖, 𝑥), (𝑖′, 𝑑𝑦)) =

𝜆2

minß


𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™𝛿𝑥(𝑑𝑦), 𝑗 = 𝑗′, 𝑖′ = 𝑖± 1

0, else.

(2.113)

First we calculate ‹𝑃(𝑖,𝑗),(𝑖′,𝑗). We have

𝑄(𝑖,𝑗),(𝑖′,𝑗)([𝑖]× Ω, [𝑖′]× Ω) =𝜆

2

∫Ω

min𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥) 𝑑𝑥 =𝜆

2𝛿(𝑖,𝑗),(𝑖′,𝑗).

(2.114)

so

𝑝(𝑖,𝑗),(𝑖′,𝑗)(𝑥) =𝑝(𝑖,𝑗)(𝑥)𝑇(𝑖,𝑗),(𝑖′,𝑗)(𝑥,Ω𝑗)

𝑄(𝑖,𝑗),(𝑖′,𝑗)([𝑖]× Ω, [𝑖′]× Ω)(2.115)

=𝑝(𝑖,𝑗)(𝑥)𝜆

2min

ß𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™

𝜆2𝛿(𝑖,𝑗),(𝑖′,𝑗)

=1

𝛿(𝑖,𝑗),(𝑖′,𝑗)min

®𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)

𝑟𝑖𝑤𝑖,𝑗

, 𝑝(𝑖,𝑗)(𝑥)

´.

(2.116)

We check the 3 assumptions in Theorem 2.6.3.

1. From Assumption 1, (2.105) and (2.113),

𝑝L ′ =𝐿∑𝑖=1

𝑚∑𝑗=1

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)L𝑖,𝑗 +∑

1 ≤ 𝑖, 𝑖′ ≤ 𝐿

𝑖′ = 𝑖± 1

𝑚∑𝑗=1

𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(T(𝑖,𝑗),(𝑖′,𝑗) − Id).

(2.117)

46

2. This follows immediately from Assumption 2.

3. Let 𝑆 consist of all pairs ((1, 𝑗), (1, 𝑗′)) and 𝑆↔ consist of all pairs ((𝑖, 𝑗), (𝑖′ =

𝑖± 1, 𝑗)). (The other pairs ((𝑖, 𝑗), (𝑖′, 𝑗′)) satisfy 𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) = 0, so they do

not matter.) We check equations (2.47)–(2.50).

(2.47) By (2.100),

𝑚∑𝑗,𝑗′=1

𝑇 ((1, 𝑗), (1, 𝑗′))𝜒2(𝑃(1,𝑗′)||𝑃(1,𝑗)) (2.118)

=𝑚∑

𝑗,𝑗′=1

𝑤1,𝑗′

𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))

𝜒2(𝑃(1,𝑗′)||𝑃(1,𝑗)) ≤ 1, (2.119)

so (2.47) is satisfied with 𝐾1 = 1.

(2.48) We apply Lemma D.1.3 with 𝑃 = 𝑃(𝑖,𝑗) and 𝑄 =𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)

𝑟𝑖𝑤𝑖,𝑗. Noting that‹𝑃(𝑖,𝑗),(𝑖′,𝑗) = 1

𝛿𝑖,𝑗,𝑖′,𝑗𝑄 by (2.116), we obtain

𝜒2Ä‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)

ä≤ 1

𝛿(𝑖,𝑗),(𝑖′,𝑗)(2.120)

By (2.100) and (2.120),

∑𝑖′=𝑖±1

𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗))𝜒2(‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)) (2.121)

≤∑

𝑖′=𝑖±1

𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗)𝜒2(‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)) ≤ 2𝐾. (2.122)

(2.49) By (2.100) and (2.114),

𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) = 𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗) ≤2𝐾

𝜆· 𝜆

2𝛿(𝑖,𝑗),(𝑖′,𝑗) =

2𝐾

𝜆𝑄𝑖,𝑗,𝑖′,𝑗([𝑖]× Ω, [𝑗]× Ω)

(2.123)

so (2.49) is satisfied with 𝐾3 = 2𝐾𝜆

.

47

(2.50) We have that

𝑟1𝑤1,𝑗𝑇 ((1, 𝑗), (1, 𝑗′)) =𝑟1𝑤1,𝑗𝑤1,𝑗′

𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))

= 𝑟1𝑤1,𝑗′𝑇 ((1, 𝑗′), (1, 𝑗))

(2.124)

and for 𝑖′ = 𝑖± 1,

𝑟𝑖𝑤𝑖,𝑗𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗)) = 𝐾𝑟𝑖𝑤𝑖,𝑗𝛿(𝑖,𝑗),(𝑖′,𝑗) (2.125)

= 𝐾𝑟𝑖′𝑤𝑖′,𝑗𝛿(𝑖′,𝑗),(𝑖,𝑗) = 𝑟𝑖′𝑤𝑖′,𝑗𝑇 ((𝑖′, 𝑗), (𝑖, 𝑗)). (2.126)

Hence the conclusion of Theorem 2.6.3 holds with 𝐾1 = 1, 𝐾2 = 2𝐾, and 𝐾3 =

2𝐾𝜆

.

2.7 Simulated tempering for gaussians with equal

variance

2.7.1 Mixtures of gaussians all the way down

Theorem 2.7.1. Let 𝑀 be the continuous simulated tempering chain for the distri-

butions with density functions

𝑝𝑖(𝑥) ∝𝑚∑𝑗=1

𝑤𝑗𝑒−𝛽𝑖‖𝑥−𝜇𝑗‖2

2𝜎2 (2.127)

with rate ΩÄ

1𝐷2

ä, relative probabilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1

where

𝐷 = maxmax𝑗‖𝜇𝑗‖ , 𝜎 (2.128)

𝛽1 = Θ

Ç𝜎2

𝐷2

å(2.129)

48

𝛽𝑖+1

𝛽𝑖

≤ 1 +1

𝑑(2.130)

𝐿 = Θ

Ç𝑑 ln

Ç𝐷

𝜎

å+ 1

å(2.131)

𝑟 =min𝑖 𝑟𝑖max𝑖 𝑟𝑖

. (2.132)


Var(𝑔) ≤ 𝑂

Ç𝐿2𝐷2

𝑟2

åE (𝑔, 𝑔) = 𝑂

ÑÄ𝑑 ln

Ä𝐷𝜎

ä+ 1

ä2𝐷2

𝑟2

éE (𝑔, 𝑔). (2.133)

Proof. Note that forcing 𝐷 ≤ 𝜎 ensures 𝛽1 = Ω(1). We check all conditions for

Theorem 2.6.5. We let 𝐾 = 1.

1. Consider the decomposition where

𝑝𝑖,𝑗(𝑥) ∝ exp

(−𝛽𝑖‖𝑥− 𝜇𝑗‖2

2𝜎2

), (2.134)

𝑤𝑖,𝑗 = 𝑤𝑗, and and 𝑀𝑖,𝑗 is the Langevin chain on 𝑝𝑖,𝑗, so that E𝑖𝑗(𝑔𝑖, 𝑔𝑖) =∫R𝑑 ‖∇𝑔𝑖‖2 𝑝𝑖,𝑗 𝑑𝑥. We check (2.98):

E𝑖(𝑔𝑖, 𝑔𝑖) =∫R𝑑‖∇𝑔𝑖‖2 𝑝𝑖 𝑑𝑥 =

∫R𝑑‖∇𝑔𝑖‖2

𝑚∑𝑗=1

𝑤𝑗𝑝𝑗 𝑑𝑥 =𝑚∑𝑗=1

𝑤𝑗E𝑖,𝑗(𝑔𝑖, 𝑔𝑖).

(2.135)

2. By Theorem A.2.1 and the fact that 𝛽1 = ΩÄ𝜎2

𝐷2

ä, E𝑖,𝑗 satisfies the Poincare

inequality

Var𝑝𝑖,𝑗(𝑔𝑖) ≤𝜎2

𝛽𝑖

E𝑖,𝑗(𝑔𝑖, 𝑔𝑖) = 𝑂(𝐷2)E𝑖,𝑗(𝑔𝑖, 𝑔𝑖). (2.136)

3. To prove a Poincare inequality for the projected chain, we use the method of

canonical paths, Theorem A.1.3. Consider the graph 𝐺 on⋃𝐿

𝑖=1𝑖 × [𝑚𝑖] that

49

is the complete graph on the slice 𝑖 = 1, and the only other edges are vertical

edges (𝑖, 𝑗), (𝑖 ± 1, 𝑗). 𝑇 is nonzero exactly on the edges of 𝐺. For vertices

𝑥 = (𝑖, 𝑗) and 𝑦 = (𝑖′, 𝑗′), define the canonical path as follows.

(a) If 𝑗 = 𝑗′, without loss of generality 𝑖 < 𝑖′. Define the path to be (𝑖, 𝑗), (𝑖+

1, 𝑗), . . . , (𝑖′, 𝑗).

(b) Else, define the path to be (𝑖, 𝑗), (𝑖− 1, 𝑗), . . . , (1, 𝑗), (1, 𝑗′), . . . , (𝑖, 𝑗′).

We calculate the transition probabilities (2.100), which are given in terms of

the 𝜒2 distances 𝜒2max(𝑃1,𝑗||𝑃1,𝑗′) and overlaps 𝛿(𝑖,𝑗),(𝑖′,𝑗).

(a) Bounding 𝜒2(𝑃1,𝑗||𝑃1,𝑗′): By Lemma D.2.1 with Σ1 = Σ2 = 𝛽−11 𝐼𝑑,

𝜒2(𝑃1,𝑗||𝑃1,𝑗′) = 𝜒2(𝑁(𝜇𝑗, 𝛽1𝐼𝑑)||𝑁(𝜇𝑗′ , 𝛽1𝐼𝑑)) (2.137)

= 𝑒𝛽1‖𝜇1−𝜇2‖2/𝜎2

=1

4(2.138)

when 𝛽1 ≤ 𝑐 𝜎2

𝐷2 for a small enough constant 𝑐.

(b) Bounding 𝛿𝑖,𝑗,𝑖′,𝑗: Suppose that 𝛽𝑖+1

𝛽𝑖= 1 + 𝜀 where 𝛿 ≤ 1

𝑑. Then applying

Lemma D.2.1 to Σ1 = 𝛽−1𝑖 𝐼𝑑 and Σ2 = 𝛽−1

𝑖+1𝐼𝑑,

𝜒2(𝑃𝑖+1,𝑗||𝑃𝑖,𝑗) = 𝜒2(𝑁(𝜇𝑗, 𝛽𝑖+1𝐼𝑑)||𝑁(𝜇𝑗, 𝛽𝑖𝐼𝑑)) (2.139)

=

Ç𝛽2𝑖+1

𝛽𝑖

å 𝑑2

(2𝛽𝑖+1 − 𝛽𝑖)− 𝑑

2 − 1 (2.140)

=

Ç𝛽𝑖+1

𝛽𝑖

å 𝑑2Ç

2− 𝛽𝑖

𝛽𝑖+1

å− 𝑑2

− 1 (2.141)

= 𝑂

Ñ(1 + 𝑑𝜀)

Ç2−

Ç1

1 + 𝜀

åå− 𝑑2

− 1

é= 𝑂(𝑑𝜀) (2.142)

so 𝜒2(𝑃𝑖+1,𝑗||𝑃𝑖,𝑗) ≤ 14

when 𝛿 ≤ 𝑐1𝛿

for a small enough constant 𝑐. Similarly,

𝜒2(𝑃𝑖−1,𝑗||𝑃𝑖,𝑗) = 14

for 𝛿 ≤ 𝑐1𝛿.

50

Note that for probability distributions 𝑃1, 𝑃2 with density functions 𝑝1, 𝑝2,Å∫Ω

(𝑝1 −min𝑝1, 𝑝2) 𝑑𝑥ã2≤∫Ω

(𝑝1 −min𝑝1, 𝑝2)2

𝑝1𝑑𝑥 = 𝜒2(𝑃2||𝑃1)

(2.143)∫Ω

min𝑝1, 𝑝2 𝑑𝑥 ≥ 1−»𝜒2(𝑃2||𝑃1). (2.144)

Moreover, we have

𝛿(𝑖,𝑗),(𝑖±1,𝑗) =∫

min

®𝑟𝑖′𝑤𝑗𝑝𝑖′,𝑗(𝑥)

𝑟𝑖𝑤𝑗

, 𝑝𝑖,𝑗

´𝑑𝑥 ≥ 𝑟

∫min𝑝𝑖′,𝑗(𝑥), 𝑝𝑖,𝑗(𝑥) 𝑑𝑥

(2.145)

Hence 𝛿(𝑖,𝑗),(𝑖±1,𝑗) ≥ 12𝑟.

Note that |𝛾𝑥,𝑦| ≤ 2𝐿− 1. Consider two kinds of edges in 𝐺.

(a) 𝑧 = (1, 𝑗), 𝑤 = (1, 𝑘). We have

∑𝛾𝑥,𝑦∋((1,𝑗),(1,𝑘)) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)

𝑝((1, 𝑗))𝑇 ((1, 𝑗), (1, 𝑘))≤ (2𝐿− 1)𝑃 ([𝐿]× 𝑗)𝑃 ([𝐿]× 𝑘)

𝑝((1, 𝑗))𝑇 ((1, 𝑗), (1, 𝑘)).

(2.146)

because the paths going through 𝑧𝑤 are exactly those between [𝐿] × 𝑗

and [𝐿]× 𝑘. Now note

𝑃 ([𝐿]× 𝑗)𝑝((1, 𝑗))

≤ 𝐿

𝑟(2.147)

𝑃 ([𝐿]× 𝑘) = 𝑤𝑘 (2.148)

𝑇 ((1, 𝑗), (1, 𝑘)) =1

2

𝑤𝑘

𝜒2max(𝑃1,𝑗||𝑃1,𝑗′)

= Ω(𝑤𝑘) (2.149)

by (2.138). Thus (2.146) = 𝑂Ä𝐿2

𝑟

ä.

51

(b) 𝑧 = (𝑖, 𝑗), 𝑤 = (𝑖− 1, 𝑗). We have

∑𝛾𝑥,𝑦∋((𝑖,𝑗),(𝑖−1,𝑗)) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)

𝑝((𝑖, 𝑗))𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗))≤ (2𝐿− 1)𝑃 (𝑆)𝑃 (𝑆𝑐)

𝑝((𝑖, 𝑗))𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗))(2.150)

where 𝑆 = 𝑖, . . . , 𝐿 × 𝑗. This follows because cutting the edge 𝑧𝑤

splits the graph into 2 connected components, one of which is 𝑆; the paths

which go through 𝑧𝑤 are exactly those between 𝑥, 𝑦 where one of 𝑥, 𝑦 is a

subset of 𝑆 and the other is not. Now note

𝑃 (𝑆)

𝑝((𝑖, 𝑗))=

𝑃 (𝑖, . . . , 𝐿 × 𝑗)𝑝((𝑖, 𝑗))

≤ 𝐿

𝑟(2.151)

𝑃 (𝑆𝑐) ≤ 1 (2.152)

𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗)) = 𝛿(𝑖,𝑗),(𝑖−1,𝑗) = Ω(𝑟) (2.153)

by (2.100) and the inequality 𝛿(𝑖,𝑗),(𝑖−1,𝑗) ≥ 12𝑟. Hence (2.150) = 𝑂

Ä𝐿2

𝑟2

ä.

By Theorem A.1.3, the projected chain satisfies a Poincare inequality with con-

stant 𝑂Ä𝐿2

𝑟2

ä.

Thus by Theorem 2.6.5, the simulated tempering chain satisfies a Poincare inequality

with constant

𝑂

Çmax

®𝐷2

Ç1 +

𝐿2

𝑟2

å,𝐿2

𝑟2𝜆

´å. (2.154)

Taking 𝜆 = 1𝐷2 makes this 𝑂

Ä𝐷2𝐿2

𝑟2

ä.

Remark 2.7.2. Note there is no dependence on either 𝑤min or the number of com-

ponents.

If 𝑝(𝑥) ∝ ∑𝑚𝑗=1𝑤𝑗𝑒

−‖𝑥−𝜇𝑗‖2

2𝜎2 and we have access to ∇ ln(𝑝 * 𝑁(0, 𝜏𝐼)) for any

𝜏 , then we can sample from 𝑝 efficiently, no matter how many components there

are. In fact, passing to the continuous limit, we can sample from any 𝑝 in the form

52

𝑝 = 𝑤 *𝑁(0, 𝜎2𝐼𝑑) where ‖𝑤‖1 = 1 and Supp(𝑤) ⊆ 𝐵𝐷.

In this way, Theorem 2.7.1 says that evolution of 𝑝 under the heat kernel is the

most “natural” way to do simulated tempering. We don’t have access to 𝑝 *𝑁(0, 𝜏𝐼),

but we will show that 𝑝𝛽 approximates it well (within a factor of 1𝑤min

).

Entropy-SGD [Cha+16] attempts to estimate ∇ ln(𝑝*𝑁(0, 𝜏𝐼)) for use in a temperature-

based algorithm; this remark provides some heuristic justification for why this is a

natural choice.

2.7.2 Comparing to the actual chain

The following lemma shows that changing the temperature is approximately the same

as changing the variance of the gaussian. We state it more generally, for arbitrary

mixtures of distributions in the form 𝑒−𝑓𝑖(𝑥).

Lemma 2.7.3 (Approximately scaling the temperature). Let 𝑝𝑖(𝑥) = 𝑒−𝑓𝑖(𝑥) be prob-

ability distributions on Ω such that for all 𝛽 > 0,∫Ω 𝑒−𝛽𝑓𝑖(𝑥) 𝑑𝑥 <∞. Let

𝑝(𝑥) =𝑛∑

𝑖=1

𝑤𝑖𝑝𝑖(𝑥) (2.155)

𝑓(𝑥) = − ln 𝑝(𝑥) (2.156)

where 𝑤1, . . . , 𝑤𝑛 > 0 and∑𝑛

𝑖=1 𝑤𝑖 = 1. Let 𝑤min = min1≤𝑖≤𝑛𝑤𝑖.

Define the distribution at inverse temperature 𝛽 to be 𝑝𝛽(𝑥), where

𝑔𝛽(𝑥) = 𝑒−𝛽𝑓(𝑥) (2.157)

𝑍𝛽 =∫Ω𝑒−𝛽𝑓(𝑥) 𝑑𝑥 (2.158)

𝑝𝛽(𝑥) =𝑔𝛽(𝑥)

𝑍𝛽

. (2.159)

53

Define the distribution 𝑝𝛽(𝑥) by

𝑔𝛽(𝑥) =𝑛∑

𝑖=1

𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) (2.160)‹𝑍𝛽 =

∫Ω

𝑛∑𝑖=1

𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) 𝑑𝑥 (2.161)

𝑝𝛽(𝑥) =𝑔𝛽(𝑥)‹𝑍𝛽

. (2.162)

Then for 0 ≤ 𝛽 ≤ 1 and all 𝑥,

𝑔𝛽(𝑥) ∈ñ1,

1

𝑤min

ô𝑔𝛽 (2.163)

𝑝𝛽(𝑥) ∈ñ1,

1

𝑤min

ô𝑝𝛽‹𝑍𝛽

𝑍𝛽

⊂ñ𝑤min,

1

𝑤min

ô𝑝𝛽. (2.164)

Proof. By the Power-Mean inequality,

𝑔𝛽(𝑥) =

(𝑛∑

𝑖=1

𝑤𝑖𝑒−𝑓𝑖(𝑥)

)𝛽

(2.165)

≥𝑛∑

𝑖=1

𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) = 𝑔𝛽(𝑥). (2.166)

On the other hand, given 𝑥, setting 𝑗 = argmin𝑖 𝑓𝑖(𝑥),

𝑔𝛽(𝑥) =

(𝑛∑

𝑖=1

𝑤𝑖𝑒−𝑓𝑖(𝑥)

)𝛽

(2.167)

≤ (𝑒−𝑓𝑗(𝑥))𝛽 (2.168)

≤ 1

𝑤min

𝑛∑𝑖=1

𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) =

1

𝑤min

𝑔𝛽(𝑥). (2.169)

This gives (2.163). This implies𝑍𝛽

𝑍𝛽∈ [𝑤min, 1], which gives (2.164).

Lemma 2.7.4. Let 𝑃1, 𝑃2 be probability measures on R𝑑 with density functions 𝑝1 ∝

54

𝑒−𝑓1, 𝑝2 ∝ 𝑒−𝑓2 satisfying ‖𝑓1 − 𝑓2‖∞ ≤Δ2. Then

E𝑃1(𝑔, 𝑔)

‖𝑔‖2𝐿2(𝑃1)

≥ 𝑒−2Δ E𝑃2(𝑔, 𝑔)

‖𝑔‖2𝐿2(𝑃2)

. (2.170)

Proof. The ratio between 𝑝1 and 𝑝2 is at most 𝑒Δ, so

∫R𝑑 ‖∇𝑔(𝑥)‖2 𝑝1(𝑥) 𝑑𝑥∫R𝑑 ‖𝑔(𝑥)‖2 𝑝1(𝑥) 𝑑𝑥

≥ 𝑒−Δ∫R𝑑 ‖∇𝑔(𝑥)‖2 𝑝2(𝑥) 𝑑𝑥

𝑒Δ∫R𝑑 ‖𝑔(𝑥)‖2 𝑝2(𝑥) 𝑑𝑥

. (2.171)

Lemma 2.7.5. Let 𝑀 and ›𝑀 be two continuous simulated tempering Langevin chains

with functions 𝑓𝑖, 𝑓𝑖,, respectively, for 𝑖 ∈ [𝐿], with rate 𝜆, and with relative proba-

bilities 𝑟𝑖. Let their Dirichlet forms be E and ‹E and their stationary measures be 𝑃

and ‹𝑃 .

Suppose that∥∥∥𝑓𝑖(𝑥)− 𝑓𝑖(𝑥)

∥∥∥∞≤ Δ

2. Then7

E (𝑔, 𝑔)

Var𝑃 (𝑔)≥ 𝑒−3Δ

‹E (𝑔, 𝑔)

Var𝑃

(𝑔). (2.172)

Proof. By Lemma 2.7.4,

∑𝐿𝑖=1 E𝑖(𝑔𝑖, 𝑔𝑖)

Var𝑃𝑖(𝑔𝑖)

≥ 𝑒−2Δ

∑𝐿𝑖=1

‹E𝑖(𝑔𝑖, 𝑔𝑖)

Var𝑃𝑖

(𝑔𝑖)(2.173)

=⇒∑𝐿

𝑖=1 𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖)

Var𝑃 (𝑔𝑖)≥ 𝑒−2Δ

∑𝐿𝑖=1 𝑟𝑖

‹E𝑖(𝑔𝑖, 𝑔𝑖)

Var𝑃

(𝑔𝑖). (2.174)

By Lemma 2.7.3, we have 𝑝𝑖𝑝𝑖∈ [𝐴,𝐴𝑒Δ], 𝑝𝑗

𝑝𝑗∈ [𝐵,𝐵𝑒Δ] for some 𝐴,𝐵 ≥ 𝑒−Δ, so

7If adjacent temperatures are close enough, then 𝐴 and 𝐵 in the proof are close, so[min𝐴,𝐵,max𝐴𝑒Δ, 𝐵𝑒Δ] ⊆ [𝐶,𝐶 · 𝑂(𝑒Δ)] for some 𝐶, improving the factor to Ω(𝑒−2Δ). Amore careful analysis would likely improve the final dependence on 𝑤min in Theorem 2.4.2 from 1

𝑤6min

to 1𝑤4

min

.

55

min𝑟𝑖𝑝𝑖,𝑟𝑗𝑝𝑗min𝑟𝑖𝑝𝑖,𝑟𝑗𝑝𝑗

∈ [min𝐴,𝐵,max𝐴𝑒Δ, 𝐵𝑒Δ] ⊆ [𝑒−Δ, 𝑒Δ]. Hence,

∫R𝑑(𝑔𝑖 − 𝑔𝑗)

2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥∫R𝑑(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥

∈ [𝑒−Δ, 𝑒Δ] (2.175)

Var𝑃𝑖(𝑔𝑖)

Var‹𝑃𝑖(𝑔𝑖)∈ [𝐴,𝐴𝑒Δ] (2.176)

=⇒𝜆4

∑𝐿𝑖=1

∑𝑗=𝑖±1


2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥𝜆4

∑𝐿𝑖=1

∑𝑗=𝑖±1

∫R𝑑(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥

∈ [𝑒−Δ, 𝑒Δ] (2.177)

Var𝑃 (𝑔)

Var𝑃

(𝑔)=

∑𝐿𝑖=1 𝑟𝑖 Var𝑃𝑖

(𝑔𝑖)∑𝐿𝑖=1 𝑟𝑖 Var

𝑃𝑖(𝑔𝑖)∈ [𝐴,𝐴𝑒Δ] (2.178)

Dividing (2.177) by (2.178) gives

𝜆4

∑𝐿𝑖=1

∑𝑗=𝑖±1


2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥Var𝑃 (𝑔)

(2.179)

≥ 𝑒−3Δ𝜆4

∑𝐿𝑖=1

∑𝑗=𝑖±1


2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥Var

𝑃(𝑔)

(2.180)

Adding (2.174) and (2.180) gives the result.

Theorem 2.7.6. Suppose∑𝑚

𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and

𝐷 = maxmax1≤𝑗≤𝑚 ‖𝜇𝑗‖ , 𝜎. Let 𝑀 be the continuous simulated tempering chain

for the distributions

𝑝𝑖(𝑥) ∝

Ñ𝑚∑𝑗=1

𝑤𝑗𝑒−‖𝑥−𝜇𝑗‖2

2𝜎2

é𝛽𝑖

(2.181)

with rate 𝑂Ä

1𝐷2


satisfying the same conditions as in Theorem 2.7.1. Then 𝑀 satisfies the Poincare

inequality

Var(𝑔) ≤ 𝑂

Ç𝐿2𝐷2

𝑟2𝑤3min

åE (𝑔, 𝑔) = 𝑂

Ñ𝑑2ÄlnÄ𝐷𝜎

ää2𝐷2

𝑟2𝑤3min

éE (𝑔, 𝑔). (2.182)

Proof. Let 𝑝𝑖 be the probability distributions in Theorem 2.7.1 with the same pa-

56

rameters as 𝑝𝑖 and let 𝑝 be the stationary distribution of that simulated tempering

chain. By Theorem 2.7.1, Var𝑃

(𝑔) = 𝑂Ä𝐿2𝐷2

𝑟2

äE𝑃

(𝑔, 𝑔). Now use By Lemma 2.7.3,

𝑝𝑖𝑝𝑖∈î1, 1

𝑤min

ó𝑍𝑖

𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ = 1

𝑤min.

2.8 Discretization

Throughout this section, let 𝑓 be as in Theorem 2.4.3 (𝑓 =∑𝑚

𝑖=1𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖), where

𝑓0 is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).

Lemma 2.8.1. Fix times 0 < 𝑇1 < · · · < 𝑇𝑛 ≤ 𝑇 .

Let 𝑝𝑇 , 𝑞𝑇 : [𝐿]× R𝑑 → R be probability density functions defined as follows (and

let 𝑃 𝑇 , 𝑄𝑇 denote the corresponding measures).

1. 𝑝𝑇 is the density function of the continuous simulated tempering Markov process

as in Definition 2.5.1 but with fixed transition times 𝑇1, . . . , 𝑇𝑛. The component

chains are Langevin diffusions on 𝑝𝑖(𝑥) ∝Ä∑𝑚

𝑗=1 𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑖)

ä𝛽𝑖.

2. 𝑞𝑇 is the discretized version as in Algorithm 1, again with fixed transition times

𝑇1, . . . , 𝑇𝑛, and with step size 𝜂 ≤ 𝜎2

2.

Then

KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6𝐾7

Ç𝐷2𝐾

2

𝜅+ 𝑑

å𝑇𝑛 + 𝜂2𝐷3𝐾3 max

𝑖E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2𝐾2𝑑𝑇

where 𝑥* is the maximum of∑𝑚

𝑗=1𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗) and satisfies ‖𝑥*‖ = 𝑂(𝐷) where

𝐷 = max ‖𝜇𝑗‖.

Before proving the above statement, we make a note on the location of 𝑥* to make

sense of max𝑖 E𝑥∼𝑃 0(𝑖,·)‖𝑥− 𝑥*‖22. Namely, we show:

57

Lemma 2.8.2 (Location of minimum). Let 𝑥* = argmin𝑥∈R𝑑𝑓(𝑥). Then, ‖𝑥*‖ ≤

𝐷»

𝐾𝜅

+ 1.

Proof. Recall that 𝑓𝑖(𝑥) = 𝑓0(𝑥 − 𝜇𝑖). We claim that 𝑓(0) ≤ 12𝐾𝐷2. Indeed, by

smoothness, we have 𝑓𝑖(0) ≤ 12𝐾‖𝜇𝑖‖2, which implies that 𝑓(0) ≤ 1

2𝐾𝐷2.

Hence, it follows that min𝑥∈R𝑑 𝑓(𝑥) ≤ 12𝐾𝐷2. However, for any 𝑥, it holds that

𝑓(𝑥) ≥ 1

2min

𝑖𝜅‖𝜇𝑖 − 𝑥‖2

≥ 1

2𝜅Å‖𝑥‖2 −max

𝑖‖𝜇𝑖‖2

ã≥ 1

2𝜅Ä‖𝑥‖2 −𝐷2

äHence, if ‖𝑥‖ > 𝐷

»𝐾𝜅

+ 1, 𝑓(𝑥) > min𝑥∈R𝑑 𝑓(𝑥). This implies the statement of the

lemma.

We prove a few technical lemmas. First, we prove that the continuous chain is

essentially contained in a ball of radius 𝐷. More precisely, we show:

Lemma 2.8.3 (Reach of continuous chain). Let 𝑃 𝛽𝑇 (𝑋) be the Markov kernel corre-

sponding to evolving Langevin diffusion

𝑑𝑋𝑡 = −𝛽∇𝑓(𝑋𝑡) 𝑑𝑡 + 𝑑𝐵𝑡

where 𝑓 and 𝐷 are as defined in (2.9) for time 𝑇 . Then,

E[‖𝑋𝑡 − 𝑥*‖2] ≤ E[‖𝑋0 − 𝑥*‖2] +

Ç400𝛽

𝐷2𝐾2

𝜅+ 2𝑑

å𝑇.

Proof. Let 𝑌𝑡 = ‖𝑋𝑡 − 𝑥*‖2. By Itos Lemma, we have

𝑑𝑌𝑡 = −2

⟨𝑋𝑡 − 𝑥*, 𝛽

𝑚∑𝑖=1

𝑤𝑖𝑒−𝑓𝑖(𝑋𝑡)∇𝑓𝑖(𝑋𝑡)∑𝑚𝑗=1𝑤𝑗𝑒−𝑓𝑗(𝑋𝑡)

⟩+ 2𝑑 𝑑𝑡+

√8

𝑑∑𝑖=1

(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡 (2.183)

58

We will show that

−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ 100𝐷2𝐾2

𝜅

Indeed, since 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖), by (2.183), we have

⟨𝑋𝑡,∇𝑓𝑖(𝑋𝑡)⟩ ≥𝜅

2‖𝑋𝑡‖2 −

𝐷2(2𝜅 + 𝐾)2

2𝜅−𝐾𝐷2

Also, by the Hessian bound 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼, we have

⟨𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ ‖𝑥*‖‖∇𝑓𝑖(𝑋𝑡)‖ ≤ 𝐷

𝐾

𝜅+ 1‖𝑋𝑡 − 𝜇𝑖‖ ≤ 𝐷

𝐾

𝜅+ 1(‖𝑋𝑡‖+ 𝐷)

Hence,

−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ −𝜅

2‖𝑋𝑡‖2 −

𝐷2(2𝜅 + 𝐾)2

2𝜅−𝐷

𝐾

𝜅+ 1(‖𝑋𝑡‖+ 𝐷)

Solving for the extremal values of the quadratic on the RHS, we get

−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ 100𝐷2𝐾2

𝜅

Together with (2.183), we get

𝑑𝑌𝑡 ≤ 100𝛽𝐷2𝐾2

𝜅+ 2𝑑 𝑑𝑡+

√8

𝑑∑𝑖=1

(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡

Integrating, we get

𝑌𝑡 ≤ 𝑌0 + 400𝛽𝐷2𝐾2

𝜅𝑇 + 2𝑑𝑇 +

√8∫ 𝑇

0

𝑑∑𝑖=1

(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡

Taking expectations and using the martingale property of the Ito integral, we get the

claim of the lemma.

59

Next, we prove a few technical bound the drift of the discretized chain after 𝑇/𝜂

discrete steps. The proofs follow similar calculations as those in [Dal16].

We will first need to bound the Hessian of 𝑓 .

Lemma 2.8.4 (Hessian bound). For all 𝑥 ∈ R𝑑,

−2(𝐷𝐾)2𝐼 ⪯ ∇2𝑓(𝑥) ⪯ 𝐾𝐼.

Proof. For notational convenience, let 𝑝(𝑥) =∑𝑚

𝑖=1 𝑤𝑖𝑒−𝑓𝑖(𝑥). Note that 𝑓(𝑥) =

− log 𝑝(𝑥). We proceed to the upper bound first. The Hessian of 𝑓 satisfies

∇2𝑓 =

∑𝑖 𝑤𝑖𝑒

−𝑓𝑖∇2𝑓𝑖𝑝

−12

∑𝑖,𝑗 𝑤𝑖𝑤𝑗𝑒

−𝑓𝑖𝑒−𝑓𝑗(∇𝑓𝑖 −∇𝑓𝑗)⊗2

𝑝2

⪯ max𝑖∇2𝑓𝑖 ⪯ 𝐾𝐼

as we need. As for the lower bound, we have

∇2𝑓 ⪰ −1

2

Åmax𝑖,𝑗‖∇𝑓𝑖 −∇𝑓𝑗‖2

ã𝐼

But notice that since 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖), we have

‖∇𝑓𝑖(𝑥)−∇𝑓𝑗(𝑥)‖ = ‖∇𝑓0(𝑥− 𝜇𝑖)−∇𝑓0(𝑥− 𝜇𝑗)‖

≤ 𝐾‖𝜇𝑖 − 𝜇𝑗‖

≤ 2𝐷𝐾

where the next-to-last inequality follows from the strong-convexity of 𝑓0. This proves

the statement of the lemma.

We introduce the following piece of notation in the following portion: we denote

60

by 𝑃𝑇 (𝑖, 𝑥) the measure on [𝐿]×R𝑑 corresponding to running the Langevin diffusion

process for 𝑇 time steps on the second coordinate, starting at (𝑖, 𝑥), and keeping

the first coordinate fixed. Let us define by ”𝑃𝑇 (𝑖, 𝑥) : [𝐿] × R𝑑 → R the analogous

measure, except running the discretized Langevin diffusion chain for 𝑇𝜂

time steps on

the second coordinate, for 𝑇𝜂

an integer.

Lemma 2.8.5 (Bounding interval drift). In the setting of this section, let 𝑖 ∈ [𝐿], 𝑥 ∈

R𝑑, and let 𝜂 ≤ 1𝐾.

KL(𝑃𝑇 (𝑖, 𝑥)||”𝑃𝑇 (𝑖, 𝑥)) ≤ 4𝐷6𝜂2𝐾7

3

Ä‖𝑥− 𝑥*‖22 + 8𝑇𝑑

ä+ 𝑑𝑇𝐷2𝜂𝐾2

Proof. Let 𝑥𝑗, 𝑖 ∈ [0, 𝑇/𝜂 − 1] be a random variable distributed as 𝑃𝜂𝑗(𝑖, 𝑥). By

Lemma 2 in [Dal16] and Lemma 2.8.4 , we have

KL(𝑃𝑇 (𝑖, 𝑥)||”𝑃𝑇 (𝑖, 𝑥)) ≤ 𝜂3𝐷2𝐾2

3

𝑇/𝜂−1∑𝑘=0

E[‖∇𝑓(𝑥𝑘)‖22] + 𝑑𝑇𝜂𝐷2𝐾2

Similarly, the proof of Corollary 4 in [Dal16] implies that

𝜂𝑇/𝜂−1∑𝑘=0

E[‖∇𝑓(𝑥𝑘)‖22] ≤ 4𝐷4𝐾4‖𝑥− 𝑥*‖22 + 8𝐷𝐾𝑇𝑑

Plugging this in, we get the statement of the lemma.

To prove the main claim, we will use Lemma D.1.6, a decomposition theorem for

the KL divergence of two mixtures of distributions, in terms of the KL divergence of

the weights and the components in the mixture.

Proof of Lemma 2.8.1. Let’s denote by 𝑅 (𝑖, 𝑥) the measure on [𝐿] × R𝑑, after one

Type 2 transition in the simulated tempering process, starting at (𝑖, 𝑥).

61

We will proceed by induction. Towards that, we can obviously write

𝑝𝑇𝑖+1 =1

2

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑗, 𝑥)𝑃𝑇𝑖+1−𝑇𝑖(𝑗, 𝑥)

é+

1

2

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑗, 𝑥)𝑅(𝑗, 𝑥)

éand similarly

𝑞𝑇𝑖+1 =1

2

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑗, 𝑥)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑗, 𝑥)

é+

1

2

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑗, 𝑥)𝑅(𝑗, 𝑥)

é(Note: the 𝑅 transition matrix doesn’t change in the discretized vs. continuous

version.)

By convexity of KL divergence, we have

KL(𝑃 𝑇𝑖+1||𝑄𝑇𝑖+1) ≤ 1

2KL

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑥, 𝑗)𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)||

∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑥, 𝑗)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)

é+

1

2KL

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)||∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)

éBy Lemma D.1.6, we have that

KL

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)||∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)

é≤ KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖).

Similarly, by Lemma 2.8.5 together with Lemma D.1.6 we have

KL

Ñ∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑝𝑇𝑖(𝑥, 𝑗)𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)||

∫𝑥∈R𝑑

𝐿−1∑𝑗=0

𝑞𝑇𝑖(𝑥, 𝑗)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)

é≤

KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖) +4𝐷6𝐾6𝜂2

3

Åmax

𝑗E𝑥∼𝑃 𝑡𝑖 (·,𝑗)‖𝑥− 𝑥*‖22 + 8(𝑇𝑖+1 − 𝑇𝑖)𝑑

ã+ 𝑑(𝑇𝑖+1 − 𝑇𝑖)𝜂𝐾

2

By Lemmas 2.8.3 and 2.8.2, we have that for any 𝑗 ∈ [0, 𝐿− 1],

E𝑥∼𝑃𝑇𝑖 (·,𝑗)‖𝑥− 𝑥*‖22 ≤ E𝑥∼𝑃𝑇𝑖−1 (·,𝑗)‖𝑥‖2 +

Ç400

𝐷2𝐾2

𝜅+ 2𝑑

å(𝑇𝑖 − 𝑇𝑖−1)

62

Hence, inductively, we have E𝑥∼𝑃𝑇𝑖 (·,𝑗)‖𝑥−𝑥*‖22 ≤ E𝑥∼𝑃 0(·,𝑗)‖𝑥−𝑥*‖22+Ä400𝐷2𝐾2

𝜅+ 2𝑑

ä𝑇𝑖.

Putting everything together, we have

KL(𝑃 𝑇𝑖+1||𝑄𝑇𝑖+1) ≤ KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖) +4𝜂2𝐷6𝐾7

3·Ç

max𝑗

E𝑥∼𝑃 0(·,𝑗)‖𝑥− 𝑥*‖22 +

Ç400

𝐷2𝐷2𝐾2

𝜅+ 2𝑑

å𝑇 + 8𝑇𝑑

å+ 𝑑𝑇𝜂𝐷2𝐾2

By induction, we hence have

KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6𝐾7

Ç𝐷2𝐾

2

𝜅+ 𝑑

å𝑇𝑛 + 𝜂2𝐷3𝐾3 max

𝑖E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2𝐾2𝑑𝑇

as needed.

2.9 Proof of main theorem

Before putting everything together, we show how to estimate the partition functions.

We will apply the following to 𝑔1(𝑥) = 𝑒−𝛽ℓ𝑓(𝑥) and 𝑔2(𝑥) = 𝑒−𝛽ℓ+1𝑓(𝑥).

Lemma 2.9.1 (Estimating the partition function to within a constant factor). Sup-

pose that 𝑃1 and 𝑃2 are probability measures on Ω with density functions (with respect

to a reference measure) 𝑝1(𝑥) = 𝑔1(𝑥)𝑍1

and 𝑝2(𝑥) = 𝑔2(𝑥)𝑍2

. Suppose ‹𝑃1 is a measure such

that 𝑑𝑇𝑉 (‹𝑃1, 𝑃1) < 𝜀2𝐶2 , and

𝑔2(𝑥)𝑔1(𝑥)

∈ [0, 𝐶] for all 𝑥 ∈ Ω. Given 𝑛 samples 𝑥1, . . . , 𝑥𝑛

from ‹𝑃1, define the random variable

𝑟 =1

𝑛

𝑛∑𝑖=1

𝑔2(𝑥𝑖)

𝑔1(𝑥𝑖). (2.184)

Let

𝑟 = E𝑥∼𝑃1

𝑔2(𝑥)

𝑔1(𝑥)=

𝑍2

𝑍1

(2.185)

63

and suppose 𝑟 ≥ 1𝐶. Then with probability ≥ 1− 𝑒−

𝑛𝜀2

2𝐶4 ,

∣∣∣∣𝑟𝑟 − 1∣∣∣∣ ≤ 𝜀. (2.186)

Proof. We have that

∣∣∣∣∣∣ E𝑥∼𝑃1

𝑔2(𝑥)

𝑔1(𝑥)− E

𝑥∼𝑃1

𝑔2(𝑥)

𝑔1(𝑥)

∣∣∣∣∣∣ ≤ 𝐶𝑑𝑇𝑉 (‹𝑃1, 𝑃1) ≤𝜀

2𝐶. (2.187)

The Chernoff bound gives

P

Ñ∣∣∣∣∣∣𝑟 − E𝑥∼𝑃1

𝑔2(𝑥)

𝑔1(𝑥)

∣∣∣∣∣∣ ≥ 𝜀

2𝐶

é≤ 𝑒

−𝑛( 𝜀

2𝐶 )2

2(𝐶2 )

2

= 𝑒−𝑛𝜀2

2𝐶4 . (2.188)

Combining (2.187) and (2.188) using the triangle inequality,

PÇ|𝑟 − 𝑟| ≥ 1

𝜀𝐶

å≤ 𝑒−

𝑛𝜀2

2𝐶4 . (2.189)

Dividing by 𝑟 and using 𝑟 ≥ 1𝐶

gives the result.

Lemma 2.9.2. Suppose that Algorithm 1 is run on 𝑓(𝑥) = − lnÅ∑𝑚

𝑗=1𝑤𝑗 expÅ−‖𝑥−𝜇𝑗‖2

2𝜎2

ããwith temperatures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿, rate 𝜆, and with partition function

estimates ”𝑍1, . . . , 𝑍ℓ satisfying

∣∣∣∣∣∣𝑍𝑖

𝑍𝑖

/”𝑍1

𝑍1

∣∣∣∣∣∣ ∈[Ç

1− 1

𝐿

å𝑖−1

,

Ç1 +

1

𝐿

å𝑖−1]

(2.190)

for all 1 ≤ 𝑖 ≤ ℓ. Suppose∑𝑚

𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 =

maxmax1≤𝑗≤𝑚 ‖𝜇𝑗‖ , 𝜎, and the parameters satisfy

𝜆 = Θ

Ç1

𝐷2

å(2.191)

64

𝛽1 = Θ

Ç𝜎2

𝐷2

å(2.192)

𝛽𝑖+1

𝛽𝑖

≤ 1 +1

𝑑 + lnÄ

1𝑤min

ä (2.193)

𝐿 = Θ

ÇÇ𝑑 + ln

Ç1

𝑤min

ååln

Ç𝐷

𝜎

å+ 1

å(2.194)

𝑇 = Ω

Ñ𝐿2𝐷2 ln

Äℓ

𝜀𝑤min

ä𝑤3

min

é(2.195)

𝜂 = 𝑂

Ñ𝜎3𝜀

𝐷2min

𝜎4Ä𝐷𝜎

+√𝑑ä𝑇,

1

𝐷12

,𝜎𝜀

𝑑𝑇

é

(2.196)

Let 𝑞0 be the distribution(𝑁(0, 𝜎

2

𝛽1

), 1)on [ℓ] × R𝑑. Then the distribution 𝑞𝑇 after

running time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀.

Setting 𝜀 = 𝑂Ä

1ℓ𝐿

äabove and taking 𝑛 = Ω

Ä𝐿2 ln

Ä1𝛿

ääsamples, with probability

1− 𝛿 the estimate

“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =1

𝑛

𝑛∑𝑗=1

𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗) (2.197)

also satisfies (2.190).

Proof. By the triangle inequality,

∥∥∥𝑝− 𝑞𝑇∥∥∥1≤∥∥∥𝑝− 𝑝𝑇

∥∥∥1

+∥∥∥𝑝𝑇 − 𝑞𝑇

∥∥∥1. (2.198)

For the first term, by Cauchy-Schwarz, (using capital letters for the probability mea-

sures)

∥∥∥𝑝− 𝑝𝑇∥∥∥1≤»𝜒2(𝑃 𝑇 ||𝑃 ) ≤ 𝑒−

𝑇2𝐶

»𝜒2(𝑃 0||𝑃 ) (2.199)

where 𝐶 = 𝑂

Ç𝑑2(ln(𝐷

𝜎 ))2𝐷2

𝑤2min

åis an upper bound on the Poincare constant as in The-

orem 2.7.6. (The assumption on “𝑍𝑖 means that 𝑟 ≤ 𝑒.) Let 𝑝𝑖 be the distribution of

𝑝 on the 𝑖th temperature, and 𝑝𝑖 be as in Lemma 2.7.3.

65

To calculate 𝜒2(𝑃 0||𝑃 ), first note by Lemma D.2.1, the 𝜒2 distance between

𝑁(0, 𝜎

2

𝛽1𝐼𝑑)

and 𝑁(𝜇, 𝜎2

𝛽1𝐼𝑑) is ≤ 𝑒‖𝜇‖

2𝛽1/𝜎2. Then

𝜒2(𝑃 0||𝑃 ) (2.200)

= 𝑂(ℓ)𝜒2

Ç𝑁

Ç0,

𝜎2

𝛽1

𝐼𝑑

å||𝑃1

å(2.201)

= 𝑂

Çℓ

𝑤min

åÇ1 + 𝜒2

Ç𝑁

Ç0,

𝜎2

𝛽1

𝐼𝑑

å||‹𝑃1

åå(2.202)

by Lemma 2.7.3 and Lemma D.1.5

= 𝑂

Çℓ

𝑤min

åÑ1 +

𝑚∑𝑗=1

𝑤𝑗𝜒2

Ç𝑁

Ç0,

𝜎2

𝛽1

𝐼𝑑

å||𝑁

Ç𝜇𝑗,

𝜎2

𝛽1

𝐼𝑑

ååé(2.203)

by Lemma D.1.4

= 𝑂

Ö𝑒

𝐷2𝛽1𝜎2 ℓ

𝑤min

è= 𝑂

Çℓ

𝑤min

å. (2.204)

Together with (2.199) this gives∥∥∥𝑝− 𝑝𝑇

∥∥∥1≤ 𝜀

3.

For the second term∥∥∥𝑝𝑇 − 𝑞𝑇

∥∥∥1, we first condition on there not being too many

transitions before time 𝑇 . Let 𝑁𝑇 = max 𝑛 : 𝑇𝑛 ≤ 𝑇 be the number of transitions.

Let 𝐶 be as in Lemma D.4.1. Note thatÄ𝐶𝑛𝑇𝜆

ä−𝑛 ≤ 𝜀 ⇐⇒ 𝑒𝑛(𝑇𝜆𝐶 )−ln𝑛 ≤ 𝜀, and

that this inequality holds when 𝑛 ≥ 𝑒𝑇𝜆𝐶

+ lnÄ1𝜀

ä. We have by Lemma D.4.1 that

P(𝑁𝑇 ≥ 𝑒𝑇𝜆𝐶

+ lnÄ1𝜀

ä) ≤ 𝜀

3. With our choice of 𝑇 , ln

Ä1𝜀

ä= 𝑂(𝑇 ).

If we condition on the event 𝐴 of the 𝑇𝑖’s being a particular sequence 𝑇1, . . . , 𝑇𝑛

with 𝑛 < 𝑒𝑇𝜆𝐶

+lnÄ1𝜀

ä, Pinsker’s inequality and Lemma 2.8.1 (with 𝐾 = 𝜅 = 1

𝜎2 ) gives

us

∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1≤»

2 KL(𝑃 𝑡(·|𝐴)||𝑄𝑡(·|𝐴)) (2.205)

= 𝑂

Ñmax

𝜂2𝐷6𝑇 2𝜆

𝜎14Ä𝐷2

𝜎2 + 𝑑ä , 𝜂2𝐷3 1

𝜎6𝐷2, 𝜂𝐷2 1

𝜎4𝑑𝑇

é

(2.206)

66

In order for this to be ≤ 𝜀3, we need (for some absolute constant 𝐶1)

𝜂 ≤ 𝐶1𝜎3𝜀

𝐷2min

𝜎4Ä𝐷𝜎

+√𝑑ä𝑇,

1

𝐷12

,𝜎𝜀

𝑑𝑇

. (2.207)

Putting everything together,

∥∥∥𝑝𝑇 − 𝑞𝑇∥∥∥1≤ P(𝑁𝑇 ≥ 𝑐𝑇𝜆) +

∥∥∥𝑝𝑡(·|𝑁𝑇 ≥ 𝑐𝑇𝜆)− 𝑞𝑡(·|𝑁𝑇 ≥ 𝑐𝑇𝜆)∥∥∥1≤ 𝜀

3+

𝜀

3=

2𝜀

3.

(2.208)

This gives∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀.

For the second part, setting 𝜀 = 𝑂Ä

1ℓ𝐿

ägives that

∥∥∥𝑝ℓ − 𝑞𝑇ℓ∥∥∥1

= 𝑂Ä1𝐿

ä. We will

apply Lemma 2.9.1. By Lemma D.3.1 the assumptions of Lemma 2.9.1 are satisfied

with 𝐶 = 𝑂(1), as we have

𝛽𝑖+1 − 𝛽𝑖

𝛽𝑖

= 𝑂

Ñ1

𝛼𝐷2

𝜎2 + 𝑑 + lnÄ

1𝑤min

äé . (2.209)

By Lemma 2.9.1, after collecting 𝑛 = ΩÄ𝐿2 ln

Ä1𝛿

ääsamples, with probability ≥ 1− 𝛿,∣∣∣∣‘𝑍ℓ+1/“𝑍ℓ

𝑍ℓ+1/𝑍ℓ− 1

∣∣∣∣ ≤ 1𝐿

. Set ’𝑍ℓ+1 = 𝑟𝑍ℓ. Then‘𝑍ℓ+1“𝑍ℓ

∈ [1 − 1𝐿, 1 + 1

𝐿]𝑍ℓ+1

𝑍ℓand

‘𝑍ℓ+1“𝑍1∈[Ä

1− 1𝐿

äℓ,Ä1 + 1

𝐿

äℓ] 𝑍ℓ+1

𝑍1.

Proof of Theorem 2.4.2. Choose 𝛿 = 𝜀2𝐿

where 𝐿 is the number of temperatures. Use

Lemma 2.9.2 inductively, with probability 1 − 𝜀2

each estimate satisfies“𝑍ℓ“𝑍1∈ [1

𝑒, 𝑒].

Estimating the final distribution within 𝜀2

accuracy gives the desired sample.

2.10 Conclusion

We initiated a study of sampling “beyond log-convexity.” In so doing, we developed

a new general technique to analyze simulated tempering, a classical algorithm used

in practice to combat multimodality but that has seen little theoretical analysis. The

67

technique is a new decomposition lemma for Markov chains based on decomposing

the Markov chain rather than just the state space. We have analyzed simulated

tempering with Langevin diffusion, but note that it can be applied to any with any

other Markov chain with a notion of temperature.

Our result is the first result in its class (sampling multimodal, non-log-concave

distributions with gradient oracle access). Admittedly, distributions encountered in

practice are rarely mixtures of distributions with the same shape. However, we hope

that our techniques may be built on to provide guarantees for more practical proba-

bility distributions. An exciting research direction is to provide (average-case) guar-

antees for probability distributions encountered in practice, such as posteriors for

clustering, topic models, and Ising models. For example, the posterior distribution

for a mixture of gaussians can have exponentially many terms, but may perhaps be

tractable in practice. Another interesting direction is to study other temperature

heuristics used in practice, such as particle filters [Sch12; DHW12; PJT15; GD17],

annealed importance sampling [Nea01], and parallel tempering [WSH09a]. We note

that our decomposition theorem can also be used to analyze simulated tempering in

the infinite switch limit [Mar+19].

68

Chapter 3

Online sampling from log-concave

distributions

This chapter is based on work in [LMV19].

3.1 Introduction

In this work, we study the following online sampling problem:

Problem 3.1.1. Consider a sequence of convex functions 𝑓0, 𝑓1, . . . , 𝑓𝑇 : R𝑑 → R

for some 𝑇 ∈ N, and let 𝜀 > 0. At each epoch 𝑡 ∈ 1, . . . , 𝑇, the function 𝑓𝑡 is

given to us, so that we have oracle access to the gradients of the first 𝑡 + 1 func-

tions 𝑓0, 𝑓1, . . . , 𝑓𝑡. The goal is to generate a sample from the distribution 𝜋𝑡(𝑥) ∝

𝑒−∑𝑡

𝑘=0𝑓𝑘(𝑥) with some fixed total-variation (TV) error 𝜀 > 0 at each epoch 𝑡. The

samples at different time steps should be almost independent.

The motivation to study this problem comes from machine learning, Bayesian statis-

tics, optimization, and theoretical computer science, and various versions of this

problem have been considered in the literature; see [NR17; Dou+00; ADH10] and

the references therein.

69

In Bayesian statistics, the goal is to infer the probability distribution (the poste-

rior) of a certain parameter based on observations; however, rather than obtaining

all the observations at once, one constantly acquires new data, and must continuously

update the posterior distribution (rather than only after all data has been collected).

One practical application of online sampling is online logistic regression, where one

wishes to obtain samples from a changing Bayesian posterior distribution as data is

acquired over time. Another practical application of online sampling which has been

well-studied is latent Dirichlet allocation (LDA), which is applied to document clas-

sificiation ([BNJ03]). As new documents are published, it is desirable to update the

distribution of topics without excessive re-computation. 1

We give some settings where online sampling algorithms can be used:

∙ Online Bayesian logistic regression. Concretely, suppose 𝜃 ∼ 𝑝0 for a

given prior distribution, and that samples 𝑦𝑡 are drawn from the conditional

distribution 𝑝(·|𝜃, 𝑦1, . . . , 𝑦𝑡−1). We would like to find the posterior distribution

of 𝑝(𝜃|𝑦1, . . . , 𝑦𝑇 ). By Bayes’ rule and letting 𝑝𝑡 := 𝑝(𝜃|𝑦1, . . . , 𝑦𝑡), we have the

following recursion.

𝑝𝑡(𝜃) ∝ 𝑝𝑡−1(𝜃)𝑝(𝑦𝑡|𝜃, 𝑦1, . . . , 𝑦𝑡−1). (3.1)

The goal is to efficiently obtain a sample(s) 𝜃𝑡 from the posterior distribution

𝑝𝑡(𝜃), for each 𝑡. We can think of the samples 𝑦𝑡 as arriving in a streaming or

online manner, and we want to keep updating our estimate for the probability

distribution. This fits the setting of Problem 3.1.1 by defining 𝑓0 to be such

that 𝑝0 ∝ 𝑒−𝑓0 and 𝑓𝑡 to be such that 𝑝(𝑦𝑡|𝜃, 𝑦1, . . . , 𝑦𝑡−1) ∝ 𝑒−𝑓𝑡 , whenever the

𝑓𝑡’s are convex.

1The theoretical results in this workdo not apply to LDA, since LDA requires sampling fromnon-log-concave distributions. However, one can still apply our algorithm to non-log-concave distri-butions such as those of LDA.

70

∙ Optimization. Online sampling is useful even if one is only interested in

optimization: one generic algorithm for online optimization is to sample a point

𝑥𝑡 from the exponential of the (suitably weighted) negative loss ([CL06], Lemma

10 in [NR17]). Indeed there are settings such as online logistic regression in

which the only known way to achieve optimal regret is through a Bayesian

sampling approach [Fos+18], with lower bounds known for the naive convex

optimization approach [HKL14].

∙ Reinforcement learning. In reinforcement learning problems [Rus+18; DFE18],

a class of online optimization problems, one seeks to choose a set of actions which

maximize a sum of “rewards” over multiple time periods. The expected value

of the reward depends on the value of a vector of unknown model parameters as

well as on the chosen action vector. While one seeks to choose an action at each

time period which gives a large reward, one also wishes to choose a wide range

of actions at different time periods in order to explore the set of possible actions,

allowing one to make a better choice of actions in future periods. Thompson

sampling [Rus+18; DFE18] solves this “exploration-exploitation dilemma” by

maximizing the expected reward at each period with respect to a sample from

the Bayesian posterior distribution for the model parameters. Every time one

chooses an action, more data is acquired from the outcome of the reward, so

that the Bayesian posterior distribution changes at each time period. To imple-

ment Thompson sampling efficiently in real time, one wishes to sample quickly

from this changing posterior distribution even as the number of data points

grows very large. For instance, if one implements Thompson sampling with a

logistic model, then one would need to sample from a changing Bayesian logistic

posterior distribution.

∙ Sampling from a log-concave distribution. Sampling from log-concave dis-

71

tributions is a classic problem in theoretical computer science with applications

to volume computation and integration [LV06], and an algorithm for Problem

3.1.1 can be used to come up with iterative (offline) sampling algorithms for

a log-concave distribution that has the form 𝑒−𝑓(𝑥) = 𝑒−∑𝑇

𝑡=0𝑓𝑡(𝑥). This “sum-

form” often arises in machine learning applications with 𝑇 ≫ 𝑑, and the cost of

evaluating the gradient of 𝑓 is 𝑇 times greater than the cost of evaluating the

gradient of a single 𝑓𝑡. Thus, one approach to sampling from 𝑒−𝑓(𝑥) could be to

think of 𝑓𝑡’s as a sequence and sample incrementally as in Problem 3.1.1.

In all of these applications, because a sample is needed at every epoch 𝑡, it is desirable

to have a fast online sampling algorithm. In particular, the ultimate goal is to design

an algorithm for Problem 3.1.1 such that the number of gradient evaluations is con-

stant at each epoch 𝑡, so that the computational requirements at each epoch do not

increase over time. However, this is quite challenging because at epoch 𝑡, one has to

incorporate information from all 𝑡+ 1 functions 𝑓0, . . . , 𝑓𝑡, while only using a number

of gradient computations which is logarithmic in the total number of functions.

The main contribution of this workis an algorithm for Problem 3.1.1 that, under

mild assumptions on the functions, makes ‹𝑂𝑇 (1) gradient evaluations per epoch (here

the subscript 𝑇 in ‹𝑂𝑇 means that we only show the dependence on the parameters 𝑡, 𝑇 ,

and exclude dependence on non-𝑇, 𝑡 parameters such as the dimension 𝑑, sampling

accuracy 𝜀 and the regularity parameters 𝐶,D, 𝐿 which we define in Section 3.2.1). All

previous rigorous results (even with comparable assumptions) for this problem imply

a bound on the number of gradient or function evaluations which is at least linear in 𝑇 ;

see Table 3.1. We assume that the functions are smooth, they have a bounded second

moment, and their minimizer drifts in a bounded manner, but we do not assume that

the functions are strongly convex. These assumptions are motivated from real-world

considerations and, as a concrete application, we show that these assumptions hold

in the setting of online Bayesian logistic regression, when the data vectors satisfy

72

natural regularity properties, giving a sampling algorithm with ‹𝑂𝑇 (1) updates. Our

result also implies the first algorithm to sample from a 𝑑-dimensional log-concave

distribution of the form 𝑒−∑𝑇

𝑡=0𝑓𝑡 where the 𝑓𝑡’s are not assumed to be strongly

convex and the total number of gradient evaluations is roughly 𝑇 log(𝑇 ) + poly(𝑑), as

opposed to 𝑇 · poly(𝑑) implied by prior works; see Table 3.2.

A natural approach to online sampling is to design a Markov chain with the right

steady state distribution [NR17; DMM18; Dwi+18; Cha+18]. The main difficulty is

that running a step of a Markov chain that incorporates all previous functions takes

time Ω(𝑡) at epoch 𝑡; all previous algorithms with provable guarantees suffer from this.

To overcome this, one must use stochasticity – for example, sample a subset of the

previous functions. However, this fails because of the large variance of the gradient.

Our result relies on a stochastic gradient Langevin dynamics (SGLD) Markov chain

that has a carefully designed variance reduction step built-in with a fixed – ‹𝑂𝑇 (1) –

batch size. Technically, lack of strong convexity is a significant barrier to analyzing

our Markov chain and, here, our main contribution is a martingale exit time argument

that shows that our Markov chain is constrained to a ball of radius roughly 1√𝑇

for

time that is sufficient for it to reach within 𝜀 of 𝜋𝑡.

3.2 Our results

3.2.1 Assumptions

Denote by ℒ(𝑌 ) the distribution of a random variable 𝑌 . For any two probability mea-

sures 𝜇, 𝜈, denote the 2-Wasserstein distance by 𝑊2(𝜇, 𝜈) := inf(𝑋,𝑌 )∼Π(𝜇,𝜈)

»E[‖𝑋 − 𝑌 ‖2],

where Π(𝜇, 𝜈) denotes the set of all possible couplings of random vectors (, 𝑌 ) with

marginals ∼ 𝜇 and 𝑌 ∼ 𝜈. For every 𝑡 ∈ 0, . . . , 𝑇, define 𝐹𝑡 :=∑𝑡

𝑘=0 𝑓𝑘, and

let 𝑥⋆𝑡 be a minimizer of 𝐹𝑡(𝑥) on R𝑑. For any 𝑥 ∈ R𝑑, let 𝛿𝑥 be the Dirac delta

distribution centered at 𝑥. We make the following assumptions:

73

Assumption 3.2.1 (Smoothness/Lipschitz gradient (with constants 𝐿0, 𝐿 > 0)).

For all 1 ≤ 𝑡 ≤ 𝑇 and 𝑥, 𝑦 ∈ R𝑑, ‖∇𝑓𝑡(𝑦)−∇𝑓𝑡(𝑥)‖ ≤ 𝐿 ‖𝑥− 𝑦‖. For 𝑡 = 0,

‖∇𝑓0(𝑦)−∇𝑓0(𝑥)‖ ≤ 𝐿0 ‖𝑥− 𝑦‖.

We allow 𝑓0 to satisfy our assumptions with a different parameter value, since in

Bayesian applications 𝑓0 models a “prior” which has different scaling than 𝑓1, 𝑓2, . . ..

Assumption 3.2.2 (Exponential concentration (with constants 𝐴, 𝑘 > 0,

𝑐 ≥ 0)). For all 0 ≤ 𝑡 ≤ 𝑇 , the concentration condition P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥ 𝛾√

𝑡+𝑐) ≤

𝐴𝑒−𝑘𝛾 holds.

Note that Assumption 3.2.2 implies a bound on the second moment, 𝑚122 :=Ä

E𝑥∼𝜋𝑡 ‖𝑥− 𝑥⋆𝑡‖

22

ä 12 ≤ 𝐶√

𝑡+𝑐for 𝐶 =

Ä2 + 1

𝑘

älog

Ä𝐴𝑘2

ä. For conciseness, we will write

bounds in terms of this parameter 𝐶.2

Assumption 3.2.3 (Drift of MAP (with constants D ≥ 0, 𝑐 ≥ 0)). 3 For all

0 ≤ 𝑡, 𝜏 ≤ 𝑇 such that 𝜏 ∈ [𝑡,max2𝑡, 1], ‖𝑥⋆𝑡 − 𝑥⋆

𝜏‖ ≤ D√𝑡+𝑐

.

Assumption 3.2.2 says that the “data is informative enough” – the current distribu-

tion 𝜋𝑡 (posterior) concentrates near the mode 𝑥⋆𝑡 as 𝑡 increases. The 1

𝑡decrease in

the second moment is what one would expect based on central limit theorems such as

the Bernstein-von Mises theorem. It is a much weaker condition than strong convex-

ity. Indeed, if the 𝑓𝑡’s are 𝛼-strongly convex, then 𝜋𝑡(𝑥) ∝ 𝑒−∑𝑡

𝑘=0𝑓𝑘(𝑥) has standard

deviation ≤√𝑑√

𝛼(𝑡+1)(consider for instance the example of Gaussians with variance

1𝛼

). In addition, many distributions satisfy Assumption 3.2.2 but are not strongly

logconcave. For instance, posterior distributions used in Bayesian logistic regression

satisfy Assumption 3.2.2 under natural conditions on the data, but are not strongly

2Having a bounded second moment suffices to obtain (weaker) polynomial bounds (by replacingthe use of the concentration inequality with Chebyshev’s inequality). We use this slightly strongercondition because exponential concentration improves the dependence on 𝜀, and is typically satisfiedin practice.

3The MAP (maximum a posteriori) is like the MLE except that it takes the prior into account.

74

logconcave unless the Bayesian prior is strongly logconcave (see section 3.2.4). More-

over, while the second moment in Assumption 3.2.2 decreases with the number of data

points, the strong convexity parameter remains constant even if the prior is strongly

logconcave. Hence, together Assumptions 3.2.1 and 3.2.2 are a weaker condition than

strong convexity and gradient Lipschitzness, the typical setting where the offline al-

gorithm is analyzed. In particular, the assumptions avoid the “ill-conditioned” case

when the distribution becomes more concentrated in one direction than another as

the number of functions 𝑡 increases.

Assumption 3.2.3 is typically satisfied in the setting where the 𝑓𝑡’s are iid. For

instance, in the case of Gaussian distributions, the maximum a posteriori (MAP) is

the mean, and the assumption reduces to the fact that a random walk drifts on the

order of√𝑡, and hence the mean drifts by 𝑂𝑇

(1√𝑡

), after 𝑡 time steps. We need this

assumption because our algorithm uses cached gradients computed Θ𝑇 (𝑡) time steps

ago, and in order for the past gradients to be close in value to the gradient at the

current point, the points where the gradients were last calculated should be at distance

𝑂𝑇

(1√𝑡

)from the current point. In Section 3.2.4 we show that these assumptions

hold for sequences of functions arising in online Bayesian logistic regression; unlike in

previous work on related techniques [Nag+17; Cha+18], our assumptions are weak

enough to hold for such applications, as they do not require 𝑓0, . . . , 𝑓𝑇 to be strongly

convex.

3.2.2 Result in the online setting

Theorem 3.2.4 (Online variance-reduced SGLD). Suppose that 𝑓0, . . . , 𝑓𝑇 :

R𝑑 → R are (weakly) convex4 and satisfy Assumptions 3.2.1-3.2.3 with 𝑐 = 𝐿0

𝐿.

Then there exist parameters 𝑏, and 𝑖max which are polynomial in 𝑑, 𝐿, 𝐶,D, 𝜀−1 and

poly-logarithmic in 𝑇 , such that at epoch 𝑡, Algorithm 4 generates an 𝜀-approximate

4In fact, it suffices for their sum to be convex.

75

independent sample X𝑡 from 𝜋𝑡.5 Moreover, the total number of gradient evaluations

required at each epoch 𝑡 is polynomial in 𝑑, 𝐿, 𝐶,D, 𝜀−1 and polylogarithmic in 𝑇 .

See Theorem 3.6.5 for a more precise statement with explicit dependencies. Note

that the algorithm needs to know the parameters, but bounds are enough.

Compared to previous work on the topic, this result is the first to obtain bounds

on the number of gradient evaluations which are polylogarthmic in 𝑇 at each epoch

(see Table 3.1 where we compare the dependence on 𝑇 of previous results applied

to the online sampling problem). Previous results for the basic Langevin and SGLD

algorithms, as well as for the variance reduced SGLD methods SAGA-LD and CV-

LD [Cha+18] and the online Dikin walk6 [NR17] all imply a bound on the number of

gradient or function7 evaluations at each epoch which is at least linear in 𝑇 . 8 On

the other hand, while polynomial, our result’s dependence on the other parameters

𝑑, 𝐿, 𝐶,D, 𝜀−1 is larger than that of the online Dikin walk and of the Langevin and

SGLD algorithms. We suspect that the order of this polynomial can be improved

with a more careful analysis.

Finally, the results of [Cha+18] require strong convexity while our result, only

requires a much weaker bound on the concentration of the target distribution (As-

sumption 3.2.2). This allows us to obtain bounds for applications such as logistic

regression where the functions 𝑓1, . . . , 𝑓𝑡 may not be strongly convex.

5See Definition 3.6.1 for the formal definition. Necessarily, ‖ℒ(X𝑡)− 𝜋𝑡‖TV ≤ 𝜀.6The online Dikin walk reduces to an online version of the Random Walk Metropolis algorithm

in our unconstrained setting.7In our setting a gradient evaluation can be computed in at worst 2𝑑 function evaluations. In

many applications (including logistic regression) computing the gradient takes the same number ofoperations as computing the function.

8Note that the number of gradient evaluations for the basic Langevin and SGLD algorithms andthe online Dikin walk depend multiplicatively on 𝑇 , (i.e., 𝑇 × poly(𝑑, 𝐿, other parameters)), whilethe number of gradient evaluations for the variance-reduced SGLD methods depend only additivelyon 𝑇 , (i.e., 𝑇 + poly(𝑑, 𝐿, other parameters)).

76

Algorithm oracle calls per epoch Other assumptions

Online Dikin walk [NR17, S5.1] 𝑂𝑇 (𝑇 ) Strong convexityBounded ratio of distributions

Langevin [DMM18; Dwi+18] 𝑂𝑇 (𝑇 ) -

SGLD [DMM18] 𝑂𝑇 (𝑇 ) -

SAGA-LD [Cha+18] 𝑂𝑇 (𝑇 ) Strong convexityLipschitz Hessian

CV-ULD [Cha+18] 𝑂𝑇 (𝑇 ) Strong convexity

This work polylog(𝑇 ) bounded second momentbounded drift of minimizer

Table 3.1: Bounds on the number of gradient (or function) evaluations required by dif-

ferent algorithms to solve the online sampling problem. Lipschitz gradient (smoothness)

is assumed for all algorithms. Note that the online Dikin walk was analyzed in [NR17]

for a different setting where the target distribution is restricted to a convex polytope; in

this table we give the result that one should obain when the support is R𝑑. It is therefore

possible that the assumptions we give for the online Dikin walk can be weakened.

3.2.3 Result in the offline setting

In the offline setting, we have access to all 𝑇 functions 𝑓1, . . . 𝑓𝑇 from the beginning

(for notational simplicity, in the rest of the paper we index the 𝑓𝑡’s from 𝑡 = 1 for

the offline setting). Our goal is simply to generate a sample from the single target

distribution 𝜋𝑇 (𝑥) ∝ 𝑒−∑𝑇

𝑡=1𝑓𝑡(𝑥) with TV error 𝜀. Since we do not assume that the

𝑓𝑡’s are given in any particular order, we replace Assumption 3.2.2 which depends on

the order in which the functions are given, with an Assumption (Assumption 3.2.5)

on the target function∑𝑇

𝑡=1 𝑓𝑡(𝑥) which does not depend on the ordering of the 𝑓𝑡’s.

Instead of working with the sequence of target distributions 𝜋1, 𝜋2 . . . which depend

on the ordering of the 𝑓𝑡’s, we introduce an inverse temperature parameter 𝛽 > 0

and consider the distributions 𝜋𝛽𝑇 (𝑥) ∝ 𝑒−𝛽

∑𝑇𝑡=1

𝑓𝑡(𝑥). In place of Assumption 3.2.2,

we assume the following:

Assumption 3.2.5 (Exponential concentration (with constants 𝐴, 𝑘 > 0)).

For all 1𝑇≤ 𝛽 ≤ 1, we have for all 𝑠 ≥ 0, P𝑋∼𝜋𝛽

𝑇

Ç‖𝑋 − 𝑥⋆‖ ≥ 𝑠√

𝛽𝑇

å≤ 𝐴𝑒−𝑘𝑠.

Assumption 3.2.5 says that the distributions 𝜋𝛽𝑇 become more concentrated as 𝛽

77

increases from 1𝑇

to 1. By sampling from a sequence of distributions 𝜋𝛽𝑇 where we

gradually increase 𝛽 from 1𝑇

to 1 at each epoch, our offline algorithm (Algorithm 5) is

able to approach the target distribution 𝜋𝑇 = 𝜋1𝑇 when starting from a cold start that

is far from a sublevel set containing most of the mass of the probability measure of

𝜋𝑇 , without requiring strong convexity. Moreover, since scaling by 𝛽 does not change

the location of the minimizer 𝑥⋆ of 𝛽∑𝑇

𝑡=1 𝑓𝑡(𝑥), we can drop Assumption 3.2.3.

Theorem 3.2.6 (Offline variance-reduced SGLD). Suppose that 𝑓1, . . . , 𝑓𝑇 sat-

isfy Assumptions 3.2.1 and 3.2.5. Then there exist 𝑏, 𝜂, and 𝑖max which are polynomial

in 𝑑, 𝐿, 𝐶, 𝜀−1 and poly-logarithmic in 𝑇 , such that Algorithm 5 generates a sample 𝑋𝑇

such that ‖ℒ(𝑋𝑇 ) − 𝜋𝑇‖TV ≤ 𝜀. Moreover, the total number of gradient evaluations

is polylog(𝑇 )× poly(𝑑, 𝐿, 𝐶,D, 𝜀−1) + ‹𝑂(𝑇 ).

See Theorem 3.7.2 for precise dependencies. The theorem could also be stated with

a 𝑓0, but we have omitted it for simplicity.

As in the online setting, we do not assume strong convexity. Further, our additive

dependence on 𝑇 in Theorem 3.2.6 is tight up to polylogarithmic factors, since the

number of gradient evaluations needed to sample from a target distribution satisfying

Assumptions 3.2.1-3.2.3 is at least Ω(𝑇 ) because of information theoretic require-

ments.

Compared to previous work in this setting, our results are the first to obtain

an additive dependence on 𝑇 and polynomial dependence on the other parameters

without assuming strong convexity. While the results of [Cha+18] for SAGA-LD and

CV-LD have additive dependence on 𝑇 , their results require the functions 𝑓1, . . . , 𝑓𝑇

to be strongly convex. Since the basic Dikin walk and basic Langevin algorithms

compute all 𝑇 functions or all 𝑇 gradients every time the Markov chain takes a step,

and the number of steps in their Markov chain depends polynomially on the other

parameters such as 𝑑 and 𝐿, the number of gradient (or function) evaluations required

by these algorithms is multiplicative in 𝑇 . Even though the basic SGLD algorithm

78

Algorithm # of oracle calls other Assumptions

Online Dikin walk [NR17, S5.1] 𝑇 × poly(𝑑, 𝐿) Strong convexity

Langevin [DMM18; Dwi+18] 𝑇 × poly(𝑑, 𝐿) Wasserstein warm start

SGLD [DMM18] 𝑇 × poly(𝑑, 𝐿) Wasserstein warm start

SAGA-LD [Cha+18] 𝑇 + poly(𝑑,𝑚−1, 𝐿, 𝐿𝐻) Strong convexity

CV-ULD [Cha+18] 𝑇 + poly(𝑑,𝑚−1, 𝐿) Strong convexity

This work 𝑇 + poly(𝑑,𝐶,D, 𝐿) bounded second momentbounded drift of minimizer

Table 3.2: Bounds on the number of gradient (or function) evaluations required by dif-

ferent algorithms to solve the offline sampling problem. Lipschitz gradient (smoothness) is

assumed for all algorithms.

computes a mini-batch of the gradients at each step, roughly speaking the batch size

at each step of the chain should be at least Ω𝑇 (𝑇 ) for the stochastic gradient to have

the required variance, implying that basic SGLD also has multiplicative dependence

on 𝑇 .

3.2.4 Application to Bayesian logistic regression

Next, we show that Assumptions 3.2.1-3.2.3, and therefore Theorem 3.2.4, hold in the

setting of online Bayesian logistic regression, when the data satisfy certain regularity

properties.

Logistic regression is a fundamental and widely used model in Bayesian statis-

tics [AC93]. It has served as a model problem for methods in scalable Bayesian

inference [WT11; HCB16; CB17; CB18a], of which online sampling is one approach.

Additionally, sampling from the logistic regression posterior is the key step in the

optimal algorithm for online logistic regret minimization [Fos+18].

In Bayesian logistic regression, one models the data (𝑢𝑡 ∈ R𝑑, 𝑦𝑡 ∈ −1, 1) as

follows: there is some unknown 𝜃0 ∈ R𝑑 such that given 𝑢𝑡 (which is thought of as

the independent variable), for all 𝑡 ∈ 1, . . . , 𝑇 the dependent variable 𝑦𝑡 follows a

Bernoulli logistic distribution with “success” probability 𝜑(𝑢⊤𝑡 𝜃) (𝑦𝑡 = 1 with proba-

bility 𝜑(𝑢⊤𝑡 𝜃) and −1 otherwise) where 𝜑(𝑥) = 1

1+𝑒−𝑥 . The Bayesian logistic regression

79

sampling problem we consider is as follows:

Problem 3.2.1 (Bayesian logistic regression). Suppose the 𝑦𝑡’s are generated

from 𝑢𝑡’s as Bernoulli random variables with “success” probability 𝜑(𝑢⊤𝑡 𝜃). At every

epoch 𝑡 ∈ 1, . . . , 𝑇, after observing (𝑢𝑘, 𝑦𝑘)𝑡𝑘=1, return a sample from the posterior

distribution9 𝑡(𝜃) ∝ 𝑒−∑𝑡

𝑘=0𝑓𝑘(𝜃), where 𝑓0(𝜃) := 𝑒−

12𝛼‖𝜃‖2 and 𝑓𝑘(𝜃) := − log[𝜑(𝑦𝑘𝑢

⊤𝑘 𝜃)].

We show that under reasonable conditions on the data-generating distribution –

namely, that the inputs are bounded and that we see data in all directions – our

online sampling algorithm, Algorithm 4, succeeds on Bayesian logistic regression.10

Theorem 3.2.2 (Online Bayesian logistic regression). Suppose that ‖𝜃0‖ ≤ B

for some B > 0, and that 𝑢𝑡 ∼ 𝑃𝑢 are iid, where 𝑃𝑢 is a distribution that satisfies

the following: for 𝑢 ∼ 𝑃𝑢, (1) For some 𝑀 > 0, ‖𝑢‖2 ≤ 𝑀 with probability 1

(bounded) and (2) E𝑢[𝑢𝑢⊤1|𝑢⊤𝜃0|≤2] ⪰ 𝜎𝐼𝑑 (“restricted” covariance matrix is bounded

away from 0). 11 Then for the functions 𝑓0, . . . , 𝑓𝑇 in Problem 3.2.1, and any 𝜀 > 0,

there exist parameters 𝐿, log(𝐴), 𝑘−1,D = poly(𝑀,𝜎−1, 𝛼,B, 𝑑, 1𝜀, log(𝑇 )) such that

Assumptions 3.2.1, 3.2.2, and 3.2.3 hold for all 𝑡 with probability at least 1 − 𝜀.

Therefore Algorithm 4 gives 𝜀-approximate samples from 𝜋𝑡 for 1 ≤ 𝑡 ≤ 𝑇 with

poly(𝑀,𝜎−1, 𝛼,B, 𝑑, 1𝜀, log(𝑇 )) gradient evaluations at each epoch.

Note that our result does not hold if the covariance matrix of the distribution of the

𝑢𝑡’s becomes much more ill-conditioned over time, as is the case in certain applications

of Thompson sampling [Rus+18]. In such applications we would have to add a pre-

conditioner to Algorithm 4 which changes at each epoch.

Our result in the offline case improves upon previous analyses of variance-reduced

SGLD for Bayesian logistic regression, where the number of gradient evaluations has

9Here we choose a Gaussian prior but this can be replaced by any 𝑒−𝑓0 where 𝑓0 is stronglyconvex and smooth.

10For simplicity, we state the result (Theorem 3.2.2) in the case where the input variables 𝑢 areiid, but note that the result holds more generally (see Lemma C.1.1 for a more general statement ofour result).

11The constant 2 may be replaced by any other constant. For a tighter condition, see the statementof Theorem C.1.2.

80

multiplicative dependence on 𝑇 [Nag+17]. Our bounds in the offline case only have

additive dependence on 𝑇 .

In Section 3.8 we show that our algorithm achieves competitive accuracy compared

to a Markov chain that is specialized to logistic regression (Polya-Gamma).

3.3 Algorithm and proof techniques

3.3.1 Overview of online algorithm

Algorithm 3 SAGA-LD

Input: Gradient oracles for 𝑓𝑘 : R𝑑 → R, for 0 ≤ 𝑘 ≤ 𝑡.Input: Step size 𝜂 > 0, batch size 𝑏 ∈ N, number of steps 𝑖max, initial point 𝑋0.Input: Cached gradients 𝐺𝑘 = ∇𝑓𝑘(𝑢𝑘) for some points 𝑢𝑘, and 𝑠 =

∑𝑡𝑘=1 𝐺

𝑘.Output: 𝑋𝑖max

1: for 𝑖 from 0 to 𝑖max − 1 do2: (Sample batch) Sample with replacement a (multi)set 𝑆 of size 𝑏 from1, . . . , 𝑡.

3: (Calculate gradients) For each 𝑘 ∈ 𝑆, let 𝐺𝑘new = ∇𝑓𝑘(𝑋𝑖).

4: (Variance-reduced gradient estimate) Let 𝑔𝑖 = ∇𝑓0(𝑋𝑖) + 𝑠 + 𝑡𝑏

∑𝑘∈𝑆(𝐺𝑘

new −𝐺𝑘).

5: (Langevin step) Let 𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑔𝑖 +√

2𝜂𝜉𝑖 where 𝜉𝑖 ∼ 𝑁(0, 𝐼).6: (Update sum) Update 𝑠← [ 𝑠 +

∑𝑘∈set(𝑆)(𝐺

𝑘new −𝐺𝑘).

7: (Update gradients) For each 𝑘 ∈ 𝑆, update 𝐺𝑘 ←[ 𝐺𝑘new.

8: end for9: Return 𝑋𝑖max .

Given gradient access to the functions 𝑓0, . . . , 𝑓𝑡, at every epoch 𝑡 = 1, . . . , 𝑇 , Algo-

rithm 4 generates a point 𝑋 𝑡 approximately distributed according to 𝜋𝑡 ∝ 𝑒−∑𝑡

𝑘=0𝑓𝑘(𝑥),

by running SAGA-LD given by Algorithm 3. Algorithm 3 makes the following update

rule at each step for the SGLD Markov chain 𝑋𝑖, for a certain choice of stochastic

gradient 𝑔𝑖, where E[𝑔𝑖] =∑𝑡

𝑘=0∇𝑓𝑘(𝑋𝑖):

𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑡𝑔𝑖 +»

2𝜂𝑡𝜉𝑖, 𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). (3.2)

81

Algorithm 4 Online SAGA-LD

Input: 𝑇 ∈ N and gradient oracles for functions 𝑓𝑡 : R𝑑 → R, for all 𝑡 ∈ 0, . . . , 𝑇 ,where only the gradient oracles ∇𝑓0, . . . ,∇𝑓𝑡 are available at epoch 𝑡.Input: step size 𝜂0, batch size 𝑏 > 0, 𝑖max > 0, constant offset 𝑐, acceptance radius𝐶 ′, an initial point X0 ∈ R𝑑.Output: At each epoch 𝑡, a sample X𝑡

1: Set 𝑠 = 0. ◁ Initial gradient sum2: for epoch 𝑡 = 1 to 𝑇 do

3: Set 𝑡′ =

2⌊log2(𝑡−1)⌋ 𝑡 > 1

0, 𝑡 = 1. ◁ The previous power of 2

4: if∥∥∥X𝑡−1 − X𝑡′

∥∥∥ ≤ 𝐶′√𝑡+𝑐

then X𝑡0 ←[ X𝑡−1 ◁ If the previous sample hasn’t

drifted too far, use the previous sample as warm start5: else X𝑡

0 ← [ X𝑡′ ◁ If the previous sample has drifted too far, reset to thesample at time 𝑡′

6: end if7: 𝐺𝑡 ←[ ∇𝑓𝑡(X𝑡

0)8: 𝑠←[ 𝑠 + 𝐺𝑡.9: For all gradients 𝐺𝑘 = ∇𝑓𝑘(𝑢𝑘) which were last updated at time 𝑡/2, replace

them by ∇𝑓𝑘(X𝑡0) and update 𝑠 accordingly.

10: Draw 𝑖𝑡 uniformly from 1, . . . , 𝑖max.11: Run Algorithm 3 with step size 𝜂0

𝑡+𝑐, batch size 𝑏, number of steps 𝑖𝑡, initial

point X𝑡0, and precomputed gradients 𝐺𝑘 with sum 𝑠. Keep track of when the

gradients are updated.12: Return the output X𝑡 = X𝑡

𝑖𝑡 of Algorithm 3.13: end for

Key to this algorithm is the construction of the variance reduced stochastic gradient

𝑔𝑖. It is constructed by taking the sum of the gradients at previous points in the

Markov chain and then correcting it with a batch. Roughly, we show that with high

probability the previous points at which each gradient in the batch was computed are

within ‹𝑂𝑇

(1√𝑡

)of 𝑥⋆

𝑡 .

Our main theorem, Theorem 3.2.4, says that to obtain a fixed TV error 𝜀 for each

sample, the number of steps at each epoch 𝑖max and the batch size 𝑏 only need to be

poly-logarithmic in 𝑇 .

The algorithm takes as input the parameter 𝜂0 > 0 which determines the step size

𝜂𝑡 of the Langevin dynamics Markov chain. Assumption 3.2.2 says that the variance

of the target distribution decreases at the rate 𝐶2

𝑡+𝑐. To ensure that the variance of

82

each step of Langevin dynamics decreases at roughly the same rate as the variance

of the target distribution 𝜋𝑡, we therefore set the step size 𝜂𝑡 to be 𝜂𝑡 = 𝜂0𝑡+𝑐

. With

this step size, the Markov chain can travel across a sub-level set containing most of

the probability measure of 𝜋𝑡 in roughly the same number 𝑖max = ‹𝑂𝑇 (1) of steps at

each epoch 𝑡. We will take the acceptance radius to be 𝐶 ′ = 2.5(𝐶1 + D) where

𝐶1 is given by (3.66), and show that with good probability this choice of 𝐶 ′ ensures∥∥∥𝑋 𝑡−1 −𝑋 𝑡′∥∥∥ ≤ 4(𝐶1+D)√

𝑡+𝑐in Algorithm 4.

3.3.2 Overview of offline algorithm

Similarly to the online Algorithm 4, our offline Algorithm 5 also calls the variance-

reduced SGLD Algorithm 3 multiple times. In the offline setting, all the functions

𝑓1, . . . , 𝑓𝑇 are given from the start, so there is no need to run Algorithm 3 on subsets

of the functions. Instead, we run SAGA-LD on 𝛽𝑓1, . . . , 𝛽𝑓𝑇 , where 𝛽 is the inverse

temperature and is doubled at each epoch, from roughly 𝛽 = 1𝑇

to 𝛽 = 1. There are

logarithmically many epochs, and each epoch takes 𝑖max = ‹𝑂𝑇 (1) Markov chain steps.

Note that we cannot just run SAGA-LD on 𝑓1, . . . , 𝑓𝑇 . The temperature schedule

is necessary because we only assume a cold start; in order for our variance-reduced

SGLD to work, the initial starting point must be ‹𝑂𝑇

(1√𝑇

)rather than ‹𝑂𝑇 (1) away

from the minimum. The temperature schedule helps us get there by roughly halving

the distance to the minimum each epoch; the step sizes are also halved at each epoch.

3.4 Proof overview

3.4.1 Online problem

For the online problem, information theoretic constraints require us to use the “in-

formation” from at least Ω(𝑡) gradients in order to sample with fixed TV error at

83

Algorithm 5 Offline variance-reduced SGLD

Input: 𝑇 ∈ N and gradient oracles for functions 𝑓𝑡 : R𝑑 → R, 1 ≤ 𝑡 ≤ 𝑇 .Input: step size 𝜂, batch size 𝑏 > 0, 𝑖max > 0, an initial point X0 ∈ R𝑑

Output: A sample X

1: X←[ X0

2: Set 𝛽 = 1𝑇

. ◁ Start at a high temperature, 𝑇 .3: while 𝛽 < 1 do4: Run Algorithm 3 with step size 𝜂

𝛽𝑇, batch size 𝑏, number of steps 𝑖max, initial

point X, and functions 𝛽𝑓𝑡, 1 ≤ 𝑡 ≤ 𝑇 .5: Set X←[ X𝛽, where X𝛽 is the output of Algorithm 3.6: 𝛽 ←[ max2𝛽, 1. ◁ Double the temperature.7: end while8: Return X.

the 𝑡th epoch. Thus, in order to use only ‹𝑂𝑇 (1) gradients at each epoch, we must

reuse gradient information from past epochs. We accomplish this by reusing gradients

computed at points in the Markov chain, including points at past epochs. This saves

a crucial factor of 𝑇 over naive SGLD, but only if we can show that these past points

in the Markov chain track the mode of the distribution, and that our Markov chain

also stays close to the mode (Lemma 3.6.1).

The distribution is concentrated to 𝑂𝑇 (1/√𝑡) at the 𝑡th epoch (Assumption 3.2.2),

and we need the Markov chain to stay within ‹𝑂𝑇 (1/√𝑡) of the mode. The bulk of the

proof (Lemma 3.6.2) is to show that with large probability the Markov chain stays

within this ball. Once we establish that the Markov chain stays close, we combine

our bounds with existing results on SGLD from [DMM18] to show that we only need‹𝑂𝑇 (1) steps per epoch (Lemma 3.6.4). Finally, an induction with careful choice of

constants finishes the proof (Theorem 3.6.5). Details of each of these steps follow.

Bounding the variance of the stochastic gradient (see Lemma 3.6.1). We

reduce the variance of our stochastic gradient by using the gradient evaluated a past

point 𝑢𝑘 and estimating the difference in the gradients between our current point

𝑋 𝑡𝑖 and the past point 𝑢𝑘. Using the 𝐿-Lipschitz property (Assumption 3.2.1) of

the gradients, we show that the variance of this stochastic gradient is bounded by

84

𝑡2

𝑏𝐿2 max𝑘 ‖𝑋 𝑡

𝑖 − 𝑢𝑘‖2. To obtain this bound, observe that the individual compo-

nents ∇𝑓𝑘(𝑋 𝑡𝑖 ) − ∇𝑓𝑘(𝑢𝑘)𝑘∈𝑆 of the stochastic gradient 𝑔𝑡𝑖 have variance at most

= 𝑡2𝐿2 max𝑘 ‖𝑋 𝑡𝑖 − 𝑢𝑘‖2 by the Lipschitz property. Averaging with a batch saves a

factor of 𝑏.

For the number of gradient evaluations to stay nearly constant at each step, in-

creasing the batch size is not a viable option to decrease the variance of our stochastic

gradient. Rather, if we can show that ‖𝑋 𝑡𝑖 − 𝑢𝑘‖ decreases as ‖𝑋 𝑡

𝑖 − 𝑢𝑘‖ = ‹𝑂𝑇 (1/√𝑡),

the variance of our stochastic gradient will decrease at each epoch at the desired rate.

Bounding the escape time from a ball where the stochastic gradient has

low variance (see Lemma 3.6.2). Our main challenge is to bound the distance

‖𝑋𝑖 − 𝑢𝑘‖. Because we do not assume that the target distribution is strongly con-

vex, we cannot use proof techniques of past papers analyzing variance-reduced SGLD

methods. [Cha+18; Nag+17] used strong convexity to show that with high prob-

ability, the Markov chain does not travel too far from its initial point, implying a

bound on the variance of their stochastic gradients. Unfortunately, many important

applications, including logistic regression, lack strong convexity.

To deal with the lack of strong convexity, we instead use a martingale exit time

argument to show that the Markov chain remains inside a ball of radius 𝑟 = ‹𝑂𝑇 (1/√𝑡)

with high probability for a large enough time 𝑖max for the Markov chain to reach a

point within TV distance 𝜀 of the target distribution. Towards this end, we would like

to bound the distance from the current state of the Markov chain to the mode ‖𝑋 𝑡𝑖 −

𝑥⋆𝑡‖ by ‹𝑂𝑇 (1/

√𝑡), and bound ‖𝑥⋆

𝑡 −𝑢𝑘‖ by ‹𝑂𝑇 (1/√𝑡). Together, this allows us to bound

the distance ‖𝑋 𝑡𝑖 − 𝑢𝑘‖ = 𝑂𝑇 (1/

√𝑡). We can then use our bound on ‖𝑋 𝑡

𝑖 − 𝑢𝑘‖ =‹𝑂𝑇 (1/√𝑡) together with Lemma 3.6.1 to bound the variance of the stochastic gradient

by roughly ‹𝑂𝑇 (1/𝑡).

Bounding ‖𝑥⋆𝑡 − 𝑢𝑘‖. Since 𝑢𝑘 is a point of the Markov chain, possibly at a

85

previous epoch 𝜏 ≤ 𝑡, roughly speaking we can bound this distance inductively by

using bounds obtained at the previous epoch 𝜏 (Theorem 3.6.5 and Lemma 3.6.4).

Noting that 𝑢𝑘 = 𝑋𝜏𝑖 for some 𝑖 ≤ 𝑖max, we use our bound for ‖𝑢𝑘−𝑥⋆

𝜏‖ = 𝑂𝑇 (1/√𝜏) =

𝑂𝑇 (1/√𝑡) obtained at the previous epoch 𝜏 , together with Assumption 3.2.3 which

says that ‖𝑥⋆𝑡 − 𝑥⋆

𝜏‖ = 𝑂𝑇 (1/√𝑡), to bound ‖𝑥⋆

𝑡 − 𝑢𝑘‖.

Bounding ‖𝑋 𝑡𝑖 − 𝑥⋆

𝑡‖. To bound the distance 𝜌𝑖 := ‖𝑋 𝑡𝑖 − 𝑥⋆

𝑡‖ to the mode, we

would like to bound the increase 𝜌𝑖+1−𝜌𝑖 at each step 𝑖 in the Markov chain. We use a

martingale exit time argument on ‖𝑋 𝑡𝑖 − 𝑥⋆

𝑡‖2, the squared distance from the current

state of the Markov chain to the mode. The advantage in using the squared distance

is that the expected increase in the squared distance due to the Gaussian noise term

√2𝜂𝑡𝜉𝑖 in the Markov chain update rule (equation (3.2)) is the same regardless of the

current position of the Markov chain, allowing us to obtain tighter bounds on the

increase regardless of the current position of the Markov chain.

To bound the component of the increase in ‖𝑋 𝑡𝑖 − 𝑥⋆

𝑡‖2

that is due to the gradi-

ent term −𝜂𝑡𝑔𝑖, we use weak convexity. By weak convexity, the (negative) gradient

never points away from the mode, meaning that, roughly speaking, the mean of the

stochastic gradient term in the Langevin Markov chain update does not increase the

squared distance to the mode. Any increase in the distance from the mode is due to

the Gaussian noise term√

2𝜂𝑡𝜉𝑖 or to the error term 𝑔𝑖 − ∇𝐹𝑡(𝑋𝑡𝑖 ) in the stochastic

gradient, both of which have mean zero and are independent of previous steps in

the Markov chain. We then apply Azuma’s martingale concentration inequalities to

bound the exit time from the ball. This shows that the Markov chain remains at

distance of roughly ‹𝑂𝑇 (1/√𝑡) from the mode.

Bounding the TV error (Lemma 3.6.4). We now show that if 𝑢𝑘 is close to 𝑥⋆𝜏 ,

then X𝑡 will be a good sample from 𝜋𝑡. More precisely, we show that if at epoch 𝑡 the

Markov chain starts at 𝑋 𝑡0 such that ‖𝑋 𝑡

0 − 𝑥⋆𝜏‖ ≤ R√

𝑡+𝑐(R to be chosen later), then

86

∥∥∥ℒ(𝑋 𝑡𝑖max

)− 𝜋𝑡

∥∥∥TV≤ 𝑂

(𝜀

log2(𝑇 )

).

To do this, we will use two bounds: a bound on the Wasserstein distance be-

tween the initial point 𝑋 𝑡0 and the target density 𝜋𝑡, and a bound on the variance

of the stochastic gradient. We then plug the bounds into Corollary 18 of [DMM18]

(reproduced as Theorem 3.6.3).

Firstly, to bound the initial Wasserstein distance, note by the triangle inequal-

ity that 𝑊2(𝛿𝑋𝑡0, 𝜋𝑡) = 𝑂(‖𝑋 𝑡

0 − 𝑥⋆𝜏‖ + ‖𝑥⋆

𝜏 − 𝑥⋆𝑡‖ + 𝑊2(𝛿𝑥⋆

𝑡, 𝜋𝑡)). The first term can

be bounded by the fact the algorithm“resets” 𝑋 𝑡0 if it has drifted too far from its

position at step 𝜏 . The second term is bounded by D√𝜏+𝑐

(by the drift assumption,

Assumption 3.2.3), and the third term by 𝐶√𝑡+𝑐

(by a bound on the second moment,

from Assumption 3.2.2). Thus 𝑊 22 (𝛿𝑋𝑡

0, 𝜋𝑡) = ‹𝑂𝑇 (1/𝑡).

Secondly, we can apply the variance bound (Lemma 3.6.1) to the Markov chain.

By the bound on the escape time from the ball (Lemma 3.6.2), with high probability

the chain stays within ‹𝑂𝑇 (1/√𝑡) of the mode. Lemma 3.6.1 then tells us that the

variance is 𝜎2𝑡 = E

î‖𝑔𝑡𝑖 −∇𝐹𝑡(𝑋

𝑡𝑖 )‖

2ó

= 𝑡2

𝑏𝐿2 max𝑘 ‖𝑋 𝑡

𝑖 − 𝑢𝑘‖2 = ‹𝑂𝑇 (1𝑡).

The result from [DMM18] then says that we can get a fixed KL-error 𝜀 with

𝑖max = 𝑂𝜀,𝑇

Ä𝑊 2

2 (𝛿𝑋𝑡0, 𝜋𝑡)𝜎

2𝑡 poly

Ä1𝜀

ää= ‹𝑂𝜀,𝑇

ÄÄ1𝑡

ä𝑡 poly

Ä1𝜀

ää= ‹𝑂𝜀,𝑇 (poly

Ä1𝜀

ä) steps

per epoch. Finally, Pinsker’s inequality bounds the TV-error by the KL-error.

These bounds allow us to prove by induction (through a union bound) that with

high probability, ‖𝑋 𝑡 − 𝑥*𝑡‖ is small whenever 𝑡 is a power of 2 (which we need for

restarts when the samples drift too far away) and that 𝑋𝑠𝑖 never drifts too far from

the current mode 𝑥*𝑠, for any 𝑖, 𝑠, and hence get a TV-error bound at each epoch.

Bounding the number of of gradient evaluations at each epoch (Theo-

rem 3.6.5). Working out the constants, we see that it suffices to have 𝑖max =

poly(𝑑, 𝐿, 𝐶,D, 𝜀−1, log(𝑇 )) to obtain TV-error 𝜀 at each epoch. A constant batch size

suffices, so the total number of gradient evaluations is 𝑂(𝑖max𝑏) = poly(𝑑, 𝐿, 𝐶,D, 𝜀−1, log(𝑇 )).

87

3.4.2 Offline problem

For the offline problem, the desired result – sampling from 𝜋𝑇 with TV error 𝜀 using‹𝑂(𝑇 ) + poly(𝑑, 𝐿, 𝐶, 𝜀−1) log2(𝑇 ) gradient evaluations – is known either when we as-

sume strong convexity, or we have a warm start. We show how to achieve the same

additive bound without either assumption.

Without strong convexity, we do not have access to a Lyapunov function which

guarantees that the distance between the Markov chain and the mode 𝑥⋆ of the target

distribution contracts at each step, even from a cold start. To get around this problem,

we sample from a sequence of log2(𝑇 ) distributions 𝜋𝛽𝑇 ∝ 𝑒−𝛽

∑𝑇𝑡=1

𝑓𝑡(𝑥), where the

inverse “temperature” 𝛽 doubles at each epoch from 1𝑇

to 1, causing the distribution

𝜋𝛽𝑇 to have a decreasing second moment and to become more “concentrated” about

the mode 𝑥⋆ at each epoch. This temperature schedule allows our algorithm to

gradually approach the target distribution, even though our algorithm is initialized

from a cold start 𝑥0 which may be far from a sub-level set containing most of the

target probability measure. The same martingale exit time argument as in the proof

for the online problem shows that at the end of each epoch, the Markov chain is at a

distance from 𝑥⋆ comparable to the (square root of the) second moment of the current

distribution 𝜋𝛽𝑇 . This provides a “warm start” for the next distribution 𝜋2𝛽

𝑇 , and in

this way our Markov chain approaches the target distribution 𝜋1𝑇 in log2(𝑇 ) epochs.

The total number of gradient evaluations is therefore 𝑇 log2(𝑇 ) + 𝑏 × 𝑖max, since

we only compute the full gradient at the beginning of each of the log2(𝑇 ) epochs,

and then only use a batch size 𝑏 for the gradient steps at each of the 𝑖max steps of

the Markov chain. As in the online case, 𝑏 and 𝑖max are polylogarithmic in 𝑇 and

polynomial in the various parameters 𝑑, 𝐿, 𝐶, 𝜀−1, implying that the total number

of gradient evaluations is ‹𝑂(𝑇 ) + poly(𝑑, 𝐶,D, 𝜀−1, 𝐿) log2(𝑇 ), in the offline setting

where our goal is only to sample from 𝜋1𝑇 .

88

The proof of Theorem 3.2.6 is similar to the proof of Theorem 3.2.4, except for

some differences as to how the stochastic gradients are computed and how one defines

the functions “𝐹𝑡”. We define 𝐹𝑡 := 𝛽𝑡∑𝑇

𝑘=1 𝑓𝑘, where 𝛽𝑡 =

2𝑡−1/𝑇, 0 ≤ 𝑠 ≤ log2(𝑇 ) + 1

1, 𝑡 = ⌈log2(𝑇 )⌉+ 1.

.

We then show that for this choice of 𝐹𝑡 the offline assumptions, proof and algorithm

are similar to those of the online case.

3.5 Related work

Online convex optimization. Our motivation for studying the online sampling

problem comes partly from the successes of online (convex) optimization. (For a

survey, see [Haz16].) In online convex optimization, one chooses a point 𝑥𝑡 ∈ 𝐾 at

each step and suffers a loss 𝑓𝑡(𝑥), where 𝐾 is a compact convex set and 𝑓𝑡 : 𝐾 → R

is a convex function [Zin03]. The aim is to minimize the regret compared to the

best point in hindsight, where Regret𝑇 =∑𝑇

𝑡=1 𝑓𝑡(𝑥𝑡)−min𝑥*∑𝑇

𝑡=1 𝑓𝑡(𝑥*). The same

algorithms for offline convex optimization (gradient descent, Newton’s method) can

be adapted essentially without change to the online setting, giving square-root regret

in the smooth setting [Zin03] and logarithmic regret in the strongly-convex setting

[HAK07].

Online sampling. To the best of our knowledge, all previous algorithms with prov-

able guarantees in our setting require computation time that grows polynomially with

𝑡. This is because any Markov chain which takes all the previous data into account

needs Ω𝑇 (𝑡) gradient evaluations per step. On the other hand, there are many stream-

ing algorithms that are used in practice which lack provable guarantees, or which rely

on properties of the data (such as compressibility).

The most relevant theoretical work in our direction is [NR17]. The authors con-

sider a changing log-concave distribution on a convex body, and show that under

89

certain conditions, they can use the previous sample as a warm start, and hence only

take a constant number of steps of their Markov chain (the Dikin walk) at each stage.

They use a zeroth-order, rather than a first-order (gradient) method.

[NR17] consider the online sampling problem in the more general setting where

the distribution is restricted to a convex body. However, they do not achieve the

optimal results in our setting, as we explain below. Firstly, they do not separately

consider the case when 𝐹𝑡(𝑥) =∑𝑡

𝑘=0 𝑓𝑘(𝑥) has a sum structure. Any method which

considers 𝐹𝑡(𝑥) =∑𝑡

𝑘=0 𝑓𝑘(𝑥) as a black box (and hence does not utilize the sum

structure) and takes at least one step per epoch, will require Ω(𝑡) steps at epoch

𝑡. Secondly, they do not consider how concentration properties of the distribution

translate into more efficient sampling. When the 𝑓𝑡 are linear, their algorithm needs

𝑂𝑇 (1) steps per epoch and 𝑂𝑇 (𝑡) gradient evaluations per epoch. However, in the

general convex setting where the 𝑓𝑡’s are smooth, the algorithm needs 𝑂𝑇 (𝑡) steps per

epoch, and 𝑂𝑇 (𝑡2) gradient evaluations per epoch. An increased number of steps here

may be inevitable because the distribution could concentrate unequally in different

directions; it could have ill-conditioned covariance matrix, with condition number 1𝑡.

We believe that with a concentration result such as Assumption 3.2.2 (for the mode

inside the convex body), their techniques can be used to show that only 𝑂𝑇 (1) steps

and 𝑂𝑇 (𝑡) gradient evaluations are necessary per epoch.

There are many other online sampling methods, and other approaches used to es-

timate changing probability distributions, used in practice. The Laplace approxima-

tion, perhaps the simplest, approximates the posterior distribution with a Gaussian

[BDT16]; however, most distributions cannot be well-approximated by Gaussians.

Stochastic gradient Langevin dynamics [WT11] can be used in an online setting;

however, it suffers from large variance which we address in this work. The particle

filter [DHW12; GD17] is a general algorithm to track a changing distribution. An-

other popular approach (besides sampling) to estimating a probability distribution is

90

variational inference, which has also been considered in an online setting ([WPB11],

[Bro+13])

Variance reduction techniques. Variance reduction techniques for SGLD were

initially proposed in [Dub+16], when sampling from a fixed distribution 𝜋 ∝ 𝑒−∑𝑇

𝑡=0𝑓𝑡 .

[Dub+16] propose two variance-reduced SGLD techniques, CV-ULD and SAGA-LD.

CV-ULD re-computes the full gradient ∇𝐹 at an “anchor” point every 𝑟 steps and

updates the gradient at intermediate steps by subsampling the difference in the gradi-

ents between the current point and the anchor point. SAGA-LD, on the other hand,

keeps track of when each gradient ∇𝑓𝑡 was computed, and updates individual gradi-

ents with respect to when they were last computed. [Cha+18] show that CV-ULD

can sample in the offline problem in roughly 𝑇 + ( 𝐿𝑚

)6 𝑑2

𝜀gradient evaluations, and

that SAGA-LD can sample in 𝑇 + 𝑇 ( 𝐿𝑚

)32

√𝑑𝜀

(1 +𝐿𝐻) gradient evaluations, where 𝐿𝐻

is the Lipschitz constant of the Hessian of − log(𝜋).12

3.6 Proof of online theorem (Theorem 3.2.4)

First we formally define what we mean by “almost independent”.

Definition 3.6.1. We say that 𝑋1, . . . , 𝑋𝑇 are 𝜀-approximate independent sam-

ples from probability distributions 𝜋1, . . . , 𝜋𝑇 if for independent random variables

𝑌𝑡 ∼ 𝜋𝑡, there exists a coupling between (𝑋1, . . . , 𝑋𝑇 ) and (𝑌 1, . . . , 𝑌 𝑇 ) such that

for each 𝑡 ∈ [1, 𝑇 ], 𝑋 𝑡 = 𝑌 𝑡 with probability 1− 𝜀.

12Note that the bounds of [Cha+18] are given for sampling within a specified Wasserstein error,not TV error. The bounds we give here are the number of gradient evaluations one would need if onesamples with Wasserstein error 𝜀 which roughly corresponds to TV error 𝜀; if there are 𝑇 stronglyconvex functions, roughly speaking, one requires 𝜀 = 𝑂( 𝜀√

𝑇) to sample with TV error 𝜀.

91

3.6.1 Bounding the variance of the stochastic gradient

We first show that the variance reduction in Algorithm 4 reduces the variance from

the order of 𝑡2 to 𝑡2 ‖𝑥− 𝑥′‖2, where 𝑥′ is a past point. This will be on the order of

𝑡 if we can ensure ‖𝑥− 𝑥′‖ = 𝑂𝑇

(1√𝑡

). Later, we will bound the probability of the

bad event that ‖𝑥− 𝑥′‖ becomes too large.

Lemma 3.6.1. Fix 𝑥 and 𝑢𝑘1≤𝑘≤𝑡 and let 𝑆 be a multiset chosen with replacement

from 1, . . . , 𝑡. Let

𝑔𝑡 = ∇𝑓0(𝑥) +

[𝑡∑

𝑘=1

∇𝑓𝑘(𝑢𝑘)

]+

𝑡

𝑏

∑𝑘∈𝑆

[∇𝑓𝑘(𝑥)−∇𝑓𝑘(𝑢𝑘)]. (3.3)

Then

∥∥∥∥∥∥𝑔𝑡 −𝑡∑

𝑘=0

∇𝑓𝑘(𝑥)

∥∥∥∥∥∥2

≤ 4𝑡2𝐿2 max𝑘‖𝑥− 𝑢𝑘‖2 (3.4)

E

∥∥∥∥∥∥𝑔𝑡 −𝑡∑

𝑘=0

∇𝑓𝑘(𝑥)

∥∥∥∥∥∥2 ≤ 𝑡2

𝑏𝐿2

(1

𝑡

𝑡∑𝑘=1

‖𝑥− 𝑢𝑘‖2)≤ 𝑡2

𝑏𝐿2 max

𝑘‖𝑥− 𝑢𝑘‖2 . (3.5)

Proof. For the first part,

∥∥∥∥∥∥𝑔𝑡 −𝑡∑

𝑘=0

∇𝑓𝑘(𝑥)

∥∥∥∥∥∥2

=

∥∥∥∥∥∥𝑡∑

𝑘=1

[∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)] +𝑡

𝑏

∑𝑘∈𝑆

[∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)]

∥∥∥∥∥∥2

(3.6)

≤

Ñ𝐿

𝑡∑𝑘=1

‖𝑢𝑘 − 𝑥‖+𝑡

𝑏𝐿∑𝑘∈𝑆‖𝑢𝑘 − 𝑥‖

é2

(3.7)

≤ 4𝑡2𝐿2 max𝑘‖𝑢𝑘 − 𝑥‖2 . (3.8)

For the second part, let 𝑉 be the random variable given by

𝑉 =𝑡

𝑏

ñ(∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥))− E

𝑘∈[𝑡][∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)]

ô(3.9)

where 𝑘 ∈ [𝑡] is chosen uniformly at random. Let 𝑉1, . . . , 𝑉𝑏 be independent draws of

92

𝑉 . Because the 𝑉𝑗 are independent,

E

∥∥∥∥∥∥𝑔𝑡 −𝑡∑

𝑘=0

∇𝑓𝑘(𝑥)

∥∥∥∥∥∥2 = E

∥∥∥∥∥∥𝑏∑

𝑗=1

𝑉𝑗

∥∥∥∥∥∥2 = tr

ÖE

Ñ

𝑏∑𝑗=1

𝑉𝑗

éÑ𝑏∑

𝑗=1

𝑉𝑗

é⊤è

(3.10)

= tr

ÑE

𝑏∑𝑗=1

𝑉𝑗𝑉⊤𝑗

é =𝑏∑

𝑗=1

Eîtr(𝑉𝑗𝑉

⊤𝑗 )ó

= 𝑏E[‖𝑉 ‖2].

(3.11)

We calculate

E[‖𝑉 ‖2] =𝑡2

𝑏2Var𝑘∈[𝑡] (∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)) (3.12)

≤ 𝑡2

𝑏2

ÇE

𝑘∈[𝑡]

î‖∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)‖2

óå(3.13)

≤ 𝑡2

𝑏2𝐿2 max

𝑘‖𝑥− 𝑢𝑘‖2 . (3.14)

Combining (3.11) and (3.14) gives the result.

3.6.2 Bounding the escape time from a ball

Lemma 3.6.2. Suppose that the following hold:

1. 𝐹 : R𝑑 → R is convex, differentiable, and 𝐿-smooth, with a minimizer 𝑥⋆ ∈ R𝑑.

2. 𝜁𝑖 is a random variable depending only on 𝑋0, . . . , 𝑋𝑖 such that E[𝜁𝑖|𝑋0, . . . , 𝑋𝑖] =

0, and whenever ‖𝑋𝑗 − 𝑥⋆‖ ≤ 𝑟 for all 𝑗 ≤ 𝑖, ‖𝜁𝑖‖ ≤ 𝑆.

Let 𝑋0 be such that ‖𝑋0 − 𝑥⋆‖ ≤ 𝑟 and define 𝑋𝑖 recursively by

𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑡𝑔𝑖 +√𝜂𝑡𝜉𝑖 (3.15)

where 𝑔𝑖 = ∇𝐹 (𝑋𝑖) + 𝜁𝑖 (3.16)

𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑) (3.17)

93

and define the event 𝐺 := ‖𝑋𝑗 − 𝑥⋆‖ ≤ 𝑟 ∀ 1 ≤ 𝑗 ≤ 𝑖max. Then for 𝑟2 >

‖𝑋0 − 𝑥⋆‖2 + 𝑖max[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2 and 𝐶𝜉 ≥

√2𝑑,

P(𝐺𝑐) ≤ 𝑖max

[exp

(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖max[2𝜂

2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑]

2(2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2

𝜉 )2

)+ exp

(−𝐶2

𝜉 − 𝑑

8

)]

(3.18)

Proof. Note that if ‖𝑥− 𝑥⋆‖ ≤ 𝑟, then because 𝐹 is 𝐿-smooth, ‖∇𝐹 (𝑥)‖ ≤ 𝐿 ‖𝑥− 𝑥⋆‖ ≤

𝐿𝑟. If ‖𝑋𝑖 − 𝑥⋆‖ ≤ 𝑟, then

‖𝑋𝑖+1 − 𝑥⋆‖2 − ‖𝑋𝑖 − 𝑥⋆‖2 (3.19)

= ‖𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖 +√𝜂𝜉𝑖‖2 − ‖𝑋𝑖 − 𝑥⋆‖2 (3.20)

= −2𝜂 ⟨𝑔𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 𝜂2 ‖𝑔𝑖‖2 + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2 (3.21)

= −2𝜂 ⟨∇𝐹𝑡(𝑋𝑖), 𝑋𝑖 − 𝑥⋆⟩︸︷︷︸≤0 by convexity

−2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 𝜂2 ‖𝑔𝑖‖2 + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2

(3.22)

≤ −2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2𝜂2Ä‖∇𝐹 (𝑥𝑖)‖2 + ‖𝜁𝑖‖2

ä+ 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2

(3.23)

≤ −2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2𝜂2(𝐿2𝑟2 + 𝑆2) + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2 (3.24)

= 2𝜂2(𝐿2𝑟2 + 𝑆2) + 𝜂𝑑−2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂(‖𝜉𝑖‖2 − 𝑑)︸︷︷︸

(*)

(3.25)

Note that (*) has expectation 0 conditioned on 𝑋0, . . . , 𝑋𝑖. To use Azuma’s inequality,

we need our random variables to be bounded. Also, recall that we assumed ‖𝑋𝑖 − 𝑥⋆‖

is bounded above by 𝑟. Thus, we define a toy Markov chain coupled to 𝑋𝑖 as follows.

94

Let 𝑋 ′0 = 𝑋0 and

𝑋 ′𝑖+1 =

𝑋 ′

𝑖, if ‖𝑋 ′𝑖 − 𝑥⋆‖ ≥ 𝑟

𝑋 ′𝑖 − 𝜂𝑔𝑖 +

√𝜂𝜉′𝑖, otherwise

(3.26)

where 𝑔𝑖 = ∇𝐹 (𝑋 ′𝑖) + 𝜁𝑖 (3.27)

𝜉′𝑖 = min(𝐶𝜉, ‖𝜉𝑖‖)𝜉𝑖‖𝜉𝑖‖

(3.28)

𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). (3.29)

Then 𝑌 ′𝑖 := ‖𝑋 ′

𝑖 − 𝑥⋆‖ − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑] is a supermartingale with differences

upper-bounded by

𝑌 ′𝑖+1 − 𝑌 ′

𝑖 ≤

0, ‖𝑋 ′

𝑖 − 𝑥⋆‖ ≥ 𝑟

−2𝜂 ⟨𝜁𝑖, 𝑋 ′𝑖 − 𝑥⋆⟩+ 2

√𝜂 ⟨𝑋 ′

𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉′𝑖⟩+ 𝜂(‖𝜉𝑖‖2 − 𝑑), ‖𝑋 ′

𝑖 − 𝑥⋆‖ < 𝑟

(3.30)

≤ 2𝜂𝑆𝑟 + 2√𝜂(𝑟 + 𝜂(𝑆 + 𝐿𝑟))𝐶𝜉 + 𝜂(𝐶2

𝜉 − 𝑑) (3.31)

≤ 2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2

𝜉 . (3.32)

By Azuma’s inequality, for 𝜆 > 0 and for 𝑟2 > ‖𝑋0 − 𝑥⋆‖2 + 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑],

PÄ‖𝑋 ′

𝑖 − 𝑥⋆‖2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑] > 𝜆ä

(3.33)

≤ exp

(− 𝜆2


𝜉 )2

)(3.34)

=⇒ P (‖𝑋 ′𝑖 − 𝑥⋆‖ > 𝑟) (3.35)

≤ exp

(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2


𝜉 )2

)(3.36)

If ‖𝑋𝑖 − 𝑥⋆‖ ≥ 𝑟 for some 𝑖 ≤ 𝑖max, then either ‖𝑋 ′𝑖 − 𝑥⋆‖ ≥ 𝑟 for some 𝑖 ≤ 𝑖max, or

𝑋𝑖 otherwise becomes different from 𝑋 ′𝑖, which happens only when 𝜉𝑖 ≥ 𝐶𝜉 for some

95

𝑖 ≤ 𝑖max. Thus by the Hanson-Wright inequality, since 𝐶𝜉 ≥√

2𝑑,

P (I ≤ 𝑖max) (3.37)

≤𝑖max∑𝑖=1

P(‖𝑋 ′𝑖 − 𝑥⋆‖2 ≥ 𝑟2) +

𝑖max∑𝑖=1

P(‖𝜉𝑖‖ ≥ 𝐶𝜉) (3.38)

≤ 𝑖max

[exp

(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖max[2𝜂

2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2


𝜉 )2

)+ exp

(−𝐶2

𝜉 − 𝑑

8

)].

(3.39)

3.6.3 Bounding the TV error

Lemma 3.6.4 will allow us to carry out the induction step for the proof of the main

theorem.

We will use the following result of [DMM18]. Note that this result works more

generally with non-smooth functions, but we will only consider smooth functions.

Their algorithm, Stochastic Proximal Gradient Langevin Dynamics, reduces to SGLD

in the smooth case. We will apply this Lemma with our variance-reduced stochastic

gradients in Algorithm 3.

Lemma 3.6.3 ([DMM18], Corollary 18). Suppose that 𝑓 : R𝑑 → R is convex and

𝐿-smooth. Let ℱ𝑖 be a filtration with 𝜉𝑖 and 𝑔(𝑥𝑖) defined on ℱ𝑖, and satisfying

E[𝑔(𝑥𝑖)|ℱ𝑖−1] = ∇𝑓(𝑥𝑖), sup𝑥 Var[𝑔(𝑥)|ℱ𝑖−1] ≤ 𝜎2 < ∞. Consider SGLD for 𝑓(𝑥)

run with step size 𝜂 and stochastic gradient 𝑔(𝑥), with initial distribution 𝜇0 and step

size 𝜂; that is,

𝑥𝑖+1 = 𝑥𝑖 − 𝜂𝑔(𝑥𝑖) +√𝜂𝜉𝑖, 𝜉𝑖 ∼ 𝑁(0, 𝐼). (3.40)

Let 𝜇𝑛 denote the distribution of 𝑥𝑛 and let 𝜋 be the distribution such that 𝜋 ∝ 𝑒−𝑓 .

96

Suppose

𝜂 ≤ min

®𝜀

2(𝐿𝑑 + 𝜎2),

1

𝐿

´(3.41)

𝑛 ≥¢𝑊 2

2 (𝜇0, 𝜋)

𝜂𝜀

•. (3.42)

Let 𝜇 = 1𝑛

∑𝑛𝑘=1 𝜇𝑘 be the “averaged” distribution. Then KL(𝜇|𝜋) ≤ 𝜀.

Remark 3.6.2. The result in [DMM18] is stated when 𝑔(𝑥) is independent of the

history ℱ𝑖, but the proof works when the stochastic gradient is allowed to depend on

history, as in SAGA. For SAGA, ℱ𝑖 contains all the information up to time step 𝑖,

including which gradients were replaced at each time step.

Note [DMM18] is derived by analogy to online convex optimization. The optimiza-

tion guarantees are only given at the point equal to the average of the 𝑥𝑡 (by Jensens

inequality). For the sampling problem, this corresponds to selecting a point from the

averaged distribution 𝜇.

Define the good events

𝐺𝑡 =

∀𝑠 ≤ 𝑡, ∀0 ≤ 𝑖 ≤ 𝑖𝑠, ‖𝑋𝑠𝑖 − 𝑥⋆

𝑠‖ ≤R»

𝑠 + 𝐿0/𝐿

(3.43)

𝐻𝑡 =

∀𝑠 ≤ 𝑡 s.t. 𝑠 is a power of 2 or 𝑠 = 0, ‖𝑋𝑠 − 𝑥⋆𝑠‖ ≤

𝐶1»𝑠 + 𝐿0/𝐿

. (3.44)

𝐺𝑡 is the event that the Markov chain never drifts too far from the current mode

(which we want, in order to bound the stochastic gradient of SAGA), and 𝐻𝑡 is the

event that the samples at powers of 2 are close to the respective modes (which we want

because we will use them as reset points). Roughly, 𝐺𝑐𝑡 will involve union-bounding

over bad events whose probabilities we will set to be 𝑂Ä𝜀𝑇

äand 𝐻𝑐

𝑡 will involve union-

bounding over bad events whose probabilities we will set to be 𝑂(

𝜀log2(𝑇 )

).

Lemma 3.6.4 (Induction step). Suppose that Assumptions 3.2.1, 3.2.2, and 3.2.3

97

hold with 𝑐 = 𝐿0

𝐿and 𝐿0 ≥ 𝐿. Let 𝑋𝜏

𝑖 be obtained by running Algorithm 4 with

𝐶 ′ = 2.5(𝐶1 + D), 𝐶1 ≥ 𝐶, and R ≥ 2(𝐶1 + D). Suppose 𝜂𝑡 = 𝜂0𝑡+𝐿0/𝐿

and 𝜀2 > 0 is

such that

𝜂0 ≤𝜀22

𝐿𝑑 + 9𝐿2(R + D)2/𝑏, (3.45)

𝑖max ≥20(𝐶1 + D)2

𝜂0𝜀22. (3.46)

Suppose 𝜀1 > 0 is such that for any 𝜏 ≥ 1,

P (𝐺𝜏 |𝐺𝜏−1 ∩𝐻𝜏−1) ≥ 1− 𝜀1. (3.47)

Suppose 𝑡 is a power of 2. Then the following hold.

1. For 𝑡 < 𝜏 ≤ 2𝑡, P(𝐺𝜏 |𝐺𝑡 ∩𝐻𝑡) ≥ 1− (𝜏 − 𝑡)𝜀1.

2. Fix 𝑋𝑠𝑖 for 𝑠 ≤ 𝑡, 0 ≤ 𝑖 ≤ 𝑖max such that 𝐺𝑡 ∩ 𝐻𝑡 holds (i.e., condition on the

filtration ℱ𝑡 on which the algorithm is defined). Then

‖ℒ(𝑋𝜏 )− 𝜋𝜏‖𝑇𝑉 ≤ (𝜏 − 𝑡)𝜀1 + 𝜀2. (3.48)

3. We have for 𝜏 = 2𝑡,

P (𝐺𝜏 ∩𝐻𝜏 |𝐺𝑡 ∩𝐻𝑡) ≥ 1− (𝑡𝜀1 + 𝜀2 + 𝐴𝑒−𝑘𝐶1) (3.49)

These also hold in the case 𝑡 = 0 and 𝜏 = 1, when 𝐿0 ≥ 𝐿.

Proof. Let 𝐹𝑡(𝑥) =∑𝑡

𝑘=0 𝑓𝑘(𝑥).

First, note that 𝐻𝜏−1 = · · · = 𝐻𝑡, because 𝐻𝑠 is defined as an intersection of

events with indices ≤ 𝑠, that are powers of 2. (See (3.44).) Moreover, 𝐺𝜏 is a subset

of 𝐺𝜏−1 for each 𝜏 , by (3.43).

98

The first part holds by induction on 𝜏 and assumption on 𝜀1. We need to show

𝑃 (𝐺𝑐𝜏 |𝐺𝑡 ∩ 𝐻𝑡) ≤ (𝜏 − 𝑡)𝜀1 by induction. Assuming it is true for 𝜏 , we have by the

union bound that

P(𝐺𝑐𝜏+1|𝐺𝑡, 𝐻𝑡) ≤ P(𝐺𝑐

𝜏+1 ∩𝐺𝜏 |𝐺𝑡 ∩𝐻𝑡) + P(𝐺𝑐𝜏 |𝐺𝑡 ∩𝐻𝑡) (3.50)

≤ P(𝐺𝑐𝜏+1|𝐺𝜏 ∩𝐺𝑡 ∩𝐻𝑡) + P(𝐺𝑐

𝜏 |𝐺𝑡 ∩𝐻𝑡). (3.51)

Now the event 𝐺𝜏 ∩ 𝐺𝑡 ∩ 𝐻𝑡 is the same as the event 𝐺𝜏 ∩ 𝐻𝜏 , by the previous

paragraph. Thus this is ≤ 𝜀 + (𝜏 − 𝑡)𝜀, completing the induction step.

For the second part, note that for 𝑡 < 𝜏 ≤ 2𝑡,

‖𝑋𝜏0 − 𝑥⋆

𝜏‖ ≤∥∥∥𝑋𝜏

0 −𝑋 𝑡∥∥∥+

∥∥∥𝑋 𝑡 − 𝑥⋆𝑡

∥∥∥+ ‖𝑋⋆𝑡 − 𝑥⋆

𝜏‖ (3.52)

≤ 2.5(𝐶1 + D)»𝜏 + 𝐿0/𝐿

+𝐶1»

𝑡 + 𝐿0/𝐿+

D»𝑡 + 𝐿0/𝐿

(3.53)

≤ 4(𝐶1 + D)»𝜏 + 𝐿0/𝐿

(3.54)

where in the 2nd inequality we used that

1. Algorithm 4 ensures that ‖𝑋𝜏0 −𝑋 𝑡‖ ≤ 𝐶′√

𝜏+𝐿0/𝐿= 2.5(𝐶1+D)√

𝜏+𝐿0/𝐿(The algorithm

resets 𝑋𝜏0 to 𝑋 𝑡 if ‖𝑋𝜏

0 −𝑋 𝑡‖ is greater than 𝐶′√𝜏+𝐿0/𝐿

, making the term 0.

This is the place where the resetting is used.),

2. the definition of 𝐻𝑡, and

3. the drift assumption 3.2.3.

In the 3rd inequality we used that√𝑡 ≥

»𝜏/2 ≥

√𝜏/1.5.

Therefore

𝑊 22 (𝛿𝑋𝜏

0, 𝜋𝜏 ) ≤ 2 ‖𝑋𝜏

0 − 𝑥⋆𝜏‖

2 + 2𝑊 22 (𝛿𝑥𝜏 , 𝜋𝜏 ) ≤ 32(𝐶1 + D)2

𝜏 + 𝐿0/𝐿+

2𝐶2

𝜏 + 𝐿0/𝐿≤ 40(𝐶1 + D)2

𝜏 + 𝐿0/𝐿

(3.55)

99

where the second moment bound comes from Assumption 3.2.2 and 𝐶 ≤ 𝐶1.

Define a toy Markov chain coupled to 𝑋𝜏𝑖 as follows. Let 𝑋𝑠

𝑗 = 𝑋𝑠𝑗 for 𝑠 < 𝜏 ,

𝑋𝜏0 = 𝑋𝜏

0 , and

𝑋𝜏𝑖+1 =

𝑋𝜏

𝑖 − 𝜂𝑔𝜏𝑖 +√𝜂𝜉𝑖, when

∥∥∥𝑋𝜏𝑗 − 𝑥⋆

𝜏

∥∥∥ ≤ R√𝜏+𝐿0/𝐿

for all 0 ≤ 𝑗 ≤ 𝑖

𝑋𝜏𝑖 − 𝜂∇𝐹𝜏 (𝑋𝑖), otherwise.

(3.56)

where 𝑔𝜏𝑖 is the stochastic gradient for 𝑋𝜏𝑖 in Algorithm 3 and 𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). By

Lemma 3.6.1, the variance of 𝑔𝜏𝑖 is at most 𝜏2𝐿2

𝑏max( 𝜏+1

2,0)≤(𝑠,𝑗)≤(𝜏,𝑖)

∥∥∥𝑋𝜏𝑖 −𝑋𝑠

𝑗

∥∥∥2. (The

ordering on ordered pairs is lexicographic. Note 𝑠 > 𝑡2

because Algorithm 4 refreshes

all gradients that were updated at time 𝑡2.) If the first case of (3.56) always holds,

we bound (using the condition that 𝐺𝑡 holds)


𝑗

∥∥∥ ≤ ∥∥∥𝑋𝜏𝑖 − 𝑥⋆

𝜏

∥∥∥+ ‖𝑥⋆𝜏 − 𝑥⋆

𝑠‖+∥∥∥𝑥⋆

𝑠 −𝑋𝑠𝑗

∥∥∥(3.57)

≤ R»𝜏 + 𝐿0/𝐿

+D»

𝑠 + 𝐿0/𝐿+

R»𝑠 + 𝐿0/𝐿

(3.58)

≤ 3R + 2D»𝜏 + 𝐿0/𝐿

<3(R + D)»𝜏 + 𝐿0/𝐿

(3.59)

=⇒ 𝜏 2𝐿2

𝑏max

( 𝑡+12

,0)≤(𝑠,𝑗)≤(𝜏,𝑖)


𝑗

∥∥∥2 ≤ 9𝜏𝐿2(R + D)2

𝑏. (3.60)

We can apply Lemma 3.6.3 with 𝜀 = 2𝜀22, 𝐿 ←[ 𝐿(𝜏 + 𝐿0/𝐿), 𝜎2 ≤ 9𝜏𝐿2(R+D)2

𝑏,

𝑊 22 (𝜇0, 𝜋) ≤ 40(𝐶1+D)2

𝜏+𝐿0/𝐿. Note that 𝜂𝜏 ≤ 𝜀22

(𝜏+𝐿0/𝐿)(𝐿𝑑+9𝐿2(R+D)2/𝑏)≤ 𝜀22

(𝜏𝐿+𝐿0)𝑑+9𝐿2𝜏(R+D)2/𝑏

does satisfy (3.41), as 𝐹𝜏 =∑𝜏

𝑘=0 𝑓𝑘 is (𝜏𝐿 + 𝐿0)-smooth by Assumption 3.2.1. Let

𝑖 ∈ [𝑖max] be uniform random on [𝑖max], and 𝑋𝜏 = 𝑋𝜏𝑖 ; note that the distribution of

𝑋𝜏 is the mixture distribution of 𝑋𝜏1 , . . . , 𝑋

𝜏𝑖max

. Under the conditions on 𝜂, 𝑖max, by

100

Pinsker’s inequality and Lemma 3.6.3,

‖ℒ(𝑋𝜏 )− 𝜋𝜏‖TV ≤

1

2KL(|𝜋𝜏 ) ≤ 𝜀2. (3.61)

Note that under 𝐺𝜏 , 𝑋𝑠𝑖 = 𝑋𝑠

𝑖 for all 𝑖 ≤ 𝑖max and 𝑠 ≤ 𝜏 , so

‖ℒ(𝑋𝜏 )− 𝜋𝜏‖TV ≤ P(𝐺𝑐𝜏 |ℱ𝑡) + ‖ℒ(𝑋𝜏

𝑖 )− 𝜋𝜏‖TV ≤ (𝜏 − 𝑡)𝜀1 + 𝜀2 (3.62)

This shows part 2.

For part 3, note that by Assumption 3.2.2,

P𝑋∼𝜋2𝑡

‖𝑋 − 𝑥⋆2𝑡‖ ≥

𝐶1»2𝑡 + 𝐿0/𝐿

≤ 𝐴𝑒−𝑘𝐶1 (3.63)

Combining (3.62) and (3.63) for 𝜏 = 2𝑡 gives (3.49).

Finally, note that the proof goes through when 𝑡 = 0, 𝜏 = 1.

3.6.4 Setting the constants; Proof of main theorem

Theorem 3.6.5 (Theorem 3.2.4 with parameters). Suppose that Assumptions 3.2.1, 3.2.2,

and 3.2.3 hold, with 𝑘 ≤ 1, 𝑐 = 𝐿0/𝐿, 𝐿0 ≥ 𝐿, and ‖𝑋0 − 𝑥⋆0‖ ≤ 𝐶√

𝐿0/𝐿. Suppose

Algorithm 4 is run with parameters 𝜂0, 𝑖max given by

𝜀1 =𝜀

3𝑇(3.64)

𝜀2 =𝜀

3 ⌈log2(𝑇 ) + 1⌉(3.65)

𝐶1 =

Ç2 +

1

𝑘

ålog

Ç𝐴

𝜀2𝑘2

å(3.66)

R = 100 max

𝑑

𝐿

√log

Çmax

®𝐿,

𝑑

𝐿,𝐶1 + D,

1

𝜀1

´å, 𝐶1 + D

(3.67)

𝜂0 =𝜀22

2𝐿2R2(3.68)

101

𝑖max =

¢20(𝐶1 + D)2

𝜂0𝜀22

•=

¢40𝐿2R2(𝐶1 + D)2

𝜀42

•(3.69)

with any constant batch size 𝑏 ≥ 9. Then it outputs a sample 𝑋 𝑡 at each epoch,

so that the 𝑋 𝑡 are 𝜀-approximate independent samples of 𝜋𝑡 (1 ≤ 𝑡 ≤ 𝑇 ), using

𝑂(𝑖max𝑏) = polyÄ𝑑, 𝐿, log(𝐴), 1

𝑘,D, 1

𝜀

ägradient evaluations at each epoch.

Note that the dependence of 𝑖max on 𝜀 is 𝑖max = ‹𝑂𝜀

Ä1𝜀4

ä.

Proof. We will choose parameters and prove by induction that for 𝑡 = 2𝑎,

P(𝐺𝑡 ∩𝐻𝑡) ≥ 1− 𝑡𝜀1 − 2(𝑎 + 1)𝜀2 (3.70)

We will also show that (3.70) implies that if 𝑡 = 2𝑎 + 𝑏 for 0 < 𝑏 ≤ 2𝑎,

P(𝐺𝑡 ∩𝐻2𝑎) ≥ 1− 𝑡𝜀1 − 2(𝑎 + 1)𝜀2 (3.71)

‖ℒ(𝑋𝑡)− 𝜋𝑡‖TV ≤ 𝑡𝜀1 + (2𝑎 + 3)𝜀2. (3.72)

With the values of 𝜀1 and 𝜀2, (3.72) gives the theorem. 13

Let 𝜂0,R be constants to be chosen, and for any 𝑡 ∈ N, let

𝜂𝑡 =𝜂0»

𝑡 + 𝐿0/𝐿(3.73)

𝑟𝑡 =R»

𝑡 + 𝐿0/𝐿(3.74)

𝑆𝑡 = 6√𝑡𝐿(R + D) (3.75)

𝜎2𝑡 =

9𝑡𝐿2(R + D)2

𝑏(3.76)

By (3.60) and Lemma 3.6.1, when 𝐺𝑡−1 ∩ 𝐻𝑡−1 holds, the stochastic gradient 𝑔𝑡𝑖

13In fact, we will show a slightly stronger result. Namely, that the distribution of 𝑋𝑡 conditionedon the filtration ℱ1 ⊆ · · · ⊆ ℱ𝑡−1, where the filtration ℱ𝜏 includes both the random batch 𝑆 as well asthe points in the Markov chain up to time 𝜏 , satisfies ‖(ℒ(𝑋𝑡)|𝐹𝑡−1)−𝜋𝑡‖TV ≤ 𝑡𝜀1+(2𝑎+3)𝜀2. Thisimplies that the samples 𝑋1, 𝑋2, . . . , 𝑋𝑡 are 𝜀-approximately independent with 𝜀 = 𝑡𝜀1+(2𝑎+3)𝜀2.

102

in (3.56) satisfies ‖𝑔𝑡𝑖‖ ≤ 𝑆𝑡 and Var(𝑔𝑡𝑖) ≤ 𝜎2𝑡 . We claim that it suffices to choose

parameters so that the following hold for each 𝑡 and some 𝐶𝜉 ≥√

2𝑑:

𝜀1 ≥ 𝑖max

exp

Ñ−

(𝑟2𝑡 −16(𝐶1+D)2

𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2

𝑡 + 𝐿2𝑡2𝑟2𝑡 ) + 𝜂𝑡𝑑])2

(2𝜂𝑡𝑆𝑡𝑟𝑡 + 2√𝜂𝑡𝐶𝜉(𝑟𝑡 + 𝜂𝑡𝑆𝑡 + 𝜂𝑡𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝑡𝐶2

𝜉 )2

é(3.77)

+ exp

(−𝐶2

𝜉 − 𝑑

8

) (3.78)

𝜂0 ≤𝜀22

𝐿𝑑 + 9𝐿2(R + D)2/𝑏(3.79)

𝑖max ≥20(𝐶1 + D)2

𝜂0𝜀22(3.80)

𝐴𝑒−𝑘𝐶1 ≤ 𝜀2 (3.81)

𝐶1 ≥ 𝐶 :=

Ç2 +

1

𝑘

ålog

Ç𝐴

𝑘2

å. (3.82)

Indeed, suppose these inequalities hold. Lemma 3.6.2 and (3.77) show that 𝜀1 satisfies

the assumption in Lemma 3.6.4. Equations (3.79), (3.80), and (3.82) ensure that 𝜂0,

𝑖max, and 𝐶 satisfy the conditions in Lemma 3.6.4.

Base case of induction. By assumption ‖𝑋0 − 𝑥⋆0‖ ≤ 𝐶1√

𝐿0/𝐿so 𝐻0 holds and the

𝑡 = 1 case of Lemma 3.6.4 shows P(𝐺1) ≥ 1 − 𝜀1 and P(𝐺1 ∩ 𝐻1) ≥ 1 − (𝜀1 + 𝜀2 +

𝐴𝑒−𝑘𝐶1) ≥ 1− (𝜀1 + 2𝜀2), using (3.81) for the last inequality.

(3.70) implies (3.71), (3.72). This follows from parts 1 and 2 of Lemma 3.6.4.

Induction step. We work with the complements. Let 𝐴𝑡 = 𝐺𝑡 ∩ 𝐻𝑡. By a union

bound,

P(𝐴𝑐2𝑡) ≤ P(𝐴𝑐

2𝑡 ∩ 𝐴𝑡) + P(𝐴𝑐𝑡) ≤ P(𝐴𝑐

2𝑡|𝐴𝑡) + P(𝐴𝑐𝑡). (3.83)

103

The first term is bounded by Part 3 of Lemma (3.6.4) (and (3.81)), 𝑃 (𝐴𝑐2𝑡|𝐴𝑡) ≤

𝑡𝜀1 + 𝜀2 + 𝜀2. The second term is bounded by the induction hypothesis, which says

𝑃 (𝐴𝑐𝑡) ≤ 𝑡𝜀1+2(𝑎+1)𝜀2. Combining these gives 𝑃 (𝐴𝑐

2𝑡) ≤ 2𝑡𝜀1+2(𝑎+2)𝜀2, completing

the induction step.

Showing inequalities. Setting 𝐶1, 𝜂0, and 𝑖max as in (3.66), (3.68), and (3.69) (with

R to be determined), we get that (3.79), (3.80), and (3.81) are satisfied, as R ≥√

𝑑𝐿

,

𝑏 ≥ 9 imply 𝜀22

2𝐿2(R+D)2≤ 𝜀22

𝐿𝑑+9𝐿2(R+D)2/𝑏. Moreover, setting 𝐶𝜉 =

√2𝑑 + 8 log

Ä2𝑖max

𝜀1

ämakes 𝑖max exp

Å−𝐶2

𝜉−𝑑

8

ã≤ 𝜀1

2. It suffices to show that our choice of R makes

𝜀12𝑖max

≥ exp

Ñ−

(𝑟2 − 16(𝐶1+D)2

𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2

𝑡 + 𝐿2(𝑡 + 𝐿0/𝐿)2𝑟2𝑡 ) + 𝜂𝑡𝑑])2

2(2𝜂𝑡𝑆𝑡𝑟𝑡 + 2√𝜂𝑡𝐶𝜉(𝑟𝑡 + 𝜂𝑡𝑆𝑡 + 𝜂𝑡𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝑡𝐶2

𝜉 )2

é(3.84)

= exp

à−

(𝑟2𝑡 −

16(𝐶1+D)2

𝑡+𝐿0/𝐿− 𝑖max

[2𝜂20

(𝑡+𝐿0/𝐿)2(16𝑡𝐿2R2 + (𝑡 + 𝐿0/𝐿)𝐿2R2)

])22

Ç8𝜂0𝐿𝑡R2

(𝑡+𝐿0/𝐿)2+

2√𝜂0√

𝑡+𝐿0/𝐿𝐶𝜉

ÇR√

𝑡+𝐿0/𝐿+ 4𝜂0𝐿R

√𝑡

𝑡+𝐿0/𝐿+ 𝜂0𝐿R√

𝑡+𝐿0/𝐿

å+ 𝜂0

𝑡+𝐿0/𝐿𝐶2

𝜉

å2

í(3.85)

⇐√

2 log2𝑖max

𝜀1≤

𝑟2𝑡 − 1𝑡+𝐿0/𝐿

(16(𝐶1 + D)2 + 40𝑖max𝜂20𝐿

2R2)1

𝑡+𝐿0/𝐿

Ä8𝜂0𝐿R2 + 2

√𝜂0𝐶𝜉(R + 5𝜂0𝐿R) + 𝜂0𝐶2

𝜉

ä (3.86)

⇐⇒ R2

𝑡 + 𝐿0/𝐿= 𝑟2𝑡 ≥

1

𝑡 + 𝐿0/𝐿

ï Ä8𝜂0𝐿R

2 + 2√𝜂0𝐶𝜉(R + 5𝜂0𝐿R) + 𝜂0𝐶

2𝜉

ä√2 log

2𝑖max

𝜀1

(3.87)

+ 16(𝐶1 + D)2 + 40𝑖max𝜂20𝐿

2R2ò

(3.88)

Using 𝜂0 = 𝜀22

2𝐿2(R+D)2and 𝜂0𝑖max ≤ 40(𝐶1+D)2

𝜀24, it suffices to have

R2 ≥(

4𝜀22

𝐿+

√2𝜀2

2𝐶𝜉

𝐿+

5𝜀23𝐶𝜉

𝐿2R2+

𝜀22𝐶2

𝜉

2𝐿2R2

)√2 log

Ç2𝑖max

𝜀1

å+ 16(𝐶1 + D)2 + 800(𝐶1 + D)2

(3.89)

104

Using 𝜀2 ≤ 1 ≤ 𝐶𝜉 and 𝐶𝜉 ≤ 4√𝑑 log

Ä2𝑖max

𝜀1

ä, the RHS is

≤(

8𝜀22𝐶𝜉

𝐿+

8𝜀22𝐶2

𝜉

𝐿2R2

√log

Ç2𝑖max

𝜀1

å)√2 log

Ç2𝑖max

𝜀1

å+ 816(𝐶1 + D)2 (3.90)

≤

Ñ8𝜀2

2𝑑12

𝐿+

8𝜀22𝑑

𝐿2R2

é8 log

Ç2𝑖max

𝜀1

å+ 816(𝐶1 + D)2. (3.91)

Now note

𝑖max ≤10𝐿2R2 (𝐶1 + D)2

𝜀24(3.92)

2𝑖max

𝜀1≤ 20𝐿2R2 (𝐶1 + D)2

𝜀24𝜀1(3.93)

≤200, 000𝐿2 max

¶𝑑𝐿

logÄmax𝐿, 𝑑

𝐿, 𝐶1 + D, 1

𝜀1ä, (𝐶1 + D)2

©(𝐶1 + D)2

𝜀24𝜀1

(3.94)

≤200, 000𝐿2 max

¶𝑑𝐿

max𝐿, 𝑑𝐿, 𝐶1 + D, 1

𝜀1, (𝐶1 + D)2

©(𝐶1 + D)2

𝜀24𝜀1

(3.95)

log

Ç2𝑖max

𝜀1

å≤ log(200, 000) + 11 log

Çmax

®𝐿,

𝑑

𝐿,𝐶1 + D,

1

𝜀1

´å(3.96)

We want to show (3.91) ≤ R2; it suffices to show

8𝜀12√𝑑

𝐿8 log

Ç2𝑖max

𝜀1

å≤ R2

4(3.97)

8𝜀12𝑑

𝐿2R28

ñlog

Ç2𝑖max

𝜀1

åô 32

≤ R2

4(3.98)

816 (𝐶1 + D)2 ≤ R2

2. (3.99)

These inequalities hold because

R2 ≥ 10000𝑑

𝐿log

Çmax

®𝐿,

𝑑

𝐿,𝐶1 + D,

1

𝜀1

´å(3.100)

≥ 256𝜀2√𝑑

𝐿

Çlog(200, 000) + 11 log

Çmax

®𝐿,

𝑑

𝐿,𝐶1 + D,

1

𝜀1

´åå(3.101)

105

≥ 256𝜀2√𝑑

𝐿log

Ç2𝑖max

𝜀1

å(3.102)

R4 ≥ 108 𝑑2

𝐿2

Çlog

Çmax

®𝐿,

𝑑

𝐿,𝐶1 + D,

1

𝜀1

´åå2

≥ 256𝜀22𝑑

𝐿2

ñlog

Ç2𝑖max

𝜀1

åô 32

(3.103)

R2 ≥ 104 (𝐶1 + D)2 . (3.104)

3.7 Proof of offline theorem (Theorem 3.2.6)

The proof of Theorem 3.2.6 is similar to the proof of Theorem 3.2.4, except for some

key differences as to how the stochastic gradients are computed and how one defines

the functions “𝐹𝑡”.

We define 𝐹𝛽 := 𝛽𝐹 = 𝛽∑𝑇

𝑘=1 𝑓𝑘, where the 𝛽’s will range over the sequence

𝛽𝑡 =

2𝑡/𝑇, 0 ≤ 𝑡 ≤ log2(𝑇 )

1, 𝑡 = ⌈log2(𝑇 )⌉ .. (3.105)

For this choice of 𝐹𝛽, the offline assumptions, proof and algorithm are similar to those

of the online case.

Differences in assumptions. We have that 𝐹𝛽 is 𝛽𝑇𝐿-smooth, which (except for

Lemma 3.6.1) is the only way in which Assumption 3.2.1 is used in the proof of

Theorem 3.2.4.

Moreover, Assumption 3.2.5 for the offline case implies that 𝜋𝛽𝑇 ∝ 𝑒−𝐹𝛽 satisfies

Assumption 3.2.2 with constants 𝐶 and 𝑘 for every 𝑡. Since the minimizer 𝑥⋆𝛽 of 𝐹𝛽

does not change with 𝑡, 𝑥⋆𝛽 satisfies Assumption 3.2.3 with constant D = 0.

Differences in algorithm. The step size used in Algorithm 𝜂𝛽𝑇

, the same step size

used in Algorithm 4. Thus, we note that Algorithm 5 is similar to Algorithm 4 except

106

for a few key differences:

1. The way in which the stochastic gradient 𝑔𝛽𝑖 is computed is different. Specifi-

cally, in the offline algorithm our stochastic gradient is computed as

𝑔𝛽𝑖 = 𝑠 +𝛽𝑇

𝑏

∑𝑘∈𝑆

(𝐺𝑘new −𝐺𝑘). (3.106)

where 𝑆 is a multiset of size 𝑏 chosen with replacement from 1, . . . , 𝑇 (rather

than from 1, . . . , 𝑡).

2. There are logarithmically many epochs.

We now give the proof in some detail.

Letting 𝑋𝛽𝑖 be the iterates at inverse temperature 𝛽, define

𝐺𝛽 =

®∀𝑖,

∥∥∥𝑋𝛽𝑖 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑇

´. (3.107)

Lemma 3.7.1 (Analogue of Lemma 3.6.4). Assume that Assumptions 3.2.1 and 3.2.5

hold. Let 𝐶 =Ä2 + 1

𝑘

älog

Ä𝐴𝑘2

ä, 𝐶1 ≥ 𝐶, and suppose

𝜂0 ≤𝜀22

𝐿𝑑 + 4𝐿2R2/𝑏(3.108)

𝑖max ≥5𝐶2

1

𝜂0𝜀22. (3.109)

Suppose 𝜀1 > 0 is such that

PÇ∀0 ≤ 𝑖 ≤ 𝑖max,

∥∥∥𝑋𝛽𝑖 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑇|∥∥∥𝑋𝛽

0 − 𝑥⋆∥∥∥ ≤ 𝐶1√

𝛽𝑇

å≥ 1− 𝜀1. (3.110)

Suppose∥∥∥𝑋𝛽

0 − 𝑥⋆∥∥∥ ≤ 2𝐶1√

𝛽𝑇. Then

1.∥∥∥ℒ(𝑋𝛽)− 𝜋𝛽

𝑇

∥∥∥𝑇𝑉≤ 𝜀1 + 𝜀2.

107

2. For 𝑖 ∈ [𝑖max] chosen at random,

PÇ∥∥∥𝑋𝛽

𝑖 − 𝑥⋆∥∥∥ ≤ 𝐶1√

𝛽𝑇

å≥ 1− (𝜀1 + 𝜀2 + 𝐴𝑒−𝑘𝐶1). (3.111)

Proof. First we calculate the distance of the starting point from the stationary dis-

tribution,

𝑊 22 (𝛿𝑋𝛽

0, 𝜋𝛽

𝑇 ) ≤ 2∥∥∥𝑋𝛽

0 − 𝑥⋆∥∥∥2 + 2𝑊 2

2 (𝛿𝑥⋆ , 𝜋𝛽𝑇 ) ≤ 8𝐶2

1

𝛽𝑇+

2𝐶2

𝛽𝑇≤ 10𝐶2

1

𝛽𝑇. (3.112)

Define a toy Markov chain coupled to 𝑋𝛽𝑖 as follows. Let 𝑋𝛽

0 = 𝑋𝛽0 and

𝑋𝛽𝑖+1 =

𝑋𝛽

𝑖 − 𝜂𝑔𝛽𝑖 +√𝜂𝜉𝑖, when

∥∥∥𝑋𝜏𝑗 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑇

for all 0 ≤ 𝑗 ≤ 𝑖

𝑋𝛽𝑖 − 𝜂𝛽∇𝐹 (𝑋𝑖), otherwise.

(3.113)

By Lemma 3.6.1, the variance of 𝑔𝛽𝑖 is at most 𝛽2𝑇 2𝐿2

𝑏max0≤𝑗≤𝑖

∥∥∥𝑋𝛽𝑖 −𝑋𝛽

𝑗

∥∥∥2. If∥∥∥𝑋𝛽𝑖 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑇

for all 0 ≤ 𝑖 ≤ 𝑖max, then∥∥∥𝑋𝛽

𝑖 −𝑋𝛽𝑗

∥∥∥ ≤ 2R√𝛽𝑇

for all 0 ≤ 𝑖, 𝑗 ≤ 𝑖max.

Then we can apply Lemma 3.6.3 with 𝜀 = 2𝜀22, 𝐿←[ 𝐿𝛽𝑇 , 𝜎2 ≤ (𝛽𝑇 )2𝐿2

𝑏4R2

𝛽𝑇= 4𝛽𝑇𝐿2R2

𝑏,

and 𝑊 22 (𝜇0, 𝜋) ≤ 10𝐶2

1

𝛽𝑇. By Pinsker’s inequality, for random 𝑖 ∈ [𝑖max],

∥∥∥ℒ(𝑋𝛽𝑖 )− 𝜋𝛽

𝑇

∥∥∥TV≤

1

2KL(|𝜋𝜏 ) ≤ 𝜀2. (3.114)

Under 𝐺𝛽, 𝑋𝛽𝑖 = 𝑋𝛽

𝑖 for all 𝑖 ≤ 𝑖max and 𝑠 ≤ 𝜏 , so

‖ℒ(𝑋𝛽𝑖 )− 𝜋𝛽

𝑇‖TV ≤ P(𝐺𝑐𝛽) +

∥∥∥ℒ(𝑋𝛽𝑖 )− 𝜋𝛽

𝑇

∥∥∥TV≤ 𝜀1 + 𝜀2 (3.115)

This shows part 1.

108

For part 2, note that by Assumption 3.2.2,

P𝑋∼𝜋𝛽𝑇

ñ‖𝑋 − 𝑥⋆‖ ≥ 𝐶1√

𝛽𝑇

ô≤ 𝐴𝑒−𝑘𝐶1 (3.116)

Combining (3.115) and (3.116) gives part 2.

Theorem 3.7.2 (Theorem 3.2.6 with parameters). Suppose that Assumptions 3.2.1

and 3.2.5 hold, with 𝑘 ≤ 1 and ‖𝑋0 − 𝑥⋆‖ ≤ 𝐶. Suppose Algorithm 5 is run with

parameters 𝜂0, 𝑖max given by

𝜀1 =𝜀

3 ⌈log2(𝑇 ) + 1⌉(3.117)

𝐶1 =

Ç2 +

1

𝑘

ålog

Ç𝐴

𝜀2𝑘2

å(3.118)

R = 100 max

𝑑

𝐿

√log

Çmax

®𝐿,

𝑑

𝐿,𝐶1,

1

𝜀1

´å, 𝐶1

(3.119)

𝜂0 =𝜀21

2𝐿2R2(3.120)

𝑖max =

¢5𝐶2

1

𝜂0𝜀21

•=

¢10𝐿2R2𝐶2

1

𝜀41

•(3.121)

with any constant batch size 𝑏 ≥ 4. Then it outputs 𝑋1 such that 𝑋1 is a sample from

𝑇 satisfying ‖𝑇 − 𝜋𝑇‖𝑇𝑉 ≤ 𝜀, using ‹𝑂(𝑇 ) + poly log(𝑇 ) poly(𝑑, 𝐿, 𝐶, 𝜀−1) gradient

evaluations.

Proof. The proof is similar to the proof of Theorem 3.6.5, and we omit the details.

We show by induction that

PÇ∥∥∥𝑋𝛽𝑠

𝑖 − 𝑥⋆∥∥∥ ≤ R√

𝛽𝑠𝑇

å≥ 1− 2𝑠𝜀1. (3.122)

The base case follows from 𝐶 ≤ 𝐶1 ≤ R. The induction step follows from noting first

109

that

∥∥∥𝑋𝛽𝑠𝑖 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑠𝑇

=⇒∥∥∥𝑋𝛽𝑠+1

0 − 𝑥⋆∥∥∥ ≤ 2R√

𝛽𝑠+1𝑇, (3.123)

noting that the conditions imply (for 𝜂𝛽 = 𝜂0√𝛽𝑇

, 𝑟𝑡 = R√𝛽𝑇

, 𝑆𝑡 = 4√𝛽𝑇𝐿R, and

𝜎2𝑡 = 4𝛽𝑇𝐿2R2

𝑏, 𝐶𝜉 =

√2𝑑 + 8 log

Ä2𝑖max

𝜀1

ä) that

𝜀1 ≥ 𝑖max

exp

Ö−

(𝑟2𝛽 −4𝐶2

1

𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2

𝛽 + 𝐿2𝑡2𝑟2𝛽) + 𝜂𝛽𝑑])2

(2𝜂𝛽𝑆𝛽𝑟𝛽 + 2√𝜂𝛽𝐶𝜉(𝑟𝛽 + 𝜂𝛽𝑆𝛽 + 𝜂𝛽𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝛽𝐶2

𝜉 )2

è(3.124)

+ exp

(−𝐶2

𝜉 − 𝑑

8

) (3.125)

Then using Lemma 3.6.2, we get that (3.110) is satisfied with 𝜀1, and the induction

step follows from item 2 of Lemma 3.7.1.

Finally, once we have ‖𝑋10 − 𝑥⋆‖ ≤ R√

𝑇, the conclusion about 𝑋1 follows from item

1 of Lemma 3.7.1.

3.8 Simulations

We test our algorithm against other sampling algorithms on a synthetic dataset for

logistic regression. The dataset consists of 𝑇 = 1000 data points in dimension 𝑑 = 20.

We compare the marginal accuracies of the algorithms.

The data is generated as follows. First, 𝜃 ∼ 𝑁(0, 𝐼𝑑), 𝑏 ∼ 𝑁(0, 1) are randomly

generated. For each 1 ≤ 𝑡 ≤ 𝑇 , a feature vector 𝑥𝑡 ∈ R𝑑 and output 𝑦𝑡 ∈ 0, 1 are

generated by

𝑥𝑡,𝑖 ∼ BernoulliÅ𝑠𝑑

ã1 ≤ 𝑖 ≤ 𝑑 (3.126)

𝑦𝑡 ∼ Bernoulli(𝜎(𝜃⊤𝑥𝑡 + 𝑏)) (3.127)

110

where the sparsity is 𝑠 = 5 in our simulations, and 𝜎(𝑥) = 11+𝑒−𝑥 is the logistic

function. We chose 𝑥𝑡 ∈ 0, 1𝑑 because in applications, features are often indicators.

The algorithms are tested in an online setting as follows. At epoch 𝑡 each algorithm

has access to 𝑥𝑠,𝑖, 𝑦𝑠 for 𝑠 ≤ 𝑡, and attempts to generate a sample from the posterior

distribution 𝑝𝑡(𝜃) ∝ 𝑒−‖𝜃‖22 𝑒−

𝑏2

2∏𝑡

𝑠=1 𝜎(𝜃⊤𝑥𝑡+𝑏); the time is limited to 𝑡 = 0.1 seconds.

We estimate the quality of the samples at 𝑡 = 𝑇 = 1000, by saving the state of the

algorithm at 𝑡 = 𝑇 − 1, and re-running it 1000 times to collect 1000 samples. We

replicate this entire simulation 8 times, and the marginal accuracies of the runs are

given in Figure 3.1.

The marginal accuracy (MA) is a heuristic to compare accuracy of samplers (see

e.g. [DMS17], [FOW11] and [C+17]). The marginal accuracy between the measure 𝜇

of a sample and the target 𝜋 is 𝑀𝐴(𝜇, 𝜋) := 1− 12𝑑

∑𝑑𝑖=1 ‖𝜇𝑖−𝜋𝑖‖TV, where 𝜇𝑖 and 𝜋𝑖 are

the marginal distributions of 𝜇 and 𝜋 for the coordinate 𝑥𝑖. Since MALA is known to

sample from the correct stationary distribution for the class of distributions analyzed

in this paper, we let 𝜋 be the estimate of the true distribution obtained from 1000

samples generated from running MALA for a long time (1000 steps). We estimate

the TV distance by the TV distance between the histograms when the bin widths are

0.25 times the sample standard deviation for the corresponding coordinate of 𝜋.

We compare our online SAGA-LD algorithm with SGLD, online Laplace approx-

imation, Polya-Gamma, and MALA. The Laplace method approximates the target

distribution with a multivariate Gaussian distribution. Here, one first finds the mode

of the target distribution using a deterministic optimization technique and then com-

putes the Hessian ∇2𝐹𝑡 of the log-posterior at the mode. The inverse of this Hessian

is the covariance matrix of the Gaussian. In the online version of the algorithm

we use, given in [CL11], to speed up optimization, only a quadratic approximation

(with diagonal Hessian) to the log-posterior is maintained. The Polya-Gamma chain

[DFE18] is a Markov chain specialized to sample from the posterior for logistic re-

111

Algorithm Mean marginal accuracy

SGLD 0.442Online Laplace 0.571

MALA 0.901Polya-Gamma 0.921

SAGA-LD 0.921

Figure 3.1: Marginal accuracies of 5 different sampling algorithms on online logisticregression, with 𝑇 = 1000 data points, dimension 𝑑 = 20, and time 0.1 seconds,averaged over 8 runs. SGLD and online Laplace perform much worse and are notpictured.

gression. Note that in contrast, our algorithm works more generally for any smooth

probability distribution over R𝑑.

The parameters are as follows. The step size at epoch 𝑡 is 0.11+0.5𝑡

for MALA, 0.011+0.5𝑡

for SGLD, and 0.051+0.5𝑡

for SAGA-LD. A smaller step size must be used with SGLD

because of the increased variance. For MALA, a larger step size can be used because

the Metropolis-Hastings acceptance step ensures the stationary distribution is correct.

The batch size for SGLD and SAGA-LD is 64.

Our results show that SAGA-LD is competitive with the best sampler for logistic

regression, namely, the Polya-Gamma Markov chain.

3.9 Discussion and future work

Comparison to using a regularizer. Recall that one issue in proving Theo-

rem 3.2.4 is that we don’t assume the 𝑓𝑡 are strongly convex. One way to get around

this is to add a strongly convex regularizer, and use existing results for Langevin in

the strongly convex case; however, because we are not leveraging the concentration

that already exists (Assumption 3.2.2), the polynomial dependence is worse.

In the online case, one would have to add 𝜀𝑡||𝑥− 𝑡||2 to the objective, where 𝑡

112

is an estimate of the mode 𝑥⋆𝑡 . Assuming we have such an estimate, using results on

Langevin for strong convexity, to get 𝜀 TV-error, we would require ‹𝑂 Ä 1𝜀6

ästeps per

iteration, rather than ‹𝑂 Ä 1𝜀4

äas in the current proof (see Theorem 3.6.5). (Specifically,

use [DMM18, Corollary 22], with strong convexity 𝑚 = 𝜀𝑡 to get that ‹𝑂 Ä 1𝜀3

äiterations

are required to get KL-error 𝜀, and apply Pinsker’s inequality.)

Preconditioning. We would like to obtain similar bounds under more general

assumptions where the covariance matrix could change at each epoch and be ill-

conditioned. This type of distribution arises in reinforcement learning applications

such as Thompson sampling [DFE18], where the data is determined by the user’s

actions. If the user favors actions in certain “optimal” directions, the distrbution will

have a much smaller covariance in those directions than in other directions, causing

the covariance matrix of the target distribution to become more ill-conditioned over

time.

Improved bounds for strongly convex functions. Suppose that we dropped

the requirement of independence. Note that if we use SAGA-LD with the last sample

from the previous epoch, we have a warm start for the previous distribution, and

would be able to achieve TV error that decreases as 𝑇 with ‹𝑂𝑇 (1) time per epoch.

It seems possible to reduce the TV error to 𝑂Å

𝜀

𝑡16

ãthis way, and possibly to 𝑂

Å𝜀

𝑡14

ãwith stronger drift assumptions. These guarantees may also extend to subexponential

distributions.

Distributions over discrete spaces. There has been work on stochastic methods

in the setting of discrete variables [DCW18] that could potentially be used to develop

analogous theory in the discrete case.

113

Bibliography

[AC93] James H Albert and Siddhartha Chib. “Bayesian analysis of binary and

polychotomous response data”. In: Journal of the American statistical

Association 88.422 (1993), pp. 669–679.

[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. “Particle

markov chain Monte Carlo methods”. In: Journal of the Royal Statistical

Society: Series B (Statistical Methodology) 72.3 (2010), pp. 269–342.

[AG12] Shipra Agrawal and Navin Goyal. “Analysis of thompson sampling for

the multi-armed bandit problem”. In: Conference on Learning Theory.

2012, pp. 39–1.

[AG13a] Shipra Agrawal and Navin Goyal. “Further optimal regret bounds for

thompson sampling”. In: Artificial Intelligence and Statistics. 2013, pp. 99–

107.

[AG13b] Shipra Agrawal and Navin Goyal. “Thompson sampling for contextual

bandits with linear payoffs”. In: International Conference on Machine

Learning. 2013, pp. 127–135.

[AG16] Animashree Anandkumar and Rong Ge. “Efficient approaches for escap-

ing higher order saddle points in non-convex optimization”. In: Confer-

ence on learning theory. 2016, pp. 81–102.

114

[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. “The multiplicative weights

update method: a meta-algorithm and applications”. In: Theory of Com-

puting 8.1 (2012), pp. 121–164.

[Bak+08] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin.

“A simple proof of the Poincare inequality for a large class of probability

measures including the log-concave case”. In: Electron. Commun. Probab

13 (2008), pp. 60–66.

[Bal19] Jonathan Balkind. Who killed Granny? personal communication. 2019.

[BDT16] Rina Foygel Barber, Mathias Drton, and Kean Ming Tan. “Laplace ap-

proximation in high-dimensional Bayesian regression”. In: Statistical Anal-

ysis for High-Dimensional Data. Springer, 2016, pp. 15–36.

[BE85] Dominique Bakry and Michel Emery. “Diffusions hypercontractives”. In:

Seminaire de Probabilites XIX 1983/84. Springer, 1985, pp. 177–206.

[Bel+15] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexan-

der Rakhlin. “Escaping the local minima via simulated annealing: Opti-

mization of approximately convex functions”. In: Conference on Learning

Theory. 2015, pp. 240–265.

[BEL18] Sebastien Bubeck, Ronen Eldan, and Joseph Lehec. “Sampling from a

log-concave distribution with Projected Langevin Monte Carlo”. In: Dis-

crete & Computational Geometry 59.4 (2018), pp. 757–783.

[BGK05] Anton Bovier, Veronique Gayrard, and Markus Klein. “Metastability

in reversible diffusion processes II: Precise asymptotics for small eigen-

values”. In: Journal of the European Mathematical Society 7.1 (2005),

pp. 69–99.

115

[BGL13] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geome-

try of Markov diffusion operators. Vol. 348. Springer Science & Business

Media, 2013.

[Bha78] RN Bhattacharya. “Criteria for recurrence and existence of invariant

measures for multidimensional diffusions”. In: The Annals of Probabil-

ity (1978), pp. 541–553.

[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirich-

let allocation”. In: Journal of machine Learning research 3.Jan (2003),

pp. 993–1022.

[Bov+02] Anton Bovier, Michael Eckhoff, Veronique Gayrard, and Markus Klein.

“Metastability and Low Lying Spectra in Reversible Markov Chains”. In:

Communications in mathematical physics 228.2 (2002), pp. 219–255.

[Bov+04] Anton Bovier, Michael Eckhoff, Veronique Gayrard, and Markus Klein.

“Metastability in reversible diffusion processes I: Sharp asymptotics for

capacities and exit times”. In: Journal of the European Mathematical

Society 6.4 (2004), pp. 399–424.

[Bro+11] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Hand-

book of markov chain monte carlo. CRC press, 2011.

[Bro+13] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and

Michael I Jordan. “Streaming variational bayes”. In: Advances in Neural

Information Processing Systems. 2013, pp. 1727–1735.

[C+17] Nicolas Chopin, James Ridgway, et al. “Leave Pima Indians alone: bi-

nary regression as a benchmark for Bayesian computation”. In: Statistical

Science 32.1 (2017), pp. 64–87.

[CB17] Trevor Campbell and Tamara Broderick. “Automated Scalable Bayesian

Inference via Hilbert Coresets”. In: arXiv preprint arXiv:1710.05053 (2017).

116

[CB18a] Trevor Campbell and Tamara Broderick. “Bayesian coreset construction

via greedy iterative geodesic ascent”. In: arXiv preprint arXiv:1802.01737

(2018).

[CB18b] Xiang Cheng and Peter L Bartlett. “Convergence of Langevin MCMC

in KL-divergence”. In: Algorithmic Learning Theory, PMLR. 83. 2018,

pp. 186–211.

[Cha+16] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun.

“Entropy-SGD: Biasing gradient descent into wide valleys”. In: arXiv

preprint arXiv:1611.01838 (2016).

[Cha+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and

Michael Jordan. “On the Theory of Variance Reduction for Stochastic

Gradient Monte Carlo”. In: Proceedings of the 35th International Con-

ference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause.

Vol. 80. Proceedings of Machine Learning Research. Stockholmsmassan,

Stockholm Sweden: PMLR, Oct. 2018, pp. 764–773. url: http://proceedings.

mlr.press/v80/chatterji18a.html.

[Che+17] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan.

“Underdamped Langevin MCMC: A non-asymptotic analysis”. In: arXiv

preprint arXiv:1707.03663 (2017).

[Che+18] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett,

and Michael I Jordan. “Sharp convergence rates for langevin dynamics

in the nonconvex setting”. In: arXiv preprint arXiv:1805.01648 (2018).

[CL06] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games.

Cambridge university press, 2006.

[CL11] O. Chapelle and L. Li. “An empirical evaluation of Thompson sampling”.

In: Advances in Neural Information Processing Systems. 2011.

117

[Cov11] Thomas M Cover. “Universal portfolios”. In: The Kelly Capital Growth

Investment Criterion: Theory and Practice. World Scientific, 2011, pp. 181–

209.

[CSI12] Ming-Hui Chen, Qi-Man Shao, and Joseph G Ibrahim. Monte Carlo

methods in Bayesian computation. Springer Science & Business Media,

2012.

[D+91] Persi Diaconis, Daniel Stroock, et al. “Geometric bounds for eigenvalues

of Markov chains”. In: The Annals of Applied Probability 1.1 (1991),

pp. 36–61.

[Dal16] Arnak S Dalalyan. “Theoretical guarantees for approximate sampling

from smooth and log-concave densities”. In: Journal of the Royal Statis-

tical Society: Series B (Statistical Methodology) (2016).

[Dal17] Arnak Dalalyan. “Further and stronger analogy between sampling and

optimization: Langevin Monte Carlo and gradient descent”. In: Proceed-

ings of the 2017 Conference on Learning Theory. Ed. by Satyen Kale

and Ohad Shamir. Vol. 65. Proceedings of Machine Learning Research.

Amsterdam, Netherlands: PMLR, July 2017, pp. 678–689. url: http:

//proceedings.mlr.press/v65/dalalyan17a.html.

[DCW18] Chris De Sa, Vincent Chen, and Wing Wong. “Minibatch Gibbs Sam-

pling on Large Graphical Models”. In: Proceedings of the 35th Inter-

national Conference on Machine Learning. Ed. by Jennifer Dy and An-

dreas Krause. Vol. 80. Proceedings of Machine Learning Research. Stock-

holmsmassan, Stockholm Sweden: PMLR, Oct. 2018, pp. 1165–1173.

url: http://proceedings.mlr.press/v80/desa18a.html.

118

[DFE18] Bianca Dumitrascu, Karen Feng, and Barbara E Engelhardt. “PG-TS:

Improved Thompson Sampling for Logistic Contextual Bandits”. In: Ad-

vances in neural information processing systems. 2018.

[DFK91] Martin Dyer, Alan Frieze, and Ravi Kannan. “A random polynomial-time

algorithm for approximating the volume of convex bodies”. In: Journal

of the ACM (JACM) 38.1 (1991), pp. 1–17.

[DHW12] Pierre Del Moral, Peng Hu, and Liming Wu. “On the concentration prop-

erties of interacting particle processes”. In: Foundations and Trends R in

Machine Learning 3.3–4 (2012), pp. 225–389.

[Dia09] Persi Diaconis. “The markov chain monte carlo revolution”. In: Bulletin

of the American Mathematical Society 46.2 (2009), pp. 179–205.

[Dia11] Persi Diaconis. “The mathematics of mixing things up”. In: Journal of

Statistical Physics 144.3 (2011), p. 445.

[DK17] Arnak S Dalalyan and Avetik G Karagulyan. “User-friendly guaran-

tees for the Langevin Monte Carlo with inaccurate gradient”. In: arXiv

preprint arXiv:1710.00095 (2017).

[DM16] Alain Durmus and Eric Moulines. “High-dimensional Bayesian inference

via the Unadjusted Langevin Algorithm”. In: (2016).

[DMM18] Alain Durmus, Szymon Majewski, and B lazej Miasojedow. “Analysis

of Langevin Monte Carlo via convex optimization”. In: arXiv preprint

arXiv:1802.09188 (2018).

[DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. “On the convergence of

Hamiltonian Monte Carlo”. In: arXiv preprint arXiv:1705.00166 (2017).

119

[Dou+00] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell.

“Rao-Blackwellised particle filtering for dynamic Bayesian networks”. In:

Proceedings of the Sixteenth conference on Uncertainty in artificial intel-

ligence. Morgan Kaufmann Publishers Inc. 2000, pp. 176–183.

[Dub+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barn-

abas Poczos, Alexander J Smola, and Eric P Xing. “Variance reduction

in stochastic gradient Langevin dynamics”. In: Advances in neural infor-

mation processing systems. 2016, pp. 1154–1162.

[Dwi+18] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. “Log-

concave sampling: Metropolis-Hastings algorithms are fast!” In: Proceed-

ings of the 2018 Conference on Learning Theory, PMLR 75. 2018.

[Fos+18] Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik

Sridharan. “Logistic Regression: The Importance of Being Improper”. In:

Proceedings of Machine Learning Research vol 75 (2018), pp. 1–42.

[FOW11] Christel Faes, John T Ormerod, and Matt P Wand. “Variational Bayesian

inference for parametric and nonparametric regression with missing data”.

In: Journal of the American Statistical Association 106.495 (2011), pp. 959–

971.

[FRT14] Cameron E Freer, Daniel M Roy, and Joshua B Tenenbaum. Towards

common-sense reasoning via conditional simulation: legacies of Turing

in Artificial Intelligence. 2014.

[GD17] Francois Giraud and Pierre Del Moral. “Nonasymptotic analysis of adap-

tive and annealed Feynman–Kac particle models”. In: Bernoulli 23.1

(2017), pp. 670–709.

120

[Ge+15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. “Escaping from saddle

points—online stochastic gradient for tensor decomposition”. In: Confer-

ence on Learning Theory. 2015, pp. 797–842.

[GLR18] Rong Ge, Holden Lee, and Andrej Risteski. “Beyond Log-concavity:

Provable Guarantees for Sampling Multi-modal Distributions using Sim-

ulated Tempering Langevin Monte Carlo”. In: Advances in Neural In-

formation Processing Systems 31. Curran Associates, Inc., 2018. url:

http://tiny.cc/glr17.

[Gru+09] Natalie Grunewald, Felix Otto, Cedric Villani, and Maria G Westdicken-

berg. “A two-scale approach to logarithmic Sobolev inequalities and the

hydrodynamic limit”. In: Annales de l’IHP Probabilites et statistiques.

Vol. 45. 2. 2009, pp. 302–351.

[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. “Logarithmic regret al-

gorithms for online convex optimization”. In: Machine Learning 69.2-3

(2007), pp. 169–192.

[Har04] Gilles Harge. “A convex/log-concave correlation inequality for Gaussian

measure and an application to abstract Wiener spaces”. In: Probability

theory and related fields 130.3 (2004), pp. 415–440.

[Haz16] Elad Hazan. “Introduction to online convex optimization”. In: Founda-

tions and Trends R in Optimization 2.3-4 (2016), pp. 157–325.

[HCB16] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. “Coresets

for scalable bayesian logistic regression”. In: Advances in Neural Infor-

mation Processing Systems. 2016, pp. 4080–4088.

[HH13] John Hammersley and D. C. Handscomb. Monte carlo methods. Springer

Science & Business Media, 2013.

121

[HKL14] Elad Hazan, Tomer Koren, and Kfir Y Levy. “Logistic regression: Tight

bounds for stochastic and online optimization”. In: Conference on Learn-

ing Theory. 2014, pp. 197–209.

[Jin+17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I

Jordan. “How to escape saddle points efficiently”. In: Proceedings of the

34th International Conference on Machine Learning-Volume 70. JMLR.

org. 2017, pp. 1724–1732.

[Jin+18] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. “On the local min-

ima of the empirical risk”. In: Advances in Neural Information Processing

Systems. 2018, pp. 4901–4910.

[JS93] Mark Jerrum and Alistair Sinclair. “Polynomial-time approximation al-

gorithms for the Ising model”. In: SIAM Journal on computing 22.5

(1993), pp. 1087–1116.

[JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. “A polynomial-time

approximation algorithm for the permanent of a matrix with nonnegative

entries”. In: Journal of the ACM (JACM) 51.4 (2004), pp. 671–697.

[KM15] Vladimir Koltchinskii and Shahar Mendelson. “Bounding the smallest

singular value of a random matrix without concentration”. In: Interna-

tional Mathematics Research Notices 2015.23 (2015), pp. 12991–13008.

[KW13] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.

In: arXiv preprint arXiv:1312.6114 (2013).

[Lel09] Tony Lelievre. “A general two-scale criteria for logarithmic Sobolev in-

equalities”. In: Journal of Functional Analysis 256.7 (2009), pp. 2211–

2221.

122

[Lia05] Faming Liang. “Determination of normalizing constants for simulated

tempering”. In: Physica A: Statistical Mechanics and its Applications

356.2-4 (2005), pp. 468–480.

[Liu08] Jun S Liu. Monte Carlo strategies in scientific computing. Springer Sci-

ence & Business Media, 2008.

[LM00] Beatrice Laurent and Pascal Massart. “Adaptive estimation of a quadratic

functional by model selection”. In: Annals of Statistics (2000), pp. 1302–

1338.

[LMV19] Holden Lee, Oren Mangoubi, and Nisheeth K Vishnoi. “Online Sampling

from Log-Concave Distributions”. In: arXiv preprint arXiv:1902.08179

(2019).

[LR16] Yuanzhi Li and Andrej Risteski. “Algorithms and matching lower bounds

for approximately-convex optimization”. In: Advances in Neural Infor-

mation Processing Systems. 2016, pp. 4745–4753.

[LS93] Laszlo Lovasz and Miklos Simonovits. “Random walks in a convex body

and an improved volume algorithm”. In: Random structures & algorithms

4.4 (1993), pp. 359–412.

[LV06] Laszlo Lovasz and Santosh Vempala. “Fast Algorithms for Logconcave

Functions: Sampling, Rounding, Integration and Optimization”. In: Pro-

ceedings of the 47th Annual IEEE Symposium on Foundations of Com-

puter Science. FOCS ’06. Washington, DC, USA: IEEE Computer Soci-

ety, 2006, pp. 57–68. isbn: 0-7695-2720-5. doi: 10.1109/FOCS.2006.28.

url: http://dx.doi.org/10.1109/FOCS.2006.28.

[Mar+19] Anton Martinsson, Jianfeng Lu, Benedict Leimkuhler, and Eric Vanden-

Eijnden. “The simulated tempering method in the infinite switch limit

123

with adaptive weight learning”. In: Journal of Statistical Mechanics: The-

ory and Experiment 2019.1 (2019), p. 013207.

[Men14] Shahar Mendelson. “Learning without concentration”. In: Conference on

Learning Theory. 2014, pp. 25–39.

[MHB16] Stephan Mandt, Matthew Hoffman, and David Blei. “A Variational Anal-

ysis of Stochastic Gradient Algorithms”. In: Proceedings of The 33rd In-

ternational Conference on Machine Learning. Ed. by Maria Florina Bal-

can and Kilian Q. Weinberger. Vol. 48. Proceedings of Machine Learning

Research. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 354–

363. url: http://proceedings.mlr.press/v48/mandt16.html.

[Mou+19] Wenlong Mou, Nhat Ho, Martin J. Wainwright, Peter Bartlett, and

Michael I. Jordan. “Polynomial-time Algorithm for Power Posterior Sam-

pling in Bayesian Mixture Models Mixture Models”. In: Preprint (2019).

[MP92] Enzo Marinari and Giorgio Parisi. “Simulated tempering: a new Monte

Carlo scheme”. In: EPL (Europhysics Letters) 19.6 (1992), p. 451.

[MR02] Neal Madras and Dana Randall. “Markov chain decomposition for con-

vergence rate analysis”. In: Annals of Applied Probability (2002), pp. 581–

606.

[MS17] Oren Mangoubi and Aaron Smith. “Rapid Mixing of Hamiltonian Monte

Carlo on Strongly Log-Concave Distributions”. In: arXiv preprint arXiv:1708.07114

(2017).

[MV17] Oren Mangoubi and Nisheeth K Vishnoi. “Convex Optimization with

Nonconvex Oracles”. In: arXiv preprint arXiv:1711.02621 (2017).

[Nag+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian

J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. “The true cost of

124

stochastic gradient Langevin dynamics”. In: arXiv preprint arXiv:1706.02692

(2017).

[Nea00] Radford M Neal. “Markov chain sampling methods for Dirichlet process

mixture models”. In: Journal of computational and graphical statistics

9.2 (2000), pp. 249–265.

[Nea01] Radford M Neal. “Annealed importance sampling”. In: Statistics and

computing 11.2 (2001), pp. 125–139.

[Nea96] Radford M Neal. “Sampling from multimodal distributions using tem-

pered transitions”. In: Statistics and computing 6.4 (1996), pp. 353–366.

[Nic12] Richard Nickl. “Statistical Theory”. In: Statistical Laboratory, Depart-

ment of Pure Mathematics and Mathematical Statistics, University of

Cambridge (2012).

[NR17] Hariharan Narayanan and Alexander Rakhlin. “Efficient sampling from

time-varying log-concave distributions”. In: The Journal of Machine Learn-

ing Research 18.1 (2017), pp. 4017–4045.

[PJT15] Daniel Paulin, Ajay Jasra, and Alexandre Thiery. “Error Bounds for Se-

quential Monte Carlo Samplers for Multimodal Distributions”. In: arXiv

preprint arXiv:1509.08775 (2015).

[PP07] Sanghyun Park and Vijay S Pande. “Choosing weights for simulated

tempering”. In: Physical Review E 76.1 (2007), p. 016703.

[RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochas-

tic Backpropagation and Approximate Inference in Deep Generative Mod-

els”. In: International Conference on Machine Learning. 2014, pp. 1278–

1286.

125

[RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. “Non-convex

learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic

analysis”. In: Conference on Learning Theory. 2017, pp. 1674–1703.

[Rus+18] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband,

Zheng Wen, et al. “A tutorial on Thompson sampling”. In: Foundations

and Trends R in Machine Learning 11.1 (2018), pp. 1–96.

[Sch12] Nikolaus Schweizer. “Non-asymptotic error bounds for sequential MCMC

and stability of Feynman-Kac propagators”. In: arXiv preprint arXiv:1204.2382

(2012).

[SR11] David Sontag and Dan Roy. “Complexity of inference in latent dirichlet

allocation”. In: Advances in neural information processing systems. 2011,

pp. 1008–1016.

[SS09] Ashok N Srivastava and Mehran Sahami. Text mining: Classification,

clustering, and applications. Chapman and Hall/CRC, 2009.

[TD14] Christopher Tosh and Sanjoy Dasgupta. “Lower bounds for the Gibbs

sampler over mixtures of Gaussians”. In: International Conference on

Machine Learning. 2014, pp. 1467–1475.

[TRR18] Nicholas G Tawn, Gareth O Roberts, and Jeffrey S Rosenthal. “Weight-

Preserving Simulated Tempering”. In: arXiv preprint arXiv:1808.04782

(2018).

[Vem05] Santosh Vempala. “Geometric random walks: a survey”. In: Combinato-

rial and computational geometry 52.573-612 (2005), p. 2.

[VW19] Santosh S Vempala and Andre Wibisono. “Rapid Convergence of the Un-

adjusted Langevin Algorithm: Log-Sobolev Suffices”. In: arXiv preprint

arXiv:1903.08568 (2019).

126

[W+08] Martin J Wainwright, Michael I Jordan, et al. “Graphical models, expo-

nential families, and variational inference”. In: Foundations and Trends R

in Machine Learning 1.1–2 (2008), pp. 1–305.

[WPB11] Chong Wang, John Paisley, and David Blei. “Online variational inference

for the hierarchical Dirichlet process”. In: Proceedings of the Fourteenth

International Conference on Artificial Intelligence and Statistics. 2011,

pp. 752–760.

[WSH09a] Dawn B Woodard, Scott C Schmidler, and Mark Huber. “Conditions for

rapid mixing of parallel and simulated tempering on multimodal distri-

butions”. In: The Annals of Applied Probability (2009), pp. 617–640.

[WSH09b] Dawn B Woodard, Scott C Schmidler, and Mark Huber. “Sufficient con-

ditions for torpid mixing of parallel and simulated tempering”. In: Elec-

tronic Journal of Probability 14 (2009), pp. 780–804.

[WT11] Max Welling and Yee W Teh. “Bayesian learning via stochastic gradient

Langevin dynamics”. In: Proceedings of the 28th International Confer-

ence on Machine Learning (ICML-11). 2011, pp. 681–688.

[YZM17] Nanyang Ye, Zhanxing Zhu, and Rafal Mantiuk. “Langevin Dynamics

with Continuous Tempering for Training Deep Neural Networks”. In: Ad-

vances in Neural Information Processing Systems 30. Ed. by I. Guyon,

U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and

R. Garnett. Curran Associates, Inc., 2017, pp. 618–626. url: http://

papers.nips.cc/paper/6664-langevin-dynamics-with-continuous-

tempering-for-training-deep-neural-networks.pdf.

[Zhe03] Zhongrong Zheng. “On swapping and simulated tempering algorithms”.

In: Stochastic Processes and their Applications 104.1 (2003), pp. 131–154.

127

[Zin03] Martin Zinkevich. “Online convex programming and generalized infinites-

imal gradient ascent”. In: Proceedings of the 20th International Confer-

ence on Machine Learning (ICML-03). 2003, pp. 928–936.

[ZLC17] Yuchen Zhang, Percy Liang, and Moses Charikar. “A Hitting Time Anal-

ysis of Stochastic Gradient Langevin Dynamics”. In: Conference on Learn-

ing Theory. 2017, pp. 1980–2022.

128

Appendix A

Background on Markov chains and

processes

A.1 Markov chains and processes

Throughout, we will use upper-case 𝑃 for probability measures, and lower-case 𝑝 for

the corresponding density function (although we will occasionally abuse notation and

let 𝑝 stand in for the measure, as well).

A discrete-time (time-invariant) Markov chain on Ω is a probability law on a

sequence of random variables (𝑋𝑡)𝑡∈N0 taking values in Ω, such that the next state

𝑋𝑡+1 only depends on the previous state 𝑋𝑡, in a fixed way. More formally, letting

ℱ𝑡 = 𝜎((𝑋𝑡)0≤𝑠≤𝑡), there is a transition kernel 𝑇 on Ω (i.e. 𝑇 (𝑥, ·) is a probability

measure and 𝑇 (·, 𝐴) is a measurable function for any measurable 𝐴) such that

P(𝑋𝑡+1 ∈ ·|ℱ𝑡) = 𝑇 (𝑋𝑡, ·). (A.1)

A stationary measure is 𝑃 such that if 𝑋0 ∼ 𝑃 , then 𝑋𝑡 ∼ 𝑃 for all 𝑡. The idea of

Markov chain Monte Carlo is to design a Markov chain whose stationary distribution

is 𝑃 with good mixing; that is, if 𝜋𝑡 is the probability distribution at time 𝑡, then

129

𝜋𝑡 → 𝑃 rapidly as 𝑡→∞.

The Markov chains we consider will be discretized versions of continuous-time

Markov processes, so we will mainly work with Markov processes (postponing dis-

cretization analysis until the end).

Instead of being defined by a single transition kernel 𝑇 , a continuous time Markov

process is instead defined by a family of kernels (𝑃𝑡)𝑡≥0, and a more natural object to

consider is the generator.

Definition A.1.1. A (continuous-time, time-invariant) Markov process is given

by 𝑀 = (Ω, (𝑃𝑡)𝑡≥0), where each 𝑃𝑡 is a transition kernel. It defines the random

process (𝑋𝑡)𝑡≥0 by

P(𝑋𝑠+𝑡 ∈ 𝐴|ℱ𝑠) = P(𝑋𝑠+𝑡 ∈ 𝐴|𝑋𝑠) = 𝑃𝑡(𝑋𝑠, 𝐴) =∫𝐴𝑃𝑡(𝑥, 𝑑𝑦)

where ℱ𝑠 = 𝜎((𝑋𝑟)0≤𝑟≤𝑠). Define the action of P𝑡 on functions by

(P𝑡𝑔)(𝑥) = E𝑦∼𝑃𝑡(𝑥,·)𝑔(𝑦) =∫Ω𝑔(𝑦)𝑃𝑡(𝑥, 𝑑𝑦). (A.2)

A stationary measure is 𝑃 such that if 𝑋0 ∼ 𝑃 , then 𝑋𝑡 ∼ 𝑃 for all 𝑡. A

Markov process with stationary measure 𝑃 is reversible if P𝑡 is self-adjoint with

respect to 𝐿2(𝑃 ), i.e., as measures 𝑃 (𝑑𝑥)𝑃𝑡(𝑥, 𝑑𝑦) = 𝑃 (𝑑𝑦)𝑃𝑡(𝑦, 𝑑𝑥).

Define the generator L by

L 𝑔 = lim𝑡0

P𝑡𝑔 − 𝑔

𝑡, (A.3)

and let 𝒟(L ) denote the space of 𝑔 ∈ 𝐿2(𝑃 ) for which L 𝑔 ∈ 𝐿2(𝑃 ) is well-defined.

If 𝑃 is the unique stationary measure, define the Dirichlet form and the variance

130

by

E𝑀(𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 (A.4)

Var𝑃 (𝑔) =∥∥∥∥𝑔 − ∫

Ω𝑔 𝑃 (𝑑𝑥)

∥∥∥∥2𝐿2(𝑃 )

(A.5)

Note that in order for (𝑃𝑡)𝑡≥0 to be a valid Markov process, it must be the case that

P𝑡P𝑢𝑔 = P𝑡+𝑢𝑔, i.e., the (P𝑡)𝑡≥0 forms a Markov semigroup. All the Markov

processes we consider will be reversible.

We will use the shorthand E (𝑔) := E (𝑔, 𝑔).

Definition A.1.2. A continuous Markov process (given by Definition A.1.1) satisfies

a Poincare inequality with constant 𝐶 if for all 𝑔 such that 𝑔 ∈ 𝒟(L ),

E𝑀(𝑔, 𝑔) ≥ 1

𝐶Var𝑃 (𝑔). (A.6)

We will implicitly assume 𝑔 ∈ 𝒟(L ) every time we write E𝑀(𝑔, 𝑔). The minimal 𝜌

such that E𝑀(𝑔, 𝑔) ≥ 𝜌Var𝑃 (𝑔) for all 𝑔 is the spectral gap Gap(𝑀) of the Markov

process.

A Poincare inequality implies rapid mixing: If 𝐶 is maximal such that 𝑀 satisfies

a Poincare inequality with constant 𝐶, it can be shown that1

‖P𝑡𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) ≤ 𝑒−𝑡Gap(𝑀) ‖𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) = 𝑒−𝑡𝐶 ‖𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) . (A.7)

We can turn this into a statement about probability distributions, as follows. If the

probability distribution at time 𝑡 is 𝜋𝑡, then setting 𝑔 = 𝑑𝜋0

𝑑𝑃(the Radon-Nikodym

1Note the subtle point that the Poincare inequality as we defined it only makes sense for 𝑔 ∈𝒟(L ), whereas (A.7) makes sense when 𝑔 ∈ 𝐿2(𝑃 ). For Langevin diffusion, it suffices to showthe Poincare inequality for 𝑔 ∈ 𝒟(L ) to obtain (A.7) for all 𝑔 ∈ 𝐿2(𝑃 ). See [BGL13]. This will,however, not be an issue for us because we will start with a measure 𝜋0 with smooth density.

131

derivative) and assuming∥∥∥𝑑𝜋0

𝑑𝑃

∥∥∥𝐿2(𝑃 )

<∞, we haveÆP𝑡𝑓,

𝑑𝜋0

𝑑𝑃

∏𝐿2(𝑃 )

=∫Ω

P𝑡𝑓(𝑥)𝜋0(𝑑𝑥) =∫Ω

∫Ω𝑓(𝑦)𝑃𝑡(𝑥, 𝑑𝑦)𝜋0(𝑑𝑥) (A.8)

=∫Ω𝑓(𝑦)𝜋𝑡(𝑑𝑦) =

Æ𝑓,

𝑑𝜋𝑡

𝑑𝑃

∏𝐿2(𝑃 )

. (A.9)

If the Markov process is reversible, then¨P𝑡𝑓,

𝑑𝜋0

𝑑𝑃

∂𝐿2(𝑃 )

=¨𝑓,P𝑡

𝑑𝜋0

𝑑𝑃

∂𝐿2(𝑃 )

. Hence for

all 𝑓 ,¨𝑓,P𝑡

𝑑𝜋0

𝑑𝑃

∂𝐿2(𝑃 )

=¨𝑓, 𝑑𝜋𝑡

𝑑𝑃

∂𝐿2(𝑃 )

, so P𝑡𝑑𝜋0

𝑑𝑃= 𝑑𝜋𝑡

𝑑𝑃. The 𝜒2 divergence is defined

by 𝜒2(𝑄||𝑃 ) =∥∥∥𝑑𝑄𝑑𝑃− 1

∥∥∥2𝐿2(𝑃 )

, so putting 𝑔 = 𝑑𝜋0

𝑑𝑃in (A.7) gives

𝜒2(𝜋𝑡||𝑃 ) ≤ 𝑒−𝑡𝐶𝜒2(𝜋0||𝑃 ). (A.10)

The following gives one way to prove a Poincare inequality.

Theorem A.1.3 (Comparison theorem using canonical paths, [D+91]). Suppose Ω is

finite. Let 𝑇 : Ω×Ω→ R be a function with 𝑇 (𝑥, 𝑦) ≥ 0 for 𝑦 = 𝑥 and∑

𝑦∈Ω 𝑇 (𝑥, 𝑦) =

1. (Think of 𝑇 as a matrix in RΩ×Ω that operates on functions 𝑔 : Ω → R, i.e., 𝑔 ∈

RΩ.) Let 𝐿 = 𝑇−𝐼, so that 𝐿(𝑥, 𝑦) = 𝑇 (𝑥, 𝑦) for 𝑦 = 𝑥 and 𝐿(𝑥, 𝑥) = −∑𝑦 =𝑥 𝑇 (𝑥, 𝑦).

Consider the Markov process 𝑀 generated by 𝐿 (𝐿 acts as 𝐿𝑔(𝑗) =∑

𝑘 =𝑗[𝑔(𝑘) −

𝑔(𝑗)]𝑇 (𝑗, 𝑘)); let its Dirichlet form be E (𝑔, 𝑔) = −⟨𝑔, 𝐿𝑔⟩ and stationary distribution

be 𝑝.

Suppose each pair 𝑥, 𝑦 ∈ Ω, 𝑥 = 𝑦 is associated with a path 𝛾𝑥,𝑦. Define the

congestion to be

𝜌(𝛾) = max𝑧,𝑤∈Ω,𝑧 =𝑤

ñ∑𝛾𝑥,𝑦∋(𝑧,𝑤) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)

𝑝(𝑧)𝑇 (𝑧, 𝑤)

ôwhere |𝛾| denotes the length of path 𝛾. Then

Var𝑝(𝑔) ≤ 𝜌(𝛾)E (𝑔, 𝑔).

Proof. Note that the statement in [D+91] is for discrete-time Markov chains; we show

132

that our continuous-time result is a simple consequence.

Let 𝜀 > 0 be small enough such that 𝑇𝜀 = 𝐼 + 𝜀𝐿 = 𝐼 + 𝜀(𝑇 − 𝐼) = (1− 𝜀)𝐼 + 𝜀𝑇

has all entries ≥ 0.2 Note that the stationary distribution for 𝑀 is the same as the

stationary distribution of the discrete-time Markov chain generated by 𝑇𝜀, namely,

the (appropriately scaled) eigenvector of 𝑇 corresponding to the eigenvalue 1. The

Dirichlet form for 𝑇𝜀 is −⟨𝑔, (𝑇𝜀 − 𝐼)𝑔⟩ = −𝜀 ⟨𝑔, 𝐿𝑔⟩ = 𝜀E (𝑔, 𝑔).

Applying [D+91, Proposition 1′] to 𝑇𝜀 (note 𝑇𝜀(𝑧, 𝑤) = 𝜀𝑇 (𝑧, 𝑤) for 𝑧 = 𝑤) gives

Var𝑝(𝑔) ≤ max𝑧,𝑤∈Ω,𝑧 =𝑤

ñ∑𝛾𝑥,𝑦∋(𝑧,𝑤) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)

𝑝(𝑧)𝜀𝑇 (𝑧, 𝑤)

ô(𝜀E (𝑔, 𝑔)) = 𝜌(𝛾)E (𝑔, 𝑔). (A.11)

A.2 Langevin diffusion

Langevin Monte Carlo is an algorithm for sampling from a measure 𝑃 with density

function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given access to the gradient of the log-pdf, ∇𝑓 . We will always

assume that∫R𝑑 𝑒−𝑓(𝑥) 𝑑𝑥 <∞ and 𝑓 ∈ 𝐶2(R𝑑).

The continuous version, overdamped Langevin diffusion (often simply called Langevin

diffusion), is a stochastic process described by the stochastic differential equation

𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡) 𝑑𝑡 +√

2 𝑑𝑊𝑡 (A.12)

where 𝑊𝑡 is the Wiener process (Brownian motion). For us, the crucial fact is that

Langevin dynamics converges to the stationary distribution given by 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).

The Dirichlet form is

E𝑀(𝑔, 𝑔) = ‖∇𝑔‖2𝐿2(𝑃 ) . (A.13)

2Alternatively, note that nothing in their proof actually depends on 𝑇 having all entries ≥ 0, sotaking 𝜀 = 1 is fine.

133

Since this depends in a natural way on 𝑃 , we will also write this as E𝑃 (𝑔, 𝑔). A

Poincare inequality for Langevin diffusion thus takes the form

E𝑃 (𝑔, 𝑔) =∫Ω‖∇𝑔‖2 𝑃 (𝑑𝑥) ≥ 1

𝐶Var𝑃 (𝑔). (A.14)

Showing mixing for Langevin diffusion reduces to showing such an inequality. If

strongly log-concave measures, this is a classical result.

Theorem A.2.1 ([BGL13]). Let 𝑓 be 𝜌-strongly convex and differentiable. Then for

𝑔 ∈ 𝒟(E𝑃 ), the measure 𝑃 with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) satisfies the Poincare

inequality

E𝑃 (𝑔, 𝑔) ≥ 𝜌Var𝑃 (𝑔).

In particular, this holds for 𝑓(𝑥) = ‖𝑥−𝜇‖22

with 𝜌 = 1, giving a Poincare inequality

for the Gaussian distribution.

134

Appendix B

Appendix for Chapter 2

B.1 General log-concave densities

In this section we generalize the main theorem from gaussian to log-concave densities.

B.1.1 Simulated tempering for log-concave densities

First we rework Section 2.7 for log-concave densities.

Theorem B.1.1 (cf. Theorem 2.7.1). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0

is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).

Let 𝑀 be the continuous simulated tempering chain for the distributions

𝑝𝑖 ∝𝑚∑𝑗=1

𝑤𝑗𝑒−𝛽𝑖𝑓0(𝑥−𝜇𝑗) (B.1)

with rate ΩÄ

𝑟𝐷2


where

𝐷 = max

max𝑗‖𝜇𝑗‖ ,

𝜅12

𝑑12𝐾

(B.2)

𝛽1 = ΘÅ 𝜅

𝑑𝐾2𝐷2

ã(B.3)

135

𝛽𝑖+1

𝛽𝑖

≤ 1 +𝜅

𝐾𝑑ÄlnÄ𝐾𝜅

ä+ 1

ä (B.4)

𝐿 = Θ

Ñ𝐾𝑑

ÄlnÄ𝐾𝜅

ä+ 1

ä2𝜅

ln

Ç𝑑𝐾𝐷

𝜅

åé(B.5)

𝑟 =min𝑖 𝑟𝑖max𝑖 𝑟𝑖

. (B.6)


Var(𝑔) ≤ 𝑂

Ç𝐿2𝐷2

𝑟2

åE (𝑔, 𝑔) = 𝑂

Ñ𝐾2𝐷2 ln

ÄlnÄ𝐾𝜅

ä+ 1

ä4lnÄ𝑑𝐾𝐷𝜅

ä2𝜅2𝑟2

éE (𝑔, 𝑔).

(B.7)

Proof. Note that forcing 𝐷 ≤ 𝜅12

𝑑12𝐾

ensures 𝛽1 = Ω(1).

The proof follows that of Theorem 2.7.1, except that we need to use Lemmas D.2.7

and D.2.6 to bound the 𝜒2-divergences. Steps 1 and 2 are the same: we consider the

decomposition where 𝑝𝑖,𝑗 ∝ 𝑒−𝛽𝑖𝑓0(𝑥−𝜇𝑗) and note E𝑖,𝑗 satisfies the Poincare inequality

Var𝑝𝑖,𝑗(𝑔𝑖) ≤1

𝜅𝛽𝑖

E𝑖,𝑗 = 𝑂(𝐷2)E𝑖,𝑗(𝑔𝑖, 𝑔𝑖). (B.8)

By Lemma D.2.7,

𝜒2(𝑝𝑖−1,𝑗||𝑝𝑖,𝑗) ≤ 𝑒

12

∣∣∣1−𝛽𝑖−1𝛽𝑖

∣∣∣ 𝐾𝑑

𝜅−𝐾

∣∣∣1−𝛽𝑖−1𝛽𝑖

∣∣∣(»

ln(𝐾𝜅 )+5

)2

(B.9)

·ÇÇ

1− 𝐾

𝜅

∣∣∣∣∣1− 𝛽𝑖−1

𝛽𝑖

∣∣∣∣∣åÇ

1 +

∣∣∣∣∣1− 𝛽𝑖−1

𝛽𝑖

∣∣∣∣∣åå− 𝑑

2

− 1 = 𝑂(1). (B.10)

By Lemma D.2.6,

𝜒2(𝑝1,𝑗′ ||𝑝1,𝑗) ≤ 𝑒12𝛽1𝜅(2𝐷)2+

√𝛽1𝐾(2𝐷)

√𝑑𝜅

(»ln(𝐾

𝜅 )+5

)(B.11)

136

·(𝑒𝐾(2𝐷)

√𝑑𝜅 +

»𝛽1𝐾(2𝐷)

4𝜋

𝜅𝑒

2√

𝛽1𝐾(2𝐷)√𝑑

√𝜅

+𝛽1𝐾

2(2𝐷)2

2𝜅

)− 1 = 𝑂(1).

(B.12)

The rest of the proof is the same.

Theorem B.1.2 (cf. Theorem 2.7.6). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0

is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).

Suppose∑𝑚

𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 = max1≤𝑗≤𝑚 ‖𝜇𝑗‖. Let 𝑀

be the continuous simulated tempering chain for the distributions

𝑝𝑖 ∝

Ñ𝑚∑𝑗=1

𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗)

é𝛽𝑖

(B.13)

with rate 𝑂Ä

𝑟𝐷2


satisfying the same conditions as in Theorem B.1.1. Then 𝑀 satisfies the Poincare

inequality

Var(𝑔) ≤ 𝑂

Ç𝐿2𝐷2

𝑟2𝑤3min

åE (𝑔, 𝑔) = 𝑂

Ñ𝐾2𝑑2

ÄlnÄ𝐾𝜅

ä+ 1


ä2𝜅2𝑟2𝑤3

min

éE (𝑔, 𝑔). (B.14)

Proof. Let 𝑝𝑖 be the probability distributions in Theorem 2.7.1 with the same pa-

rameters as 𝑝𝑖 and let 𝑝 be the stationary distribution of that simulated tempering

chain. By Theorem B.1.1, Var𝑝(𝑔) = 𝑂Ä𝐿2𝐷2

𝑟2

äE 𝑝(𝑔, 𝑔). Now use By Lemma 2.7.3,

𝑝𝑖𝑝𝑖∈î1, 1

𝑤min

ó𝑍𝑖

𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ = 1

𝑤min.

B.1.2 Proof of main theorem for log-concave densities

Next we rework Section 2.9 for log-concave densities, and prove the main theorem for

log-concave densities, Theorem 2.4.3.

Lemma B.1.3 (cf. Lemma 2.9.2). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0 is

𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).

137

Suppose that Algorithm 1 is run on 𝑓(𝑥) = − lnÄ∑𝑚

𝑗=1𝑤𝑗𝑓0(𝑥− 𝜇𝑗)äwith temper-

atures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿 with partition function estimates ”𝑍1, . . . , 𝑍ℓ

satisfying

∣∣∣∣∣∣𝑍𝑖

𝑍𝑖

/”𝑍1

𝑍1

∣∣∣∣∣∣ ∈[Ç

1− 1

𝐿

å𝑖−1

,

Ç1 +

1

𝐿

å𝑖−1]

(B.15)



maxß

max𝑗 ‖𝜇𝑗‖ , 𝜅12

𝑑12𝐾

™, 𝐾 ≥ 1, and the parameters satisfy

𝜆 = Θ

Ç1

𝐷2

å(B.16)

𝛽1 = 𝑂Å 𝜅

𝑑𝐾2𝐷2

ã(B.17)

𝛽𝑖+1

𝛽𝑖

≤ 1 +𝜅


ä+ 1

ä (B.18)

𝐿 = Θ

Ñ𝐾𝑑

ÄlnÄ𝐾𝜅

ä+ 1

ä2𝜅

ln

Ç𝑑𝐾𝐷

𝜅

åé(B.19)

𝑇 =

Ç𝐿2𝐷2

𝑤3min

𝑑 ln

Çℓ

𝜀𝑤min

åln

Ç𝐾

𝜅

åå(B.20)

𝜂 = 𝑂

Ümin

𝜀

𝐷2𝐾72

Å𝐷 𝐾

𝜅12

+ 𝑑12

ã𝑇,

𝜀

𝐷52𝐾

32

ÅÄ𝐾𝜅

ä 12 + 1

ã , 𝜀

𝐷2𝐾2𝑑𝑇

ê

.

(B.21)

Let 𝑞0 be the distribution(𝑁(0, 1

𝜅𝛽1

), 1)on [ℓ]×R𝑑. The distribution 𝑞𝑇 after running

time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀.


1ℓ𝐿


Ä𝐿2 ln

Ä1𝛿



“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =

Ñ1

𝑛

𝑛∑𝑗=1

𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗)

é(B.22)

also satisfies (B.15).

138

Proof. Begin as in the proof of Lemma 2.9.2. Let 𝑝𝛽,𝑖 ∝ 𝑒−𝛽1𝑓0(𝑥−𝜇𝑖) be a probability

density function.

Write∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤∥∥∥𝑝− 𝑝𝑇

∥∥∥1

+∥∥∥𝑝𝑇 − 𝑞𝑇

∥∥∥1. Bound the first term by

∥∥∥𝑝− 𝑝𝑇∥∥∥1≤»

𝜒2(𝑃 𝑇 ||𝑃 ) ≤ 𝑒−𝑇2𝐶

»𝜒2(𝑃 0||𝑃 ) where 𝐶 is the upper bound on the Poincare constant

in Theorem B.1.2. As in (2.203), we get

𝜒2(𝑝||𝑝0) = 𝑂

Çℓ

𝑤min

åÑ1 +

𝑚∑𝑗=1

𝑤𝑗𝜒2

Ç𝑁

Ç0,

1

𝜅𝛽1

𝐼𝑑

å||𝑝𝛽1,𝑗

åé. (B.23)

By Lemma D.2.8 with strong convexity constants 𝜅𝛽1 and 𝐾𝛽1, this is

𝑂

Ñℓ

𝑤min

Ç𝐾

𝜅

å 𝑑2

𝑒𝐾𝛽1𝐷2

é= 𝑂

Ñℓ

𝑤min

Ç𝐾

𝜅

å 𝑑2

é(B.24)

when 𝛽1 = 𝑂Ä

𝐾𝐷2

ä. Thus for 𝑇 = Ω

Ä𝐶 ln

Äℓ

𝜀𝑤min

ä𝑑 ln

Ä𝐾𝜅

ää,∥∥∥𝑝− 𝑝𝑇

∥∥∥1≤ 𝜀

3.

Again conditioning on the event 𝐴 that 𝑁𝑇 = max 𝑛 : 𝑇𝑛 ≤ 𝑇 = 𝑂(𝑇𝜆), we get

by Lemma 2.8.1 that

∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1

= 𝑂

Ç𝜂2𝐷6𝐾7

Ç𝐷2𝐾

2

𝜅+ 𝑑

å𝑇𝑛 + 𝜂2𝐷5

Ç𝐾

𝜅+ 1

å+ 𝜂𝐷2𝐾2𝑑𝑇

å.

(B.25)

Choosing 𝜂 as in the problem statement, we get∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀 as before. Finally,

apply Lemma 2.9.1, checking the assumptions are satisfied using Lemma D.3.2. The

assumptions of Lemma D.3.2 hold, as

𝛽𝑖+1 − 𝛽𝑖

𝛽𝑖

= 𝑂

Ñ1

𝛼𝐾𝐷2 + 𝑑𝜅

Ä1 + ln

Ä𝐾𝜅

ää+ 1

𝜅lnÄ

1𝑤min

äé . (B.26)

Proof of Theorem 2.4.3. This follows from Lemma B.1.3 in exactly the same way that

the main theorem for gaussians (Theorem 2.4.2) follows from Lemma 2.9.2.

139

B.2 Perturbation tolerance

The proof of Theorem 2.4.4 will follow immediately from Lemma B.2.2, which is a

straightforward analogue of Lemma 2.9.2.

B.2.1 Simulated tempering for distribution with perturba-

tion

First, we consider the mixing time of the continuous tempering chain, analogously to

Theorem B.1.2:

Theorem B.2.1 (cf. Theorem B.1.2). Suppose 𝑓0 satisfies Assumption 2.4.1

Let 𝑀 be the continuous simulated tempering chain with rate 𝑂Ä

𝑟𝐷2

ä, relative prob-

abilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1 satisfying the same conditions

as in Lemma B.2.2. Then 𝑀 satisfies the Poincare inequality

Var(𝑔) ≤ 𝑂

Ç𝐿2𝐷2

𝑟2𝑤2min

åE (𝑔, 𝑔) = 𝑂

Ñ𝐾2𝑑2

ÄlnÄ𝐾𝜅

ä+ 1


ä2𝜅2𝑟2𝑒3Δ𝑤3

min

éE (𝑔, 𝑔). (B.27)

Proof. The proof is almost the same as Let 𝑝𝑖 be the probability distributions in

Theorem 2.7.1 with the same parameters as 𝑝𝑖 and let 𝑝 be the stationary distribution

of that simulated tempering chain. By Theorem B.1.1, Var𝑝(𝑔) = 𝑂Ä𝐿2𝐷2

𝑟2

äE 𝑝(𝑔, 𝑔).

Now use By Lemma 2.7.3, 𝑝𝑖𝑝𝑖∈î1, 1

𝑤min

ó𝑍𝑖

𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ substituted

to be 𝑒Δ 1𝑤min

.

B.2.2 Proof of main theorem with perturbations

Lemma B.2.2 (cf. Lemma B.1.3). Suppose that Algorithm 1 is run on 𝑓(𝑥) =

− lnÄ∑𝑚

𝑗=1𝑤𝑗𝑓0(𝑥− 𝜇𝑗)äwith temperatures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿 with

140

partition function estimates ”𝑍1, . . . , 𝑍ℓ satisfying

∣∣∣∣∣∣𝑍𝑖

𝑍𝑖

/”𝑍1

𝑍1

∣∣∣∣∣∣ ∈[Ç

1− 1

𝐿

å𝑖−1

,

Ç1 +

1

𝐿

å𝑖−1]

(B.28)



maxß

max𝑗 ‖𝜇𝑗‖ , 𝜅12

𝑑12𝐾

™, 𝐾 ≥ 1, and the parameters satisfy

𝜆 = Θ

Ç1

𝐷2

å(B.29)

𝛽1 = 𝑂Å

minß

∆,𝜅

𝑑𝐾2𝐷2

™ã(B.30)

𝛽𝑖+1

𝛽𝑖

≤ min

∆, 1 +𝜅


ä+ 1

ä (B.31)

𝐿 = Θ

Ñ𝐾𝑑

ÄlnÄ𝐾𝜅

ä+ 1

ä2𝜅

ln

Ç𝑑𝐾𝐷

𝜅

åé(B.32)

𝑇 =

Ç𝑒3Δ

𝐿2𝐷2

𝑤3min

𝑑 ln

Çℓ

𝜀𝑤min

åln

Ç𝐾

𝜅

åå(B.33)

𝜂 = 𝑂

Ümin

𝜀

𝐷2(𝐾 + 𝜏)72

Å𝐷𝐾+𝜏

𝜅12

+ 𝑑12

ã𝑇,

𝜀

𝐷52 (𝐾 + 𝜏)

32

ÅÄ𝐾+𝜏𝜅

ä 12 + 1

ã , 𝜀

𝐷2(𝐾 + 𝜏)2𝑑𝑇

ê

.

(B.34)

Let 𝑞0 be the distribution(𝑁(0, 1

𝜅𝛽1

), 1)on [ℓ]×R𝑑. The distribution 𝑞𝑇 after running

time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀.


1ℓ𝐿


Ä𝐿2 ln

Ä1𝛿



“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =

Ñ1

𝑛

𝑛∑𝑗=1

𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗)

é(B.35)

also satisfies (B.28).

The way we prove this theorem is to prove the tolerance of each of the proof

ingredients to perturbations to 𝑓 .

141

Discretization

We now verify all the discretization lemmas continue to hold with perturbations.

The proof of Lemma 2.8.3, combined with the fact that∥∥∥∇𝑓 −∇𝑓∥∥∥

∞≤ ∆ gives

Lemma B.2.3 (Perturbed reach of continuous chain). Let 𝑃 𝛽𝑇 (𝑋) be the Markov

kernel corresponding to evolving Langevin diffusion

𝑑𝑋𝑡 = −𝛽∇𝑓(𝑋𝑡) 𝑑𝑡 + 𝑑𝐵𝑡

with 𝑓 and 𝐷 are as defined in 2.2 for time 𝑇 . Then,

E[‖𝑋𝑡 − 𝑥*‖2] . E[‖𝑋0 − 𝑥*‖2] +

Ç400𝛽

𝐷2𝐾2𝜏 2

𝜅+ 𝑑

å𝑇

Proof. The proof proceeds exactly the same as Lemma 2.8.3.

Furthermore, since ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖2 ≤ 𝜏, ∀𝑥 ∈ R𝑑, from Lemma 2.8.4, we get

Lemma B.2.4 (Perturbed Hessian bound).

‖∇2𝑓(𝑥)‖2 ≤ 4(𝐷𝐾)2 + 𝜏, ∀𝑥 ∈ R𝑑

As a consequence, the analogue of Lemma 2.8.5 gives:

Lemma B.2.5 (Bounding interval drift). In the setting of Lemma 2.8.5, let 𝑥 ∈

R𝑑, 𝑖 ∈ [𝐿], and let 𝜂 ≤ ( 1𝜎+𝜏)2

𝛼. Then,

KL(𝑃𝑇 (𝑥, 𝑖)||”𝑃𝑇 (𝑥, 𝑖)) ≤ 4𝐷6𝜂7(𝐾 + 𝜏)7

3

Ä‖𝑥− 𝑥*‖22 + 8𝑇𝑑

ä+ 𝑑𝑇𝐷2𝜂(𝐾 + 𝜏)2

Putting these together, we get the analogue of Lemma 2.8.1:

Lemma B.2.6. Fix times 0 < 𝑇1 < · · · < 𝑇𝑛 ≤ 𝑇 .

Let 𝑝𝑇 , 𝑞𝑇 : [𝐿]× R𝑑 → R be defined as follows.

142

1. 𝑝𝑇 is the continuous simulated tempering Markov process as in Definition 2.5.1

but with fixed transition times 𝑇1, . . . , 𝑇𝑛. The component chains are Langevin

diffusions on 𝑝𝑖 ∝Ä∑𝑚

𝑗=1 𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑖)

ä𝛽𝑖.

2. 𝑞𝑇 is the discretized version as in Algorithm (1), again with fixed transition

times 𝑇1, . . . , 𝑇𝑛, and with step size 𝜂 ≤ 𝜎2

2.

Then

KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6(𝐾 + 𝜏)6Ç𝐷2 (𝐾 + 𝜏)2

𝜅+ 𝑑

å𝑇𝑛

+ 𝜂2𝐷3(𝐾 + 𝜏)3 max𝑖

E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2(𝐾 + 𝜏)2𝑑𝑇

where 𝑥* is the maximum of∑𝑚

𝑗=1𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗) and satisfies ‖𝑥*‖ = 𝑂(𝐷) where

𝐷 = max ‖𝜇𝑗‖.

Putting it all together

Finally, we prove Lemma B.2.2.

Proof of Lemma B.2.2. The proof is analogous to the one of Lemma 2.9.2 in com-

bination with the Lemmas from the previous subsections, so we just point out the

differences.

We bound 𝜒2(‹𝑃 ||𝑄0) as follows: by the proof of Lemma 2.9.2, we have 𝜒2(𝑃 ||𝑄0) =

𝑂(

ℓ𝑤min

𝐾𝑑2

). By the definition of 𝜒2, this means

∫𝑞0(𝑥)2

𝑝(𝑥)𝑑𝑥 ≤ 𝑂

Çℓ

𝑤min

𝐾𝑑2

å

143

This in turn implies that

𝜒2(‹𝑃 ||𝑄0) ≤∫

(𝑞0(𝑥))2

𝑝(𝑥)𝑑𝑥 ≤ 𝑂

Çℓ

𝑤min

𝐾𝑑2 𝑒Δ

åThen, analogously as in Lemma 2.9.2, we get

∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1

= 𝑂

Ñ𝜂2𝐷6(𝐾 + 𝜏)7

Ç𝐷2 (𝐾 + 𝜏)2

𝜅+ 𝑑

å𝑇𝜂 (B.36)

+ 𝜂2𝐷5

Ç𝐾

𝜅+ 1

å+ 𝜂𝐷2(𝐾 + 𝜏)2𝑑𝑇

é. (B.37)

Choosing 𝜂 as in the statement of the lemma,∥∥∥𝑝− 𝑞𝑇

∥∥∥1≤ 𝜀 follows. The rest of the

lemma is identical to Lemma 2.9.2.

B.3 Continuous version of decomposition theorem

We consider the case where 𝐼 is continuous, and the Markov process is the Langevin

process. We will take 𝐼 = Ω(1) ⊆ R𝑑1 and each Ω𝑖 will be a fixed Ω(2) ⊆ R𝑑2 , so the

space is Ω(1) × Ω(2) ⊆ R𝑑1 × R𝑑2 = R𝑑1+𝑑2 .

The proof of the following is very similar to the proof of Theorem 2.6.3, and to

Lemma 5 in [Mou+19]. The main difference is that they bound using an 𝐿∞ norm

over Ω(1) × Ω(2), and we bound using a 𝐿∞ norm over just Ω(1); thus this bound is

stronger, and can prevent having to consider restrictions. We also note that there is

an analogue of the theorem for log-Sobolev inequalities [Lel09; Gru+09].

Theorem B.3.1 (Poincare inequality from marginal and conditional distribution).

Consider a probability measure 𝜋 with 𝐶1 density on Ω = Ω(1)×Ω(2), where Ω(1) ⊆ R𝑑1

and Ω(2) ⊆ R𝑑2 are closed sets. For 𝑋 = (𝑋1, 𝑋2) ∼ 𝑃 with probability density

function 𝑝 (i.e., 𝑃 (𝑑𝑥) = 𝑝(𝑥) 𝑑𝑥 and 𝑃 (𝑑𝑥2|𝑥1) = 𝑝(𝑥2|𝑥1) 𝑑𝑥2), suppose that

144

∙ The marginal distribution of 𝑋1 satisfies a Poincare inequality with constant

𝐶1.

∙ For any 𝑥1 ∈ Ω(1), the conditional distribution 𝑋2|𝑋1 = 𝑥1 satisfies a Poincare

inequality with constant 𝐶2.

Then 𝜋 satisfies a Poincare inequality with constant

‹𝐶 = max

𝐶2

Ñ1 + 2𝐶1

∥∥∥∥∥∥∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2

𝑝(𝑥2|𝑥1)𝑑𝑥2

∥∥∥∥∥∥𝐿∞(Ω(1))

é, 2𝐶1

(B.38)

Note that an alternate way to write the integral is

∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2

𝑝(𝑥2|𝑥1)𝑑𝑥2 =

∫Ω(2)∇𝑥1𝑝(𝑥2|𝑥1)∇𝑥1(ln 𝑝(𝑥2|𝑥1)) 𝑑𝑥2. (B.39)

Adapting this theorem it would be possible, with additional work, to show that

Langevin on the space [𝛽0, 1]×Ω (that is, include the temperature as a coordinate in

the Langevin algorithm), where the first coordinate is the temperature, will mix.

In the proof we will draw analogies between the discrete and continuous case.

Proof. To make the analogy, let

E↔(𝑔, 𝑔) = E (1)(𝑔, 𝑔) : =∫Ω‖∇𝑥1𝑔(𝑥)‖2 𝑃 (𝑑𝑥) (B.40)

E(𝑔, 𝑔) = E (2)(𝑔, 𝑔) : =∫Ω‖∇𝑥2𝑔(𝑥)‖2 𝑃 (𝑑𝑥) (B.41)

and note E = E↔ + E = E (1) + E (2).

Let 𝑃 be the 𝑥1-marginal of 𝑃 , i.e., 𝑃 (𝐴) = 𝑃 (𝐴×Ω(2)). Given 𝑔 ∈ 𝐿2(Ω(1)×Ω(2)),

145

define 𝑔 ∈ 𝐿2(Ω(1)) by 𝑔(𝑥) = E𝑥2∼𝑃 (·|𝑥1)[𝑔(𝑥)]. Analogously to (2.68),

Var𝑃 (𝑔) =∫Ω(1)

Å∫Ω(2)

(𝑔(𝑥)− E𝑃𝑔(𝑥))2 𝑃 (𝑑𝑥2|𝑥1)

ã𝑃 (𝑑𝑥1) (B.42)

=∫Ω(1)

(∫Ω(2)

(𝑔(𝑥)− E𝑥2∼𝑃 (·|𝑥1)

𝑔(𝑥))2 𝑃 (𝑑𝑥2|𝑥1) +

ÇE

𝑥2∼𝑃 (·|𝑥1)[𝑔(𝑥)]− E

𝑃[𝑔(𝑥)]

å2)𝑃 (𝑑𝑥1)

(B.43)

≤∫Ω(1)

𝐶E𝑃 (·|𝑥1)(𝑔, 𝑔)𝑃 (𝑑𝑥1) + Var𝑃 (𝑔) (B.44)

≤ 𝐶2E(2)(𝑔, 𝑔) + 𝐶1E (𝑔, 𝑔). (B.45)

The second term is analogous to the 𝐵 term in (2.72). (There is no 𝐴 term.) Note

E (𝑔, 𝑔) =∫Ω(1) ‖∇𝑥1𝑔(𝑥1)‖2 𝑃 (𝑑𝑥1), and we can expand ∇𝑥1𝑔(𝑥1) using integration

by parts:

∇𝑥1𝑔(𝑥1) = ∇𝑥1

Å∫Ω(2)

𝑔(𝑥)𝑃 (𝑑𝑥2|𝑥1)ã

(B.46)

=∫Ω(2)∇𝑥1𝑔(𝑥)𝑃 (𝑑𝑥2|𝑥1) +

∫Ω(2)

𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2 (B.47)

Hence by Cauchy-Schwarz, (compare with (2.84))

E (𝑔, 𝑔) ≤ 2

ñ∫Ω‖∇𝑥1𝑔(𝑥)‖2 𝑃 (𝑑𝑥2|𝑥1)𝑃 (𝑑𝑥1) +

∫Ω(1)

∥∥∥∥∫Ω(2)

𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2

∥∥∥∥2 𝑃 (𝑑𝑥1)

ô(B.48)

The first term is E (1)(𝑔, 𝑔). The second term is bounded by Lemma D.1.2, the con-

tinuous analogue of Lemma D.1.1, with 𝑔 ←[ 𝑔(𝑥1, ·) and 𝑝𝑥1(𝑥2) = 𝑝(𝑥2|𝑥1).

∫Ω(1)

∥∥∥∥∫Ω(2)

𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2

∥∥∥∥2 𝑃 (𝑑𝑥1) (B.49)

≤∫Ω(1)

Var𝑃 (·|𝑥1)[𝑔(𝑥)]

(∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2


)𝑃 (𝑑𝑥1) (B.50)

≤∫Ω(1)

𝐶2E𝑃 (·|𝑥1)(𝑔, 𝑔)𝑃 (𝑑𝑥1) ·∥∥∥∥∥∥∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2


∥∥∥∥∥∥𝐿∞(Ω(1))

(B.51)

146

by 𝐿1-𝐿∞ inequality (B.52)

= 𝐶2E(2)(𝑔, 𝑔)

∥∥∥∥∥∥∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2


∥∥∥∥∥∥𝐿∞(Ω(1))

(B.53)

Hence, recalling (B.45) and (B.48),

E (𝑔, 𝑔) ≤ 2E (1)(𝑔, 𝑔) + 2

𝐶2E(2)(𝑔, 𝑔)

∥∥∥∥∥∥∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2


∥∥∥∥∥∥𝐿∞(Ω(1))

(B.54)

Var𝑃 (𝑔) ≤ 𝐶2E(2)(𝑔, 𝑔) + 𝐶1E (𝑔, 𝑔) (B.55)

≤ max

𝐶2

Ñ1 + 2𝐶1

∥∥∥∥∥∥∫Ω(2)

‖∇𝑥1𝑝(𝑥2|𝑥1)‖2


∥∥∥∥∥∥𝐿∞(Ω(1))

é, 2𝐶1

E (𝑔, 𝑔).

(B.56)

B.4 Examples

It might be surprising that sampling a mixture of gaussians require a complicated

Markov Chain such as simulated tempering. However, many simple strategies seem

to fail.

Langevin with few restarts One natural strategy to try is simply to run Langevin

a polynomial number of times from randomly chosen locations. While the time to

“escape” a mode and enter a different one could be exponential, we may hope that

each of the different runs “explores” the individual modes, and we somehow stitch

the runs together. The difficulty with this is that when the means of the gaussians

are not well-separated, it’s difficult to quantify how far each of the individual runs

will reach and thus how to combine the various runs.

147

Recovering the means of the gaussians Another natural strategy would be

to try to recover the means of the gaussians in the mixture by performing gradient

descent on the log-pdf with a polynomial number of random restarts. The hope

would be that maybe the local minima of the log-pdf correspond to the means of the

gaussians, and with enough restarts, we should be able to find them.

Unfortunately, this strategy without substantial modifications also seems to not

work: for instance, in dimension 𝑑, consider a mixture of 𝑑 + 1 gaussians, 𝑑 of them

with means on the corners of a 𝑑-dimensional simplex with a side-length substantially

smaller than the diameter 𝐷 we are considering, and one in the center of the simplex.

In order to discover the mean of the gaussian in the center, we would have to have a

starting point extremely close to the center of the simplex, which in high dimensions

seems difficult.

Additionally, this doesn’t address at all the issue of robustness to perturbations.

Though there are algorithms to optimize “approximately” convex functions, they can

typically handle only very small perturbations. [Bel+15; LR16]

Gaussians with different covariance Our result requires all the gaussians to have

the same variance. This is necessary, as even if the variance of the gaussians only

differ by a factor of 2, there are examples where a simulated tempering chain takes

exponential time to converge [WSH09b]. Intuitively, this is illustrated in Figure B.1.

The figure on the left shows the distribution in low temperature – in this case the

two modes are separate, and both have a significant mass. The figure on the right

shows the distribution in high temperature. Note that although in this case the two

modes are connected, the volume of the mode with smaller variance is much smaller

(exponentially small in 𝑑). Therefore in high dimensions, even though the modes

can be connected at high temperature, the probability mass associated with a small

variance mode is too small to allow fast mixing.

148

In the next section, we show that even if we do not restrict to the particular

simulated tempering chain, no efficient algorithm can efficiently and robustly sample

from a mixture of two Gaussians with different covariances.

Figure B.1: Mixture of two gaussians with different covariance at different tempera-ture

B.5 Lower bound when Gaussians have different

variance

In this section, we give a lower bound showing that in high dimensions, if the Gaus-

sians can have different covariance matrices, results similar to our Theorem 2.4.2

cannot hold. In particular, we construct a log density function 𝑓 that is close to the

log density of mixture of two Gaussians (with different variances), and show that any

algorithm must query the function at exponentially many locations in order to sample

from the distribution. More precisely, we prove the following theorem:

Theorem B.5.1. There exists a function 𝑓 such that 𝑓 is close to a negative log

density function 𝑓 for a mixture of two Gaussians:∥∥∥𝑓 − 𝑓

∥∥∥∞≤ log 2, ∀𝑥 ‖∇𝑓(𝑥) −

∇𝑓(𝑥)‖ ≤ 𝑂(𝑑), ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝑂(𝑑). Let 𝑝 be the distribution whose density

function is proportional to exp(−𝑓). There exists constant 𝑐 > 0, 𝐶 > 0, such that

when 𝑑 ≥ 𝐶, any algorithm with at most 2𝑐𝑑 queries to 𝑓 and ∇𝑓 cannot generate a

149

distribution that is within TV-distance 0.3 to 𝑝.

In order to prove this theorem, we will first specify the mixture of two Gaussians.

Consider a uniform mixture of two Gaussian distributions 𝑁(0, 2𝐼) and 𝑁(𝑢, 𝐼)(𝑢 ∈

R𝑑) in R𝑑.

Definition B.5.2. Let 𝑓1 = ‖𝑥‖2/4 + 𝑑2

log(2√

2𝜋) and 𝑓2 = ‖𝑥− 𝑢‖2/2 + 𝑑2

log(2𝜋).

The mixture 𝑓 used in the lower bound is

𝑓 = − log(1

2(𝑒−𝑓1 + 𝑒−𝑓1)).

In order to prove the lower bound, we will show that there is a function 𝑓 close to

𝑓 , such that 𝑓 behaves exactly like a single Gaussian 𝑁(0, 2𝐼) on almost all points.

Intuitively, any algorithm with only queries to 𝑓 will not be able to distinguish it

with a single Gaussian, and therefore will not be able to find the second component

𝑁(𝑢, 𝐼). More precisely, we have

Lemma B.5.3. When ‖𝑢‖ ≥ 4𝑑 log 2, for any point 𝑥 outside of the ball with center

2𝑢 and radius 1.5‖𝑢‖, we have 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥).

Proof. The Lemma follows from simple calculation. In order for 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥),

since 𝑒𝑥 is monotone we know

−‖𝑥− 𝑢‖2

2≤ −‖𝑥‖

2

4− 𝑑

4log 2.

This is a quadratic inequality in terms of 𝑥, reordering the terms we get

‖𝑥− 2𝑢‖2 ≥ 𝑑 log 2 + 2‖𝑢‖2.

Since 𝑑 log 2 ≤ 0.25‖𝑢‖2, we know whenever ‖𝑥−2𝑢‖2 ≥ 1.5‖𝑢‖ this is always satisfied,

and hence 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥).

150

The lemma shows that outside of this ball, the contribution from the first Gaussian

is dominating. Intuitively, we try to make 𝑓 = 𝑓1 outside of this ball, and 𝑓 = 𝑓

inside the ball. To make the function continuous, we shift between the two functions

gradually. More precisely, we define 𝑓 as follows:

Definition B.5.4. The function

𝑓(𝑥) = 𝑔(𝑥)𝑓1(𝑥) + (1− 𝑔(𝑥))𝑓(𝑥). (B.57)

Here the function 𝑔(𝑥) (see Definition B.5.6) satisfies

𝑔(𝑥) =

1 ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖

0 ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖

∈ [0, 1] otherwise

Also 𝑔(𝑥) is twice differentiable with all first and second order derivatives bounded.

With a carefully constructed 𝑔(𝑥), it is possible to prove that 𝑓 is point-wise close

to 𝑓 in function value, gradient and Hessian, as stated in the Lemma below. Since

these are just routine calculations, we leave the construction of 𝑔(𝑥) and verification

of this lemma at the end of this section.

Lemma B.5.5. For the functions 𝑓 and 𝑓 defined in Definitions B.5.2 and B.5.4, if

‖𝑢‖ ≥ 4𝑑 log 2, there exists a large enough constant 𝐶 such that

|𝑓 − 𝑓 |∞ ≤ log 2

∀𝑥 ‖∇𝑓(𝑥)−∇𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖

∀𝑥 ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2.

Now we are ready to prove the main theorem:

151

Proof of Theorem B.5.1. We will show that if we pick ‖𝑢‖ to be a uniform random

vector with norm 8𝑑 log 2, there exists constant 𝑐 > 0 such that for any algorithm,

with probability at least 1− exp(−𝑐𝑑), in the first exp 𝑐𝑑 iterations of the algorithm

there will be no vector 𝑥 = 0 such that cos 𝜃(𝑥, 𝑢) ≥ 3/5.

First, by standard concentration inequalities, we know for any fixed vector 𝑥 = 0

and a uniformly random 𝑢,

Pr[cos 𝜃(𝑥, 𝑢) ≥ 3/5] ≤ exp−𝑐′𝑑,

for some constant 𝑐′ > 0 (𝑐′ = 0.01 suffices).

Now, for any algorithm, consider running the algorithm with oracle to 𝑓1 and

𝑓 respectively (if the algorithm is randomized, we also couple the random choices

of the algorithm in these two runs). Suppose when the oracle is 𝑓1 the queries are

𝑥1, 𝑥2, ..., 𝑥𝑡 and when the oracle is 𝑓 the queries are 1, ..., 𝑡.

Let 𝑐 = 𝑐′/2, when 𝑡 ≤ exp(𝑐𝑑), by union bound we know with probability at least

1− exp(−𝑐𝑑), we have cos 𝜃(𝑥𝑖, 𝑢) < 3/5 for all 𝑖 ≤ 𝑡. On the other hand, every point

𝑦 in the ball with center 2‖𝑢‖ and radius 1.6‖𝑢‖ has cos 𝜃(𝑦, 𝑢) ≥ 3/5. We know

‖𝑥𝑖 − 2𝑢‖ > 1.6‖𝑢‖, hence 𝑓1(𝑥𝑖) = 𝑓(𝑥𝑖) for all 𝑖 ≤ 𝑡 (the derivatives are also the

same). Therefore, the algorithm is going to get the same response no matter whether

it has access to 𝑓1 or 𝑓 . This implies 𝑖 = 𝑥𝑖 for all 𝑖 ≤ 𝑡.

Now, to see why this implies the output distribution of the last point is far from

𝑝, note that when 𝑑 is large enough 𝑝 has mass at least 0.4 in ball ‖𝑥𝑖− 2𝑢‖ ≤ 1.6‖𝑢‖

(because essentially all the mass in the second Gaussian is inside this ball), while the

algorithm has less than 0.1 probability of having any point in this region. Therefore

the TV distance is at least 0.3 and this finishes the proof.

152

B.5.1 Construction of 𝑔 and closeness of two functions

Now we finish the details of the proof by construction a function 𝑔.

Definition B.5.6. Let ℎ(𝑥) be the following function:

ℎ(𝑥) =

1 𝑥 ≥ 1

0 𝑥 ≤ 0

𝑥2(1− 𝑥)2 + (1− (1− 𝑥)2)2 𝑥 ∈ [0, 1]

We then define 𝑔(𝑥) to be 𝑔(𝑥) := ℎ(10(‖𝑥−2𝑢‖‖𝑢‖ − 1.5

)).

For this function we can prove:

Lemma B.5.7. The function 𝑔 defined above satisfies

𝑔(𝑥) =

1 ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖

0 ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖

∈ [0, 1] otherwise

Also 𝑔(𝑥) is twice differentiable. There exists large enough constant 𝐶𝑔 > 0 such that

for all 𝑥

‖∇𝑔(𝑥)‖ ≤ 𝐶𝑔‖𝑢‖ ‖∇2𝑔(𝑥)‖ ≤ 𝐶𝑔(‖𝑢‖2 + 1).

Proof. First we prove properties of ℎ(𝑥). Let ℎ0(𝑥) = 𝑥2(1− 𝑥)2 + (1− (1− 𝑥)2)2, it

is easy to check that ℎ0(0) = ℎ′0(0) = ℎ′′

0(0) = 0, ℎ0(1) = 1 and ℎ′0(1) = ℎ′′

0(1) = 0.

Therefore the entire function ℎ(𝑥) is twice differentiable.

Also, we know ℎ′0(𝑥) = 2𝑥(4𝑥2− 9𝑥+ 5), which is always positive when 𝑥 ∈ [0, 1].

Therefore ℎ(𝑥) is monotone in [0, 1]. The second derivative ℎ′′0(𝑥) = 24𝑥2− 36𝑥+ 10.

Just using the naive bound (sum of absolute values of individual terms) we can get

for any 𝑥 ∈ [0, 1] |ℎ′0(𝑥)| ≤ 36 and |ℎ′′(𝑥)| ≤ 60. (We can of course compute better

bounds but it is not important for this proof.)

153

Now consider the function 𝑔. We know when ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖,

∇𝑔(𝑥) = ℎ′Ç

10

Ç‖𝑥− 2𝑢‖‖𝑢‖

− 1.5

åå· 10(𝑥− 2𝑢).

Therefore ‖∇𝑔(𝑥)‖ ≤ 36 × 10 × ‖𝑥 − 2𝑢‖ ≤ 𝐶𝑔‖𝑢‖ (when 𝐶𝑔 is a large enough

constant).

For the second order derivative, we know

∇2𝑔(𝑥) = 100ℎ′′Ç

10

Ç‖𝑥− 2𝑢‖‖𝑢‖

− 1.5

åå(𝑥−2𝑢)(𝑥−2𝑢)⊤+10ℎ′

Ç10

Ç‖𝑥− 2𝑢‖‖𝑢‖

− 1.5

åå𝐼.

Again by bounds on ℎ′ and ℎ′′ we know there exists large enough constants so

that ‖∇2𝑔(𝑥)‖ ≤ 𝐶𝑔(‖𝑢‖2 + 1).

Finally we can prove Lemma B.5.5.

Proof of Lemma B.5.5. We first show that the function values are close. When ‖𝑥−

2𝑢‖ ≤ 1.5‖𝑢‖, by definition 𝑓(𝑥) = 𝑓(𝑥). When ‖𝑥 − 2𝑢‖ ≥ 1.5‖𝑢‖, by property

of 𝑔 we know 𝑓(𝑥) is between 𝑓(𝑥) and 𝑓1(𝑥). Now by Lemma B.5.3, in this range

𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥), so 𝑓1(𝑥)− log 2 ≤ 𝑓(𝑥) ≤ 𝑓1(𝑥). As a result we know |𝑓(𝑥)−𝑓(𝑥)| ≤

log 2.

Next we consider the gradient. Again when ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖ the two functions

(and all their derivatives) are the same. When ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖, we have

∇𝑓(𝑥) = 𝑔(𝑥)∇𝑓1(𝑥) + (1− 𝑔(𝑥))∇𝑓(𝑥) + (𝑓1(𝑥)− 𝑓(𝑥))∇𝑔(𝑥).

By Lemma B.5.7 we have upperbounds for 𝑔(𝑥) and ‖∇𝑔(𝑥)‖, also both ‖∇𝑓1(𝑥)‖, ‖∇𝑓(𝑥)‖

can be easily bounded by 𝑂(1)‖𝑢‖, therefore ‖∇𝑓(𝑥) − ∇𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖ for large

enough constant 𝐶.

154

When ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖, we know ∇𝑓(𝑥) = ∇𝑓1(𝑥). Calculation shows

∇𝑓1(𝑥)−∇𝑓(𝑥) =𝑒−𝑓2(𝑥)

𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥)(∇𝑓1(𝑥)−∇𝑓2(𝑥)).

When ‖𝑥‖ ≤ 50‖𝑢‖, we have ‖∇𝑓1(𝑥) − ∇𝑓2(𝑥)‖ ≤ 2‖𝑥‖ + 2‖𝑢‖ ≤ 𝑂(1)‖𝑢‖.

When ‖𝑥‖ ≥ 50‖𝑢‖, it is easy to check that 𝑒−𝑓2(𝑥)

𝑒−𝑓1(𝑥)+𝑒−𝑓2(𝑥)≤ exp−‖𝑥‖2/5 and

‖∇𝑓1(𝑥)−∇𝑓2(𝑥)‖ ≤ 2‖𝑥‖, therefore in this case the difference in gradient bounded

by exp(−𝑡2/5)2𝑡 which is always small.

Finally we can check the Hessian. Once again when ‖𝑥 − 2𝑢‖ ≤ 1.5‖𝑢‖ the two

functions are the same. When ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖, we have

∇2𝑓(𝑥) =𝑔(𝑥)∇2𝑓1(𝑥) + (1− 𝑔(𝑥))∇2𝑓(𝑥)

+ (∇𝑓1(𝑥)−∇𝑓(𝑥))(∇𝑔(𝑥))⊤ + (∇𝑔(𝑥))(∇𝑓1(𝑥)−∇𝑓(𝑥))⊤

+ (𝑓1 − 𝑓)∇2𝑔(𝑥).

In this case we get bounds for 𝑔(𝑥),∇𝑔(𝑥),∇2𝑔(𝑥) from Lemma B.5.7, ‖∇𝑓1(𝑥)‖, ‖∇𝑓(𝑥)‖

can still be bounded by 𝑂(1)‖𝑢‖, ‖∇2𝑓(𝑥)‖, ‖∇2𝑓1(𝑥)‖ can be bounded by 𝑂(‖𝑢‖2)

and 𝑂(1) respectively. Therefore we know ‖∇2𝑓(𝑥) − ∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2 for large

enough constant 𝐶.

When ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖, we have 𝑓(𝑥) = 𝑓1(𝑥), and

∇2𝑓1(𝑥)−∇2𝑓(𝑥) =𝑒−𝑓2(𝑥)

𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥)(∇2𝑓1(𝑥)−∇2𝑓2(𝑥))

+𝑒−𝑓1(𝑥)−𝑓2(𝑥)(∇𝑓1(𝑥)−∇𝑓2(𝑥))(∇𝑓1(𝑥)−∇𝑓2(𝑥))⊤

(𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥))2.

Here the first term is always bounded by a constant (because 𝑒−𝑓2(𝑥) is smaller,

and ∇2𝑓1(𝑥)−∇2𝑓2(𝑥) = 𝐼/2). For the second term, by arguments similar as before,

we know when ‖𝑥‖ ≤ 50‖𝑢‖ this is bounded by 𝑂(1)‖𝑢‖2. When ‖𝑥‖ ≥ 50‖𝑢‖ we can

155

check 𝑒−𝑓1(𝑥)−𝑓2(𝑥)

(𝑒−𝑓1(𝑥)+𝑒−𝑓2(𝑥))2≤ exp(−‖𝑥‖2/5) and ‖(∇𝑓1(𝑥)−∇𝑓2(𝑥))(∇𝑓1(𝑥)−∇𝑓2(𝑥))⊤‖ ≤

4‖𝑥|2. Therefore the second term is bounded by exp(−𝑡2/5) · 4𝑡2 which is no larger

than a constant. Combining all the cases we know there exists a large enough constant

𝐶 such that ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2 for all 𝑥.

156

Appendix C

Appendix for Chapter 3

C.1 Proof for logistic regression application

C.1.1 Theorem for general posterior sampling, and applica-

tion to logistic regression

We show that under some general conditions—roughly, that we see data in all directions—

the posterior distribution concentrates. We specialize to logistic regression and show

that the posterior for logistic regression concentrates under reasonable assumptions.

The proof shares elements with the proof of the Bernstein-von Mises theorem (see

e.g. [Nic12]), which says that under some weak smoothness and integrability assump-

tions, the posterior distribution after seeing iid data (asymptotically) approaches a

normal distribution. However, we only need to prove a weaker result—not that the

posterior distribution is close to normal, but just 𝛼𝑇 -strongly log concave in a neigh-

borhood of the MLE, for some 𝛼 > 0; hence, we get good, nonasymptotic bounds.

This is true under more general assumptions; in particular, the data do not have have

to be iid, as long as we observe data “in all directions.”

Theorem C.1.1 (Validity of the assumptions for posterior sampling). Sup-

157

pose that ‖𝜃0‖ ≤ 𝐵, 𝑥𝑡 ∼ 𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0). Let 𝑓𝑡, 𝑡 ≥ 1 be such that 𝑃𝑥(𝑥𝑡|𝑥1:𝑡−1, 𝜃) ∝

𝑒−𝑓𝑡(𝜃) and let 𝜋𝑡(𝜃) be the posterior distribution, 𝜋𝑡(𝜃) ∝ 𝑒−∑𝑡

𝑘=0𝑓𝑡(𝜃). Suppose there

is 𝑀,𝐿, 𝑟, 𝜎min, 𝑇min > 0 and 𝛼, 𝛽 ≥ 0 such that the following conditions hold:

1. For each 𝑡, 1 ≤ 𝑡 ≤ 𝑇 , 𝑓𝑡(𝜃) is twice continuously differentiable and convex.

2. (Gradients have bounded variation) For each 𝑡, given 𝑥1:𝑡−1,

‖∇𝑓𝑡(𝜃)− E[∇𝑓𝑡(𝜃)|𝑥1:𝑡−1]‖ ≤𝑀. (C.1)

3. (Smoothness) Each 𝑓𝑡 is 𝐿-smooth, for 1 ≤ 𝑡 ≤ 𝑇 .

4. (Strong convexity in neighborhood) Let

𝐼𝑇 (𝜃) : =1

𝑇

𝑇∑𝑡=1

∇2𝑓𝑡(𝜃) (C.2)

Then for 𝑇 ≥ 𝑇min, with probability ≥ 1− 𝜀2,

∀𝜃 ∈ B(𝜃0, 𝑟), 𝐼𝑇 (𝜃) ⪰ 𝜎min𝐼𝑑 (C.3)

5. 𝑓0(𝜃) is 𝛼-strongly convex and 𝛽-smooth, and has minimum at 𝜃 = 0.

Let 𝜃⋆𝑇 be the minimum of∑𝑇

𝑡=0 𝑓𝑡(𝜃), i.e., the MAP for 𝜃 after observing 𝑥1:𝑇 . Letting

𝐶 = max

1,𝑀

√2𝑑 log

Ç2𝑑

𝜀

å,

4𝑑

𝜎min

,

and 𝑐 = 𝛼𝜎min

, if 𝑇 ≥ 𝑇min is such that 𝐶√𝑇+𝛽𝐵

𝜎min𝑇+𝛼+ 𝐶√

𝑇+𝑐< 𝑟, then with probability 1−𝜀,

the following hold:

1. ‖𝜃⋆𝑇 − 𝜃0‖ ≤ 𝐶√𝑇+𝛽𝐵

𝜎min𝑇+𝛼.

2. For 𝐶 ′ ≥ 0, P𝜃∼𝜋𝑇

(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′

√𝑇+𝑐

)≤ 𝐾1

𝜎min𝐶√𝑇+𝑐

((𝐿𝑇+𝛽)𝑒

𝑑

) 𝑑2 𝑒

12𝜎min𝐶

2−𝜎min𝐶𝐶′2

for some constant 𝐾1.

158

The strong convexity condition is analogous to a small-ball inequality [KM15; Men14]

for the sample Fisher information matrix in a neighborhood of the true parameter

value. In the iid case we have concentration (which is necessary for a central limit

theorem to hold, as in the Bernstein-von Mises Theorem); in the non-iid case we do

not necessarily have concentration, but the small-ball inequality can still hold.

We show that under reasonable conditions on the data-generating distribution,

logistic regression satisfies the above conditions. Let 𝜑(𝑥) = 11+𝑒−𝑥 be the logistic

function. Note that 𝜑(−𝑥) = 1− 𝜑(𝑥).

Applying Theorem C.1.1 to the setting of logistic regression, we will obtain the

following.

Lemma C.1.2. In the setting of Problem 3.2.1 (logistic regression), suppose that

‖𝜃0‖ ≤ B, 𝑢𝑡 ∼ 𝑃𝑢 are iid, where 𝑃𝑢 is a distribution that satisfies the following: for

𝑢 ∼ 𝑃𝑢,

1. (Bounded) ‖𝑢‖2 ≤𝑀 with probability 1.

2. (Minimal eigenvalue of Fisher information matrix)

𝐼(𝜃0) : =∫R𝑑

𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢⊤ 𝑑𝑃𝑢 ⪰ 𝜎𝐼𝑑, (C.4)

for 𝜎 > 0.

Let

𝐶 = max

1, 2𝑀

√2𝑑 log

Ç2𝑑

𝜀

å,4𝑒𝑑

𝜎

(C.5)

Then for 𝑡 > maxß

𝑀4 log( 2𝑑𝜀 )

8𝜎2 , 4𝑀2Ä2𝑒𝐶𝜎

+ 1ä2

, 4𝑒𝑀B𝛼𝜎

™, we have

1. ∇𝑓𝑘(𝜃) is 𝑀2

4-Lipschitz for all 𝑘 ∈ N.

159

2. For any 𝐶 ′ ≥ 0, and 𝑐 = 2𝑒𝛼𝜎,

P𝜃∼𝜋𝑡

Ç‖𝜃 − 𝜃⋆𝑡 ‖ ≥

𝐶 ′√𝑇 + 𝑐

å≤ 𝐾1

𝜎𝐶√𝑇 + 𝑐

ÑÄ𝑀2

4𝑇 + 𝛼

ä𝑒

𝑑

é 𝑑2

𝑒14𝑒

𝜎𝐶2−𝜎𝐶𝐶′4𝑒

(C.6)

for some constant 𝐾1.

3. With probability 1− 𝜀, ‖𝜃⋆𝑡 − 𝜃0‖ ≤ 𝐶√𝑡+𝛼B

𝜎𝑡/2𝑒+𝛼.

Remark C.1.3. We explain the condition 𝐼(𝜃0) =∫R𝑑 𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢

⊤ 𝑑𝑃𝑢 ⪰

𝜎𝐼𝑑. Note that 𝜑(𝑥)𝜑(−𝑥) can be bounded away from 0 in a neighborhood of 𝑥 = 0,

and then decays to 0 exponentially in 𝑥. Thus, 𝐼(𝜃0) is essentially the second moment,

where we ignore vectors that are too large in the direction of ±𝜃0.

More precisely, we have the following implication:

E𝑢[𝑢𝑢⊤1𝜑(𝑢⊤𝜃0)≤𝐶1

] ⪰ 𝜎𝐼𝑑 =⇒∫R𝑑

𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢⊤ 𝑑𝑃𝑢 ⪰

1

𝜑(𝐶1)(1− 𝜑(𝐶1))𝜎𝐼𝑑.

(C.7)

Theorem 3.2.2 is stated with 𝐶1 = 2.

C.1.2 Proof of Theorem C.1.1

Proof of Theorem C.1.1. Let ℰ be the event that (C.3) holds.

Step 1: We bound ‖𝜃⋆𝑇 − 𝜃0‖ with high probability.

We show that with high probability∑𝑇

𝑡=0∇𝑓𝑡(𝜃0) is close to 0. Since∑𝑇

𝑡=0∇𝑓𝑡(𝜃⋆𝑇 ) =

0, the gradient at 𝜃0 and 𝜃⋆𝑇 are close. Then by strong convexity, we conclude 𝜃0 and

𝜃⋆𝑇 are close.

First note that E[𝑓𝑡(𝜃)|𝑥1:𝑡−1] =∫R𝑑 − log𝑃𝑥(𝑥𝑡|𝑥1:𝑡−1, 𝜃) 𝑑𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0) is a KL

divergence minus the entropy for 𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0), and hence is minimized at 𝜃 = 𝜃0.

160

Hence 1𝑇

∑𝑇𝑡=1 E[∇𝑓𝑡(𝜃0)|𝑥1:𝑡−1] = 0. Thus by Lemma D.4.2 applied to

𝑇∑𝑡=1

∇𝑓𝑡(𝜃0) =𝑇∑𝑡=1

[∇𝑓𝑡(𝜃0)− E[∇𝑓𝑡(𝜃0)|𝑥1:𝑡−1]] , (C.8)

we have by Chernoff’s inequality that

P

Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1

∇𝑓𝑡(𝜃0)∥∥∥∥∥∥ ≥ 𝐶√

𝑇

é≤ 2𝑑𝑒−

𝐶2

2𝑀2𝑑 ≤ 𝜀

2(C.9)

when 𝐶2

2𝑀2𝑑≥ log

Ä4𝑑𝜀

ä, which happens when 𝐶 ≥𝑀

√2𝑑 log

Ä4𝑑𝜀

ä.

Let 𝒜 be the event that∥∥∥ 1𝑇

∑𝑇𝑡=1∇𝑓𝑡(𝜃0)

∥∥∥ < 𝐶√𝑇

. Then under 𝒜,

∥∥∥∥∥∥ 1

𝑇

𝑇∑𝑡=0

∇𝑓𝑡(𝜃0)∥∥∥∥∥∥ > − 𝐶√

𝑇− 1

𝑇𝛽 ‖𝜃0‖ ≥ −

𝐶√𝑇− 𝛽𝐵

𝑇(C.10)

Let 𝑤 =𝜃⋆𝑇−𝜃0

‖𝜃⋆𝑇−𝜃0‖ . Under the event ℰ ,

1

𝑇

𝑇∑𝑡=0

∇𝑓𝑡(𝜃0 + 𝑠𝑤)⊤𝑤 ≥ − 𝐶√𝑇− 𝛽𝐵

𝑇+Å𝜎min +

𝛼

𝑇

ãmin𝑠, 𝑟. (C.11)

Hence, if 𝑠, 𝑟 > 𝐶√𝑇+𝛽𝐵

𝜎min𝑇+𝛼, then

∑𝑇𝑡=0∇𝑓𝑡(𝜃0) = 0. Considering 𝑠 = ‖𝜃⋆𝑇 − 𝜃0‖, this

means that

‖𝜃⋆𝑇 − 𝜃0‖ ≤𝐶√𝑇 + 𝛽𝐵

𝜎min𝑇 + 𝛼. (C.12)

Step 2: For 𝑐 = 𝛼𝜎min

, we bound P𝜃∼𝜋𝑇(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′

√𝑇+𝑐

).

Under ℰ , 1𝑇

∑𝑇𝑡=1 𝑓𝑡(𝜃) is 𝜎min-strongly convex for 𝜃 ∈ B

(𝜃⋆𝑇 ,

𝐶√𝑇+𝑐

)⊂ B(𝜃0, 𝑟), and

𝑓0(𝜃) is 𝛼-strongly convex.

Let 𝑟′ = 𝑟 − 𝐶√𝑇+𝛽𝐵

𝜎min𝑇+𝛼. Under 𝒜, B(𝜃⋆𝑇 , 𝑟

′) ⊂ B(𝜃0, 𝑟). Thus under ℰ ∩ 𝒜, letting

161

𝑤(𝜃) :=𝜃−𝜃⋆𝑇

‖𝜃−𝜃⋆𝑇‖,

∀𝜃 ∈ B(𝜃⋆𝑇 , 𝑟′) ⊂ B(𝜃0, 𝑟),

𝑇∑𝑡=0

∇𝑓𝑡(𝜃)⊤𝑤(𝜃) ≥ (𝑇𝜎min + 𝛼) ‖𝜃 − 𝜃⋆𝑇‖ . (C.13)

Suppose 𝑇 is such that 𝐶√𝑇+𝑐

< 𝑟′, i.e., 𝐶√𝑇+𝛽𝐵

𝜎min𝑇+𝛼+ 𝐶√

𝑇+𝑐< 𝑟. By shifting, we may

assume that∑𝑇

𝑡=0 𝑓𝑡(𝜃⋆𝑇 ) = 0. Because 𝑓𝑡(𝜃) is 𝐿-smooth for 1 ≤ 𝑡 ≤ 𝑇 and 𝛽-smooth

for 𝑡 = 0,

𝑇∑𝑡=0

𝑓𝑡(𝜃) ≤ 𝐿𝑇 + 𝛽

2‖𝜃 − 𝜃⋆𝑇‖

2 . (C.14)

Then for all 𝜃 ∈ B(𝜃⋆𝑇 ,

𝐶√𝑇+𝑐

)𝑐,

𝑇∑𝑡=0

𝑓𝑡(𝜃) ≥𝑇∑𝑡=0

𝑓𝑡

Ç𝜃⋆𝑇 +

𝐶√𝑇 + 𝑐

𝑤(𝜃)

å+

𝑇∑𝑡=0

ñ𝑓𝑡(𝜃)− 𝑓𝑡

Ç𝜃⋆𝑇 +

𝐶√𝑇 + 𝑐

𝑤(𝜃)

åô(C.15)

≥ 1

2(𝑇𝜎min + 𝛼)

𝐶2

𝑇 + 𝑐+ (𝑇𝜎min + 𝛼)

𝐶√𝑇 + 𝑐

Ç‖𝜃 − 𝜃⋆𝑇‖ −

𝐶√𝑇 + 𝑐

å(C.16)

≥ 1

2𝜎min𝐶

2 + 𝜎min𝐶√𝑇 + 𝑐

Ç‖𝜃 − 𝜃⋆𝑇‖ −

𝐶√𝑇 + 𝑐

å. (C.17)

Thus for any 𝐶 ′ ≥ 0,

∫R𝑑

𝑒−∑𝑇

𝑡=0𝑓𝑡(𝜃) 𝑑𝜃 ≥

∫R𝑑

𝑒−𝐿𝑇+𝛽

2 ‖𝜃−𝜃⋆𝑇‖2

𝑑𝜃 =

Ç2𝜋

𝐿𝑇 + 𝛽

å 𝑑2

(C.18)∫BÄ𝜃⋆𝑇 , 𝐶′

√𝑇+𝑐

ä𝑐 𝑒−∑𝑇𝑡=0

𝑓𝑡(𝜃) 𝑑𝜃 ≤∫BÄ𝜃⋆𝑇 , 𝐶′

√𝑇+𝑐

ä𝑐 𝑒− 12𝜎min𝐶

2

𝑒−𝜎min𝐶

√𝑇+𝑐

Ä‖𝜃−𝜃⋆𝑇‖− 𝐶√

𝑇+𝑐

ä𝑑𝜃

(C.19)

=∫ ∞

𝐶′√𝑇+𝑐

Vol𝑑−1(S𝑑−1)𝛾𝑑−1𝑒12𝜎min𝐶

2

𝑒−𝜎min𝐶√𝑇+𝑐𝛾 𝑑𝛾 (C.20)

162

=∫ ∞

𝐶′√𝑇+𝑐

Vol𝑑−1(S𝑑−1)𝑒12𝜎min𝐶

2

𝑒−(𝜎min𝐶√𝑇+𝑐𝛾−(𝑑−1) log 𝛾) 𝑑𝛾

(C.21)

Now, when 𝐶 ≥ max2(𝑑−1)𝜎min

, 1, we have that

𝜎min𝐶√𝑇 + 𝑐𝛾 − (𝑑− 1) log 𝛾 ≥ 𝜎min𝐶

√𝑇 + 𝑐𝛾 − (𝑑− 1)𝛾 (C.22)

≥ 𝜎min𝐶√𝑇 + 𝑐𝛾 − 𝜎min𝐶

√𝑇 + 𝑐𝛾

2(C.23)

=𝜎min𝐶

√𝑇 + 𝑐𝛾

2. (C.24)

Then by Stirling’s formula, for some 𝐾1,

(C.21) ≤ Vol𝑑−1(S𝑑−1)𝑒12𝜎min𝐶

2∫ ∞

𝐶′√𝑇+𝑐

𝑒−𝜎min𝐶

√𝑇+𝑐𝛾

2 𝑑𝛾 (C.25)

≤ 2𝜋𝑑2

ΓÄ𝑑2

ä𝑒 12𝜎min𝐶

2 2

𝜎min𝐶√𝑇 + 𝑐

𝑒−𝜎min𝐶𝐶′

2 (C.26)

≤ 𝐾1


Ç2𝜋𝑒

𝑑

å 𝑑2

𝑒12𝜎min𝐶

2−𝜎min𝐶𝐶′2 (C.27)

We bound P𝜃∼𝜋𝑇

(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′

√𝑇+𝑐

). By (C.18) and (C.21),

P𝜃∼𝜋𝑇

Ç‖𝜃 − 𝜃⋆𝑇‖ ≥

𝐶 ′√𝑇 + 𝑐

å=

∫𝜃∈𝐵

Ä𝜃⋆𝑇 , 𝐶′

√𝑇+𝑐

ä𝑐 𝑒−∑𝑇𝑡=0

𝑓𝑡(𝜃) 𝑑𝜃∫R𝑑 𝑒−

∑𝑇𝑡=0 𝑓𝑡(𝜃) 𝑑𝜃

(C.28)

≤ 𝐾1


Ç𝐿𝑇 + 𝛽

2𝜋

å 𝑑2Ç

2𝜋𝑒

𝑑

å 𝑑2

𝑒12𝜎min𝐶


(C.29)

=𝐾1


Ç(𝐿𝑇 + 𝛽)𝑒

𝑑

å 𝑑2

𝑒12𝜎min𝐶


(C.30)

as needed. The requirements on 𝐶 are 𝐶 ≥ max

1,𝑀√

2𝑑 logÄ4𝑑𝜀

ä, 2𝑑𝜎min

, so the

theorem follows.

163

C.1.3 Online logistic regression: Proof of Lemma C.1.2 and

Theorem 3.2.2

To prove Theorem C.1.2, we will apply Theorem C.1.1. To do this, we need to verify

the conditions in Theorem C.1.1.

Lemma C.1.4. Under the assumptions of Theorem C.1.2,

1. (Gradients have bounded variation) For all 𝑡, ‖∇𝑓𝑡(𝜃)‖ ≤𝑀 and

‖∇𝑓𝑡(𝜃)− E∇𝑓𝑡(𝜃)‖ ≤ 2𝑀 .

2. (Smoothness) For all 𝑡, 𝑓𝑡 is14𝑀2-smooth.

3. (Strong convexity in neighborhood) for 𝑇 ≥ 𝑀4 log( 𝑑𝜀)

8𝜎2 ,

P(∀𝜃 ∈ B

Ç𝜃0,

1

𝑀

å,

𝑇∑𝑡=1

∇2𝑓𝑡(𝜃) ⪰ 𝜎

2𝑒𝑇𝐼𝑑

)≥ 1− 𝜀. (C.31)

Proof. First, we calculate the Hessian of the negative log-likelihood.

If 𝑓𝑡(𝜃) = − log 𝜑(𝑦𝑢⊤𝜃), then

∇𝑓𝑡(𝜃) =−𝑦𝜑(𝑦𝑢⊤𝜃)𝜑(−𝑦𝑢⊤𝜃)

𝜑(𝑦𝑢⊤𝜃)𝑢 = −𝑦𝜑(−𝑦𝑢⊤𝜃)𝑢 (C.32)

∇2𝑓𝑡(𝜃) = 𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤. (C.33)

Note that ‖∇𝑓𝑡(𝜃)‖ ≤ ‖𝑢‖ ≤𝑀 , so the first point follows.

To obtain the expected values, note that 𝑦 = 1 with probability 𝜑(𝑢⊤𝜃0), and

164

𝑦 = −1 with probability 1− 𝜑(𝑢⊤𝜃0), so that

E[∇2𝑓𝑡(𝜃)] = E(𝑢,𝑦)[𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤] (C.34)

= E𝑢[𝜑(𝑢⊤𝜃0)𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤ + (1− 𝜑(𝑢⊤𝜃0))𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤]

(C.35)

= E𝑢[𝜑(𝑢⊤𝜃)(1− 𝜑(𝑢⊤𝜃))𝑢𝑢⊤]. (C.36)

Suppose that E𝑢[𝜑(𝑢⊤𝜃)(1− 𝜑(𝑢⊤𝜃))𝑢𝑢⊤] ⪰ 𝜎𝐼.

Next, we show that∑𝑇

𝑡=1∇2𝑓𝑡(𝜃0) is lower-bounded with high probability.

Note that ‖∇2𝑓𝑡(𝜃0)‖ =∥∥∥𝜑(−𝑦𝑢⊤𝜃0)𝜑(𝑦𝑢⊤𝜃0)𝑢𝑢

⊤∥∥∥2≤ 1

4𝑀2. (So the second point

follows.) By the Matrix Chernoff bound,

P(

𝑇∑𝑡=1

∇𝑓 2𝑡 (𝜃0) ⪰

𝜎

2𝑇𝐼𝑑

)≤ 𝑑𝑒−

2·42𝑀4 𝑇(𝜎

2 )2

= 𝑑𝑒−8𝜎2𝑇𝑀4 ≤ 𝜀 (C.37)

when 𝑇 ≥ 𝑀4 log( 𝑑𝜀)

8𝜎2 .

Finally, we show that if the minimum eigenvalue of this matrix is bounded away

from 0 at 𝜃0, then it is also bounded away from 0 in a neighborhood. To see this,

note

𝜑(𝑥 + 𝑐)(1− 𝜑(𝑥 + 𝑐))

𝜑(𝑥)(1− 𝜑(𝑥))=

𝑒𝑥+𝑐

(1 + 𝑒𝑥+𝑐)2(1 + 𝑒𝑥)2

𝑒𝑥≥ 𝑒𝑐

𝑒2𝑐= 𝑒−𝑐. (C.38)

Therefore, if∑𝑇

𝑡=1∇2𝑓𝑡(𝜃0) ⪰ 𝜎′𝐼𝑑, then for ‖𝜃 − 𝜃0‖2 ≤1𝑀

, |𝑢⊤𝜃 − 𝑢⊤𝜃0| < 1 so

by (C.38),

𝑇∑𝑡=1

∇2𝑓𝑡(𝜃) =𝑇∑𝑡=1

𝜑(𝑢⊤𝑡 𝜃)(1− 𝜑(𝑢⊤

𝑡 𝜃))𝑢𝑡𝑢⊤𝑡 (C.39)

⪰𝑇∑𝑡=1

𝑒−1𝜑(𝑢⊤𝑡 𝜃0)(1− 𝜑(𝑢⊤

𝑡 𝜃0))𝑢𝑡𝑢⊤𝑡 ⪰

𝜎′

𝑒𝐼𝑑. (C.40)

165

Therefore,

P(∀𝜃 ∈ B

Ç𝜃0,

1

𝑀

å,

𝑇∑𝑡=1

∇2𝑓𝑡(𝜃) ⪰ 𝜎

2𝑒𝑇𝐼𝑑

)≤ P

(𝑇∑𝑡=1

∇𝑓 2𝑡 (𝜃0) ⪰

𝜎

2𝑇𝐼𝑑

)≤ 𝜀. (C.41)

Proof of Lemma C.1.2. Part 1 was already shown in Lemma C.1.4.

Lemma C.1.4 shows that the conditions of Theorem C.1.1 are satisfied with 𝑀 ←[

2𝑀 , 𝐿 = 𝑀2

4, 𝑟 = 1

𝑀, 𝜎min = 𝜎

2𝑒, 𝑇min =


8𝜎2 . Also, 𝛼 = 𝛽. We further need

to check that the condition on 𝑡 implies that 𝐶√𝑡+𝛽B

𝜎min𝑡+𝛼+ 𝐶√

𝑡< 1

𝑀. We have, noting

𝜎min ≤ 𝐿 (the strong convexity is at most the smoothness),

𝐶√𝑡 + 𝛽B

𝜎min𝑡 + 𝛼+

𝐶√𝑡≤Ç

𝐶

𝜎min

+ 1

å1»𝑡 + 𝛼

𝐿

+𝛽B

𝜎min

Ä𝑡 + 𝛼

𝜎min

ä (C.42)

so it suffices to have each entry be < 12𝑀

, and this holds when 𝑡 > 4𝑀2Ä

𝐶𝜎min

+ 1ä2

=

4𝑀2Ä2𝑒𝐶𝜎

+ 1ä2

and 𝑡 > 2𝑀B𝛽𝜎min

= 4𝑒𝑀B𝛼𝜎

.

Part 2 and 3 then follow immediately.

Proof of Theorem 3.2.2. Redefine 𝜎 such that 𝐼(𝜃0) ⪰ 𝜎𝐼𝑑 holds. (By Remark C.1.3,

this 𝜎 is a constant factor times the 𝜎 in Theorem 3.2.2) Theorem 3.2.2 follows

from Theorem 3.2.4 once we show that Assumptions 3.2.1, 3.2.2, and 3.2.3 are sat-

isfied. Assumption 3.2.1 is satisfied with 𝐿0 = 𝛼 and 𝐿 = 𝑀2

4. The rest will fol-

low from Lemma C.1.2 except that we need bounds to cover the case 𝑡 ≤ 𝑇min :=

maxß


8𝜎2 , 16𝑒2𝑀2𝐶2

𝜎2 , 4𝑒𝑀B𝛼𝜎

™as well.

Showing that Assumption 3.2.2 holds. Note 𝐿 ≥ 𝜎 so 𝐶′√𝑇+𝛼

𝐿

≥ 𝐶′√𝑇+ 2𝑒𝛼

𝜎

. For

𝑡 > 𝑇min, item 2 of Lemma C.1.2 shows Assumption 3.2.2 is satisfied with 𝑐 = 𝛼𝐿

(where 𝐿 = 𝑀2

4), 𝐴1 = 𝐾1

𝜎𝐶

(Ä𝑀2

4𝑇+𝛼

ä𝑒

𝑑

) 𝑑2

𝑒14𝑒

𝜎𝐶2and 𝑘1 = 𝜎𝐶

4𝑒.

166

For 𝑡 ≤ 𝑇min, we use Lemma D.2.5, which says that if 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) in R𝑑 and 𝑓

is 𝜅-strongly convex and 𝐾-smooth, and 𝑥⋆ = argmin𝑥 𝑓(𝑥), then

P𝑥∼𝑝

Ñ‖𝑥− 𝑥⋆‖2 ≥ 1

𝜅

(√𝑑 +

√2𝑡 + 𝑑 log

Ç𝐾

𝜅

å)2é≤ 𝑒−𝑡. (C.43)

In our case,∑𝑡

𝑠=0 𝑓𝑠(𝑥) is 𝛼-strongly convex and 𝛼 + 𝑇min𝐿-smooth, so

P𝑥∼𝑝 (‖𝑥− 𝑥⋆‖ ≥ 𝛾) ≤ exp

−(𝛾√𝜅−√𝑑)2 − 𝑑 log

Ä𝐾𝜅

ä2

(C.44)

= 𝑒𝑑2(−1+log(𝐾

𝜅 ))𝑒𝛾√𝜅𝑑− 𝛾2𝜅

2 (C.45)

≤ 𝑒𝑑2(−1+log(𝐾

𝜅 ))−Ä𝛾−2√

𝑑𝜅

ä√𝜅𝑑

(C.46)

Thus for 𝑡 ≤ 𝑇min,

P𝜃∼𝜋𝑡(‖𝜃 − 𝜃⋆𝑡 ‖ ≥ 𝛾) ≤ 𝐴2𝑒−𝑘2𝛾 (C.47)

with 𝐴2 = 𝑒𝑑2(−1+log(𝐾

𝜅 )) = 𝑒𝑑2

Ä−1+log

Ä𝑇min𝐿+𝛼

𝛼

ää(C.48)

𝑘2 =

√𝜅𝑑»

𝑇min + 𝛼𝐿

=

√𝛼𝑑»

𝑇min + 𝛼𝐿

. (C.49)

Take 𝐴 = max𝐴1, 𝐴2 and 𝑘 = min𝑘1, 𝑘2 and note that log(𝐴), 𝑘−1 are polyno-

mial in all parameters and log(𝑇 ).

Showing that Assumption 3.2.3 holds. For 𝑡 > 𝑇min, item 3 of Lemma C.1.2

shows that with probability at least 1− 𝜀, (using 𝐿 ≥ 𝜎)

‖𝜃⋆𝑡 − 𝜃0‖ ≤𝐶√𝑡 + 𝛼B

𝜎𝑡/2𝑒 + 𝛼≤

Ñ𝐶

𝜎/2𝑒+

𝛼B

𝜎/2𝑒 ·»𝑡 + 2𝑒𝛼

𝜎

é1»𝑡 + 𝛼

𝐿

. (C.50)

Now consider 𝑡 ≤ 𝑇min. Since 𝐹𝑡 is strongly convex, the minimizer 𝜃⋆𝑡 of 𝐹𝑡 is

the unique point where ∇𝐹𝑡(𝜃⋆𝑡 ) = 0. Moreover, ‖∑𝑡

𝑘=1∇𝑓𝑘(𝜃)‖ ≤ 𝑇min𝑀 for

167

𝑡 ≤ 𝑇min. Therefore, since 𝑓0 is 𝛼-strongly convex, we have that ‖∇𝐹𝑡(𝜃)‖ =

‖∇𝑓0(𝜃) +∑𝑡

𝑘=1∇𝑓𝑘(𝜃)‖ > 0 for all ‖𝜃‖ > 𝑇min𝑀𝛼−1. Therefore, we must have

that ‖𝜃⋆𝑡 ‖ ≤ 𝑇min𝑀𝛼−1 for all 𝑡 ≤ 𝑇min, and hence that

‖𝜃⋆𝑡 − 𝜃0‖ ≤ 𝑇min𝑀𝛼−1 + B ∀𝑡 ≤ 𝑇min. (C.51)

Set D = 2 max

®(𝑇min𝑀𝛼−1 + B)

»𝑇min + 𝛼

𝐿, 𝐶

𝜎/2𝑒+

√𝛼B√𝜎/2𝑒

´. Then Equations (C.50)

and (C.51) and the triangle inequality would imply that if 𝑡 < 𝜏 , then ‖𝜃⋆𝑡 − 𝜃⋆𝜏‖ ≤D√𝑡+𝛼

𝐿

. To get Assumption 3.2.3 to hold with probability at least 1−𝜀 for all 𝑡, 𝜏 < 𝑇 ,

substitute 𝜀←[ 𝜀𝑇

. D is polynomial in all parameters and log(𝑇 ).

168

Appendix D

Calculations on probability

distributions

D.1 Chi-squared and KL inequalities

Lemma D.1.1. Let 𝑃,𝑄 be probability measures on Ω such that 𝑄≪ 𝑃 , 𝜒2(𝑄||𝑃 ) <

∞, and 𝑔 : Ω→ R satisfies 𝑔 ∈ 𝐿2(𝑃 ). ThenÅ∫Ω𝑔(𝑥)𝑃 (𝑑𝑥)−

∫Ω𝑔(𝑥)𝑄(𝑑𝑥)

ã2≤ Var𝑃 (𝑔)𝜒2(𝑄||𝑃 ). (D.1)

Proof. Noting that∫Ω 𝑃 (𝑑𝑥)−𝑄(𝑑𝑥) = 0 and using Cauchy-Schwarz,Å∫

Ω𝑔(𝑥) 𝑑𝑃 (𝑥)−

∫Ω𝑔(𝑥)𝑄(𝑑𝑥)

ã2=Å∫

Ω(𝑔(𝑥)− E

𝑃[𝑔(𝑥)])(𝑃 (𝑑𝑥)−𝑄(𝑑𝑥))

ã2(D.2)

≤Å∫

Ω(𝑔(𝑥)− E

𝑃[𝑔(𝑥)])2𝑃 (𝑑𝑥)

ã(∫Ω

Ç1− 𝑑𝑄

𝑑𝑃

å2

𝑃 (𝑑𝑥)

)

(D.3)

= Var𝑃 (𝑔)𝜒2(𝑄||𝑃 ). (D.4)

169

The continuous analogue of Lemma D.1.1 is the following.

Lemma D.1.2. Let Ω = Ω(1) × Ω(2) with Ω(1) ⊆ R𝑑1. Suppose 𝑃𝑥1 is a probability

measure on Ω(2) for each 𝑥1 ∈ Ω(1) with density function 𝑝𝑥1 (with respect to some

reference measure 𝑑𝑥), 𝑔 : Ω(2) satisfies 𝑔 ∈ 𝐿2(𝑃𝑥1), and∫Ω(2)‖∇𝑥1𝑝𝑥1 (𝑥2)‖2

𝑝𝑥1 (𝑥2)𝑑𝑥2 <∞.

Then

∥∥∥∥∫Ω(2)

𝑔(𝑥2)∇𝑥1(𝑝𝑥1(𝑥2)) 𝑑𝑥2

∥∥∥∥2 ≤ Var𝑃𝑥1(𝑔)

(∫Ω(2)

‖∇𝑥1𝑝𝑥1(𝑥2)‖2

𝑝𝑥1(𝑥2)𝑑𝑥2

)(D.5)

Proof. Because each 𝑃𝑥1 is a probability measure,

∫Ω(2)∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2 = ∇𝑥1

∫Ω(2)

𝑝𝑥1(𝑥2) 𝑑𝑥2 = ∇𝑥1(1) = 0. (D.6)

Hence

∥∥∥∥∫Ω(2)

𝑔(𝑥2)∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2

∥∥∥∥2 (D.7)

≤∥∥∥∥∥∫Ω(2)

[𝑔(𝑥2)− E

𝑃𝑥1

[𝑔(𝑥2)]

]∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2

∥∥∥∥∥2

(D.8)

≤

Ñ∫Ω(2)

[𝑔(𝑥2)− E

𝑃𝑥1

[𝑔(𝑥2)]

]2𝑝𝑥1(𝑥2) 𝑑𝑥

é(∫Ω(2)

‖∇𝑥1𝑝𝑥1(𝑥2)‖2


)(D.9)

= Var𝑃𝑥1(𝑔)

(∫Ω(2)

‖∇𝑥1𝑝𝑥1(𝑥2)‖2


)(D.10)

(D.11)

Lemma D.1.3. Let 𝑃 be a probability measure and 𝑄 be a nonnegative measure on

Ω. Define the measure 𝑅 by 𝑅 = min¶𝑑𝑄𝑑𝑃

, 1©𝑃 . (If 𝑝, 𝑞 are the density functions of

𝑃,𝑄, then the density function of 𝑅 is simply 𝑟(𝑥) = min𝑝(𝑥), 𝑞(𝑥).) Let 𝛿 be the

170

overlap 𝛿 = 𝑅(Ω), and ‹𝑅 = 𝑅𝛿

= 𝑅𝑅(Ω)

the normalized overlap measure. Then

𝜒2 (𝑅||𝑃 ) ≤ 1

𝛿. (D.12)

Proof. We make a change of variable to 𝑢 = 𝑑𝑄𝑑𝑃

. Let 𝐹 (𝑢) = 𝑃Ä¶𝑥 : 𝑑𝑄

𝑑𝑃(𝑥) ≤ 𝑢

©ä.

Then

𝜒2 (𝑅||𝑃 ) =∫Ω

Ñmin

¶1, 𝑑𝑄

𝑑𝑃

©𝛿

− 1

é2

𝑃 (𝑑𝑥) (D.13)

=∫Ω

Çmin1, 𝑢

𝛿− 1

å2

𝑑𝐹 (𝑢) (Stieltjes integral) (D.14)

≤ 1

𝛿2

∫Ω

(min1, 𝑢)2 𝑑𝐹 (𝑢) (D.15)

≤ 1

𝛿2

∫Ω

min1, 𝑢 𝑑𝐹 (𝑢). (D.16)

Now note that∫Ω min1, 𝑢 𝑑𝐹 (𝑢) =

∫Ω min

¶1, 𝑑𝑄

𝑑𝑃

©𝑃 (𝑑𝑥) = 𝛿. Hence (D.16) is at

most 1𝛿2𝛿 = 1

𝛿.

Lemma D.1.4. If 𝑃, 𝑃𝑖 are probability measures on Ω such that 𝑃 =∑𝑛

𝑖=1𝑤𝑖𝑃𝑖

(where 𝑤𝑖 > 0 sum to 1), and 𝑄≪ 𝑃, 𝑃𝑖, then

𝜒2(𝑄||𝑃 ) ≤𝑛∑

𝑖=1

𝑤𝑖𝜒2(𝑄||𝑃𝑖). (D.17)

This inequality follows from convexity of 𝑓 -divergences; for completeness we in-

clude a proof.

Proof. By Cauchy-Schwarz,

𝜒2(𝑄||𝑃 ) =

(∫Ω

𝑛∑𝑖=1

Ç𝑑𝑄

𝑑𝑃

å2

𝑃 (𝑑𝑥)

)− 1 (D.18)

=∫Ω

(𝑛∑

𝑖=1

𝑤𝑖𝑑𝑃𝑖

𝑑𝑃

𝑑𝑄

𝑑𝑃𝑖

)2

𝑃 (𝑑𝑥)− 1 (D.19)

171

≤∫Ω

(𝑛∑

𝑖=1

𝑤𝑖𝑑𝑃𝑖

𝑑𝑃

)(𝑛∑

𝑖=1

𝑤𝑖

Ç𝑑𝑄

𝑑𝑃𝑖

å2 𝑑𝑃𝑖

𝑑𝑃

)𝑃 (𝑑𝑥)− 1 (D.20)

≤𝑛∑

𝑖=1

𝑤𝑖

(∫Ω

Ç𝑑𝑄

𝑑𝑃𝑖

å2

− 1

)𝑃𝑖(𝑑𝑥) =

𝑛∑𝑖=1

𝑤𝑖𝜒2(𝑄||𝑃𝑖) (D.21)

Lemma D.1.5. Suppose 𝑃, ‹𝑃 are probability distributions on Ω such that 𝑑𝑃𝑑𝑃≤ 𝐾.

Then for any probability measure 𝑄≪ 𝑃 ,

𝜒2(𝑄||𝑃 ) ≤ 𝐾𝜒2(𝑄||‹𝑃 ) + 𝐾 − 1. (D.22)

Proof.

𝜒2(𝑄||𝑃 ) =∫Ω

Ç𝑑𝑄

𝑑𝑃

å2

𝑃 (𝑑𝑥)− 1 =∫Ω

Ç𝑑𝑄

𝑑‹𝑃 å2(𝑑‹𝑃𝑑𝑃

)2

𝑃 (𝑑𝑥)− 1 (D.23)

≤ 𝐾

(∫Ω

Ç𝑑𝑄

𝑑‹𝑃 å2 ‹𝑃 (𝑑𝑥)

)− 1 = 𝐾(𝜒2(𝑄||‹𝑃 ) + 1)− 1. (D.24)

Lemma D.1.6. Let 𝑊 and 𝑊 ′ be probability measures over 𝐼, with densities 𝑤(𝑖),

𝑤′(𝑖) with respect to a reference measure 𝑑𝑖, such that KL(𝑊 ||𝑊 ′) < ∞. For each

𝑖 ∈ 𝐼, suppose 𝑃𝑖, 𝑄𝑖 are probability measures over Ω. Then

KLÅ∫

𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||

∫𝐼𝑤′(𝑖)𝑄𝑖 𝑑𝑖

ã≤ KL(𝑊 ||𝑊 ′) +

∫𝐼𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖.

Proof. Overloading notation, we will use 𝐾𝐿(𝑎||𝑏) for two measures 𝑎, 𝑏 even if they

are not necessarily probability distributions, with the obvious definition. Using the

convexity of KL divergence,

KLÅ∫

𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||

∫𝐼𝑤′(𝑖)𝑄𝑖 𝑑𝑖

ã= KL

Ç∫𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||

∫𝐼𝑤(𝑖)𝑄𝑖

𝑤′(𝑖)

𝑤(𝑖)𝑑𝑖

å172

≤∫𝐼𝑤(𝑖)KL

Ç𝑃𝑖||𝑄𝑖

𝑤′(𝑖)

𝑤(𝑖)

å𝑑𝑖

=∫𝐼𝑤(𝑖) log

Ç𝑤(𝑖)

𝑤′(𝑖)

å𝑑𝑖 +

∫𝑖𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖

= KL(𝑊 ||𝑊 ′) +∫𝐼𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖

D.2 Chi-squared divergence calculations for log-

concave distributions

We calculate the chi-squared divergence between log-concave distributions at different

temperatures, and at different locations. In the gaussian case there is a closed formula

(Lemma D.2.1). The general case is more involved (Lemmas D.2.6 and D.2.7), and

the bound is in terms of the strong convexity and smoothness constants.

Lemma D.2.1. For a matrix Σ, let |Σ| denote its determinant. The 𝜒2 divergence

between 𝑁(𝜇1,Σ1) and 𝑁(𝜇2,Σ2) is

𝜒2(𝑁(𝜇2,Σ2)||𝑁(𝜇1,Σ1)) (D.25)

=|Σ1|

12

|Σ2|∣∣∣(2Σ−1

2 − Σ−11 )

∣∣∣− 12 (D.26)

· exp

Ç1

2(2Σ−1

2 𝜇2 − Σ−11 𝜇1)

𝑇 (2Σ−12 − Σ−1

1 )−1(2Σ−12 𝜇2 − Σ−1

1 𝜇1) +1

2𝜇𝑇1 Σ−1

1 𝜇1 − 𝜇𝑇2 Σ−1

2 𝜇2

å− 1

(D.27)

In particular, in the cases of equal mean or equal variance,

𝜒2(𝑁(𝜇,Σ2)||𝑁(𝜇,Σ1)) =|Σ1|

12

|Σ2|∣∣∣(2Σ−1

2 − Σ−11 )

∣∣∣− 12 − 1 (D.28)

𝜒2(𝑁(𝜇2,Σ)||𝑁(𝜇1,Σ)) = exp[(𝜇2 − 𝜇1)𝑇Σ−1(𝜇2 − 𝜇1)]. (D.29)

173

Proof.

𝜒2(𝑁(𝜇,Σ2)||𝑁(𝜇,Σ1)) + 1 (D.30)

=1

(2𝜋)𝑑2

|Σ1|12

|Σ2|

∫R𝑑

exp

ñ−1

2

Ä2(𝑥− 𝜇2)

𝑇Σ−12 (𝑥− 𝜇2)− (𝑥− 𝜇1)

𝑇Σ−11 (𝑥− 𝜇1)

äô𝑑𝑥

(D.31)

=1

(2𝜋)𝑑2

|Σ1|12

|Σ2|

∫R𝑑

exp

− 1

2

Ñ𝑥𝑇 (2Σ−1

2 − Σ−11 )𝑥 + 2𝑥𝑇Σ−1

1 𝜇1 − 4𝑥𝑇Σ−12 𝑥 (D.32)

− 𝜇𝑇1 Σ−1

1 𝜇1 + 2𝜇𝑇2 Σ−1

2 𝜇2

é 𝑑𝑥 (D.33)

=1

(2𝜋)𝑑2

|Σ1|12

|Σ2|

∫R𝑑

exp

ñ−1

2(𝑥′𝑇 (2Σ−1

2 − Σ−11 )𝑥′ + 𝑐)

ô(D.34)

𝑥′ : = 𝑥− (2Σ−12 − Σ−1

1 )−1(𝜇𝑇1 Σ−1

1 − 2𝜇𝑇2 Σ−1

2 ) (D.35)

𝑐 : =1

2(2Σ−1

2 𝜇2 − Σ−11 𝜇1)

𝑇 (2Σ−12 − Σ−1

1 )−1(2Σ−12 𝜇2 − Σ−1

1 𝜇1) +1

2𝜇𝑇1 Σ−1

1 𝜇1 − 𝜇𝑇2 Σ−1

2 𝜇2

(D.36)

Integrating gives the result. For the equal variance case,

𝑐 =1

2(2𝜇2 − 𝜇1)Σ

−1(2𝜇2 − 𝜇1) +1

2𝜇1Σ

−1𝜇1 − 𝜇2Σ−1𝜇2 = (𝜇2 − 𝜇1)

𝑇Σ−1(𝜇2 − 𝜇1)𝑇 .

(D.37)

The following theorem is essential in generalizing from gaussian to log-concave

densities.

Theorem D.2.2 (Harge, [Har04]). Suppose the 𝑑-dimensional gaussian 𝑁(0,Σ) has

density 𝛾. Let 𝑝 = ℎ·𝛾 be a probability density, where ℎ is log-concave. Let 𝑔 : R𝑑 → R

be convex. Then

∫R𝑑

𝑔(𝑥− E𝑝𝑥)𝑝(𝑥) 𝑑𝑥 ≤

∫R𝑑

𝑔(𝑥)𝛾(𝑥) 𝑑𝑥. (D.38)

174

Lemma D.2.3 (𝜒2-tail bound). Let 𝛾 = 𝑁Ä0, 1

𝜅

ä. Then

∀𝑦 ≥ 𝑑

𝜅, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒

−𝜅2

Ä𝑦−√

𝑑𝜅

ä2. (D.39)

Proof. By the 𝜒2𝑑 tail bound in [LM00], for all 𝑡 ≥ 0,

P𝑥∼𝛾

Ç‖𝑥‖2 ≥ 1

𝜅(√𝑑 +√

2𝑡)2å≤ P𝑥∼𝛾

Ç‖𝑥‖2 ≥ 1

𝜅(𝑑 + 2(

√𝑑𝑡 + 𝑡))

å≤ 𝑒−𝑡

(D.40)

=⇒ ∀𝑦 ≥ 𝑑

𝜅, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒

−Ä√

𝜅𝑦−√𝑑√

2

ä2= 𝑒

−𝜅2

Ä𝑦−√

𝑑𝜅

ä2(D.41)

Lemma D.2.4. Let 𝑓 : R𝑑 → R be a 𝜅-strongly convex and 𝐾-smooth function and let

𝑃 be a probability measure with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥). Let 𝑥* = argmin𝑥 𝑓(𝑥)

and 𝑥 = E𝑃 𝑓(𝑥). Then

‖𝑥* − 𝑥‖ ≤ 𝑑

𝜅

(√ln

Ç𝐾

𝜅

å+ 5

). (D.42)

Proof. We establish both concentration around the mode 𝑥* and the mean 𝑥. This

will imply that the mode and the mean are close. Without loss of generality, assume

𝑥* = 0 and 𝑓(0) = 0.

For the mode, note that by Lemma D.2.3, for all 𝑟 ≥ 𝑑𝑚

,

∫‖𝑥‖≥𝑟

𝑒−𝑓(𝑥) 𝑑𝑥 ≤∫‖𝑥‖≥𝑟

𝑒−12𝜅𝑥2 ≤

Ç2𝜋

𝜅

å 𝑑2

𝑒−𝜅

2

Ä𝑟−√

𝑑𝜅

ä2(D.43)

∫‖𝑥‖<𝑟

𝑒−𝑓(𝑥) 𝑑𝑥 ≥∫‖𝑥‖<𝑟

𝑒−12𝐾𝑥2 ≥

Ç2𝜋

𝐾

å 𝑑2(

1− 𝑒−𝐾

2

Ä𝑟−√

𝑑𝜅

ä2). (D.44)

175

Let 𝑟 =√

𝑑𝜅

(√lnÄ𝐾𝜅

ä+ 3

). Then

∫‖𝑥‖≥𝑟

𝑒−𝑓(𝑥) 𝑑𝑥 ≤Ç

2𝜋

𝜅

å 𝑑2

𝑒−𝑑2(ln(𝐾

𝜅 )+2) ≤Ç

2𝜋

𝐾

å 𝑑2

𝑒−𝑑 (D.45)

∫‖𝑥‖<𝑟

𝑒−𝑓(𝑥) 𝑑𝑥 ≥Ç

2𝜋

𝐾

å 𝑑2(

1− 𝑒−𝐾

2

Ä𝑟−√

𝑑𝜅

ä2)(D.46)

≥Ç

2𝜋

𝐾

å 𝑑2Å

1− 𝑒−𝐾𝑑2𝜅 (2+ln( 𝐾

𝜅𝜅))ã≥Ç

2𝜋

𝐾

å 𝑑2

(1− 𝑒−𝑑) (D.47)

Thus

P𝑥∼𝑃 (‖𝑥‖ ≥ 𝑟) =

∫‖𝑥‖≥𝑟 𝑒

−𝑓(𝑥) 𝑑𝑥∫‖𝑥‖≥𝑟 𝑒

−𝑓(𝑥) 𝑑𝑥 +∫‖𝑥‖<𝑟 𝑒

−𝑓(𝑥) 𝑑𝑥≤ 𝑒−𝑑 ≤ 1

2. (D.48)

Now we show concentration around the mean. By adding a constant to 𝑓 , we

may assume that 𝑝(𝑥) = 𝑒−𝑓(𝑥). Note that because 𝑓 is 𝜅-smooth, 𝑝 is the product of

𝛾(𝑥) with a log-concave function, where 𝛾(𝑥) is the density of 𝑁(0, 1𝜅𝐼𝑑). note that

by Harge’s Theorem D.2.2,

∫R𝑑‖𝑥− 𝑥‖2 𝑝(𝑥) 𝑑𝑥 ≤

∫R𝑑‖𝑥‖2 𝛾(𝑥) 𝑑𝑥 =

𝑑

𝜅. (D.49)

By Markov’s inequality,

P𝑥∼𝑃

(‖𝑥− 𝑥‖ ≥

2𝑑

𝜅

)= P𝑥∼𝑃

Ç‖𝑥− 𝑥‖2 ≥ 2𝑑

𝜅

å≤ 1

2. (D.50)

Let 𝐵𝑟(𝑥) denote the ball of radius 𝑟 around 𝑥. By (D.48) and (D.50), 𝐵√𝑑𝜅

(»ln(𝐾

𝜅 )+3

)(𝑥*)

and 𝐵√ 2𝑑𝜅

(𝑥) intersect. Thus ‖𝑥− 𝑥*‖ ≤√

𝑑𝜅

(√lnÄ𝐾𝜅

ä+ 5

).

Lemma D.2.5 (Concentration around mode for log-concave distributions). Suppose

𝑓 : R𝑑 → R is 𝜅-strongly convex and 𝐾-smooth. Let 𝑃 be the probability measure

176

with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥). Let 𝑥* = argmin𝑥 𝑓(𝑥). Then

P𝑥∼𝑃

Ñ‖𝑥− 𝑥*‖2 ≥ 1

𝜅

(√𝑑 +

√2𝑡 + 𝑑 ln

Ç𝐾

𝜅

å)2é≤ 𝑒−𝑡. (D.51)

Proof. By (D.43) and (D.44),

P𝑥∼𝑃 (‖𝑥‖ ≥ 𝑟) ≤Ç𝐾

𝜅

å 𝑑2

𝑒−𝜅

2

Ä𝑟−√

𝑑𝜅

ä2. (D.52)

Substituting in 𝑟 = 1√𝜅

(√𝑑 +

√2𝑡 + 𝑑 ln

Ä𝐾𝜅

ä)gives the lemma.

Lemma D.2.6 (𝜒2-divergence between translates). Let 𝑓 : R𝑑 → R be a 𝜅-strongly

convex and 𝐾-smooth function and let 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) be a probability distribution. Let

‖𝜇‖ = 𝐷. Then

𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) ≤ 𝑒12𝜅𝐷2+𝐾𝐷

√𝑑𝜅

(»ln(𝐾

𝜅 )+5

) (𝑒𝐾𝐷√

𝑑𝜅 + 𝐾𝐷

4𝜋

𝜅𝑒

2𝐾𝐷√𝑑√

𝜅+𝐾2𝐷2

2𝜅

)− 1.

(D.53)

Proof. Without loss of generality, suppose 𝑓 attains minimum at 0, or equivalently,

∇𝑓(0) = 0. We bound

𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) + 1 =∫R𝑑

𝑒−2𝑓(𝑥)

𝑒−𝑓(𝑥−𝜇)𝑑𝑥 =

∫R𝑑

𝑒−𝑓(𝑥)𝑒𝑓(𝑥−𝜇)−𝑓(𝑥) 𝑑𝑥 (D.54)

≤∫R𝑑

𝑒−𝑓(𝑥)𝑒𝐾𝐷‖𝑥‖+ 12𝜅𝐷2

𝑑𝑥 (D.55)

Note that because 𝑓 is 𝜅-strongly convex, 𝑝 is the product of 𝛾(𝑥) with a log-concave

function, where 𝛾(𝑥) is the density of 𝑁(0, 1𝜅𝐼𝑑). Let 𝑥 = E𝑥∼𝑝 𝑥 be the average value

of 𝑥 under 𝑝. Apply Harge’s Theorem D.2.2 on 𝑔(𝑥) = 𝑒𝐾𝐷‖𝑥+𝑥‖, 𝑝(𝑥) = 𝑒−𝑓(𝑥) to get

that

∫R𝑑

𝑒−𝑓(𝑥)𝑒𝐾𝐷‖𝑥‖ 𝑑𝑥 ≤∫R𝑑

𝛾(𝑥)𝑒𝐾𝐷‖𝑥+𝑥‖ 𝑑𝑥 (D.56)

177

= 𝑒𝐾𝐷‖𝑥‖(𝑒𝐾𝐷√

𝑑𝜅 +

∫ ∞√

𝑑𝜅

P𝑥∼𝛾(‖𝑥‖ ≥ 𝑦)𝐾𝐷𝑒𝐾𝐷𝑦 𝑑𝑦

)(D.57)

where we used the identity∫R 𝑓(𝑥)𝑝(𝑥) 𝑑𝑥 = 𝑓(𝑦0) +

∫∞𝑦0

P𝑥∼𝑝(𝑥 ≥ 𝑦)𝑓 ′(𝑦) 𝑑𝑦 when

𝑓(𝑥) is an increasing function. By Lemma D.2.3,

∀𝑦 ≥

𝑑

𝑚, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒

−𝜅2

Ä𝑦−√

𝑑𝜅

ä2(D.58)

=⇒∫ ∞√

𝑑𝜅

P𝑥∼𝛾(‖𝑥‖ ≥ 𝑦)𝐾𝐷𝑒𝐾𝐷𝑦 𝑑𝑦 ≤ 𝐾𝐷∫ ∞√

𝑑𝜅

𝑒−𝜅2 (𝑦− 𝑑

𝜅)2+𝐾𝐷𝑦 𝑑𝑦 (D.59)

= 𝐾𝐷∫ ∞√

𝑑𝜅

𝑒−𝜅

2

ï(𝑦− 𝑑

𝜅)2− 2𝐾𝐷

√𝑑

𝜅32

−𝐾2𝐷2

𝜅2

ò𝑑𝑦

(D.60)

= 𝐾𝐷∫ ∞√

𝑑𝜅

𝑒−𝜅

2 (𝑦− 𝑑𝜅)

2+ 2𝐾𝐷

√𝑑√

𝜅+𝐾2𝐷2

2𝜅 𝑑𝑦 (D.61)

≤ 𝐾𝐷

4𝜋

𝜅𝑒

2𝐾𝐷√𝑑√

𝜅+𝐾2𝐷2

2𝜅 . (D.62)

Putting together (D.55), (D.57), and (D.62), and using Lemma D.2.4,

𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) ≤ 𝑒12𝜅𝐷2+𝐾𝐷

√𝑑𝜅

(»ln(𝐾

𝜅 )+5

) (𝑒𝐾𝐷√

𝑑𝜅 + 𝐾𝐷

4𝜋

𝜅𝑒

2𝐾𝐷√𝑑√

𝜅+𝐾2𝐷2

2𝜅

).

(D.63)

Lemma D.2.7 (𝜒2-divergence between temperatures). Let 𝑓 : R𝑑 → R be a 𝜅-

strongly convex and 𝐾-smooth function and let 𝑃, 𝑃𝛽 be probability measures with

density functions 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), 𝑝𝛽(𝑥) ∝ 𝑒−𝛽𝑓(𝑥). Suppose 𝛽1, 𝛽2 > 0 and |𝛽2 − 𝛽1| <𝜅𝐾. Then

𝜒2(𝑃𝛽2||𝑃𝛽1) ≤ 𝑒

12

∣∣∣1−𝛽1𝛽2

∣∣∣ 𝐾𝑑

𝜅−𝐾|1−𝛽1𝛽2|(»

ln(𝐾𝜅 )+5

)2 ÇÇ1− 𝐾

𝜅

∣∣∣∣∣1− 𝛽1

𝛽2

∣∣∣∣∣åÇ

1 +

∣∣∣∣∣1− 𝛽𝑖−1

𝛽𝑖

∣∣∣∣∣åå− 𝑑

2

− 1.

(D.64)

178

Proof. Without loss of generality, suppose 𝑓 attains minimum at 0 (or equivalently,

∇𝑓(0) = 0), and 𝑓(0) = 0. We bound

𝜒2(𝑃𝛽2||𝑃𝛽1) + 1 =

∫R𝑑 𝑒−𝛽1𝑓(𝑥) 𝑑𝑥

∫R𝑑 𝑒(𝛽1−2𝛽2)𝑓(𝑥) 𝑑𝑥

(∫R𝑑 𝑒−𝛽2𝑓(𝑥) 𝑑𝑥)

2 . (D.65)

Let 𝑥 = E𝑥∼𝑃𝛽2𝑥 be the average value of 𝑥 under 𝑝𝛽2 . Note that because 𝑓 is 𝑚-

strongly convex, 𝑒−𝛽1𝑓(𝑥) is the product of 𝛾(𝑥) with a log-concave function, where

𝛾(𝑥) is the density of 𝑁(0, 1

𝛽2𝜅𝐼𝑑). Applying Harge’s Theorem D.2.2 on 𝑔1(𝑥) =

𝑒(𝛽2−𝛽1)𝑓(𝑥+𝑥) and 𝑔2(𝑥) = 𝑒(𝛽1−𝛽2)𝑓(𝑥+𝑥) to get

(D.65) ≤∫R𝑑

𝑒(𝛽2−𝛽1)𝑓(𝑥+𝑥)𝛾(𝑥) 𝑑𝑥∫R𝑑

𝑒(𝛽1−𝛽2)𝑓(𝑥+𝑥)𝛾(𝑥) 𝑑𝑥. (D.66)

Because 𝑓 is 𝑚-strongly convex and 𝑀 -smooth, and 𝑓(0) = 0 is the minimum of 𝑓 ,

(D.66) ≤∫R𝑑 𝑒−|𝛽2−𝛽1|𝐾2 ‖𝑥+𝑥‖2𝑒−

𝛽2𝜅

2‖𝑥‖2 𝑑𝑥

∫R𝑑 𝑒−|𝛽2−𝛽1|𝜅2 ‖𝑥+𝑥‖2𝑒−

𝛽2𝜅

2‖𝑥‖2 𝑑𝑥(∫

R𝑑 𝑒−𝛽2𝑚

2‖𝑥‖2 𝑑𝑥

)2 . (D.67)

Using the identity

𝑎 ‖𝑥 + 𝑥‖2 + 𝑏 ‖𝑥‖2 = (𝑎 + 𝑏) ‖𝑥‖2 + 2𝑎 ⟨𝑥, 𝑥⟩+𝑎2

𝑎 + 𝑏‖𝑥‖2 +

𝑎𝑏

𝑎 + 𝑏‖𝑥‖2 (D.68)

= (𝑎 + 𝑏)∥∥∥∥𝑥 +

𝑎

𝑎 + 𝑏𝑥∥∥∥∥2 +

𝑎𝑏

𝑎 + 𝑏‖𝑥‖2 , (D.69)

we get using Lemma D.2.4 (· · · denote quantities not involving 𝑥, that we will not

need)

(D.67) =1(∫

R𝑑 𝑒−𝛽2𝜅

2‖𝑥‖2 𝑑𝑥

)2𝑒 𝐾𝜅|𝛽2−𝛽1|𝛽2

2(𝜅𝛽2−𝐾|𝛽2−𝛽1|)‖𝑥‖2

∫R𝑑

𝑒(𝐾2|𝛽2−𝛽1|−𝜅

2𝛽2)‖𝑥+···‖2 𝑑𝑥 (D.70)

· 𝑒−𝜅|𝛽1−𝛽2|𝛽2

2𝜅(𝛽2−|𝛽2−𝛽1|)‖𝑥‖2

∫R𝑑

𝑒(−𝜅2𝛽2−|𝛽2−𝛽1|𝜅2 )‖𝑥+···‖2 𝑑𝑥

(D.71)

179

≤ 𝑒|𝛽2−𝛽1|

2

𝐾𝜅𝛽2𝜅𝛽2−𝐾|𝛽2−𝛽1|

‖𝑥‖2Ç

2𝜋

𝜅𝛽2 −𝐾|𝛽2 − 𝛽1|

å 𝑑2Ç

2𝜋

𝜅(𝛽2 + |𝛽2 − 𝛽1|)

å 𝑑2Ç

2𝜋

𝜅𝛽2

å−𝑑

(D.72)

≤ 𝑒

12

∣∣∣1−𝛽1𝛽2

∣∣∣ 𝐾𝑑

𝜅−𝐾|1−𝛽1𝛽2|(»

ln(𝐾𝜅 )+5

)2 ÇÇ1− 𝐾

𝜅

∣∣∣∣∣1− 𝛽1

𝛽2

∣∣∣∣∣åÇ

1 +

∣∣∣∣∣1− 𝛽𝑖−1

𝛽𝑖

∣∣∣∣∣åå− 𝑑

2

.

(D.73)

Lemma D.2.8 (𝜒2 divergence between gaussian and log-concave distribution). Sup-

pose that probability measure 𝑃 has probability density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥−𝜇), where

𝑓 is 𝜅-strongly convex, 𝐾-smooth, and attains minimum at 0. Let 𝐷 = ‖𝜇‖. Then

𝜒2

Ç𝑁

Ç0,

1

𝐾𝐼𝑑

å||𝑃

å≤Ç𝐾

𝜅

å 𝑑2

𝑒𝐾𝐷2

. (D.74)

Proof. We calculate

𝑝(𝑥) =𝑒−𝑓0(𝑥−𝜇)∫

R𝑑 𝑒−𝑓0(𝑢−𝜇) 𝑑𝑢≥ 𝑒−

𝐾2‖𝑥−𝜇‖2∫

R𝑑 𝑒−𝜅2‖𝑢−𝜇‖2 𝑑𝑢

=Å 𝜅

2𝜋

ã 𝑑2

𝑒−𝐾2‖𝑥−𝜇‖2 . (D.75)

Then

𝜒2

Ç𝑁

Ç0,

1

𝐾𝐼𝑑

å||𝑃

å=∫R𝑑

Ä𝐾2𝜋

ä𝑑𝑒−𝐾‖𝑥‖2

𝑝(𝑥)𝑑𝑥− 1 (D.76)

≤Ç𝐾

2𝜋

å𝑑 Ç2𝜋

𝜅

å 𝑑2∫R𝑑

𝑒−𝐾(‖𝑥‖2− 12‖𝑥−𝜇‖2) 𝑑𝑥 (D.77)

=

Ç𝐾

𝜅

å 𝑑2Ç𝐾

2𝜋

å 𝑑2∫R𝑑

𝑒−𝐾2‖𝑥+𝜇‖2+𝐾‖𝜇‖2 𝑑𝑥 (D.78)

≤Ç𝐾

𝜅

å 𝑑2

𝑒𝐾𝐷2

. (D.79)

180

D.3 A probability ratio calculation

Lemma D.3.1. Suppose that 𝑓(𝑥) = − ln

ñ∑𝑛𝑖=1𝑤𝑖𝑒

−‖𝑥−𝜇𝑖‖22

ô, 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), and for

𝛼 ≥ 0 let 𝑝𝛼(𝑥) ∝ 𝑒−𝛼𝑓(𝑥), 𝑍𝛼 =∫R𝑑 𝑒−𝛼𝑓(𝑥) 𝑑𝑥. Suppose that ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖.

If 𝛼 < 𝛽, thenï∫𝐴

min𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥ò/𝑝𝛽(𝐴) ≥ min

𝑥

𝑝𝛼(𝑥)

𝑝𝛽(𝑥)≥ 𝑍𝛽

𝑍𝛼

(D.80)

𝑍𝛽

𝑍𝛼

∈

1

2𝑒−2(𝛽−𝛼)

(𝐷+ 1√

𝛼

(√𝑑+2

√lnÄ

2𝑤min

ä))2

, 1

. (D.81)

Choosing 𝛽 − 𝛼 = 𝑂

(1

𝐷2+ 𝑑𝛼+ 1

𝛼lnÄ

1𝑤min

ä), this quantity is Ω(1).

This is a special case of the following more general lemma.

Lemma D.3.2. Suppose that 𝑓(𝑥) = − lnî∑𝑛

𝑖=1 𝑤𝑖𝑒−𝑓𝑖(𝑥)

ó, where 𝑓𝑖(𝑥) = 𝑓0(𝑥 − 𝜇𝑖)

and 𝑓0 is 𝜅-strongly convex and 𝐾-smooth. Let 𝑃 , 𝑃𝛼 (for 𝛼 > 0) be probability

measures with densities 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) and 𝑝𝛼(𝑥) ∝ 𝑒−𝛼𝑓(𝑥). Let 𝑍𝛼 =∫R𝑑 𝑒−𝛼𝑓(𝑥) 𝑑𝑥.

Suppose that ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖.

Let 𝐶 = 𝐷 + 1√𝛼𝜅

(√𝑑 +

√𝑑 ln

Ä𝐾𝜅

ä+ 2 ln

Ä2

𝑤min

ä). If 𝛼 < 𝛽, thenï∫

𝐴min𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥

ò/𝑝𝛽(𝐴) ≥ min

𝑥

𝑝𝛼(𝑥)

𝑝𝛽(𝑥)≥ 𝑍𝛽

𝑍𝛼

(D.82)

𝑍𝛽

𝑍𝛼

∈ñ

1

2𝑒−

12(𝛽−𝛼)𝐾𝐶2

, 1

ô. (D.83)

If 𝛽 − 𝛼 = 𝑂

(1

𝐾Ä𝐷2+ 𝑑

𝛼𝜅(1+ln(𝐾𝜅 ))+ 1

𝛼𝜅lnÄ

1𝑤min

ää), then this quantity is Ω(1).

Proof. Let ‹𝑃𝛼 be the probability measure with density function 𝑝𝛼(𝑥) ∝ ∑𝑛𝑖=1𝑤𝑖𝑒

−𝛼𝑓0(𝑥−𝜇𝑖).

181

By Lemma 2.7.3 and Lemma D.2.5, since 𝛼𝑓 is 𝛼𝜅-strongly, convex,

P𝑥∼𝑃 (‖𝑥‖ ≥ 𝐶) ≤ 1

𝑤min

P𝑥∼𝑃𝛼

(‖𝑥‖ ≥ 𝐶) (D.84)

≤ 1

𝑤min

𝑛∑𝑖=1

𝑤𝑖P𝑥∼𝑃𝛼(‖𝑥‖ ≥ 𝐶) (D.85)

≤ 1

𝑤min

𝑛∑𝑖=1

𝑤𝑖P𝑥∼𝑃 (‖𝑥‖2 ≥ (𝐶 −𝐷)2) (D.86)

=1

𝑤min

P𝑥∼𝑃

‖𝑥‖2 ≥ 1

𝛼𝜅

(√𝑑 +

√𝑑 ln

Ç𝐾

𝜅

å+ 2 ln

Ç2

𝑤min

å)2

(D.87)

≤ 1

𝑤min

𝑤min

2=

1

2. (D.88)

Thus, using 𝑓(𝑥) ≥ 0,ï∫𝐴

min 𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥ò/𝑝𝛽(𝐴) ≥

∫𝐴

min

®𝑝𝛼(𝑥)

𝑝𝛽(𝑥), 1

´𝑝𝛽(𝑥) 𝑑𝑥

¡𝑝𝛽(𝐴) (D.89)

≥∫𝐴

min

®𝑍𝛽

𝑍𝛼

𝑒(𝛽−𝛼)𝑓(𝑥), 1

´𝑝𝛽(𝑥) 𝑑𝑥

¡𝑝𝛽(𝐴)

(D.90)

≥ 𝑍𝛽

𝑍𝛼

(D.91)

=

∫𝑒−𝛽𝑓(𝑥) 𝑑𝑥∫𝑒−𝛼𝑓(𝑥) 𝑑𝑥

(D.92)

=∫R𝑑

𝑒(−𝛽+𝛼)𝑓(𝑥)𝑝𝛼(𝑥) 𝑑𝑥 (D.93)

≥∫‖𝑥‖≤𝐶

𝑒(−𝛽+𝛼)𝑓(𝑥)𝑝𝛼(𝑥) 𝑑𝑥 (D.94)

≥ 1

2𝑒−(𝛽−𝛼)max‖𝑥‖≤𝐶(𝑓(𝑥)) (D.95)

≥ 1

2𝑒−

12(𝛽−𝛼)𝑀𝐶2

. (D.96)

182

D.4 Other facts

Lemma D.4.1. Let (𝑁𝑇 )𝑇≥0 be a Poisson process with rate 𝜆. Then there is a

constant 𝐶 such that

P(𝑁𝑇 ≥ 𝑛) ≤Ç𝐶𝑛

𝑇𝜆

å−𝑛

. (D.97)

Proof. Assume 𝑛 > 𝑇𝜆. We have by Stirling’s formula

P(𝑁𝑇 ≥ 𝑛) = 𝑒−𝜆𝑇∞∑

𝑚=𝑛

(𝜆𝑇 )𝑚

𝑚!(D.98)

≤ 𝑒−𝜆𝑇 1

𝑛!

1

1− 𝜆𝑇𝑛

(𝜆𝑇 )𝑛 (D.99)

= 𝑒−𝜆𝑇𝑂

Ç𝑛− 1

2

Ç𝑒𝜆𝑇

𝑛

å𝑛å(D.100)

≤Ç𝐶𝑛

𝜆𝑇

å𝑛

(D.101)

for some 𝐶, since 𝑒−𝜆𝑇 ≥ 𝑒−𝑛.

Lemma D.4.2. Suppose that 𝑋𝑡 are a sequence of random variables in R𝑑 and for

each 𝑡, ‖𝑋𝑡 − E[𝑋𝑡|𝑋1:𝑡−1]‖∞ ≤ 𝑀 (with probability 1). Let 𝑆𝑇 =∑𝑇

𝑡=1 E[𝑋𝑡|𝑋1:𝑡−1]

(a random variable depending on 𝑋1:𝑇 ). Then

P

Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1

𝑋𝑡 − 𝑆𝑡

∥∥∥∥∥∥2

≥ 𝑐

é≤ 2𝑑𝑒−

𝑐2𝑇2𝑀2𝑑 . (D.102)

Proof. By Azuma’s inequality, for each 1 ≤ 𝑗 ≤ 𝑑,

P

Ñ∣∣∣∣∣∣ 𝑇∑𝑡=1

(𝑋𝑡)𝑗 − (𝑆𝑡)𝑗

∣∣∣∣∣∣ ≥ 𝑐

é≤ 2𝑒−

𝑐2𝑇2𝑀2 (D.103)

183

By a union bound,

P

Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1

𝑋𝑡 − 𝑆𝑡

∥∥∥∥∥∥2

≥ 𝑐

é≤

𝑑∑𝑗=1

P

Ñ∣∣∣∣∣∣ 𝑇∑𝑡=1

(𝑋𝑡)𝑗 − (𝑆𝑡)𝑗

∣∣∣∣∣∣ ≥ 𝑐√𝑑

é≤ 2𝑑𝑒−

𝑐2𝑇2𝑀2𝑑 (D.104)

Lemma D.4.1. Suppose that 𝜋 is a distribution with P𝜃∼𝜋(‖𝜃 − 𝜃0‖ ≥ 𝛾) ≤ 𝐴𝑒−𝑘𝛾,

for some 𝜃0. Then

E𝜃∼𝜋[‖𝜃 − 𝜃0‖2] ≤Ç

2 +1

𝑘

ålog

Ç𝐴

𝑘2

å.

Proof. Without loss of generality, 𝜃0 = 0. Then

E𝜃∼𝜋[‖𝜃‖2] =∫ ∞

02𝛾P𝜃∼𝜋(‖𝜃‖ ≥ 𝛾) 𝑑𝛾 (D.105)

≤ 𝛾0 +∫ ∞

𝛾02𝛾P𝜃∼𝜋(‖𝜃‖ ≥ 𝛾) 𝑑𝛾 (D.106)

≤ 𝛾0 +∫ ∞

𝛾02𝛾𝐴𝑒−𝑘𝛾 𝑑𝛾 by assumption

(D.107)

= 𝛾0 + 𝐴

Ç−2𝛾

𝑘𝑒−𝑘𝛾

∣∣∣∣∞𝛾0

−∫ ∞

𝛾0−2

𝑘𝑒−𝑘𝛾 𝑑𝛾

åintegration by parts

(D.108)

= 𝐴

Ç2𝛾0𝑘

𝑒−𝑘𝛾0 +2

𝑘2𝑒−𝑘𝛾0

å. (D.109)

Set 𝛾0 =log( 𝐴

𝑘2)

𝑘. Then this is ≤

Ä2 + 1

𝑘

älog

Ä𝐴𝑘2

ä, as desired.

184

Nomenclature

𝒟(L ) Domain of L

E Dirichlet form

ℒ(𝑋) Distribution of random variable 𝑋

L Generator of Markov process‹𝑂𝑇 Subscript 𝑇 means that only the dependence on 𝑇 is shown.

Ω State space

Π(𝜇, 𝜈) Set of all possible couplings of random vectors (, 𝑌 ) with marginals ∼ 𝜇

and 𝑌 ∼ 𝜈

𝑃𝑡 Family of kernels defining Markov process

P𝑡 P𝑡𝑔(𝑥) = E𝑦∼𝑃𝑡(𝑥,·)𝑔(𝑦) =∫Ω 𝑔(𝑦)𝑃𝑡(𝑥, 𝑑𝑦)

Var𝑃 Variance with respect to 𝑃

𝛽 Inverse temperature

𝑑 Dimension

𝜂 Step size

𝜑 Logistic function 𝜑(𝑥) = 11+𝑒−𝑥

185

𝜉 Gaussian noise 𝜉 ∼ 𝑁(0, 𝐼)

𝐶 Poincare constant for projected Markov process (Chapter 2)

𝐷 Bound on centers ‖𝜇𝑖‖ ≤ 𝐷 (Chapter 2)

∆ Perturbation of 𝑓 ,∥∥∥𝑓 − 𝑓

∥∥∥∞≤ ∆ (Chapter 2)

E E(𝑔, 𝑔) =∑

𝑖∈𝐼 E𝑖(𝑔, 𝑔) (Chapter 2)

E↔ E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼 𝑤𝑖 ⟨𝑔, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖. (Chapter 2)

𝐾 Smoothness of 𝑓0 (Chapter 2)

𝐿 Number of temperatures (Chapter 2)

L Generator of projected Markov process (Chapter 2)

𝑀 Projected Markov process (Chapter 2)

𝑀st Simulated tempering chain (Chapter 2)

𝑃 Stationary measure of projected Markov process (Chapter 2)‹𝑃𝑖,𝑗‹𝑃𝑖,𝑗(𝑑𝑥) =

∫𝑦∈Ω𝑗

‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) (Chapter 2)

𝑄𝑖,𝑗 Stationary distribution of 𝑃𝑖, times transition kernel: 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (Chap-

ter 2)

𝑄𝑗,𝑘 Minimum of probability measures 𝑃𝑗, 𝑃𝑘: min𝑑𝑃𝑘

𝑑𝑃𝑗, 1𝑃𝑗 (Chapter 2)‹𝑄𝑖,𝑗 Normalized version of 𝑄𝑖,𝑗 (Chapter 2)

𝑆 Subset of 𝐼 × 𝐼 chosen to contain pairs (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗 are close. (Chap-

ter 2)

𝑆↔ Subset of 𝐼×𝐼 chosen to contain pairs (𝑖, 𝑗) such that there is a lot of probability

flow between 𝑃𝑖 and 𝑃𝑗. (Chapter 2)

186

𝑇 Transition probabilities of projected Markov process (Chapter 2)

𝑇𝑖,𝑗 Transition between different components, in the general density decomposition

theorem (Chapter 2)

𝑍𝑖 Partition function of 𝑖th temperature,∫R𝑑 𝑒−𝛽𝑖𝑓(𝑥) 𝑑𝑥 (Chapter 2)“𝑍𝑖 Estimate of partition function 𝑍𝑖 (Chapter 2)

𝛿𝑗,𝑘 Overlap between measures 𝑃𝑗, 𝑃𝑘:∫Ω min

𝑑𝑃𝑘

𝑑𝑃𝑗, 1𝑃𝑗(𝑑𝑥) (Chapter 2)

𝑓0 Base function (Chapter 2)

𝑓𝑖 Translate of base function 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖) (Chapter 2)

𝜅 Strong convexity of 𝑓0 (Chapter 2)

𝜆 Rate of simulated tempering (Chapter 2)

𝜇𝑖 Center of 𝑖th component (Chapter 2)

𝑟𝑖 Relative probabilities (Chapter 2)

𝜏 Bound on gradient and Hessian perturbations∥∥∥∇𝑓 −∇𝑓∥∥∥

∞≤ 𝜏 ,

∥∥∥∇2𝑓 −∇2𝑓∥∥∥∞≤

𝜏 (Chapter 2)

𝑤min Minimum weight 𝑤min = min𝑖𝑤𝑖 (Chapter 2)

𝑤𝑖 Weight of 𝑖th component (Chapter 2)

𝐴 Exponential concentration in Assumption 3.2.2 and 3.2.5, P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥

𝛾√𝑡+𝑐

) ≤ 𝐴𝑒−𝑘𝛾 (Chapter 3)

B Bound on ‖𝜃0‖ (Chapter 3)

𝐶 Bound on second moment from Assumption 3.2.2 and 3.2.5, 𝑚122 :=

ÄE𝑥∼𝜋𝑡 ‖𝑥− 𝑥⋆

𝑡‖22

ä 12 ≤

𝐶√𝑡+𝑐

for 𝐶 =Ä2 + 1

𝑘

älog

Ä𝐴𝑘2

ä(Chapter 3)

187

𝐶 ′ acceptance radius 𝐶 ′ = 2.5(𝐶1 + D) (Chapter 3)

D Drift parameter in Assumption 3.2.3 (Chapter 3)

𝐹𝑡 𝐹𝑡 =∑𝑡

𝑘=0 𝑓𝑘 (Chapter 3)

𝐺𝛽 𝐺𝛽 =

®∀𝑖,

∥∥∥𝑋𝛽𝑖 − 𝑥⋆

∥∥∥ ≤ R√𝛽𝑇

´(Chapter 3)

𝐺𝑡 𝐺𝑡 =

®∀𝑠 ≤ 𝑡,∀0 ≤ 𝑖 ≤ 𝑖𝑠, ‖𝑋𝑠

𝑖 − 𝑥⋆𝑠‖ ≤ R√

𝑠+𝐿0/𝐿

´(Chapter 3)

𝐻𝑡 𝐻𝑡 =

®∀𝑠 ≤ 𝑡 s.t. 𝑠 is a power of 2 or 𝑠 = 0, ‖𝑋𝑠 − 𝑥⋆

𝑠‖ ≤ 𝐶1√𝑠+𝐿0/𝐿

´(Chap-

ter 3)

𝐿 Smoothness (Lipschitz constant of gradient) in Assumption 3.2.1 (Chapter 3)

𝐿0 Smoothness (Lipschitz constant of gradient) of 𝑓0 in Assumption 3.2.1 (Chapter

3)

𝑀 Bound on input vectors to logistic regression (Chapter 3)

𝑃𝑢 Distribution of input vectors (Chapter 3)

𝑋 𝑡 sample returned at epoch 𝑡, 𝑋 𝑡 = 𝑋 𝑡𝑖𝑡 (Chapter 3)

𝑋 𝑡𝑖 𝑖th iterate at epoch 𝑡 (Chapter 3)

𝑏 batch size (Chapter 3)

𝑐 Offset in Assumptions 3.2.2 and 3.2.3 (Chapter 3)

𝜂0 initial step size (Chapter 3)

𝑔𝑖 Stochastic gradient at step 𝑖 (Chapter 3)

𝑖max number of steps (Chapter 3)

188

𝑘 Exponential concentration in Assumption 3.2.2 and 3.2.5, P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥

𝛾√𝑡+𝑐

) ≤ 𝐴𝑒−𝑘𝛾 (Chapter 3)

𝜋𝛽𝑇 Distribution at inverse temperature 𝛽, 𝜋𝑇 (𝑥) ∝ 𝑒−

∑𝑇𝑡=1

𝑓𝑡(𝑥) (Chapter 3)

𝜋𝑡 Distribution at epoch 𝑡, 𝑒−∑𝑡

𝑘=0𝑓𝑘 (Chapter 3)

𝜃0 True value of parameter (Chapter 3)

𝑥⋆𝑡 minimizer of 𝐹𝑡 (Chapter 3)

189

MCMC algorithms for sampling from multimodal and …...However, guarantees for MCMC do not cover...

Documents

Transcript of MCMC algorithms for sampling from multimodal and …...However, guarantees for MCMC do not cover...