MCMC algorithms for sampling from multimodal and …...However, guarantees for MCMC do not cover...
Transcript of MCMC algorithms for sampling from multimodal and …...However, guarantees for MCMC do not cover...
MCMC algorithms for sampling from
multimodal and changing distributions
Holden Lee
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Mathematics
Adviser: Sanjeev Arora
June 2019
c Copyright by Holden Lee, 2019.
All Rights Reserved
Abstract
The problem of sampling from a probability distribution is a fundamental problem in
Bayesian statistics and machine learning, with applications throughout the sciences.
One common algorithmic framework for solving this problem is Markov Chain Monte
Carlo (MCMC). However, a large gap exists between simple settings where MCMC
has been proven to work, and complex settings arising in practice. In this thesis, I
make progress towards closing this gap, focusing on two hurdles in particular.
In Chapter 2, I consider the problem of sampling from multimodal distributions.
Many distributions arising in practice, from simple mixture models to deep genera-
tive models, are multimodal, so any Markov chain which makes local moves will get
stuck in one mode. Although a variety of temperature heuristics are used to address
this problem, their theoretical guarantees are not well-understood even for simple
multimodal distributions. I analyze an algorithm combining Langevin diffusion with
simulated tempering, a heuristic which speeds up mixing by transitioning between
different temperatures of the distribution. I develop a general method to prove mix-
ing time using “soft decompositions” of Markov processes, and use it to prove rapid
mixing for (polynomial) mixtures of log-concave distributions.
In Chapter 3, I address the problem of sampling from the distributions 𝑝𝑡(𝑥) ∝
𝑒−∑𝑡
𝑘=0𝑓𝑘(𝑥) for each epoch 1 ≤ 𝑡 ≤ 𝑇 in an online manner, given a sequence of (con-
vex) functions 𝑓0, . . . , 𝑓𝑇 . This problem arises in large-scale Bayesian inference (for
instance, online logistic regression) where instead of obtaining all the observations at
once, one constantly acquires new data, and must continuously update the distribu-
tion. All previous results for this problem imply a bound on the number of gradient
evaluations at each epoch 𝑡 that grows at least linearly in 𝑡. For this problem, I
show that a certain variance-reduced SGLD (stochastic gradient Langevin dynamics)
algorithm solves the online sampling problem with fixed TV-error 𝜀 with an almost
constant number of gradient evaluations per epoch.
iii
Acknowledgements
I would like to thank my adviser Sanjeev Arora for supporting me throughout my
Ph.D., pointing me to relevant problems in the field, and helping with presentations;
Allan Sly for being a reader; and Weinan E and Ramon van Handel for serving on
my committee. Thanks also to Zeev Dvir for support during the first two years.
I would like to thank all my co-authors: Sanjeev Arora, Rong Ge, Elad Hazan,
Tengyu Ma, Oren Mangoubi, Andrej Risteski, Karan Singh, Nisheeth Vishnoi, Cyril
Zhang, and Yi Zhang. I’ve had many engaging discussions with them, with other
members of the research group including Orestis Plevrakis, Nikunj Saunshi, and Kiran
Vodrahalli, and with the “Machine Learning Rant Group.” I would especially like to
thank Andrej and Cyril for their dedication to our research projects, and for always
being enthusiastic about sharing ideas. Andrej was already ready to give advice as
well as feedback on drafts and presentations. Cyril made sure all our collaborators
were on the same page, and even when our proofs were falling apart, kept calm and
carried on.
Thanks to Jill LeClair and Mitra Kelly for administrative support throughout my
Ph.D.
I thank 2D for being a warm, loving community and for keeping me well-fed for the
past four years, and Arch & Arrow and Graduate Improv for providing a much-needed
creative outlet and community.
I would like to dedicate this thesis to my father, T.Y. Lee, for getting me started
on my mathematical journey, and teaching me that good character is more important
than achievement. Finally I thank my mother, Ching-An Lee, for her continual love
and care.
iv
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction: MCMC algorithms 1
1.1 The problem of sampling . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Applications of sampling . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Bayesian modeling . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Theoretical computer science . . . . . . . . . . . . . . . . . . 9
1.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 New results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Sampling from multimodal distributions . . . . . . . . . . . . 11
1.4.2 Sampling from changing distributions . . . . . . . . . . . . . . 12
2 Sampling from multimodal distributions using simulated tempering 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Overview of algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Simulated tempering . . . . . . . . . . . . . . . . . . . . . . . 20
v
2.2.3 Main algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Overview of the proof techniques . . . . . . . . . . . . . . . . . . . . 22
2.4 Theorem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Simulated tempering . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Markov process decomposition theorems . . . . . . . . . . . . . . . . 31
2.6.1 Simple density decomposition theorem . . . . . . . . . . . . . 33
2.6.2 General density decomposition theorem . . . . . . . . . . . . . 36
2.6.3 Theorem for simulated tempering . . . . . . . . . . . . . . . . 42
2.7 Simulated tempering for gaussians with equal variance . . . . . . . . 48
2.7.1 Mixtures of gaussians all the way down . . . . . . . . . . . . . 48
2.7.2 Comparing to the actual chain . . . . . . . . . . . . . . . . . . 53
2.8 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.9 Proof of main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Online sampling from log-concave distributions 69
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.2 Result in the online setting . . . . . . . . . . . . . . . . . . . . 75
3.2.3 Result in the offline setting . . . . . . . . . . . . . . . . . . . . 77
3.2.4 Application to Bayesian logistic regression . . . . . . . . . . . 79
3.3 Algorithm and proof techniques . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Overview of online algorithm . . . . . . . . . . . . . . . . . . 81
3.3.2 Overview of offline algorithm . . . . . . . . . . . . . . . . . . 83
3.4 Proof overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.1 Online problem . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.2 Offline problem . . . . . . . . . . . . . . . . . . . . . . . . . . 88
vi
3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Proof of online theorem (Theorem 3.2.4) . . . . . . . . . . . . . . . . 91
3.6.1 Bounding the variance of the stochastic gradient . . . . . . . . 92
3.6.2 Bounding the escape time from a ball . . . . . . . . . . . . . . 93
3.6.3 Bounding the TV error . . . . . . . . . . . . . . . . . . . . . . 96
3.6.4 Setting the constants; Proof of main theorem . . . . . . . . . . 101
3.7 Proof of offline theorem (Theorem 3.2.6) . . . . . . . . . . . . . . . . 106
3.8 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.9 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . 112
A Background on Markov chains and processes 129
A.1 Markov chains and processes . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Langevin diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B Appendix for Chapter 2 135
B.1 General log-concave densities . . . . . . . . . . . . . . . . . . . . . . 135
B.1.1 Simulated tempering for log-concave densities . . . . . . . . . 135
B.1.2 Proof of main theorem for log-concave densities . . . . . . . . 137
B.2 Perturbation tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.2.1 Simulated tempering for distribution with perturbation . . . . 140
B.2.2 Proof of main theorem with perturbations . . . . . . . . . . . 140
B.3 Continuous version of decomposition theorem . . . . . . . . . . . . . 144
B.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.5 Lower bound when Gaussians have different variance . . . . . . . . . 149
B.5.1 Construction of 𝑔 and closeness of two functions . . . . . . . . 153
C Appendix for Chapter 3 157
C.1 Proof for logistic regression application . . . . . . . . . . . . . . . . . 157
vii
C.1.1 Theorem for general posterior sampling, and application to lo-
gistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 157
C.1.2 Proof of Theorem C.1.1 . . . . . . . . . . . . . . . . . . . . . 160
C.1.3 Online logistic regression: Proof of Lemma C.1.2 and Theo-
rem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
D Calculations on probability distributions 169
D.1 Chi-squared and KL inequalities . . . . . . . . . . . . . . . . . . . . . 169
D.2 Chi-squared divergence calculations for log-concave distributions . . . 173
D.3 A probability ratio calculation . . . . . . . . . . . . . . . . . . . . . . 181
D.4 Other facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
viii
Chapter 1
Introduction: MCMC algorithms
1.1 The problem of sampling
In this thesis, we consider the question of sampling from a probability distribution
𝑃 whose density function is specified up to a partition function (normalizing con-
stant) [LV06],
𝑝(𝑥) =𝑒−𝑓(𝑥)∫
Ω 𝑒−𝑓(𝑥) 𝑑𝑥. (1.1)
The domain Ω could be a continuous domain, such as R𝑑 or a subset of R𝑑, or a discrete
domain such as the boolean cube 0, 1𝑑. We are interested in designing efficient
algorithms for the high-dimensional setting, i.e., algorithms that scale polynomially
in the dimension 𝑑.
Problem 1.1.1 (Sampling). Let Ω = R𝑑. Given query access to 𝑓(𝑥) and ∇𝑓(𝑥),
sample from a distribution ‹𝑃 that is 𝜀-close (in TV-distance or another distance) to
the distribution 𝑃 with density function (1.1).
For general functions 𝑓 , the problem is intractable (#P-hard), so we will need
more assumptions about the structure of 𝑓 to provide provable guarantees.
1
As we describe in Section 1.2, sampling is a fundamental problem in statistics,
machine learning, and theoretical computer science. It also has applications to simu-
lation of physical systems. The main approach is Markov Chain Monte Carlo [HH13;
Liu08; Bro+11; CSI12], which we describe in Section 1.3.
However, guarantees for MCMC do not cover many practical problems of interest.
We describe our progress in theoretical guarantees for MCMC methods in Section 1.4.
1.2 Applications of sampling
1.2.1 Bayesian modeling
In Bayesian statistics and machine learning, one starts by assuming that observations
are generated by a fixed probabilistic model with unknown parameters 𝜃. Below, we
describe several key tasks in this framework.
The problem of learning the parameters is to find the posterior distribution
of the parameters 𝜃, given the observations. One assumes a prior distribution on
the parameters, 𝑝(𝜃), and fixes a probabilistic model of how the observed random
variables 𝑌 are generated from 𝜃 (and perhaps some other information 𝑋), 𝑝(·|𝑥, 𝜃).
By Bayes’s Rule, the posterior distribution of the parameters 𝜃 is
𝑝(𝜃|𝑦, 𝑥) =𝑝(𝑦|𝑥, 𝜃)𝑝(𝜃)
𝑝(𝑦)∝ 𝑝(𝑦|𝑥, 𝜃)𝑝(𝜃). (1.2)
The problem of inferring the latent variables arises in latent variable models.
A latent variable model is a model for observations 𝑌 that is simple when it is
conditioned on the parameter 𝜃 and some latent (hidden) variable 𝐻 that is not
observed, given by 𝑝𝜃(·|ℎ). Here, 𝐻 is a random variable whose distribution depends
in a known way on 𝜃. The goal is to “infer” the probability distribution on the hidden
2
variable 𝐻 from the observations 𝑌 . Again by Bayes’s Rule, if 𝜃 is known,
𝑝𝜃(ℎ|𝑥) =𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ)
𝑝𝜃(𝑥)∝ 𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ). (1.3)
Although the numerator is easy to evaluate, the denominator 𝑝𝜃(𝑥) =∫𝑝𝜃(ℎ)𝑝𝜃(𝑥|ℎ) 𝑑ℎ
can be NP-hard to approximate even for simple models like topic models [SR11].
A probability distribution may be difficult to work with even in a fully observed
model that does not involve latent variables. Even if 𝜃 is known, if 𝑝𝜃(·) is only given
up to a normalization constant (similar to the situation in both tasks above), to
obtain a sample from this distribution, we must implicitly solve a counting problem,
which can also be intractable in general.
In each of the tasks, one desires to understand a certain probability distribution.
However, in general this distribution has no succinct description that allows us to
extract useful information for downstream tasks. We may want to calculate marginals,
or calculate E𝜃∼𝑃 𝑔(𝜃) for some function 𝑔; this includes the mean, variance, and
proportion within a given set. There are two main approaches to understanding the
probability distribution; these comprise the main approaches to Bayesian modeling.
One method is Markov Chain Monte Carlo (MCMC). The idea of MCMC is to design
a Markov chain which has the desired distribution as the stationary distribution. By
running the Markov chain, one obtains samples from the probability distribution;
these samples can then be used to estimate desired quantities like E𝜃∼𝑃 𝑔(𝜃). Another
approach is variational inference [W+08], which seeks to approximate the distribution
with a distribution from a family of distributions that is easier to optimize over, such
as product distributions. We will focus on MCMC in this work.
We now give examples of probabilistic models where MCMC can be used.
Logistic regression: Logistic regression is a fundamental and widely used model
3
in Bayesian statistics [AC93]. In Bayesian logistic regression, one models the data
(𝑢𝑡 ∈ R𝑑, 𝑦𝑡 ∈ −1, 1) as follows: there is some unknown 𝜃0 ∈ R𝑑 such that given
𝑢𝑡 (the independent variable), for all 𝑡 ∈ 1, . . . , 𝑇 the dependent variable 𝑦𝑡 follows
a Bernoulli distribution with “success” probability 𝜑(𝑢⊤𝑡 𝜃) (𝑦𝑡 = 1 with probability
𝜑(𝑢⊤𝑡 𝜃) and −1 otherwise) where 𝜑(𝑥) = 1
1+𝑒−𝑥 is the logistic function:
𝜃 ∼ 𝑃
𝑦𝑡 ∼ Bernoulli(𝜑(𝑢⊤𝑡 𝜃)) ∀1 ≤ 𝑡 ≤ 𝑇.
Given the prior distribution 𝑝(𝜃), the posterior distribution is given by
𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) = 𝑝(𝜃)𝑝((𝑦𝑡)
𝑇𝑡=1|(𝑢𝑡)
𝑇𝑡=1, 𝜃) (1.4)
= 𝑝(𝜃)𝑇∏𝑡=1
𝜑(𝑦𝑡𝑢⊤𝑡 𝜃) = 𝑝(𝜃)
𝑇∏𝑡=1
1
1 + 𝑒−𝑦𝑡𝑢⊤𝑡 𝜃
(1.5)
Hence this problem fits in the framework of Problem 1.1.1 with 𝑓(𝜃) = − ln 𝑝(𝜃) +∑𝑇𝑡=1 ln(1 + 𝑒−𝑦𝑡𝑢⊤
𝑡 𝜃).
Here, 𝑢𝑡 is a “feature vector,” and the parameter 𝜃 specifies the effects that the
different features have on predicting 𝑦𝑡. More generally, other functions can be used
for the link function 𝜑. Once 𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) is estimated, one can then estimate the re-
sponse probability for a new data point 𝑝(𝑦|𝑢, (𝑢𝑡, 𝑦𝑡)𝑇𝑡=1) =
∫𝜑(𝑦𝑢⊤𝜃)𝑝(𝜃|(𝑢𝑡, 𝑦𝑡)
𝑇𝑡=1) 𝑑𝜃.
While logistic regression is a simple model, Bayesian inference is still a practical chal-
lenge for large or streaming datasets [HCB16]. We will apply our online sampling
algorithm to this problem in Chapter 3.
Latent Dirichlet allocation: Latent Dirichlet allocation (LDA) [BNJ03] is per-
haps the most common probabilistic model for collections of documents. It falls into
the general framework of topic modeling [SS09], a latent variable model where doc-
4
uments are generated as follows: there is a collection of topics (which are unobserved),
and each topic is associated with a probability distribution over words. A document
is a mixture of topics, and to generate a word from the document, a topic in the mix-
ture is drawn, and a word is drawn from the word distribution of the topic. In LDA,
the relative frequencies of words and topics is a draw from a Dirichlet distribution.
Formally, let 𝐾 be the number of topics, 𝑉 be the number of words in the vocab-
ulary. For hyperparameters 𝛼 and 𝛽, LDA is given by
word frequencies for topics 𝜙𝑘 ∼ Dirichlet𝑉 (𝛽) ∀1 ≤ 𝑘 ≤ 𝐾
topic frequencies for documents 𝜃𝑑 ∼ Dirichlet𝐾(𝛼) ∀1 ≤ 𝑑 ≤𝑀
topics for words in document 𝑧𝑑,𝑖 ∼ Multinomial𝐾(𝜃𝑑) ∀1 ≤ 𝑑 ≤𝑀, 1 ≤ 𝑖 ≤ 𝑁𝑑
words in document 𝑤𝑑,𝑖 ∼ Multinomial𝑉 (𝜙𝑧𝑑,𝑖) ∀1 ≤ 𝑑 ≤𝑀, 1 ≤ 𝑖 ≤ 𝑁𝑑.
Note that LDA is an unsupervised learning problem; an algorithm for LDA allows us
to learn the parameters and infer the topics from the documents without a labeled
dataset. It is useful for document classification and information retrieval, and also
challenging in the setting of large, streaming data.
Dirichlet process mixture model: A Dirichlet process mixture model [Nea00]
is the most commonly used probabilistic model for mixtures of distributions. It is a
latent variable model: observations are generated from one distribution in a mixture,
but which mixture is not revealed. It is defined as follows.
Let 𝐹 (𝜃) be a family of probability distributions indexed by 𝜃 ∈ Θ. Let 𝐺0 be the
prior distribution on 𝜃. The Dirichlet process mixture model on 𝐹 with 𝑘 components,
parameter 𝛼, and prior 𝐺0 is as follows:
mixture components 𝜃1, . . . , 𝜃𝑘 ∼ 𝐺0
5
mixture proportions 𝑝 ∼ Dirichlet𝑘(𝛼)
class assignments 𝑐𝑖 ∼ Multinomial(𝑝) ∀1 ≤ 𝑖 ≤ 𝑁
observations 𝑦𝑖 ∼ 𝐹 (𝜃𝑐𝑖) ∀1 ≤ 𝑖 ≤ 𝑁.
For example, in the case of Gaussian mixtures, 𝜃 = (𝜇,Σ), 𝐹 (𝜇,Σ) = 𝑁(𝜇,Σ), and
one choice for 𝐺0 is 𝜇 ∼ 𝑁(0, 𝜎20𝐼) and Σ−1 ∼ Gamma(1, 𝜎−2
0 ). In both LDA and the
Dirichlet mixture model, the difficulty comes from the fact that the latent variables
(topics/classes) are not observed, and hence to find what settings of 𝜃 are likely, one
must integrate or sum over all possible topic or class assignments.
The posterior distribution is naturally multimodal, even in very simple cases. For
example, fixing the proportions 𝑝 =Ä1𝑘, . . . , 1
𝑘
äand fixing the variances to be Σ = 𝐼𝑑,
𝜎20 = 1, the posterior update after seeing 𝑦1, . . . , 𝑦𝑁 is∝ ∏𝑁
𝑛=1
Å∑𝑘𝑗=1
1𝑘
expÅ−‖𝑦𝑛−𝜇𝑗‖2
2
ãã,
which when expanded, is seen to be a mixture of exponentially many gaussians. The
naive Markov chain can be torpidly mixing [TD14]. We make progress towards sam-
pling from such multimodal distributions in Chapter 2.
Markov random fields: A Markov random field is a probabilistic model for a
collection of variables whose dependency structure is described as a graph. The state
space is Ω = 𝑆𝑉 , where 𝑆 is the possible states for each variable, and 𝑉 is the set of
vertices of the graph 𝐺. The probability distribution is given by
𝑝(𝑥) ∝ exp
Ñ ∑𝑐∈Cl(𝐺)
𝜑𝑐(𝑥𝑐)
é(1.6)
where Cl(𝐺) is the set of cliques of 𝐺. The Ising model is the special case where
the possible states are 𝑆 = −1, 1, all interactions are pairwise, and depend on just
6
whether the two variables are the same:
𝑝(𝑥) ∝ exp
Ñ ∑(𝑖,𝑗)∈𝐸(𝐺)
𝐽𝑖,𝑗𝑥𝑖𝑥𝑗
é. (1.7)
Note that when the interaction strengths 𝐽𝑖,𝑗 (or 𝜑𝑐) are known, there is no hidden
variable; yet obtaining basic information about the distribution (such as calculating
the marginals, or obtaining a sample) can be computationally difficult. These tasks
are roughly equivalent to the problem of computing the partition function, which is
#P-hard in general.
The Ising model was introduced as a statistical physics model for magnets; how-
ever, it has since then found many other applications. For example, the Ising model
can be used to define Bayesian neural networks, such as the deep Boltzmann machine
(DBM). In this setting, the nodes are divided into layers and 𝐺 has edges between
adjacent layers; the coefficients are unknown and are to be learned from data.
Finally, we note that one interesting line of research is to automate, as much as
possible, Bayesian inference in the same way that automatic differentiation has au-
tomated optimization. The field of probabilistic programming [FRT14] aims to
combine algorithm and programming language design to make this possible. However,
in terms of theoretical guarantees for convergence, the building blocks of Bayesian in-
ference (like MCMC) are less well-understood than the building block of optimization
algorithms (gradient descent and its variants).
1.2.2 Optimization
Machine learning approaches can roughly be classified into a “probabilistic modeling”
(Bayesian) approach and “optimization” approach, with plenty of overlap; this par-
allels the Bayesian and frequentist approaches to statistics. The Bayesian approach
7
produces probability distributions as answers—for the parameters of the model and
for the predictions—and the hope is that it accurately captures the uncertainty and
the spread in the problem; it often relies on designing the probabilistic model in such
a way such that the parameters have explanatory power. The optimization approach,
on the other hand, will choose one setting of parameters, and one function giving
the predictions—the one that minimizes the loss function. Feedforward neural net-
works are a hallmark success of this approach. Training these models is often more
efficient and straightforward, but broadly speaking, the results are thought to be less
interpretable than those obtained using probabilistic models.
However, there are many connections between optimization and sampling. Many
optimization algorithms rely implicitly on some kind of sampling algorithm, or include
techniques that are inspired by sampling algorithms.
The primary connection between optimization and sampling is the following: if
we sample from the probability distribution 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), we are more likely to
obtain samples 𝑥 where 𝑓(𝑥) is small. We can take this to the extreme: In the limit
as the inverse temperature 𝛽 → ∞, the distribution 𝑝𝛽(𝑥) ∝ 𝑒−𝛽𝑓 is peaked at
exactly the global minima. In this sense, optimization is “just” a sampling problem.
In reality, however, we often cannot take 𝛽 → ∞, so there is a tradeoff between the
ability to sample and the quality of the solution.
Sampling algorithms have been used in nonconvex optimization to escape local
minima. The most well-studied such algorithm is Langevin Monte Carlo, which is
essentially gradient descent plus noise. Many researchers have heuristically explained
the success (or limitations) of stochastic gradient descent (SGD) using Langevin dy-
namics [WT11; MHB16]; Langevin dynamics has also been rigorously analyzed for
non-convex optimization [ZLC17]. Langevin dynamics has also inspired temperature-
based methods such as entropy-SGD for optimizing neural networks [Cha+16; YZM17].
Other algorithms are not explicitly based on LMC, but also add noise in judicious
8
ways to help escape saddle points or local minima [Ge+15; AG16; Jin+17; Jin+18].
Thus, a better understanding of sampling algorithms, and stronger guarantees for
them, will translate to a better understanding of and stronger guarantees for opti-
mization algorithms.
Secondly, sampling algorithms arise as a subroutine in online decision problems.
One generic algorithm for online optimization, called multiplicative weights, is to sam-
ple a point 𝑥𝑡 from the exponential of the (suitably weighted) negative loss ([CL06;
AHK12], [NR17, Lemma 10]). Indeed, there are settings such as online logistic re-
gression in which the only known way to achieve optimal regret is through a Bayesian
sampling approach [Fos+18], with lower bounds known for the naive convex opti-
mization approach [HKL14]. The celebrated Cover’s algorithm [Cov11] is an optimal
algorithm for portfolio management based on posterior sampling. Another applica-
tion is for Thompson sampling for bandit problems [CL11; AG12; AG13b; AG13a;
Rus+18; DFE18].
1.2.3 Theoretical computer science
It is of great interest in theoretical computer science to develop efficient algorithms
that approximately counts objects such as independent sets or matchings in graphs,
or the number of solutions to constraint satisfaction problems. One way to develop
algorithms for counting is to first develop algorithms for sampling, because the prob-
lems of approximate counting and sampling are equivalent for self-reducible problems.
If we can define a Markov chain on the set of solutions to a counting problem, then
the counting problem has a fully polynomial-time randomized approximation scheme
(FPRAS).
This strategy has been used to obtain a FPRAS for the partition function of
the ferromagnetic Ising model [JS93], for counting the number of perfect matchings
in a graph, and for approximating the permanent of a matrix with nonnegative en-
9
tries [JSV04].
In the continuous setting, Markov chains have been used for volume computation
of convex sets [DFK91].
1.3 Markov Chain Monte Carlo
A Markov chain 𝑀 on a state space Ω is a random process 𝑋0, 𝑋1, . . . where the
probability law for the next state depends on just the previous state, in a time-
invariant fashion: There is a fixed transition kernel 𝑇 , where 𝑇 (𝑥, ·) is a measure on
Ω for each 𝑥 ∈ Ω, such that
P(𝑋𝑡+1 ∈ 𝐴|𝑋0, . . . , 𝑋𝑡) = P(𝑋𝑡+1 ∈ 𝐴|𝑋𝑡) = 𝑇 (𝑋𝑡, 𝐴). (1.8)
A stationary measure 𝑃 for the Markov chain is such that if 𝑋0 ∼ 𝑃 , then 𝑋1 ∼ 𝑃
and hence 𝑋𝑡 ∼ 𝑃 for all 𝑡. A Markov chain is ergodic if the stationary measure 𝑃
is unique and for any starting measure, the probability measure 𝜋𝑡 of 𝑋𝑡 converges
to 𝑃 in TV-distance. Define the 𝜀-mixing time of 𝑀 to be the smallest 𝑡 such that
for all starting measures 𝜋0,
𝑑𝑇𝑉 (𝜋𝑡, 𝑃 ) ≤ 𝜀, (1.9)
and informally say that 𝑀 has rapid mixing if 𝑡 is small (polynomial in relevant
parameters) and torpid mixing if 𝑡 is large.
The theory of Markov chains is a beautiful and well-developed area of mathemat-
ics; however, there is a gap between the simple settings where rapid mixing has been
established (often with sharp rates), and the settings in which MCMC is often used.
Techniques for bounding the mixing time of Markov chains include spectral meth-
ods, functional inequalities inspired from geometry (such as Cheeger’s inequality and
10
canonical paths), coupling, and Harris recurrence; see [Dia09; Dia11] for a survey. We
give more background on Markov chains in Appendix A.
MCMC has been used in diverse fields, including physics (magnetization, simu-
lation of gases and fluids), chemistry (simulation of molecules), biology (genetics),
engineering (feedback control systems), medicine, economics, social science, food sci-
ence (because there is no such thing as too much bread), and forensics (such as
figuring out who killed Granny [Bal19]).
Key challenges in MCMC include overcoming torpid mixing for multimodal distri-
butions, designing efficient algorithms for big datasets and for streaming data (where
one pass over the data can be expensive), and assessing the quality of samples ob-
tained.
1.4 New results
Despite the extensive literature on Markov Chain Monte Carlo (MCMC) methods,
two shortcomings of current methods limit their applicability to real-world settings:
probability distributions in practice are multimodal, and applications often require
that the distribution be updated in an online fashion in response to streaming data.
1.4.1 Sampling from multimodal distributions
Many distributions that arise in practice, from simple mixture models to deep gener-
ative models, are multimodal, and hence any Markov chain which makes local moves
will get stuck in one mode. Although a variety of temperature heuristics are used to
address this problem in practice, their theoretical guarantees are not well-understood
even for simple multimodal distributions.
A natural Markov chain called Langevin diffusion mixes in polynomial time for
log-concave distributions [BE85], but can take exponential time otherwise (including
11
for multimodal distributions) [Bov+04]. In Chapter 2, I address this problem by
combining Langevin diffusion with simulated tempering. The resulting Markov
chain mixes more rapidly by transitioning between different temperatures of the dis-
tribution. To show rapid mixing, I developed a general method to prove mixing
time using “soft decompositions” of Markov chains, and use this theory to show that
our algorithm [GLR18] provably samples from polynomial mixtures of log-concave
distributions in polynomial time.
1.4.2 Sampling from changing distributions
Likewise motivated by sampling problems which arise in practice, in Chapter 3, I
address the problem of obtaining independent samples from the distributions 𝑝𝑡(𝑥) ∝
𝑒−∑𝑡
𝑘=0 𝑓𝑘(𝑥) for each epoch 1 ≤ 𝑡 ≤ 𝑇 in an online manner, given a sequence of
(convex) functions 𝑓0, . . . , 𝑓𝑇 . This problem arises in large-scale Bayesian inference
where instead of obtaining all the observations at once, one constantly acquires new
data, and must continuously update the distribution. All previous results for this
problem imply a bound on the number of gradient evaluations at each epoch 𝑡 that
grows at least linearly in 𝑡. This is an inherent limitation of any Markov chain which
incorporates all data at each time step.
To overcome this problem, I show that a certain variance-reduced SGLD (stochas-
tic gradient Langevin dynamics) algorithm [LMV19] solves the online sampling prob-
lem with fixed TV-error 𝜀 with an almost constant number of gradient evaluations
each epoch. We make mild assumptions that are motivated by applications such as
online logistic regression. As described in Section 1.2.2, our work on online sampling
has applications to online optimization.
12
Chapter 2
Sampling from multimodal
distributions using simulated
tempering
This chapter is based on work in [GLR18].
2.1 Introduction
Dealing with multimodal distributions is a core challenge in Markov Chain Monte
Carlo. The naive Markov chain often does not mix rapidly, and we obtain samples
from only one part of the support of the distribution.
Practitioners have dealt with this problem through a variety of heuristics. A
popular family of approaches involve changing the temperature of the distribution.
However, there has been little theoretical analysis of such methods. We give provable
guarantees for a temperature-based method called simulated tempering when it is
combined with Langevin diffusion.
Previous theoretical results on sampling have focused on log-concave distributions,
i.e., distributions of the form 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) for a convex function 𝑓(𝑥). This is anal-
13
ogous to convex optimization where the objective function 𝑓(𝑥) is convex. Recently,
there has been renewed interest in analyzing a popular Markov Chain for sampling
from such distributions, when given gradient access to 𝑓—a natural setup for the pos-
terior sampling task described above. In particular, a Markov chain called Langevin
Monte Carlo (see Section 2.2.1), popular with Bayesian practitioners, has been proven
to work, with various rates depending on the precise properties of 𝑓 [Dal16; DM16;
Dal17; CB18b; DMM18].
Yet, just as many interesting optimization problems are nonconvex, many interest-
ing sampling problems are not log-concave. A log-concave distribution is necessarily
uni-modal: its density function has only one local maximum, which is necessarily a
global maximum. This fails to capture many interesting scenarios. Many simple pos-
terior distributions are neither log-concave nor uni-modal, for instance, the posterior
distribution of the means for a mixture of gaussians, given a sample of points from the
mixture of gaussians. In a more practical direction, complicated posterior distribu-
tions associated with deep generative models [RMW14] and variational auto-encoders
[KW13] are believed to be multimodal as well.
In this work we initiate an exploration of provable methods for sampling “beyond
log-concavity,” in parallel to optimization “beyond convexity”. As worst-case results
are prohibited by hardness results, we must make assumptions on the distributions of
interest. As a first step, we consider a mixture of strongly log-concave distributions
of the same shape. This class of distributions captures the prototypical multimodal
distribution, a mixture of Gaussians with the same covariance matrix. Our result
is also robust in the sense that even if the actual distribution has density that is
only close to a mixture that we can handle, our algorithm can still sample from
the distribution in polynomial time. Note that the requirement that all Gaussians
have the same covariance matrix is in some sense necessary: in Appendix B.5 we
show that even if the covariance of two components differ by a constant factor, no
14
algorithm (with query access to 𝑓 and∇𝑓) can achieve the same robustness guarantee
in polynomial time.
2.1.1 Problem statement
We formalize the problem of interest as follows.
Problem 2.1.1. Let 𝑓 : R𝑑 → R be a function. Given query access to ∇𝑓(𝑥) and 𝑓(𝑥)
at any point 𝑥 ∈ R𝑑, sample from the probability distribution with density function
𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).
In particular, consider the case where 𝑒−𝑓(𝑥) is the density function of a mixture
of strongly log-concave distributions that are translates of each other. That is, there is
a base function 𝑓0 : R𝑑 → R, centers 𝜇1, 𝜇2, . . . , 𝜇𝑚 ∈ R𝑑, and weights 𝑤1, 𝑤2, . . . , 𝑤𝑚
(∑𝑚
𝑖=1 𝑤𝑖 = 1) such that
𝑓(𝑥) = − log
(𝑚∑𝑖=1
𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)
), (2.1)
For notational convenience, we will define 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖).
The function 𝑓0 specifies a basic “shape” around the modes, and the means 𝜇𝑖
indicate the locations of the modes.
Without loss of generality we assume the mode of the distribution 𝑒−𝑓0(𝑥) is at 0
(∇𝑓0(0) = 0). We also assume 𝑓0 is twice differentiable, and for any 𝑥 the Hessian
is sandwiched between 𝜅𝐼 ⪯ ∇2𝑓0(𝑥)) ⪯ 𝐾𝐼. Such functions are called 𝜅-strongly-
convex, 𝐾-smooth functions. The corresponding distribution 𝑒−𝑓0(𝑥) are strongly
log-concave distributions. 1
1On a first read, we recommend concentrating on the case 𝑓0(𝑥) =1
2𝜎2 ‖𝑥‖2. This corresponds tothe case where all the components are spherical Gaussians with mean 𝜇𝑖 and covariance matrix 𝜎2𝐼.
15
2.1.2 Our results
We show that there is an efficient algorithm that can sample from this distribution
given just access to 𝑓(𝑥) and ∇𝑓(𝑥).
Theorem 2.1.2 (Main). Given 𝑓(𝑥) as defined in Equation (2.1), where the base
function 𝑓0 satisfies for any 𝑥, 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼, and ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖 ∈ [𝑚],
there is an algorithm (given as Algorithm 2 with appropriate setting of parameters)
with running time polyÄ
1𝑤min
, 𝐷, 𝑑, 1𝜀, 1𝜅, 𝐾
ä, which given query access to ∇𝑓 and 𝑓 ,
outputs a sample from a distribution within TV-distance 𝜀 of 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).
Note that importantly the algorithm does not have direct access to the mixture
parameters 𝜇𝑖, 𝑤𝑖, 𝑖 ∈ [𝑛] (otherwise the problem would be trivial). Sampling from this
mixture is thus non-trivial: algorithms that are based on making local steps (such
as the ball-walk [LS93; Vem05] and Langevin Monte Carlo) cannot move between
different components of the gaussian mixture when the gaussians are well-separated.
In the algorithm we use simulated tempering (see Section 2.2.2), which is a technique
that adjusts the “temperature” of the distribution in order to move between different
components.
Of course, requiring the distribution to be exactly a mixture of log-concave distri-
butions is a very strong assumption. Our results can be generalized to all functions
that are “close” to a mixture of log-concave distributions.
More precisely, assume the function 𝑓 satisfies the following properties:
∃𝑓 : R𝑑 → R where∥∥∥𝑓 − 𝑓
∥∥∥∞≤ ∆ ,
∥∥∥∇𝑓 −∇𝑓∥∥∥∞≤ 𝜏 and ‖∇2𝑓 −∇2𝑓‖2 ≤ 𝜏, ∀𝑥 ∈ R𝑑
(2.2)
and 𝑓(𝑥) = − log
(𝑚∑𝑖=1
𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)
)(2.3)
where ∇𝑓0(0) = 0, and ∀𝑥, 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼. (2.4)
16
That is, 𝑓 is within a 𝑒Δ multiplicative factor of an (unknown) mixture of log-
concave distributions. Our theorem can be generalized to this case.
Theorem 2.1.3 (general case). For function 𝑓(𝑥) that satisfies Equations (2.2),
(2.3), and (2.4), there is an algorithm (given as Algorithm 2 with appropriate setting
of parameters) that runs in time polyÄ
1𝑤min
, 𝐷, 𝑑, 1𝜀, 𝑒Δ, 𝜏, 1
𝜅, 𝐾
ä, which given query
access to ∇𝑓 and 𝑓 , outputs a sample 𝑥 from a distribution that has TV-distance at
most 𝜀 from 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).
Both main theorems may seem simple. In particular, one might conjecture that it
is easy to use local search algorithms to find all the modes. However in Section B.4,
we give a few examples to show that such simple heuristics do not work (e.g. random
initialization is not enough to find all the modes).
The assumption that all the mixture components share the same 𝑓0 (hence when
applied to Gaussians, all Gaussians have same covariance) is also necessary. In Sec-
tion B.5, we give an example where for a mixture of two gaussians, even if the co-
variance only differs by a constant factor, any algorithm that achieves similar gau-
rantees as Theorem 2.1.3 must take exponential time. The limiting factor is approx-
imately finding all the mixture components. We note that when the approximate
locations of the mixture are known, there are heuristic ways to temper them differ-
ently; see [TRR18].
2.2 Overview of algorithm
Our algorithm combines Langevin diffusion, a chain for sampling from distributions
in the form 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given only gradient access to 𝑓 and simulated tempering, a
heuristic used for tackling multimodality. We briefly define both of these and recall
what is known for both of these techniques. For technical prerequisites on Markov
chains, the reader can refer to Appendix A.
17
The basic idea to keep in mind is the following: A Markov chain with local
moves such as Langevin diffusion gets stuck in a local mode. Creating a “meta-
Markov chain” which changes the temperature (the simulated tempering chain) can
exponentially speed up mixing.
2.2.1 Langevin dynamics
Langevin Monte Carlo is an algorithm for sampling from 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given access
to the gradient of the log-pdf, ∇𝑓 .
The continuous version, overdamped Langevin diffusion (often simply called Langevin
diffusion), is a stochastic process described by the stochastic differential equation
(henceforth SDE)
𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡) 𝑑𝑡 +√
2 𝑑𝑊𝑡 (2.5)
where 𝑊𝑡 is the Wiener process (Brownian motion). For us, the crucial fact is that
Langevin dynamics converges to the stationary distribution given by 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).
We will always assume that∫R𝑑 𝑒−𝑓(𝑥) 𝑑𝑥 <∞ and 𝑓 ∈ 𝐶2(R𝑑).
Substituting 𝛽𝑓 for 𝑓 in (2.5) gives the Langevin diffusion process for inverse
temperature 𝛽, which has stationary distribution ∝ 𝑒−𝛽𝑓(𝑥). Equivalently we can
consider the temperature as changing the magnitude of the noise:
𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡)𝑑𝑡 +»
2𝛽−1𝑑𝑊𝑡.
Of course algorithmically we cannot run a continuous-time process, so we run a
discretized version of the above process: namely, we run a Markov chain where the
random variable at time 𝑡 is described as
𝑋𝑡+1 = 𝑋𝑡 − 𝜂∇𝑓(𝑋𝑡) +»
2𝜂𝜉𝑘, 𝜉𝑘 ∼ 𝑁(0, 𝐼) (2.6)
18
where 𝜂 is the step size. (The reason for the√𝜂 scaling is that running Brownian
motion for 𝜂 of the time scales the variance by√𝜂.) This is analogous to how gradient
descent is a discretization of gradient flow.
Prior work on Langevin dynamics
For Langevin dynamics, convergence to the stationary distribution is a classic result
[Bha78]. Fast mixing for log-concave distributions is also a classic result: [BE85;
Bak+08] show that log-concave distributions satisfy a Poincare and log-Sobolev in-
equality, which characterize the rate of convergence—If 𝑓 is 𝛼-strongly convex, then
the mixing time is on the order of 1𝛼
. Of course, algorithmically, one can only run
a “discretized” version of the Langevin dynamics. Analyses of the discretization
are more recent: [Dal16; DM16; Dal17; DK17; CB18b; DMM18] give running times
bounds for sampling from a log-concave distribution over R𝑑, and [BEL18] give a
algorithm to sample from a log-concave distribution restricted to a convex set by in-
corporating a projection. We note these analysis and ours are for the simplest kind of
Langevin dynamics, the overdamped case; better rates are known for underdamped
dynamics ([Che+17]), if a Metropolis-Hastings rejection step is used ([Dwi+18]), and
for Hamiltonian Monte Carlo which takes into account momentum ([MS17]).
[RRT17; Che+18; VW19] carefully analyze the effect of discretization for arbi-
trary non-log-concave distributions with certain regularity properties, but the mixing
time is exponential in general; furthermore, it has long been known that transition-
ing between different modes can take exponentially long, a phenomenon known as
meta-stability [Bov+02; Bov+04; BGK05]. The Holley-Stroock Theorem (see e.g.
[BGL13]) shows that guarantees for mixing extend to distributions 𝑒−𝑓(𝑥) where 𝑓(𝑥)
is a “nice” function that is close to a convex function in 𝐿∞ distance; however, this
does not address more global deviations from convexity. [MV17] consider a more
general model with multiplicative noise.
19
2.2.2 Simulated tempering
For distributions that are far from being log-concave and have many deep modes,
additional techniques are necessary. One proposed heuristic, out of many, is simu-
lated tempering, which swaps between Markov chains that are different temperature
variants of the original chain. The intuition is that the Markov chains at higher tem-
perature can move between modes more easily, and hence, the higher-temperature
chain acts as a “bridge” to move between modes.
Indeed, Langevin dynamics corresponding to a higher temperature distribution—
with 𝛽𝑓 rather than 𝑓 , where 𝛽 < 1—mixes faster. (Here, we use terminology from
statistical physics, letting 𝜏 denote teh temperature and 𝛽 = 1𝜏
denote the inverse
temperature.) A high temperature flattens out the distribution. However, we can’t
simply run Langevin at a higher temperature because the stationary distribution is
wrong; the simulated tempering chain combines Markov chains at different tempera-
tures in a way that preserves the stationary distribution.
We can define simulated tempering with respect to any sequence of Markov chains
𝑀𝑖 on the same space Ω. Think of 𝑀𝑖 as the Markov chain corresponding to temper-
ature 𝑖, with stationary distribution 𝑒−𝛽𝑖𝑓 .
Then we define the simulated tempering Markov chain as follows.
∙ The state space is [𝐿] × Ω: 𝐿 copies of the state space (in our case R𝑑), one
copy for each temperature.
∙ The evolution is defined as follows.
1. If the current point is (𝑖, 𝑥), then evolve according to the 𝑖th chain 𝑀𝑖.
2. Propose swaps with some rate 𝜆. When a swap is proposed, attempt to
move to a neighboring chain, 𝑖′ = 𝑖±1. With probability min𝑝𝑖′(𝑥)/𝑝𝑖(𝑥), 1,
the transition is successful. Otherwise, stay at the same point. This is a
20
Metropolis-Hastings step; its purpose is to preserve the stationary distri-
bution.2
The crucial fact to note is that the stationary distribution is a “mixture” of the
distributions corresponding to the different temperatures. Namely:
Proposition 2.2.1. [MP92; Nea96] If the 𝑀𝑘, 1 ≤ 𝑘 ≤ 𝐿 are reversible Markov
chains with stationary distributions 𝑝𝑘, then the simulated tempering chain 𝑀 is a
reversible Markov chain with stationary distribution
𝑝(𝑖, 𝑥) =1
𝐿𝑝𝑖(𝑥).
The typical setting of simulated tempering is as follows. The Markov chains come
from a smooth family of Markov chains with parameter 𝛽 ≥ 0, and 𝑀𝑖 is the Markov
chain with parameter 𝛽𝑖, where 0 ≤ 𝛽1 ≤ . . . ≤ 𝛽𝐿 = 1. We are interested in sampling
from the distribution when 𝛽 is large (𝜏 is small). However, the chain suffers from
torpid mixing in this case, because the distribution is more peaked. The simulated
tempering chain uses smaller 𝛽 (larger 𝜏) to help with mixing. For us, the stationary
distribution at inverse temperature 𝛽 is ∝ 𝑒−𝛽𝑓(𝑥).
Prior work on simulated tempering
Provable results of this heuristic are few and far between. [WSH09a; Zhe03] lower-
bound the spectral gap for generic simulated tempering chains, using a Markov chain
decomposition technique due to [MR02]. However, for the Problem 2.1.1 that we are
interested in, the spectral gap bound in [WSH09a] is exponentially small as a function
2This can be defined as either a discrete or continuous Markov chain. For a discrete chain,we propose a swap with probability 𝜆 and follow the current chain with probability 1 − 𝜆. For acontinuous chain, the time between swaps is an exponential distribution with decay 𝜆 (in other words,the times of the swaps forms a Poisson process). Note that simulated tempering is traditionallydefined for discrete Markov chains, but we will use the continuous version. See Definition 2.5.1 forthe formal definition.
21
of the number of modes. Drawing inspiration from [MR02], we establish a Markov
chain decomposition technique that overcomes this.
One issue that comes up in simulated tempering is estimating the partition func-
tions; various methods have been proposed for this [PP07; Lia05].
2.2.3 Main algorithm
Our algorithm is intuitively the following. Take a sequence of inverse temperatures
𝛽𝑖, starting at a small value and increasing geometrically towards 1. Run simulated
tempering Langevin on these temperatures, suitably discretized. Take the samples
that are at the 𝐿th temperature.
Note that there is one complication: the standard simulated tempering chain
assumes that we can compute the ratio between temperatures 𝑝𝑖′ (𝑥)𝑝𝑖(𝑥)
. However, we
only know the probability density functions up to a normalizing factor (the partition
function). To overcome this, we note that if we use the ratios 𝑟𝑖′𝑝𝑖′ (𝑥)𝑟𝑖𝑝𝑖(𝑥)
instead, for∑𝐿𝑖=1 𝑟𝑖 = 1, then the chain converges to the stationary distribution with 𝑝(𝑥, 𝑖) =
𝑟𝑖𝑝𝑖(𝑥). Thus, it suffices to estimate each partition function up to a constant factor.
We can do this inductively: running the simulated tempering chain on the first ℓ
levels, we can estimate the partition function 𝑍ℓ+1; then we can run the simulated
tempering chain on the first ℓ+ 1 levels. This is what Algorithm 2 does when it calls
Algorithm 1 as subroutine.
A formal description of the algorithm follows.
2.3 Overview of the proof techniques
We summarize the main ingredients and crucial techniques in the proof. Full proofs
appear in the following sections.
22
Algorithm 1 Simulated tempering Langevin Monte Carlo
INPUT: Temperatures 𝛽1, . . . , 𝛽ℓ; partition function estimates “𝑍1, . . . , “𝑍ℓ; step size𝜂, time 𝑇 , rate 𝜆, variance of initial distribution 𝜎0.OUTPUT: A random sample 𝑥 ∈ R𝑑 (approximately from the distribution 𝑝ℓ(𝑥) ∝𝑒−𝛽ℓ𝑓(𝑥)).Let (𝑖, 𝑥) = (1, 𝑥0) where 𝑥0 ∼ 𝑁(0, 𝜎2
0𝐼).Let 𝑛 = 0, 𝑇0 = 0.while 𝑇𝑛 < 𝑇 do
Determine the next transition time: Draw 𝜉𝑛+1 from the exponential distribu-tion 𝑝(𝑥) = 𝜆𝑒−𝜆𝑥, 𝑥 ≥ 0.
Let 𝜉𝑛+1 ←[ min𝑇 − 𝑇𝑛, 𝜉𝑛+1, 𝑇𝑛+1 = 𝑇𝑛 + 𝜉𝑛+1.
Let 𝜂′ = 𝜉𝑛+1/⌈𝜉𝑛+1
𝜂
⌉(the largest step size < 𝜂 that evenly divides into 𝜉𝑛+1).
Repeat⌈𝜉𝑛+1
𝜂
⌉times: Update 𝑥 according to 𝑥 ← [ 𝑥 − 𝜂′𝛽𝑖∇𝑓(𝑥) +
√2𝜂′𝜉,
𝜉 ∼ 𝑁(0, 𝐼).If 𝑇𝑛+1 < 𝑇 (i.e., the end time has not been reached), let 𝑖′ = 𝑖 ± 1 with
probability 12. If 𝑖′ is out of bounds, do nothing. If 𝑖′ is in bounds, make a type 2
transition, where the acceptance ratio is minß
𝑒−𝛽𝑖′𝑓(𝑥)/𝑍𝑖′
𝑒−𝛽𝑖𝑓(𝑥)/𝑍𝑖, 1™
.
𝑛← [ 𝑛 + 1.end whileIf the final state is (ℓ, 𝑥) for some 𝑥 ∈ R𝑑, return 𝑥. Otherwise, re-run the chain.
Algorithm 2 Main algorithm
INPUT: A function 𝑓 : R𝑑, satisfying assumption (2.2), to which we have gradientaccess.OUTPUT: A random sample 𝑥 ∈ R𝑑.Let 0 ≤ 𝛽1 < · · · < 𝛽𝐿 = 1 be a sequence of inverse temperatures satisfying (2.192)and (2.193).Let “𝑍1 = 1.for ℓ = 1→ 𝐿 do
Run the simulated tempering chain in Algorithm 1 with temperatures 𝛽1, . . . , 𝛽ℓ,partition function estimates “𝑍1, . . . , “𝑍ℓ, step size 𝜂, time 𝑇 , and rate 𝜆 given byLemma 2.9.2.
If ℓ = 𝐿, return the sample.If ℓ < 𝐿, repeat to get 𝑛 = 𝑂(𝐿2 ln
Ä1𝛿
ä) samples, and let ’𝑍ℓ+1 =
𝑍ℓ
Ä1𝑛
∑𝑛𝑗=1 𝑒
(−𝛽ℓ+1+𝛽ℓ)𝑓(𝑥𝑗)ä.
end for
23
Step 1: Define a continuous version of the simulated tempering Markov chain
(Definition 2.5.1, Lemma 2.5.2), where transition times are real numbers determined
by an exponential weighting time distribution.
Step 2: Prove a new decomposition theorem (Theorem 2.6.5) for bounding the
spectral gap (or equivalently, the mixing time) of the simulated tempering chain we
define. This is the main technical ingredient, and also a result of independent interest.
While decomposition theorems have appeared in the Markov Chain literature
(e.g. [MR02]), typically one partitions the state space, and bounds the spectral gap
using (1) the probability flow of the chain inside the individual sets, and (2) between
different sets.
In our case, we decompose the Markov chain itself; this includes a decomposition
of the stationary distribution into components. (More precisely, we show a decom-
position theorem on the generator of the tempering chain.) We would like to do this
because in our setting, the stationary distribution is exactly a mixture distribution
(Problem 2.1.1).
Our Markov chain decomposition theorem bounds the spectral gap (mixing time)
of a simulated tempering chain in terms of the spectral gap (mixing time) of two
chains:
1. “component” chains on the mixture components
2. a “projected” chain whose state space is the set of components, and which
captures the action of the chain between components as well as the distance
between the mixture components (measured in terms of their overlap)
This means that if the Markov chain on the individual components mixes rapidly,
and the “projected” chain mixes rapidly, then the simulated tempering chain mixes
rapidly as well. (Note [MR02, Theorem 1.2] does partition into mixture components,
24
but they only consider the special case where they components are laid out in a chain.)
The mixing time of a continuous Markov chain is quantified by a Poincare in-
equality.
Theorem (Simplified version of Theorem 2.6.5). Consider the simulated tempering
chain 𝑀 with rate 𝜆 = 1𝐶, where the Markov chain at the 𝑖th level (temperature)
is 𝑀𝑖 = (Ω,L𝑖) with stationary distribution 𝑝𝑖, for 1 ≤ 𝑖 ≤ 𝐿. Suppose we have
a decomposition of the Markov chain at each level, 𝑝𝑖𝑀𝑖 =∑𝑚
𝑗=1𝑤𝑖,𝑗𝑝𝑖,𝑗𝑀𝑖,𝑗, where∑𝑚𝑗=1𝑤𝑖,𝑗 = 1. If each 𝑀𝑖,𝑗 satisfies a Poincare inequality with constant 𝐶, and the
projected chain 𝑀 satisfies a Poincare inequality with constant 𝐶, then 𝑀 satisfies a
Poincare inequality with constant 𝑂(𝐶(1 + 𝐶)).
Here, the projected chain 𝑀 is the chain on [𝐿]× [𝑚] with probability flow in the
same and adjacent levels given by
𝑇 ((𝑖, 𝑗), (𝑖, 𝑗′)) = 𝑤𝑖,𝑗′𝛿(𝑖,𝑗),(𝑖,𝑗′) (2.7)
𝑇 ((𝑖, 𝑗), (𝑖± 1, 𝑗)) = 𝛿(𝑖,𝑗),(𝑖±1,𝑗) (2.8)
where 𝛿(𝑖,𝑗),(𝑖′,𝑗′) :=∫Ω min𝑝𝑖,𝑗(𝑥), 𝑝𝑖′,𝑗′(𝑥) 𝑑𝑥 is the overlap.
The decomposition theorem is the reason why we use a slightly different simulated
tempering chain, which is allowed to transition at arbitrary times, with some rate 𝜆.
Such a chain “composes” nicely with the decomposition of the Langevin chain, and
allows a better control of the Dirichlet form of the tempering chain, which governs
the mixing time.
Step 3: Finally, we need to apply the decomposition theorem to our setup,
namely a distribution which is a mixture of strongly log-concave distributions. The
“components” of the decomposition in our setup are simply the mixture components
𝑒−𝑓0(𝑥−𝜇𝑗). We rely crucially on the fact that Langevin diffusion on a mixture distri-
25
bution decomposes into Langevin diffusion on the individual components.
We actually first analyze the hypothetical simulated tempering Langevin chain
on 𝑝𝑖 ∝∑𝑚
𝑗=1 𝑤𝑗𝑒−𝛽𝑗𝑓0(𝑥−𝜇𝑗) (Theorem 2.7.1)—i.e., where the stationary distribution
for each temperature is a mixture. Then in Lemma 2.7.5 we compare to the actual
simulated tempering Langevin that we can run, where 𝑝𝑖 ∝ 𝑝𝛽. To do this, we use
the fact that 𝑝𝑖 is off from 𝑝𝑖 by at most 1𝑤min
. (This is the only place where a factor
of 𝑤min comes in.)
To use our Markov chain decomposition theorem, we need to show two things:
1. The component chains mix rapidly: this follows from the classic fact that
Langevin diffusion mixes rapidly for log-concave distributions.
2. The projected chain mixes rapidly: The “projected” chain is defined as having
more probability flow between mixture components in the same or adjacent
temperatures which are close together in 𝜒2-divergence.
By choosing the temperatures close enough, we can ensure that the correspond-
ing mixture components in adjacent temperatures are close (in the sense of
having high overlap). By choosing the highest temperature large enough, we
can ensure that all the mixture components at the highest temperature are
close.
From this it follows that we can easily get from any component to any other (by
traveling up to the highest temperature and then back down). Thus the pro-
jected chain mixes rapidly from the method of canonical paths, Theorem A.1.3.
Note that the equal variance (for gaussians) or shape (for general log-concave dis-
tributions) condition is necessary here. For gaussians with different variance, the
Markov chain can fail to mix between components at the highest temperature. This
is because scaling the temperature changes the variance of all the components equally,
26
and preserves their ratio (which is not equal to 1).
Step 4: We analyze the error from discretization (Lemma 2.8.1), and choose
parameters so that it is small. We show that in Algorithm 2 we can inductively
estimate the partition functions. When we have all the estimates, we can run the
simulated tempering chain on all the temperatures to get the desired sample.
2.4 Theorem statements
We restate the main theorems more precisely. First define the assumptions.
Assumption 2.4.1. The function 𝑓 satisfies the following. There exists a function
𝑓 that satisfies the following properties.
1. 𝑓 , ∇𝑓 , and ∇2𝑓 are close to 𝑓 :
∥∥∥𝑓 − 𝑓∥∥∥∞≤ ∆ ,
∥∥∥∇𝑓 −∇𝑓∥∥∥∞≤ 𝜏 and ∇2𝑓(𝑥) ⪯ ∇2𝑓(𝑥) + 𝜏𝐼, ∀𝑥 ∈ R𝑑
(2.9)
2. 𝑓 is the log-pdf of a mixture:
𝑓(𝑥) = − log
(𝑚∑𝑖=1
𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)
)(2.10)
where ∇𝑓0(0) = 0 and
(a) 𝑓0 is 𝜅-strongly convex: ∇2𝑓0(𝑥) ⪰ 𝜅𝐼 for 𝜅 > 0.
(b) 𝑓0 is 𝐾-smooth: ∇2𝑓0(𝑥) ⪯ 𝐾𝐼.
Our main theorem is the following.
27
Theorem 2.4.2 (Main theorem, Gaussian version). Suppose
𝑓(𝑥) = − ln
Ñ𝑚∑𝑗=1
𝑤𝑗 exp
(−‖𝑥− 𝜇𝑗‖2
2𝜎2
)é(2.11)
on R𝑑 where∑𝑚
𝑗=1 𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑗 > 0, and 𝐷 = max1≤𝑗≤𝑚 ‖𝜇𝑗‖. Then
Algorithm 2 with parameters satisfying 𝑡, 𝑇, 𝜂−1, 𝛽−11 , (𝛽𝑖−𝛽𝑖−1)
−1 = polyÄ
1𝑤min
, 𝐷, 𝑑, 1𝜎2 ,
1𝜀
äproduces a sample from a distribution 𝑝′ with ‖𝑝− 𝑝′‖1 ≤ 𝜀 in time poly
Ä1
𝑤min, 𝐷, 𝑑, 1
𝜎2 ,1𝜀
ä.
The precise parameter choices are given in Lemma 2.9.2. Examining the parame-
ters, the number of temperatures required is 𝐿 = ‹Θ(𝑑), the amount of time to simulate
the Markov chain is 𝑇 = ‹ΘÅ𝐿2
𝑤3min
ã, the step size is 𝜂 = ‹Θ Ä
𝜀2
𝑑𝑇
ä, so the total amount
of steps is 𝑇𝜂
= ‹Θ Ä𝑇 2𝑑𝜀2
ä= ‹ΘÅ
𝑑5
𝜀2𝑤6min
ã. Note that to obtain the actual complexity, we
need to additionally multiply by a factor of 𝐿4 = ‹Θ(𝑑4): one factor of 𝐿 comes from
needing to estimate the partition function at each temperature, a factor of 𝐿2 comes
from the fact that we need 𝐿2 samples at each temperature to do this, and the final
factor of 𝐿 comes from the fact that we reject the sample if the sample is not at the
final temperature. (We have not made an effort to optimize this 𝐿4 factor.)
Our more general theorem allows the mixture component to come from an arbi-
trary log-concave distribution 𝑝(𝑥) ∝ 𝑒−𝑓0(𝑥).
Theorem 2.4.3 (Main theorem). Suppose 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) and 𝑓(𝑥) = − lnÄ∑𝑚
𝑖=1𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖)
äon R𝑑, where function 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0 is 𝜅-strongly convex,
𝐾-smooth, and has minimum at 0),∑𝑚
𝑖=1𝑤𝑖 = 1, 𝑤min = min1≤𝑖≤𝑚 𝑤𝑖 > 0, and
𝐷 = max1≤𝑖≤𝑛 ‖𝜇𝑖‖. Then Algorithm 2 with parameters satisfying 𝑡, 𝑇, 𝜂−1, 𝛽−11 , (𝛽𝑖−
𝛽𝑖−1)−1 = poly
Ä1
𝑤min, 𝐷, 𝑑, 1
𝜀, 1𝜅, 𝐾
äproduces a sample from a distribution 𝑝′ with
‖𝑝− 𝑝′‖1 ≤ 𝜀 in time polyÄ
1𝑤min
, 𝐷, 𝑑, 1𝜀, 1𝜅, 𝐾
ä.
The precise parameter choices are given in Lemma B.1.3.
Theorem 2.4.4 (Main theorem with perturbations). Keep the setup of Theorem 2.4.3.
If instead 𝑓 satisfies Assumption 2.4.1 (𝑓 is ∆-close in 𝐿∞ norm to the log-pdf of a
28
mixture of log-concave distributions), then the result of Theorem 2.4.3 holds with an
additional factor of poly(𝑒Δ, 𝜏) in the running time.
2.5 Simulated tempering
First we define a continuous version of the simulated tempering Markov chain (Def-
inition 2.5.1). Unlike the usual definition of a simulated tempering chain in the
literature, the transition times can be arbitrary real numbers. Our definition falls out
naturally from writing down the generator L as a combination of the generators for
the individual chains and for the transitions between temperatures (Lemma 2.5.2).
Because L decomposes in this way, the Dirichlet form E will be easier to control in
Theorem 2.6.5.
Definition 2.5.1. Let 𝑀𝑖, 𝑖 ∈ [𝐿] be a sequence of continuous Markov processes with
state space Ω with stationary distributions 𝑝𝑖(𝑥) (with respect to a reference measure).
Let 𝑟𝑖, 1 ≤ 𝑖 ≤ 𝐿 satisfy
𝑟𝑖 > 0,𝐿∑𝑖=1
𝑟𝑖 = 1.
Define the continuous simulated tempering Markov process 𝑀st with rate 𝜆
and relative probabilities 𝑟𝑖 as follows.
The states of 𝑀st are [𝐿]× Ω.
For the evolution, let (𝑇𝑛)𝑛≥0 be a Poisson point process on R≥0 with rate 𝜆, i.e.,
𝑇0 = 0 and
𝑇𝑛+1 − 𝑇𝑛|𝑇1, . . . , 𝑇𝑛 ∼ Exp(𝜆) (2.12)
with probability density 𝑝(𝑡) = 𝜆𝑒−𝜆𝑡. If the state at time 𝑇𝑛 is (𝑖, 𝑥), then the Markov
process evolves according to 𝑀𝑖 on the time interval [𝑇𝑛, 𝑇𝑛+1). The state 𝑋𝑇𝑛+1 at
time 𝑇𝑛+1 is obtained from the state 𝑋−𝑇𝑛+1
:= lim𝑡→𝑇−𝑛+1
𝑋𝑡 by a “Type 2” transition:
29
If 𝑋−𝑇𝑛+1
= (𝑖, 𝑥), then transition to (𝑗 = 𝑖± 1, 𝑥) each with probability
1
2min
®𝑟𝑗𝑝𝑗(𝑥)
𝑟𝑖𝑝𝑖(𝑥), 1
´and stay at (𝑖, 𝑥) otherwise. (If 𝑗 is out of bounds, then don’t move.)
Lemma 2.5.2. Let 𝑀𝑖, 𝑖 ∈ [𝐿] be a sequence of continuous Markov processes with
state space Ω, generators L𝑖 (with domains 𝒟(L𝑖)), and unique stationary distribu-
tions 𝑝𝑖. Then the continuous simulated tempering Markov process 𝑀st with rate 𝜆
and relative probabilities 𝑟𝑖 has generator L defined by the following equation, where
𝑔 = (𝑔1, . . . , 𝑔𝐿) ∈ ∏𝐿𝑖=1𝒟(L𝑖):
(L 𝑔)(𝑖, 𝑦) = (L𝑖𝑔𝑖)(𝑦) +𝜆
2
∑
1 ≤ 𝑗 ≤ 𝐿
𝑗 = 𝑖± 1
min
®𝑟𝑗𝑝𝑗(𝑥)
𝑟𝑖𝑝𝑖(𝑥), 1
´(𝑔𝑗(𝑥)− 𝑔𝑖(𝑥))
. (2.13)
The corresponding Dirichlet form is
E (𝑔, 𝑔) = −⟨𝑔,L 𝑔⟩ (2.14)
=𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) +𝜆
2
∑1 ≤ 𝑖, 𝑗 ≤ 𝐿
𝑗 = 𝑖± 1
∫Ω𝑟𝑖𝑝𝑖(𝑥) min
®𝑟𝑗𝑝𝑗(𝑥)
𝑟𝑖𝑝𝑖(𝑥), 1
´(𝑔𝑖(𝑥)2 − 𝑔𝑖(𝑥)𝑔𝑗(𝑥)) 𝑑𝑥
(2.15)
=𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) +𝜆
4
∑1 ≤ 𝑖, 𝑗 ≤ 𝐿
𝑗 = 𝑖± 1
∫Ω
(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥 (2.16)
where E𝑖(𝑔𝑖, 𝑔𝑖) = −⟨𝑔𝑖,L𝑖𝑔𝑖⟩𝑃𝑖.
Proof. Continuous simulated tempering is a Markov process because the Poisson pro-
cess is memoryless. We show that its generator equals L . Let 𝐹 be the operator
30
which acts by
𝐹𝑔(𝑥, 𝑖) = 𝑔𝑖(𝑥) +1
2
∑1 ≤ 𝑗 ≤ 𝐿
𝑗 = 𝑖± 1
min
®𝑟𝑗𝑝𝑗(𝑥)
𝑟𝑖𝑝𝑖(𝑥), 1
´(𝑔𝑗(𝑥)− 𝑔𝑖(𝑥)) (2.17)
Let 𝑁𝑡 = max 𝑛 : 𝑇𝑛 ≤ 𝑡. Let 𝑃𝑗,𝑡 be such that (𝑃𝑗,𝑡𝑔)(𝑥) = E𝑀𝑗[𝑔(𝑥𝑡)|𝑥0 = 𝑥], the
expected value after running 𝑀𝑗 for time 𝑡, and let 𝑃𝑡 the same operator for 𝑀 .
We have, letting 𝑃 ′𝑠 =
∑𝐿𝑗=1 𝛿𝑗 × 𝑃𝑗,𝑠 (where 𝛿𝑗(𝑖) = 1𝑖=𝑗 is a function on [𝐿]),
𝑃𝑡𝑔 = P(𝑁𝑡 = 0)𝐿∑
𝑗=1
𝛿𝑗 × 𝑃𝑗,𝑡𝑔𝑗 +∫ 𝑡
0𝑃 ′𝑠𝐹𝑃 ′
𝑡−𝑠𝑔P(𝑡1 = 𝑑𝑠,𝑁𝑡 = 1) + P(𝑁𝑡 ≥ 2)ℎ.
(2.18)
where ‖ℎ‖𝐿2(𝑃 ) ≤ ‖𝑔‖𝐿2(𝑃 ) (by contractivity of Markov processes). Here, 𝑃 ′𝑠𝐹𝑃 ′
𝑡−𝑠
comes from moving for time 𝑠 at one level, doing a level change, then moving for
time 𝑡− 𝑠 on the new level. By basic properties of the Poisson process, P(𝑁𝑡 = 0) =
1− 𝜆𝑡+𝑂(𝑡2), P(𝑡1 = 𝑠,𝑁𝑡 = 1) = 𝜆+𝑂(𝑡) for 0 ≤ 𝑠 ≤ 𝑡, and P(𝑁 ≥ 2) = 𝑂(𝑡2), so
𝑑
𝑑𝑡(𝑃𝑡𝑔)|𝑡=0 = −𝜆
𝐿∑𝑗=1
𝛿𝑗 × 𝑃𝑗,𝑡𝑔𝑗︸ ︷︷ ︸𝑔
+𝐿∑
𝑗=1
𝛿𝑗 ×L𝑗𝑔𝑗 + 𝜆𝐹𝑔 = L 𝑔. (2.19)
2.6 Markov process decomposition theorems
For ease of reading, we first prove a simple density decomposition theorem, Theo-
rem 2.6.1. (This may be skipped, as it is a consequence of Theorem 2.6.3, up to
constants.) Then we will prove a general density decomposition theorem for Markov
processes, Theorem 2.6.3. Then we show how to specialize Theorem 2.6.3 to the case
of simulated tempering in Theorem 2.6.5, which is the density decomposition theorem
31
that we use to prove the main Theorem 2.4.2. We also give a version of the theorem
for a continuous index set, Theorem B.3.1.
We compare Theorems 2.6.1 and 2.6.3 to decomposition theorem in the litera-
ture, [MR02, Theorem 1.1, 1.2] and [WSH09a, Theorem 5.2]. Note that our theo-
rems are stated for continuous-time Markov processes, while the others are stated for
discrete-time; however, either proof could be adapted for the other setting.
∙ In Theorem 2.6.1 we use the Poincare constants of the component Markov
processes, and the distance of their stationary distributions to each other, to
bound the Poincare constant of the original chain. (Theorem 2.6.1 gives a bound
in terms of the 𝜒2 divergences, but Remark 2.6.2 gives the bound in terms of
the “overlap” quantity which is used in the literature.)
This is a generalization of [MR02, Theorem 1.2], which deals with the special
case where the state space is partitioned into overlapping sets, and [MR02,
Theorem 1.2], which covers the special case where the component distributions
are laid out in a chain. Our theorem can deal with any arrangement.
∙ In Theorem 2.6.3 we additionally use the “probability flow” between components
to get a more general bound. It involves partitioning the pairs of indices 𝐼 × 𝐼
into 𝑆 and 𝑆↔, where to get a good bound, one puts (𝑖, 𝑗) where 𝑝𝑖 and 𝑝𝑗 are
close into 𝑆, and (𝑖, 𝑗) where there is a lot of “probability flow” between 𝑝𝑖 and
𝑝𝑗 into 𝑆↔. Theorem 2.6.1 is the special case of Theorem 2.6.3 when 𝑆↔ = 𝜑.
Note that [WSH09a, Theorem 5.2] is similar to the case where 𝑆 = 𝜑. However,
they depend only on the probability flow, while we depend on the “overlap” in
the probability flow.
32
2.6.1 Simple density decomposition theorem
Theorem 2.6.1 (Simple density decomposition theorem). Let 𝑀 = (Ω,L ) be a
(continuous-time) Markov process with stationary measure 𝑃 and Dirichlet form
E (𝑔, 𝑔) = −⟨𝑔,L 𝑔⟩𝑃 . Suppose the following hold.
1. There is a decomposition
⟨𝑓,L 𝑔⟩ =𝑚∑𝑗=1
𝑤𝑗 ⟨𝑓,L𝑗𝑔⟩𝑃𝑗(2.20)
𝑃 =𝑚∑𝑗=1
𝑤𝑗𝑃𝑗. (2.21)
where L𝑗 is the generator for some Markov chain 𝑀𝑗 on Ω with stationary
distribution 𝑃𝑗.
2. (Mixing for each 𝑀𝑗) The Dirichlet form E𝑗(𝑓, 𝑔) = −⟨𝑓,L𝑗𝑔⟩𝑃𝑗satisfies the
Poincare inequality
Var𝑃𝑗(𝑔) ≤ 𝐶E𝑗(𝑔, 𝑔). (2.22)
3. (Mixing for projected chain) Define the projected process 𝑀 as the Markov pro-
cess on [𝑚] generated by L , where L acts on 𝑔 ∈ 𝐿2([𝑚]) by3
L 𝑔(𝑗) =∑
1≤𝑘≤𝑚,𝑘 =𝑗
[𝑔(𝑘)− 𝑔(𝑗)]𝑇 (𝑗, 𝑘) (2.23)
where 𝑇 (𝑗, 𝑘) =𝑤𝑘
𝜒2max(𝑃𝑗||𝑃𝑘)
(2.24)
where 𝜒2max(𝑃 ||𝑄) := max𝜒2(𝑃 ||𝑄), 𝜒2(𝑄||𝑃 ). (Define 𝑇 (𝑗, 𝑘) = 0 if this is
infinite.) Let 𝑃 be the stationary distribution of 𝑀 ; 𝑀 satisfies the Poincare
3𝑀 is defined so that the rate of diffusion from 𝑗 to 𝑘 is given by 𝑇 (𝑗, 𝑘).
33
inequality
Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.25)
Then 𝑀 satisfies the Poincare inequality
Var𝑃 (𝑔) ≤ 𝐶
(1 +
𝐶
2
)E (𝑔, 𝑔). (2.26)
Remark 2.6.2. The theorem also holds with 𝑇 (𝑗, 𝑘) = 𝑤𝑘𝛿𝑗,𝑘, where 𝛿𝑗,𝑘 is defined
by
𝑄𝑗,𝑘(𝑑𝑥) = min
®𝑑𝑃𝑘
𝑑𝑃𝑗
, 1
´𝑃𝑗(𝑑𝑥) = min𝑝𝑗, 𝑝𝑘 𝑑𝑥 (2.27)
𝛿𝑗,𝑘 = 𝑃𝑗,𝑘(Ω) =∫Ω
min𝑝𝑗, 𝑝𝑘 𝑑𝑥 (2.28)
where the equalities on the RHS hold when each 𝑃𝑖 has density function 𝑝𝑖. For this
definition of 𝑇 , the theorem holds with conclusion
Var𝑃 (𝑔) ≤ 𝐶Ä1 + 2𝐶
äE (𝑔, 𝑔). (2.29)
Proof. First, note that a stationary distribution 𝑃 of 𝑀 is given by 𝑝(𝑗) := 𝑃 (𝑗) =
𝑤𝑗, because 𝑤𝑗𝑇 (𝑗, 𝑘) = 𝑤𝑘𝑇 (𝑘, 𝑗). (Note that the reason 𝑇 has a maximum of 𝜒2
divergences in the denominator is to make this “detailed balance” condition hold.)
Given 𝑔 ∈ 𝐿2(Ω), define 𝑔 ∈ 𝐿2([𝑚]) by 𝑔(𝑗) = E𝑃𝑗𝑔. Then decomposing the
variance into the variance within the 𝑃𝑗 and between the 𝑃𝑗, and using Assumptions
2 and 3 gives
Var𝑝(𝑔) =𝑚∑𝑗=1
𝑤𝑗
∫(𝑔(𝑥)− E[𝑔(𝑥)])2 𝑃𝑗(𝑑𝑥) (2.30)
34
=𝑚∑𝑗=1
𝑤𝑗
∫(𝑔(𝑥)− E
𝑃𝑗
[𝑔(𝑥)])2 𝑃𝑗(𝑑𝑥) +𝑚∑𝑗=1
𝑤𝑗(E𝑃𝑗
𝑔 − E𝑃𝑔)2 (2.31)
≤𝑚∑𝑗=1
𝑤𝑗
∫(𝑔 − E
𝑃𝑗
𝑔)2 𝑃𝑗(𝑑𝑥) +𝑚∑𝑗=1
𝑝(𝑗)(𝑔(𝑗)− E𝑃
𝑔)2 (2.32)
≤ 𝐶𝑚∑𝑗=1
𝑤𝑗E𝑃𝑗(𝑔, 𝑔) + Var𝑃 (𝑔) (2.33)
≤ 𝐶E (𝑔, 𝑔) + 𝐶E (𝑔, 𝑔). (2.34)
Note E (𝑔, 𝑔) =∑𝑚
𝑗=1𝑤𝑗E𝑃𝑗(𝑔, 𝑔) follows from Assumption 1. Now
E (𝑔, 𝑔) =1
2
𝑚∑𝑗=1
𝑚∑𝑘=1
(𝑔(𝑗)− 𝑔(𝑘))2𝑤𝑗𝑇 (𝑗, 𝑘) (2.35)
≤ 1
2
𝑚∑𝑗=1
𝑚∑𝑘=1
(𝑔(𝑗)− 𝑔(𝑘))2𝑤𝑗𝑤𝑘
𝜒2(𝑃𝑗||𝑃𝑘)(2.36)
≤ 1
2
𝑚∑𝑗=1
𝑚∑𝑘=1
Var𝑃𝑗(𝑔)𝑤𝑗𝑤𝑘 by Lemma D.1.1 (2.37)
≤ 1
2
𝑚∑𝑗=1
𝑤𝑗𝐶E𝑗(𝑔, 𝑔) =𝐶
2E (𝑔, 𝑔). (2.38)
Thus
(2.34) ≤ 𝐶E (𝑔, 𝑔) +𝐶𝐶
2E (𝑔, 𝑔) (2.39)
as needed.
For Remark 2.6.2, let ‹𝑃𝑗,𝑘(𝑑𝑥) =𝑃𝑗,𝑘(𝑑𝑥)
𝛿𝑗,𝑘=
𝑃𝑗,𝑘(𝑑𝑥)
𝑃𝑗,𝑘(Ω); it is ‹𝑃𝑗,𝑘 normalized to be a
probability distribution. Note that we can instead bound (2.36) as follows.
(2.36) ≤𝑚∑𝑗=1
𝑚∑𝑘=1
ÑE𝑃𝑗
𝑔 − E𝑃𝑗,𝑘
𝑔
é2
+
ÑE𝑃𝑘
𝑔 − E𝑃𝑗,𝑘
𝑔
é2𝑤𝑗𝑤𝑘𝛿𝑗,𝑘 (2.40)
by (𝑎 + 𝑏)2 ≤ 2(𝑎2 + 𝑏2)
≤𝑚∑𝑗=1
𝑚∑𝑘=1
[Var𝑃𝑗(𝑔)𝜒2(‹𝑃𝑗,𝑘||𝑃𝑗) + Var𝑃𝑘
(𝑔)𝜒2(‹𝑃𝑗,𝑘||𝑃𝑘)]𝑤𝑗𝑤𝑘𝛿𝑗,𝑘 (2.41)
by Lemma D.1.1
35
≤𝑚∑𝑗=1
𝑚∑𝑘=1
(Var𝑃𝑗(𝑔) + Var𝑃𝑘
(𝑔))𝑤𝑗𝑤𝑘 (2.42)
by Lemma D.1.3
≤𝑚∑𝑗=1
𝑤𝑗2𝐶E𝑗(𝑔, 𝑔) = 2𝐶E (𝑔, 𝑔) (2.43)
which gives (2.29).
2.6.2 General density decomposition theorem
Theorem 2.6.3 (General density decomposition theorem). Consider a Markov pro-
cess 𝑀 = (Ω,L ) with stationary distribution 𝑝. Let 𝑀𝑖 = (Ω𝑖,L𝑖) be Markov pro-
cesses for 𝑖 ∈ 𝐼 (|𝐼| finite), with 𝑀𝑖 supported on Ω𝑖 ⊆ Ω (possibly overlapping) and
with stationary distribution 𝑃𝑖. Let the Dirichlet forms be E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 and
E𝑖(𝑔, ℎ) = −⟨𝑔,L𝑖ℎ⟩𝑃𝑖. (This only depends on 𝑔|Ω𝑖
and ℎ|Ω𝑖.)
Suppose the following hold.
1. There is a decomposition
⟨𝑓,L 𝑔⟩𝑃 =∑𝑖∈𝐼
𝑤𝑖 ⟨𝑓,L𝑖𝑔⟩𝑃𝑖+∑𝑖,𝑗∈𝐼
𝑤𝑖 ⟨𝑓, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖(2.44)
𝑃 =∑𝑖∈𝐼
𝑤𝑖𝑃𝑖. (2.45)
where T𝑖,𝑗 acts on a function Ω𝑗 → R (or Ω→ R, by restriction) to give a func-
tion Ω𝑖 → R, by T𝑖,𝑗𝑔(𝑥) =∫Ω𝑗
𝑔(𝑦)𝑇𝑖,𝑗(𝑥, 𝑑𝑦), and 𝑇𝑖,𝑗 satisfies the following:4
∙ For every 𝑥 ∈ Ω𝑖, 𝑇𝑖,𝑗(𝑥, ·) is a measure on Ω𝑗.
∙ 𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) = 𝑤𝑗𝑃𝑗(𝑑𝑥)𝑇𝑗,𝑖(𝑦, 𝑑𝑥).
4We are breaking up the generator into a part that describes flow within the components, andflow between the components. 𝑃𝑖,𝑗 describes the probability flow between Ω𝑖 and Ω𝑗 . Note thatfixing the 𝑀𝑖 = (Ω𝑖,L𝑖), the decomposition into the T𝑖,𝑗 may be non-unique, because the Ω𝑖’s canoverlap.
36
2. (Mixing for each 𝑀𝑖) 𝑀𝑖 satisfies the Poincare inequality
Var𝑃𝑖(𝑔) ≤ 𝐶E𝑖(𝑔, 𝑔). (2.46)
3. (Mixing for projected chain) Let 𝐼×𝐼 = 𝑆⊔𝑆↔ be a partition of 𝐼×𝐼 such that
(𝑖, 𝑗) ∈ 𝑆∙ ⇐⇒ (𝑗, 𝑖) ∈ 𝑆∙, for ∙ ∈ ,↔.5 Suppose that 𝑇 : 𝐼 × 𝐼 → R≥0
satisfies6
∀𝑖 ∈ 𝐼,∑
𝑗:(𝑖,𝑗)∈𝑆
𝑇 (𝑖, 𝑗)𝜒2(𝑃𝑗||𝑃𝑖) ≤ 𝐾1 (2.47)
∀𝑖 ∈ 𝐼,∑
𝑗:(𝑖,𝑗)∈𝑆↔
𝑇 (𝑖, 𝑗)𝜒2(‹𝑃𝑖,𝑗||𝑃𝑖) ≤ 𝐾2 (2.48)
∀(𝑖, 𝑗) ∈ 𝑆↔, 𝑇 (𝑖, 𝑗) ≤ 𝐾3𝑄𝑖,𝑗(Ω𝑖,Ω𝑗) (2.49)
∀𝑖, 𝑗 ∈ 𝐼, 𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖) (2.50)
where 𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) and ‹𝑃𝑖,𝑗(𝑑𝑥) = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥,Ω𝑗)
𝑄𝑖,𝑗(Ω𝑖,Ω𝑗).
Define the projected chain 𝑀 as the Markov chain on 𝐼 generated by L , where
L acts on 𝑔 ∈ 𝐿2(𝐼) by
L 𝑔(𝑖) =∑𝑗∈𝐼
(𝑔(𝑗)− 𝑔(𝑖))𝑇 (𝑖, 𝑗). (2.51)
Let 𝑃 be the stationary distribution of 𝑀 and E (𝑔, 𝑔) = −¨𝑔,L 𝑔
∂; 𝑀 satisfies
the Poincare inequality
Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.52)
5The idea will be to choose 𝑆 to contain pairs (𝑖, 𝑗) such that the distributions 𝑃𝑖, 𝑃𝑗 are close,and to choose 𝑆↔ to contain pairs such that there is a lot of probability flow between 𝑃𝑖 and 𝑃𝑗 .
6𝑇 represents probability flow, and will be chosen to have large value on (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗
are close (from 𝑆), and on (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗 have a large probability flow between them, as
determined by ‹𝑃𝑖,𝑗 .
37
Then 𝑀 satisfies the Poincare inequality
Var𝑃 (𝑔) ≤ max
®𝐶
Ç1 +
Ç1
2𝐾1 + 3𝐾2
å𝐶
å, 3𝐾3𝐶
´E (𝑔, 𝑔). (2.53)
To use this theorem, we would choose 𝑇 (𝑖, 𝑗) - 1𝜒2(𝑃𝑗 ||𝑃𝑖)
for (𝑖, 𝑗) ∈ 𝑆 and
𝑇 (𝑖, 𝑗) - 1
𝜒2(𝑃𝑖,𝑗 ||𝑃𝑖)for (𝑖, 𝑗) ∈ 𝑆↔, i.e., choose the projected chain to have large
probability flow between distributions that are close, or have a lot of flow between
them; ‹𝑃𝑖,𝑗 measures how much 𝑃𝑖 is “flowing” into 𝑃𝑗.
Remark 2.6.4. As in Remark 2.6.2, we can replace (2.47) by
∀𝑖 ∈ 𝐼,∑
𝑗:(𝑖,𝑗)∈𝑆
𝑇 (𝑖, 𝑗)𝛿𝑖,𝑗 ≤ 𝐾1 where 𝛿𝑖,𝑗 =∫Ω
min
®𝑑𝑃𝑗
𝑑𝑃𝑖
, 1
´𝑃𝑖(𝑑𝑥) (2.54)
and obtain
Var𝑃 (𝑔) ≤ max¶𝐶Ä1 + (2𝐾1 + 3𝐾2)𝐶
ä, 3𝐾3𝐶
©E (𝑔, 𝑔). (2.55)
Proof. First, we make some preliminary observations and definitions.
1. A stationary distribution 𝑃 of 𝑀 is given by 𝑝(𝑖) := 𝑃 (𝑖) = 𝑤𝑖, because
𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖) by (2.50).
2. For 𝑖, 𝑗 ∈ 𝐼, define
𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) : = 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.56)‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) : =𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)(2.57)‹𝑃𝑖,𝑗(𝑑𝑥) : =
∫𝑦∈Ω𝑗
‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) =𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥,Ω𝑗)
𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)(2.58)
38
and note
𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦)
𝑤𝑖𝑄𝑖,𝑗(Ω𝑖,Ω𝑗)=
𝑤𝑗𝑃𝑗(𝑑𝑦)𝑇𝑗,𝑖(𝑦, 𝑑𝑥)
𝑤𝑗𝑄𝑗,𝑖(Ω𝑗,Ω𝑖)(2.59)
=⇒ ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = ‹𝑄𝑗,𝑖(𝑑𝑦, 𝑑𝑥). (2.60)
3. Let
E(𝑔, 𝑔) =∑𝑖∈𝐼
𝑤𝑖E𝑖(𝑔, 𝑔) = −∑𝑖∈𝐼
𝑤𝑖 ⟨𝑔,L𝑖𝑔⟩𝑃𝑖(2.61)
E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼
𝑤𝑖 ⟨𝑔, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖. (2.62)
4. We can write E↔ in terms of squares as follows.
E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼
∫Ω𝑖
𝑔(𝑥)𝑤𝑖
∫Ω𝑗
(𝑔(𝑦)− 𝑔(𝑥))𝑇𝑖,𝑗(𝑥, 𝑑𝑦)𝑃𝑖(𝑑𝑥) (2.63)
=∑𝑖,𝑗∈𝐼
1
2
∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
𝑔(𝑥)2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.64)
−∫Ω𝑖
∫Ω𝑗
𝑔(𝑥)𝑔(𝑦)𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) +1
2
∫Ω𝑖
∫Ω𝑗
𝑔(𝑦)2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦)
(2.65)
=1
2
∑𝑖,𝑗∈𝐼
∫Ω𝑖
∫Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (2.66)
=1
2
∑𝑖,𝑗∈𝐼
∫Ω𝑖
∫Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦). (2.67)
Given 𝑔 ∈ 𝐿2(Ω), define 𝑔 ∈ 𝐿2(𝐼) by 𝑔(𝑖) = E𝑃𝑖𝑔. We decompose the variance into
the variance within and between the parts, and then use the Poincare inequality on
each part.
Var𝑃 (𝑔) =∑𝑖∈𝐼
𝑤𝑖
∫Ω𝑖
(𝑔(𝑥)− E𝑃
[𝑔(𝑥)])2 𝑃𝑖(𝑑𝑥) (2.68)
39
=∑𝑖∈𝐼
𝑤𝑖
ñÇ∫(𝑔(𝑥)− E
𝑃𝑖
[𝑔(𝑥)])2 𝑃𝑖(𝑑𝑥)
å+ (E
𝑃𝑖
𝑔 − E𝑃𝑔)2ô
(2.69)
≤ 𝐶∑𝑖∈𝐼
𝑤𝑖E𝑖(𝑔, 𝑔) + Var𝑃 (𝑔) (2.70)
≤ 𝐶E(𝑔, 𝑔) + 𝐶E (𝑔, 𝑔). (2.71)
Now we break up E as follows,
E (𝑔, 𝑔) =1
2
∑(𝑖,𝑗)∈𝑆
(𝑔(𝑖)− 𝑔(𝑗))2𝑤𝑖𝑇 (𝑖, 𝑗)
︸ ︷︷ ︸𝐴
+1
2
∑(𝑖,𝑗)∈𝑆↔
(𝑔(𝑖)− 𝑔(𝑗))2𝑤𝑖𝑇 (𝑖, 𝑗)
︸ ︷︷ ︸𝐵
. (2.72)
First, as in Theorem 2.6.1, we bound 𝐴 by Lemma D.1.1,
𝐴 ≤∑
(𝑖,𝑗)∈𝑆
Var𝑃𝑖(𝑔)𝜒2(𝑃𝑗||𝑃𝑖)𝑤𝑖𝑇 (𝑖, 𝑗) (2.73)
≤ 𝐾1
∑(𝑖,𝑗)∈𝑆
𝑤𝑖 Var𝑃𝑖(𝑔) (2.74)
≤ 𝐾1𝐶E(𝑔, 𝑔). (2.75)
For the second term,
𝐵 =∑
(𝑖,𝑗)∈𝑆↔
ñ∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)
ô𝑤𝑖𝑇 (𝑖, 𝑗) (2.76)
=∑
(𝑖,𝑗)∈𝑆↔
∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) (2.77)
+∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
2𝑤𝑖𝑇 (𝑖, 𝑗) (2.78)
Note that
∫Ω𝑖
∫Ω𝑗
𝑔(𝑥)(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) =∫Ω𝑖
𝑔(𝑥)
Ç𝑃𝑖(𝑑𝑥)−
∫Ω𝑗
‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
å(2.79)
40
=∫Ω𝑖
𝑔(𝑥)Ä𝑃𝑖(𝑑𝑥)− ‹𝑃𝑖,𝑗(𝑑𝑥)
ä(2.80)
and similarly, because ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) = ‹𝑄𝑖,𝑗(𝑑𝑦, 𝑑𝑥) by (2.60),
∫Ω𝑖
∫Ω𝑗
𝑔(𝑦)(𝑃𝑖(𝑑𝑥)𝑃𝑗(𝑑𝑦)− ‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)) =∫Ω𝑗
𝑔(𝑦)Ä𝑃𝑗(𝑑𝑦)− ‹𝑃𝑗,𝑖(𝑑𝑦)
ä. (2.81)
Hence
𝐵 =∑
(𝑖,𝑗)∈𝑆↔
∫𝑥∈Ω𝑖
𝑔(𝑥)Ä𝑃𝑖(𝑑𝑥)− ‹𝑃𝑖,𝑗(𝑑𝑥)
ä−∫𝑥∈Ω𝑗
𝑔(𝑦)Ä𝑃𝑗(𝑑𝑦)− ‹𝑃𝑗,𝑖(𝑑𝑦)
ä(2.82)
+∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
2𝑤𝑖𝑇 (𝑖, 𝑗) (2.83)
≤ 3∑
(𝑖,𝑗)∈𝑆↔
Ç∫𝑥∈Ω𝑖
𝑔(𝑥)Ä𝑃𝑖(𝑑𝑥)− ‹𝑃𝑖,𝑗(𝑑𝑥)
äå2
+
Ç∫𝑥∈Ω𝑗
𝑔(𝑦)Ä𝑃𝑗(𝑑𝑦)− ‹𝑃𝑗,𝑖(𝑑𝑦)
äå2
(2.84)
+
Ç∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
å2𝑤𝑖𝑇 (𝑖, 𝑗) (2.85)
by Cauchy-Schwarz (2.86)
≤ 3∑
(𝑖,𝑗)∈𝑆↔
Var𝑃𝑖(𝑔)𝜒2(‹𝑃𝑖,𝑗||𝑃𝑖)𝑤𝑖𝑇 (𝑖, 𝑗) + Var𝑃𝑖
(𝑔)𝜒2(‹𝑃𝑗,𝑖||𝑃𝑗)𝑤𝑗𝑇 (𝑗, 𝑖) (2.87)
+∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)𝑇 (𝑖, 𝑗)
(2.88)
by Lemma D.1.1 and (2.50) (𝑤𝑖𝑇 (𝑖, 𝑗) = 𝑤𝑗𝑇 (𝑗, 𝑖)) (2.89)
≤ 3
𝐾2
∑𝑖∈𝐼
𝑤𝑖 Var𝑃𝑖(𝑔) + 𝐾2
∑𝑗∈𝐼
𝑤𝑗 Var𝑃𝑗(𝑔) (2.90)
by (2.48) and (2.49) (2.91)
+ 𝐾3
∑(𝑖,𝑗)∈𝑆↔
∫𝑥∈Ω𝑖
∫𝑦∈Ω𝑗
(𝑔(𝑥)− 𝑔(𝑦))2𝑤𝑖𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦)
(2.92)
≤ 6𝐾2𝐶E(𝑔, 𝑔) + 6𝐾3E↔(𝑔, 𝑔) (2.93)
41
where the last line follows from the Poincare inequality on each 𝑀𝑖, and (2.67). Then
Var𝑃 (𝑔) ≤ (2.71) ≤ 𝐶E(𝑔, 𝑔) + 𝐶
Ç1
2(𝐴 + 𝐵)
å(2.94)
≤ 𝐶E(𝑔, 𝑔) + 𝐶1
2(𝐾1𝐶E↔(𝑔, 𝑔) + 6𝐾2𝐶E(𝑔, 𝑔) + 6𝐾3E↔(𝑔, 𝑔))
(2.95)
≤ max
®𝐶
Ç1 +
Ç1
2𝐾1 + 3𝐾2
å𝐶
å, 3𝐾3𝐶
´E (𝑔, 𝑔). (2.96)
2.6.3 Theorem for simulated tempering
We now apply the general density decomposition theorem 2.6.3 to simulated tem-
pering. The state space is [𝐿] × Ω, and we consider a decomposition of 𝑀 into
𝑀𝑖,𝑗 = ([𝑖] × Ω,L𝑖,𝑗) for (𝑖, 𝑗) ∈ 𝐼 := (𝑖, 𝑗) : 1 ≤ 𝑖 ≤ 𝐿, 1 ≤ 𝑗 ≤ 𝑚𝑖, where 𝑚𝑖 is
the number of components in the 𝑖th level. (For our application, we consider when
all the 𝑚𝑖 are equal to a fixed 𝑚.) Here, 𝑀𝑖,𝑗 is a Markov process on the 𝑖th level.
Note that an index in 𝐼 is an ordered pair (𝑖, 𝑗), not to be confused with the 𝑖, 𝑗 in
Theorem 2.6.3.
To apply the general density decomposition theorem we first need to decompose
the generator into generators of component parts L(𝑖,𝑗) and probability flow between
parts, L(𝑖,𝑗),(𝑖′,𝑗′) = T(𝑖,𝑗),(𝑖′,𝑗′)−Id. In simulated tempering, the L(𝑖,𝑗) generate Markov
processes within the levels, and the L(𝑖,𝑗),(𝑖′,𝑗′) cover the transitions between adjacent
levels. Next, we have to partition the index set 𝐼×𝐼 into 𝑆, which will contain pairs
such that the corresponding distributions are close, and 𝑆↔, which will contain pairs
with a lot of probability flow between them. The 𝑆 will contain pairs within the
same level (just the highest temperature, in our application), while 𝑆↔ will contain
pairs between adjacent levels.
We will see that the quantity 𝑄(𝑖,𝑗),(𝑖′=𝑖±1,𝑗′)([𝑖] × Ω, [𝑖′] × Ω) is simply the “over-
42
lap” between 𝑃(𝑖,𝑗) and 𝑃(𝑖′,𝑗′), and the measure ‹𝑃(𝑖,𝑗),(𝑖′,𝑗′) is roughly min𝑝(𝑖,𝑗), 𝑝(𝑖′,𝑗′),
appropriately normalized. 𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) is defined in terms of these quantities
(see (2.100)), which means that the projected process has large probability flow be-
tween nodes (𝑖, 𝑗), (𝑖′ ∈ 𝑖− 1, 𝑖, 𝑖 + 1, 𝑗′) in the same and adjacent levels where the
probability distributions are close.
One can apply Theorem 2.6.3 in different ways, as there are different ways to
choose 𝑇 . In our case, we define it just to include connections at the highest level
and for the same component between adjacent levels, so the adjacency graph of 𝑇
contains a complete graph at the highest temperature, and “chains” going down the
levels. For a different simulated tempering problem, one could define it differently,
for example, with a more complex tree structure.
For ease of notation, we will sometimes drop the parentheses in the subscripts,
e.g., 𝑝𝑖,𝑗 = 𝑝(𝑖,𝑗).
Theorem 2.6.5 (Density decomposition theorem for simulated tempering). Consider
simulated tempering 𝑀 with Markov processes 𝑀𝑖 = (Ω,L𝑖), 1 ≤ 𝑖 ≤ 𝐿. Let the
stationary distribution of 𝑀𝑖 be 𝑃𝑖, the relative probabilities be 𝑟𝑖, and the rate be 𝜆.
Let the Dirichlet forms be E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 and E𝑖(𝑔, ℎ) = −⟨𝑔,L𝑖ℎ⟩𝑃𝑖. Assume
the probability measures have density functions with respect to some reference measure
𝑑𝑥, represented by the lower-case letter: 𝑃𝑖(𝑑𝑥) = 𝑝𝑖(𝑥) 𝑑𝑥.
Represent a function 𝑔 ∈ [𝐿] × Ω as (𝑔1, . . . , 𝑔𝐿). Let 𝑃 be the stationary distri-
bution on [𝐿] × Ω, L be the generator, and E (𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 be the Dirichlet
form.
Suppose the following hold.
1. There is a decomposition
⟨𝑓,L𝑖𝑔⟩𝑃𝑖=
𝑚𝑖∑𝑗=1
𝑤𝑖,𝑗
¨𝑓,L(𝑖,𝑗)𝑔
∂𝑃(𝑖,𝑗)
for 𝑓, 𝑔 : Ω𝑖 → R (2.97)
43
𝑃𝑖 =𝑚∑𝑗=1
𝑤𝑖,𝑗𝑃(𝑖,𝑗). (2.98)
where L𝑖,𝑗 is the generator for some Markov chain 𝑀𝑖,𝑗 on 𝑖 × Ω with sta-
tionary measure 𝑃(𝑖,𝑗).
2. (Mixing for each 𝑀𝑖,𝑗) 𝑀𝑖,𝑗 satisfies the Poincare inequality
Var𝑃(𝑖,𝑗)(𝑔) ≤ 𝐶E𝑖,𝑗(𝑔, 𝑔) (2.99)
where E𝑖,𝑗(𝑔, 𝑔) = −⟨𝑔,L𝑖,𝑗𝑔⟩𝑃(𝑖,𝑗).
3. (Mixing for projected chain) Define
𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) =
𝑤1,𝑗′
𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))
, 𝑖 = 𝑖′ = 1
𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗), 𝑖′ = 𝑖± 1, 𝑗 = 𝑗′
0, else
(2.100)
where 𝜒2max(𝑃 ||𝑄) := max𝜒2(𝑃 ||𝑄), 𝜒2(𝑄||𝑃 ), 𝐾 > 0 is any constant, and
𝛿(𝑖,𝑗),(𝑖′,𝑗′) =∫Ω
min
®𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗
, 𝑝(𝑖′,𝑗′)(𝑥)
´𝑑𝑥. (2.101)
Define the projected chain 𝑀 as the Markov chain on [𝑛] generated by L =
T − Id, so that L acts on 𝑔 ∈ 𝐿2([𝑛]) by
L 𝑔(𝑖, 𝑗) =𝐿∑𝑖=1
𝑚∑𝑗=1
𝐿∑𝑖′=1
𝑚∑𝑗′=1
[𝑔(𝑖′, 𝑗′)− 𝑔(𝑖, 𝑗)]𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)). (2.102)
Let 𝑃 be the stationary distribution of 𝑀 ; 𝑀 satisfies the Poincare inequality
Var𝑃 (𝑔) ≤ 𝐶E (𝑔, 𝑔). (2.103)
44
Then 𝑀 satisfies the Poincare inequality
Var𝑃 (𝑔) ≤ max
𝐶
Ç1 +
Ç1
2+ 6𝐾
å𝐶
å,6𝐾𝐶
𝜆
E (𝑔, 𝑔). (2.104)
Proof. We first relate 𝑀 to 𝑀 ′ on [𝐿] × Ω defined as follows. 𝑀 ′ has transition
probability from level 𝑖 to level 𝑖′ = 𝑖 ± 1 given by∑𝑚
𝑗=1 𝑟𝑖𝑤𝑖,𝑗 minß
𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™
rather than 𝑟𝑖 min𝑟𝑖′𝑝𝑖′ (𝑥)𝑟𝑖𝑝𝑖(𝑥)
, 1
. We show that E ′(𝑔, 𝑔) ≤ E (𝑔, 𝑔) below; this basically
follows from the fact that the probability flow between any two distinct points in 𝑀 ′
is at most the probability flow in 𝑀 . More precisely (letting 𝑝L denote the functional
defined by (𝑝L 𝑓)(𝑥) = 𝑝(𝑥)(L 𝑓)(𝑥)),
𝑝L ′ =𝐿∑𝑖=1
𝑟𝑖𝑝𝑖L𝑖 +𝜆
2
∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
𝑚∑𝑗=1
𝑟𝑖𝑤𝑖,𝑗 min
𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1
(𝑔𝑖′ − 𝑔𝑖)(𝑥)
(2.105)
E ′(𝑔, 𝑔) = −⟨𝑔,L ′𝑔⟩𝑃 (2.106)
=𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) (2.107)
+𝜆
2
∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
∫Ω
𝑚∑𝑗=1
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥) min
𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1
(𝑔𝑖(𝑥)2 − 𝑔𝑖(𝑥)𝑔𝑖′(𝑥)) 𝑑𝑥
(2.108)
=𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) (2.109)
+𝜆
4
∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
∫Ω
𝑚∑𝑗=1
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥) min
𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1
(𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2 𝑑𝑥
(2.110)
=𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) +𝜆
4
∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
∫Ω
𝑚∑𝑗=1
min¶𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥), 𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥)
©(𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2
(2.111)
45
≤𝐿∑𝑖=1
𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖) +𝜆
4
∑1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
∫Ω
min 𝑟𝑖′𝑝𝑖′(𝑥), 𝑟𝑖𝑝𝑖(𝑥) (𝑔𝑖(𝑥)− 𝑔𝑖′(𝑥))2 = E (𝑔, 𝑔).
(2.112)
Thus it suffices to prove a Poincare inequality for 𝑀 ′. We will apply Theorem 2.6.3
with
𝑇(𝑖,𝑗),(𝑖′,𝑗′)((𝑖, 𝑥), (𝑖′, 𝑑𝑦)) =
𝜆2
minß
𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™𝛿𝑥(𝑑𝑦), 𝑗 = 𝑗′, 𝑖′ = 𝑖± 1
0, else.
(2.113)
First we calculate ‹𝑃(𝑖,𝑗),(𝑖′,𝑗). We have
𝑄(𝑖,𝑗),(𝑖′,𝑗)([𝑖]× Ω, [𝑖′]× Ω) =𝜆
2
∫Ω
min𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥) 𝑑𝑥 =𝜆
2𝛿(𝑖,𝑗),(𝑖′,𝑗).
(2.114)
so
𝑝(𝑖,𝑗),(𝑖′,𝑗)(𝑥) =𝑝(𝑖,𝑗)(𝑥)𝑇(𝑖,𝑗),(𝑖′,𝑗)(𝑥,Ω𝑗)
𝑄(𝑖,𝑗),(𝑖′,𝑗)([𝑖]× Ω, [𝑖′]× Ω)(2.115)
=𝑝(𝑖,𝑗)(𝑥)𝜆
2min
ß𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(𝑥), 1™
𝜆2𝛿(𝑖,𝑗),(𝑖′,𝑗)
=1
𝛿(𝑖,𝑗),(𝑖′,𝑗)min
®𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)(𝑥)
𝑟𝑖𝑤𝑖,𝑗
, 𝑝(𝑖,𝑗)(𝑥)
´.
(2.116)
We check the 3 assumptions in Theorem 2.6.3.
1. From Assumption 1, (2.105) and (2.113),
𝑝L ′ =𝐿∑𝑖=1
𝑚∑𝑗=1
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)L𝑖,𝑗 +∑
1 ≤ 𝑖, 𝑖′ ≤ 𝐿
𝑖′ = 𝑖± 1
𝑚∑𝑗=1
𝑟𝑖𝑤𝑖,𝑗𝑝(𝑖,𝑗)(T(𝑖,𝑗),(𝑖′,𝑗) − Id).
(2.117)
46
2. This follows immediately from Assumption 2.
3. Let 𝑆 consist of all pairs ((1, 𝑗), (1, 𝑗′)) and 𝑆↔ consist of all pairs ((𝑖, 𝑗), (𝑖′ =
𝑖± 1, 𝑗)). (The other pairs ((𝑖, 𝑗), (𝑖′, 𝑗′)) satisfy 𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) = 0, so they do
not matter.) We check equations (2.47)–(2.50).
(2.47) By (2.100),
𝑚∑𝑗,𝑗′=1
𝑇 ((1, 𝑗), (1, 𝑗′))𝜒2(𝑃(1,𝑗′)||𝑃(1,𝑗)) (2.118)
=𝑚∑
𝑗,𝑗′=1
𝑤1,𝑗′
𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))
𝜒2(𝑃(1,𝑗′)||𝑃(1,𝑗)) ≤ 1, (2.119)
so (2.47) is satisfied with 𝐾1 = 1.
(2.48) We apply Lemma D.1.3 with 𝑃 = 𝑃(𝑖,𝑗) and 𝑄 =𝑟𝑖′𝑤𝑖′,𝑗𝑝(𝑖′,𝑗)
𝑟𝑖𝑤𝑖,𝑗. Noting that‹𝑃(𝑖,𝑗),(𝑖′,𝑗) = 1
𝛿𝑖,𝑗,𝑖′,𝑗𝑄 by (2.116), we obtain
𝜒2Ä‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)
ä≤ 1
𝛿(𝑖,𝑗),(𝑖′,𝑗)(2.120)
By (2.100) and (2.120),
∑𝑖′=𝑖±1
𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗))𝜒2(‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)) (2.121)
≤∑
𝑖′=𝑖±1
𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗)𝜒2(‹𝑃(𝑖,𝑗),(𝑖′,𝑗)||𝑃(𝑖,𝑗)) ≤ 2𝐾. (2.122)
(2.49) By (2.100) and (2.114),
𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗′)) = 𝐾𝛿(𝑖,𝑗),(𝑖′,𝑗) ≤2𝐾
𝜆· 𝜆
2𝛿(𝑖,𝑗),(𝑖′,𝑗) =
2𝐾
𝜆𝑄𝑖,𝑗,𝑖′,𝑗([𝑖]× Ω, [𝑗]× Ω)
(2.123)
so (2.49) is satisfied with 𝐾3 = 2𝐾𝜆
.
47
(2.50) We have that
𝑟1𝑤1,𝑗𝑇 ((1, 𝑗), (1, 𝑗′)) =𝑟1𝑤1,𝑗𝑤1,𝑗′
𝜒2max(𝑃(1,𝑗)||𝑃(1,𝑗′))
= 𝑟1𝑤1,𝑗′𝑇 ((1, 𝑗′), (1, 𝑗))
(2.124)
and for 𝑖′ = 𝑖± 1,
𝑟𝑖𝑤𝑖,𝑗𝑇 ((𝑖, 𝑗), (𝑖′, 𝑗)) = 𝐾𝑟𝑖𝑤𝑖,𝑗𝛿(𝑖,𝑗),(𝑖′,𝑗) (2.125)
= 𝐾𝑟𝑖′𝑤𝑖′,𝑗𝛿(𝑖′,𝑗),(𝑖,𝑗) = 𝑟𝑖′𝑤𝑖′,𝑗𝑇 ((𝑖′, 𝑗), (𝑖, 𝑗)). (2.126)
Hence the conclusion of Theorem 2.6.3 holds with 𝐾1 = 1, 𝐾2 = 2𝐾, and 𝐾3 =
2𝐾𝜆
.
2.7 Simulated tempering for gaussians with equal
variance
2.7.1 Mixtures of gaussians all the way down
Theorem 2.7.1. Let 𝑀 be the continuous simulated tempering chain for the distri-
butions with density functions
𝑝𝑖(𝑥) ∝𝑚∑𝑗=1
𝑤𝑗𝑒−𝛽𝑖‖𝑥−𝜇𝑗‖2
2𝜎2 (2.127)
with rate ΩÄ
1𝐷2
ä, relative probabilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1
where
𝐷 = maxmax𝑗‖𝜇𝑗‖ , 𝜎 (2.128)
𝛽1 = Θ
Ç𝜎2
𝐷2
å(2.129)
48
𝛽𝑖+1
𝛽𝑖
≤ 1 +1
𝑑(2.130)
𝐿 = Θ
Ç𝑑 ln
Ç𝐷
𝜎
å+ 1
å(2.131)
𝑟 =min𝑖 𝑟𝑖max𝑖 𝑟𝑖
. (2.132)
Then 𝑀 satisfies the Poincare inequality
Var(𝑔) ≤ 𝑂
Ç𝐿2𝐷2
𝑟2
åE (𝑔, 𝑔) = 𝑂
ÑÄ𝑑 ln
Ä𝐷𝜎
ä+ 1
ä2𝐷2
𝑟2
éE (𝑔, 𝑔). (2.133)
Proof. Note that forcing 𝐷 ≤ 𝜎 ensures 𝛽1 = Ω(1). We check all conditions for
Theorem 2.6.5. We let 𝐾 = 1.
1. Consider the decomposition where
𝑝𝑖,𝑗(𝑥) ∝ exp
(−𝛽𝑖‖𝑥− 𝜇𝑗‖2
2𝜎2
), (2.134)
𝑤𝑖,𝑗 = 𝑤𝑗, and and 𝑀𝑖,𝑗 is the Langevin chain on 𝑝𝑖,𝑗, so that E𝑖𝑗(𝑔𝑖, 𝑔𝑖) =∫R𝑑 ‖∇𝑔𝑖‖2 𝑝𝑖,𝑗 𝑑𝑥. We check (2.98):
E𝑖(𝑔𝑖, 𝑔𝑖) =∫R𝑑‖∇𝑔𝑖‖2 𝑝𝑖 𝑑𝑥 =
∫R𝑑‖∇𝑔𝑖‖2
𝑚∑𝑗=1
𝑤𝑗𝑝𝑗 𝑑𝑥 =𝑚∑𝑗=1
𝑤𝑗E𝑖,𝑗(𝑔𝑖, 𝑔𝑖).
(2.135)
2. By Theorem A.2.1 and the fact that 𝛽1 = ΩÄ𝜎2
𝐷2
ä, E𝑖,𝑗 satisfies the Poincare
inequality
Var𝑝𝑖,𝑗(𝑔𝑖) ≤𝜎2
𝛽𝑖
E𝑖,𝑗(𝑔𝑖, 𝑔𝑖) = 𝑂(𝐷2)E𝑖,𝑗(𝑔𝑖, 𝑔𝑖). (2.136)
3. To prove a Poincare inequality for the projected chain, we use the method of
canonical paths, Theorem A.1.3. Consider the graph 𝐺 on⋃𝐿
𝑖=1𝑖 × [𝑚𝑖] that
49
is the complete graph on the slice 𝑖 = 1, and the only other edges are vertical
edges (𝑖, 𝑗), (𝑖 ± 1, 𝑗). 𝑇 is nonzero exactly on the edges of 𝐺. For vertices
𝑥 = (𝑖, 𝑗) and 𝑦 = (𝑖′, 𝑗′), define the canonical path as follows.
(a) If 𝑗 = 𝑗′, without loss of generality 𝑖 < 𝑖′. Define the path to be (𝑖, 𝑗), (𝑖+
1, 𝑗), . . . , (𝑖′, 𝑗).
(b) Else, define the path to be (𝑖, 𝑗), (𝑖− 1, 𝑗), . . . , (1, 𝑗), (1, 𝑗′), . . . , (𝑖, 𝑗′).
We calculate the transition probabilities (2.100), which are given in terms of
the 𝜒2 distances 𝜒2max(𝑃1,𝑗||𝑃1,𝑗′) and overlaps 𝛿(𝑖,𝑗),(𝑖′,𝑗).
(a) Bounding 𝜒2(𝑃1,𝑗||𝑃1,𝑗′): By Lemma D.2.1 with Σ1 = Σ2 = 𝛽−11 𝐼𝑑,
𝜒2(𝑃1,𝑗||𝑃1,𝑗′) = 𝜒2(𝑁(𝜇𝑗, 𝛽1𝐼𝑑)||𝑁(𝜇𝑗′ , 𝛽1𝐼𝑑)) (2.137)
= 𝑒𝛽1‖𝜇1−𝜇2‖2/𝜎2
=1
4(2.138)
when 𝛽1 ≤ 𝑐 𝜎2
𝐷2 for a small enough constant 𝑐.
(b) Bounding 𝛿𝑖,𝑗,𝑖′,𝑗: Suppose that 𝛽𝑖+1
𝛽𝑖= 1 + 𝜀 where 𝛿 ≤ 1
𝑑. Then applying
Lemma D.2.1 to Σ1 = 𝛽−1𝑖 𝐼𝑑 and Σ2 = 𝛽−1
𝑖+1𝐼𝑑,
𝜒2(𝑃𝑖+1,𝑗||𝑃𝑖,𝑗) = 𝜒2(𝑁(𝜇𝑗, 𝛽𝑖+1𝐼𝑑)||𝑁(𝜇𝑗, 𝛽𝑖𝐼𝑑)) (2.139)
=
Ç𝛽2𝑖+1
𝛽𝑖
å 𝑑2
(2𝛽𝑖+1 − 𝛽𝑖)− 𝑑
2 − 1 (2.140)
=
Ç𝛽𝑖+1
𝛽𝑖
å 𝑑2Ç
2− 𝛽𝑖
𝛽𝑖+1
å− 𝑑2
− 1 (2.141)
= 𝑂
Ñ(1 + 𝑑𝜀)
Ç2−
Ç1
1 + 𝜀
åå− 𝑑2
− 1
é= 𝑂(𝑑𝜀) (2.142)
so 𝜒2(𝑃𝑖+1,𝑗||𝑃𝑖,𝑗) ≤ 14
when 𝛿 ≤ 𝑐1𝛿
for a small enough constant 𝑐. Similarly,
𝜒2(𝑃𝑖−1,𝑗||𝑃𝑖,𝑗) = 14
for 𝛿 ≤ 𝑐1𝛿.
50
Note that for probability distributions 𝑃1, 𝑃2 with density functions 𝑝1, 𝑝2,Å∫Ω
(𝑝1 −min𝑝1, 𝑝2) 𝑑𝑥ã2≤∫Ω
(𝑝1 −min𝑝1, 𝑝2)2
𝑝1𝑑𝑥 = 𝜒2(𝑃2||𝑃1)
(2.143)∫Ω
min𝑝1, 𝑝2 𝑑𝑥 ≥ 1−»𝜒2(𝑃2||𝑃1). (2.144)
Moreover, we have
𝛿(𝑖,𝑗),(𝑖±1,𝑗) =∫
min
®𝑟𝑖′𝑤𝑗𝑝𝑖′,𝑗(𝑥)
𝑟𝑖𝑤𝑗
, 𝑝𝑖,𝑗
´𝑑𝑥 ≥ 𝑟
∫min𝑝𝑖′,𝑗(𝑥), 𝑝𝑖,𝑗(𝑥) 𝑑𝑥
(2.145)
Hence 𝛿(𝑖,𝑗),(𝑖±1,𝑗) ≥ 12𝑟.
Note that |𝛾𝑥,𝑦| ≤ 2𝐿− 1. Consider two kinds of edges in 𝐺.
(a) 𝑧 = (1, 𝑗), 𝑤 = (1, 𝑘). We have
∑𝛾𝑥,𝑦∋((1,𝑗),(1,𝑘)) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)
𝑝((1, 𝑗))𝑇 ((1, 𝑗), (1, 𝑘))≤ (2𝐿− 1)𝑃 ([𝐿]× 𝑗)𝑃 ([𝐿]× 𝑘)
𝑝((1, 𝑗))𝑇 ((1, 𝑗), (1, 𝑘)).
(2.146)
because the paths going through 𝑧𝑤 are exactly those between [𝐿] × 𝑗
and [𝐿]× 𝑘. Now note
𝑃 ([𝐿]× 𝑗)𝑝((1, 𝑗))
≤ 𝐿
𝑟(2.147)
𝑃 ([𝐿]× 𝑘) = 𝑤𝑘 (2.148)
𝑇 ((1, 𝑗), (1, 𝑘)) =1
2
𝑤𝑘
𝜒2max(𝑃1,𝑗||𝑃1,𝑗′)
= Ω(𝑤𝑘) (2.149)
by (2.138). Thus (2.146) = 𝑂Ä𝐿2
𝑟
ä.
51
(b) 𝑧 = (𝑖, 𝑗), 𝑤 = (𝑖− 1, 𝑗). We have
∑𝛾𝑥,𝑦∋((𝑖,𝑗),(𝑖−1,𝑗)) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)
𝑝((𝑖, 𝑗))𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗))≤ (2𝐿− 1)𝑃 (𝑆)𝑃 (𝑆𝑐)
𝑝((𝑖, 𝑗))𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗))(2.150)
where 𝑆 = 𝑖, . . . , 𝐿 × 𝑗. This follows because cutting the edge 𝑧𝑤
splits the graph into 2 connected components, one of which is 𝑆; the paths
which go through 𝑧𝑤 are exactly those between 𝑥, 𝑦 where one of 𝑥, 𝑦 is a
subset of 𝑆 and the other is not. Now note
𝑃 (𝑆)
𝑝((𝑖, 𝑗))=
𝑃 (𝑖, . . . , 𝐿 × 𝑗)𝑝((𝑖, 𝑗))
≤ 𝐿
𝑟(2.151)
𝑃 (𝑆𝑐) ≤ 1 (2.152)
𝑇 ((𝑖, 𝑗), (𝑖− 1, 𝑗)) = 𝛿(𝑖,𝑗),(𝑖−1,𝑗) = Ω(𝑟) (2.153)
by (2.100) and the inequality 𝛿(𝑖,𝑗),(𝑖−1,𝑗) ≥ 12𝑟. Hence (2.150) = 𝑂
Ä𝐿2
𝑟2
ä.
By Theorem A.1.3, the projected chain satisfies a Poincare inequality with con-
stant 𝑂Ä𝐿2
𝑟2
ä.
Thus by Theorem 2.6.5, the simulated tempering chain satisfies a Poincare inequality
with constant
𝑂
Çmax
®𝐷2
Ç1 +
𝐿2
𝑟2
å,𝐿2
𝑟2𝜆
´å. (2.154)
Taking 𝜆 = 1𝐷2 makes this 𝑂
Ä𝐷2𝐿2
𝑟2
ä.
Remark 2.7.2. Note there is no dependence on either 𝑤min or the number of com-
ponents.
If 𝑝(𝑥) ∝ ∑𝑚𝑗=1𝑤𝑗𝑒
−‖𝑥−𝜇𝑗‖2
2𝜎2 and we have access to ∇ ln(𝑝 * 𝑁(0, 𝜏𝐼)) for any
𝜏 , then we can sample from 𝑝 efficiently, no matter how many components there
are. In fact, passing to the continuous limit, we can sample from any 𝑝 in the form
52
𝑝 = 𝑤 *𝑁(0, 𝜎2𝐼𝑑) where ‖𝑤‖1 = 1 and Supp(𝑤) ⊆ 𝐵𝐷.
In this way, Theorem 2.7.1 says that evolution of 𝑝 under the heat kernel is the
most “natural” way to do simulated tempering. We don’t have access to 𝑝 *𝑁(0, 𝜏𝐼),
but we will show that 𝑝𝛽 approximates it well (within a factor of 1𝑤min
).
Entropy-SGD [Cha+16] attempts to estimate ∇ ln(𝑝*𝑁(0, 𝜏𝐼)) for use in a temperature-
based algorithm; this remark provides some heuristic justification for why this is a
natural choice.
2.7.2 Comparing to the actual chain
The following lemma shows that changing the temperature is approximately the same
as changing the variance of the gaussian. We state it more generally, for arbitrary
mixtures of distributions in the form 𝑒−𝑓𝑖(𝑥).
Lemma 2.7.3 (Approximately scaling the temperature). Let 𝑝𝑖(𝑥) = 𝑒−𝑓𝑖(𝑥) be prob-
ability distributions on Ω such that for all 𝛽 > 0,∫Ω 𝑒−𝛽𝑓𝑖(𝑥) 𝑑𝑥 <∞. Let
𝑝(𝑥) =𝑛∑
𝑖=1
𝑤𝑖𝑝𝑖(𝑥) (2.155)
𝑓(𝑥) = − ln 𝑝(𝑥) (2.156)
where 𝑤1, . . . , 𝑤𝑛 > 0 and∑𝑛
𝑖=1 𝑤𝑖 = 1. Let 𝑤min = min1≤𝑖≤𝑛𝑤𝑖.
Define the distribution at inverse temperature 𝛽 to be 𝑝𝛽(𝑥), where
𝑔𝛽(𝑥) = 𝑒−𝛽𝑓(𝑥) (2.157)
𝑍𝛽 =∫Ω𝑒−𝛽𝑓(𝑥) 𝑑𝑥 (2.158)
𝑝𝛽(𝑥) =𝑔𝛽(𝑥)
𝑍𝛽
. (2.159)
53
Define the distribution 𝑝𝛽(𝑥) by
𝑔𝛽(𝑥) =𝑛∑
𝑖=1
𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) (2.160)‹𝑍𝛽 =
∫Ω
𝑛∑𝑖=1
𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) 𝑑𝑥 (2.161)
𝑝𝛽(𝑥) =𝑔𝛽(𝑥)‹𝑍𝛽
. (2.162)
Then for 0 ≤ 𝛽 ≤ 1 and all 𝑥,
𝑔𝛽(𝑥) ∈ñ1,
1
𝑤min
ô𝑔𝛽 (2.163)
𝑝𝛽(𝑥) ∈ñ1,
1
𝑤min
ô𝑝𝛽‹𝑍𝛽
𝑍𝛽
⊂ñ𝑤min,
1
𝑤min
ô𝑝𝛽. (2.164)
Proof. By the Power-Mean inequality,
𝑔𝛽(𝑥) =
(𝑛∑
𝑖=1
𝑤𝑖𝑒−𝑓𝑖(𝑥)
)𝛽
(2.165)
≥𝑛∑
𝑖=1
𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) = 𝑔𝛽(𝑥). (2.166)
On the other hand, given 𝑥, setting 𝑗 = argmin𝑖 𝑓𝑖(𝑥),
𝑔𝛽(𝑥) =
(𝑛∑
𝑖=1
𝑤𝑖𝑒−𝑓𝑖(𝑥)
)𝛽
(2.167)
≤ (𝑒−𝑓𝑗(𝑥))𝛽 (2.168)
≤ 1
𝑤min
𝑛∑𝑖=1
𝑤𝑖𝑒−𝛽𝑓𝑖(𝑥) =
1
𝑤min
𝑔𝛽(𝑥). (2.169)
This gives (2.163). This implies𝑍𝛽
𝑍𝛽∈ [𝑤min, 1], which gives (2.164).
Lemma 2.7.4. Let 𝑃1, 𝑃2 be probability measures on R𝑑 with density functions 𝑝1 ∝
54
𝑒−𝑓1, 𝑝2 ∝ 𝑒−𝑓2 satisfying ‖𝑓1 − 𝑓2‖∞ ≤Δ2. Then
E𝑃1(𝑔, 𝑔)
‖𝑔‖2𝐿2(𝑃1)
≥ 𝑒−2Δ E𝑃2(𝑔, 𝑔)
‖𝑔‖2𝐿2(𝑃2)
. (2.170)
Proof. The ratio between 𝑝1 and 𝑝2 is at most 𝑒Δ, so
∫R𝑑 ‖∇𝑔(𝑥)‖2 𝑝1(𝑥) 𝑑𝑥∫R𝑑 ‖𝑔(𝑥)‖2 𝑝1(𝑥) 𝑑𝑥
≥ 𝑒−Δ∫R𝑑 ‖∇𝑔(𝑥)‖2 𝑝2(𝑥) 𝑑𝑥
𝑒Δ∫R𝑑 ‖𝑔(𝑥)‖2 𝑝2(𝑥) 𝑑𝑥
. (2.171)
Lemma 2.7.5. Let 𝑀 and ›𝑀 be two continuous simulated tempering Langevin chains
with functions 𝑓𝑖, 𝑓𝑖,, respectively, for 𝑖 ∈ [𝐿], with rate 𝜆, and with relative proba-
bilities 𝑟𝑖. Let their Dirichlet forms be E and ‹E and their stationary measures be 𝑃
and ‹𝑃 .
Suppose that∥∥∥𝑓𝑖(𝑥)− 𝑓𝑖(𝑥)
∥∥∥∞≤ Δ
2. Then7
E (𝑔, 𝑔)
Var𝑃 (𝑔)≥ 𝑒−3Δ
‹E (𝑔, 𝑔)
Var𝑃
(𝑔). (2.172)
Proof. By Lemma 2.7.4,
∑𝐿𝑖=1 E𝑖(𝑔𝑖, 𝑔𝑖)
Var𝑃𝑖(𝑔𝑖)
≥ 𝑒−2Δ
∑𝐿𝑖=1
‹E𝑖(𝑔𝑖, 𝑔𝑖)
Var𝑃𝑖
(𝑔𝑖)(2.173)
=⇒∑𝐿
𝑖=1 𝑟𝑖E𝑖(𝑔𝑖, 𝑔𝑖)
Var𝑃 (𝑔𝑖)≥ 𝑒−2Δ
∑𝐿𝑖=1 𝑟𝑖
‹E𝑖(𝑔𝑖, 𝑔𝑖)
Var𝑃
(𝑔𝑖). (2.174)
By Lemma 2.7.3, we have 𝑝𝑖𝑝𝑖∈ [𝐴,𝐴𝑒Δ], 𝑝𝑗
𝑝𝑗∈ [𝐵,𝐵𝑒Δ] for some 𝐴,𝐵 ≥ 𝑒−Δ, so
7If adjacent temperatures are close enough, then 𝐴 and 𝐵 in the proof are close, so[min𝐴,𝐵,max𝐴𝑒Δ, 𝐵𝑒Δ] ⊆ [𝐶,𝐶 · 𝑂(𝑒Δ)] for some 𝐶, improving the factor to Ω(𝑒−2Δ). Amore careful analysis would likely improve the final dependence on 𝑤min in Theorem 2.4.2 from 1
𝑤6min
to 1𝑤4
min
.
55
min𝑟𝑖𝑝𝑖,𝑟𝑗𝑝𝑗min𝑟𝑖𝑝𝑖,𝑟𝑗𝑝𝑗
∈ [min𝐴,𝐵,max𝐴𝑒Δ, 𝐵𝑒Δ] ⊆ [𝑒−Δ, 𝑒Δ]. Hence,
∫R𝑑(𝑔𝑖 − 𝑔𝑗)
2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥∫R𝑑(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥
∈ [𝑒−Δ, 𝑒Δ] (2.175)
Var𝑃𝑖(𝑔𝑖)
Var‹𝑃𝑖(𝑔𝑖)∈ [𝐴,𝐴𝑒Δ] (2.176)
=⇒𝜆4
∑𝐿𝑖=1
∑𝑗=𝑖±1
∫R𝑑(𝑔𝑖 − 𝑔𝑗)
2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥𝜆4
∑𝐿𝑖=1
∑𝑗=𝑖±1
∫R𝑑(𝑔𝑖 − 𝑔𝑗)2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥
∈ [𝑒−Δ, 𝑒Δ] (2.177)
Var𝑃 (𝑔)
Var𝑃
(𝑔)=
∑𝐿𝑖=1 𝑟𝑖 Var𝑃𝑖
(𝑔𝑖)∑𝐿𝑖=1 𝑟𝑖 Var
𝑃𝑖(𝑔𝑖)∈ [𝐴,𝐴𝑒Δ] (2.178)
Dividing (2.177) by (2.178) gives
𝜆4
∑𝐿𝑖=1
∑𝑗=𝑖±1
∫R𝑑(𝑔𝑖 − 𝑔𝑗)
2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥Var𝑃 (𝑔)
(2.179)
≥ 𝑒−3Δ𝜆4
∑𝐿𝑖=1
∑𝑗=𝑖±1
∫R𝑑(𝑔𝑖 − 𝑔𝑗)
2 min𝑟𝑖𝑝𝑖, 𝑟𝑗𝑝𝑗 𝑑𝑥Var
𝑃(𝑔)
(2.180)
Adding (2.174) and (2.180) gives the result.
Theorem 2.7.6. Suppose∑𝑚
𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and
𝐷 = maxmax1≤𝑗≤𝑚 ‖𝜇𝑗‖ , 𝜎. Let 𝑀 be the continuous simulated tempering chain
for the distributions
𝑝𝑖(𝑥) ∝
Ñ𝑚∑𝑗=1
𝑤𝑗𝑒−‖𝑥−𝜇𝑗‖2
2𝜎2
é𝛽𝑖
(2.181)
with rate 𝑂Ä
1𝐷2
ä, relative probabilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1
satisfying the same conditions as in Theorem 2.7.1. Then 𝑀 satisfies the Poincare
inequality
Var(𝑔) ≤ 𝑂
Ç𝐿2𝐷2
𝑟2𝑤3min
åE (𝑔, 𝑔) = 𝑂
Ñ𝑑2ÄlnÄ𝐷𝜎
ää2𝐷2
𝑟2𝑤3min
éE (𝑔, 𝑔). (2.182)
Proof. Let 𝑝𝑖 be the probability distributions in Theorem 2.7.1 with the same pa-
56
rameters as 𝑝𝑖 and let 𝑝 be the stationary distribution of that simulated tempering
chain. By Theorem 2.7.1, Var𝑃
(𝑔) = 𝑂Ä𝐿2𝐷2
𝑟2
äE𝑃
(𝑔, 𝑔). Now use By Lemma 2.7.3,
𝑝𝑖𝑝𝑖∈î1, 1
𝑤min
ó𝑍𝑖
𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ = 1
𝑤min.
2.8 Discretization
Throughout this section, let 𝑓 be as in Theorem 2.4.3 (𝑓 =∑𝑚
𝑖=1𝑤𝑖𝑒−𝑓0(𝑥−𝜇𝑖), where
𝑓0 is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).
Lemma 2.8.1. Fix times 0 < 𝑇1 < · · · < 𝑇𝑛 ≤ 𝑇 .
Let 𝑝𝑇 , 𝑞𝑇 : [𝐿]× R𝑑 → R be probability density functions defined as follows (and
let 𝑃 𝑇 , 𝑄𝑇 denote the corresponding measures).
1. 𝑝𝑇 is the density function of the continuous simulated tempering Markov process
as in Definition 2.5.1 but with fixed transition times 𝑇1, . . . , 𝑇𝑛. The component
chains are Langevin diffusions on 𝑝𝑖(𝑥) ∝Ä∑𝑚
𝑗=1 𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑖)
ä𝛽𝑖.
2. 𝑞𝑇 is the discretized version as in Algorithm 1, again with fixed transition times
𝑇1, . . . , 𝑇𝑛, and with step size 𝜂 ≤ 𝜎2
2.
Then
KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6𝐾7
Ç𝐷2𝐾
2
𝜅+ 𝑑
å𝑇𝑛 + 𝜂2𝐷3𝐾3 max
𝑖E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2𝐾2𝑑𝑇
where 𝑥* is the maximum of∑𝑚
𝑗=1𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗) and satisfies ‖𝑥*‖ = 𝑂(𝐷) where
𝐷 = max ‖𝜇𝑗‖.
Before proving the above statement, we make a note on the location of 𝑥* to make
sense of max𝑖 E𝑥∼𝑃 0(𝑖,·)‖𝑥− 𝑥*‖22. Namely, we show:
57
Lemma 2.8.2 (Location of minimum). Let 𝑥* = argmin𝑥∈R𝑑𝑓(𝑥). Then, ‖𝑥*‖ ≤
𝐷»
𝐾𝜅
+ 1.
Proof. Recall that 𝑓𝑖(𝑥) = 𝑓0(𝑥 − 𝜇𝑖). We claim that 𝑓(0) ≤ 12𝐾𝐷2. Indeed, by
smoothness, we have 𝑓𝑖(0) ≤ 12𝐾‖𝜇𝑖‖2, which implies that 𝑓(0) ≤ 1
2𝐾𝐷2.
Hence, it follows that min𝑥∈R𝑑 𝑓(𝑥) ≤ 12𝐾𝐷2. However, for any 𝑥, it holds that
𝑓(𝑥) ≥ 1
2min
𝑖𝜅‖𝜇𝑖 − 𝑥‖2
≥ 1
2𝜅Å‖𝑥‖2 −max
𝑖‖𝜇𝑖‖2
ã≥ 1
2𝜅Ä‖𝑥‖2 −𝐷2
äHence, if ‖𝑥‖ > 𝐷
»𝐾𝜅
+ 1, 𝑓(𝑥) > min𝑥∈R𝑑 𝑓(𝑥). This implies the statement of the
lemma.
We prove a few technical lemmas. First, we prove that the continuous chain is
essentially contained in a ball of radius 𝐷. More precisely, we show:
Lemma 2.8.3 (Reach of continuous chain). Let 𝑃 𝛽𝑇 (𝑋) be the Markov kernel corre-
sponding to evolving Langevin diffusion
𝑑𝑋𝑡 = −𝛽∇𝑓(𝑋𝑡) 𝑑𝑡 + 𝑑𝐵𝑡
where 𝑓 and 𝐷 are as defined in (2.9) for time 𝑇 . Then,
E[‖𝑋𝑡 − 𝑥*‖2] ≤ E[‖𝑋0 − 𝑥*‖2] +
Ç400𝛽
𝐷2𝐾2
𝜅+ 2𝑑
å𝑇.
Proof. Let 𝑌𝑡 = ‖𝑋𝑡 − 𝑥*‖2. By Itos Lemma, we have
𝑑𝑌𝑡 = −2
⟨𝑋𝑡 − 𝑥*, 𝛽
𝑚∑𝑖=1
𝑤𝑖𝑒−𝑓𝑖(𝑋𝑡)∇𝑓𝑖(𝑋𝑡)∑𝑚𝑗=1𝑤𝑗𝑒−𝑓𝑗(𝑋𝑡)
⟩+ 2𝑑 𝑑𝑡+
√8
𝑑∑𝑖=1
(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡 (2.183)
58
We will show that
−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ 100𝐷2𝐾2
𝜅
Indeed, since 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖), by (2.183), we have
⟨𝑋𝑡,∇𝑓𝑖(𝑋𝑡)⟩ ≥𝜅
2‖𝑋𝑡‖2 −
𝐷2(2𝜅 + 𝐾)2
2𝜅−𝐾𝐷2
Also, by the Hessian bound 𝜅𝐼 ⪯ ∇2𝑓0(𝑥) ⪯ 𝐾𝐼, we have
⟨𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ ‖𝑥*‖‖∇𝑓𝑖(𝑋𝑡)‖ ≤ 𝐷
𝐾
𝜅+ 1‖𝑋𝑡 − 𝜇𝑖‖ ≤ 𝐷
𝐾
𝜅+ 1(‖𝑋𝑡‖+ 𝐷)
Hence,
−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ −𝜅
2‖𝑋𝑡‖2 −
𝐷2(2𝜅 + 𝐾)2
2𝜅−𝐷
𝐾
𝜅+ 1(‖𝑋𝑡‖+ 𝐷)
Solving for the extremal values of the quadratic on the RHS, we get
−⟨𝑋𝑡 − 𝑥*,∇𝑓𝑖(𝑋𝑡)⟩ ≤ 100𝐷2𝐾2
𝜅
Together with (2.183), we get
𝑑𝑌𝑡 ≤ 100𝛽𝐷2𝐾2
𝜅+ 2𝑑 𝑑𝑡+
√8
𝑑∑𝑖=1
(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡
Integrating, we get
𝑌𝑡 ≤ 𝑌0 + 400𝛽𝐷2𝐾2
𝜅𝑇 + 2𝑑𝑇 +
√8∫ 𝑇
0
𝑑∑𝑖=1
(𝑋𝑡)𝑖 𝑑(𝐵𝑖)𝑡
Taking expectations and using the martingale property of the Ito integral, we get the
claim of the lemma.
59
Next, we prove a few technical bound the drift of the discretized chain after 𝑇/𝜂
discrete steps. The proofs follow similar calculations as those in [Dal16].
We will first need to bound the Hessian of 𝑓 .
Lemma 2.8.4 (Hessian bound). For all 𝑥 ∈ R𝑑,
−2(𝐷𝐾)2𝐼 ⪯ ∇2𝑓(𝑥) ⪯ 𝐾𝐼.
Proof. For notational convenience, let 𝑝(𝑥) =∑𝑚
𝑖=1 𝑤𝑖𝑒−𝑓𝑖(𝑥). Note that 𝑓(𝑥) =
− log 𝑝(𝑥). We proceed to the upper bound first. The Hessian of 𝑓 satisfies
∇2𝑓 =
∑𝑖 𝑤𝑖𝑒
−𝑓𝑖∇2𝑓𝑖𝑝
−12
∑𝑖,𝑗 𝑤𝑖𝑤𝑗𝑒
−𝑓𝑖𝑒−𝑓𝑗(∇𝑓𝑖 −∇𝑓𝑗)⊗2
𝑝2
⪯ max𝑖∇2𝑓𝑖 ⪯ 𝐾𝐼
as we need. As for the lower bound, we have
∇2𝑓 ⪰ −1
2
Åmax𝑖,𝑗‖∇𝑓𝑖 −∇𝑓𝑗‖2
ã𝐼
But notice that since 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖), we have
‖∇𝑓𝑖(𝑥)−∇𝑓𝑗(𝑥)‖ = ‖∇𝑓0(𝑥− 𝜇𝑖)−∇𝑓0(𝑥− 𝜇𝑗)‖
≤ 𝐾‖𝜇𝑖 − 𝜇𝑗‖
≤ 2𝐷𝐾
where the next-to-last inequality follows from the strong-convexity of 𝑓0. This proves
the statement of the lemma.
We introduce the following piece of notation in the following portion: we denote
60
by 𝑃𝑇 (𝑖, 𝑥) the measure on [𝐿]×R𝑑 corresponding to running the Langevin diffusion
process for 𝑇 time steps on the second coordinate, starting at (𝑖, 𝑥), and keeping
the first coordinate fixed. Let us define by ”𝑃𝑇 (𝑖, 𝑥) : [𝐿] × R𝑑 → R the analogous
measure, except running the discretized Langevin diffusion chain for 𝑇𝜂
time steps on
the second coordinate, for 𝑇𝜂
an integer.
Lemma 2.8.5 (Bounding interval drift). In the setting of this section, let 𝑖 ∈ [𝐿], 𝑥 ∈
R𝑑, and let 𝜂 ≤ 1𝐾.
KL(𝑃𝑇 (𝑖, 𝑥)||”𝑃𝑇 (𝑖, 𝑥)) ≤ 4𝐷6𝜂2𝐾7
3
Ä‖𝑥− 𝑥*‖22 + 8𝑇𝑑
ä+ 𝑑𝑇𝐷2𝜂𝐾2
Proof. Let 𝑥𝑗, 𝑖 ∈ [0, 𝑇/𝜂 − 1] be a random variable distributed as 𝑃𝜂𝑗(𝑖, 𝑥). By
Lemma 2 in [Dal16] and Lemma 2.8.4 , we have
KL(𝑃𝑇 (𝑖, 𝑥)||”𝑃𝑇 (𝑖, 𝑥)) ≤ 𝜂3𝐷2𝐾2
3
𝑇/𝜂−1∑𝑘=0
E[‖∇𝑓(𝑥𝑘)‖22] + 𝑑𝑇𝜂𝐷2𝐾2
Similarly, the proof of Corollary 4 in [Dal16] implies that
𝜂𝑇/𝜂−1∑𝑘=0
E[‖∇𝑓(𝑥𝑘)‖22] ≤ 4𝐷4𝐾4‖𝑥− 𝑥*‖22 + 8𝐷𝐾𝑇𝑑
Plugging this in, we get the statement of the lemma.
To prove the main claim, we will use Lemma D.1.6, a decomposition theorem for
the KL divergence of two mixtures of distributions, in terms of the KL divergence of
the weights and the components in the mixture.
Proof of Lemma 2.8.1. Let’s denote by 𝑅 (𝑖, 𝑥) the measure on [𝐿] × R𝑑, after one
Type 2 transition in the simulated tempering process, starting at (𝑖, 𝑥).
61
We will proceed by induction. Towards that, we can obviously write
𝑝𝑇𝑖+1 =1
2
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑗, 𝑥)𝑃𝑇𝑖+1−𝑇𝑖(𝑗, 𝑥)
é+
1
2
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑗, 𝑥)𝑅(𝑗, 𝑥)
éand similarly
𝑞𝑇𝑖+1 =1
2
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑗, 𝑥)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑗, 𝑥)
é+
1
2
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑗, 𝑥)𝑅(𝑗, 𝑥)
é(Note: the 𝑅 transition matrix doesn’t change in the discretized vs. continuous
version.)
By convexity of KL divergence, we have
KL(𝑃 𝑇𝑖+1||𝑄𝑇𝑖+1) ≤ 1
2KL
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑥, 𝑗)𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)||
∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑥, 𝑗)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)
é+
1
2KL
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)||∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)
éBy Lemma D.1.6, we have that
KL
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)||∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑥, 𝑗)𝑅(𝑥, 𝑗)
é≤ KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖).
Similarly, by Lemma 2.8.5 together with Lemma D.1.6 we have
KL
Ñ∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑝𝑇𝑖(𝑥, 𝑗)𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)||
∫𝑥∈R𝑑
𝐿−1∑𝑗=0
𝑞𝑇𝑖(𝑥, 𝑗)Ÿ𝑃𝑇𝑖+1−𝑇𝑖(𝑥, 𝑗)
é≤
KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖) +4𝐷6𝐾6𝜂2
3
Åmax
𝑗E𝑥∼𝑃 𝑡𝑖 (·,𝑗)‖𝑥− 𝑥*‖22 + 8(𝑇𝑖+1 − 𝑇𝑖)𝑑
ã+ 𝑑(𝑇𝑖+1 − 𝑇𝑖)𝜂𝐾
2
By Lemmas 2.8.3 and 2.8.2, we have that for any 𝑗 ∈ [0, 𝐿− 1],
E𝑥∼𝑃𝑇𝑖 (·,𝑗)‖𝑥− 𝑥*‖22 ≤ E𝑥∼𝑃𝑇𝑖−1 (·,𝑗)‖𝑥‖2 +
Ç400
𝐷2𝐾2
𝜅+ 2𝑑
å(𝑇𝑖 − 𝑇𝑖−1)
62
Hence, inductively, we have E𝑥∼𝑃𝑇𝑖 (·,𝑗)‖𝑥−𝑥*‖22 ≤ E𝑥∼𝑃 0(·,𝑗)‖𝑥−𝑥*‖22+Ä400𝐷2𝐾2
𝜅+ 2𝑑
ä𝑇𝑖.
Putting everything together, we have
KL(𝑃 𝑇𝑖+1||𝑄𝑇𝑖+1) ≤ KL(𝑃 𝑇𝑖 ||𝑄𝑇𝑖) +4𝜂2𝐷6𝐾7
3·Ç
max𝑗
E𝑥∼𝑃 0(·,𝑗)‖𝑥− 𝑥*‖22 +
Ç400
𝐷2𝐷2𝐾2
𝜅+ 2𝑑
å𝑇 + 8𝑇𝑑
å+ 𝑑𝑇𝜂𝐷2𝐾2
By induction, we hence have
KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6𝐾7
Ç𝐷2𝐾
2
𝜅+ 𝑑
å𝑇𝑛 + 𝜂2𝐷3𝐾3 max
𝑖E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2𝐾2𝑑𝑇
as needed.
2.9 Proof of main theorem
Before putting everything together, we show how to estimate the partition functions.
We will apply the following to 𝑔1(𝑥) = 𝑒−𝛽ℓ𝑓(𝑥) and 𝑔2(𝑥) = 𝑒−𝛽ℓ+1𝑓(𝑥).
Lemma 2.9.1 (Estimating the partition function to within a constant factor). Sup-
pose that 𝑃1 and 𝑃2 are probability measures on Ω with density functions (with respect
to a reference measure) 𝑝1(𝑥) = 𝑔1(𝑥)𝑍1
and 𝑝2(𝑥) = 𝑔2(𝑥)𝑍2
. Suppose ‹𝑃1 is a measure such
that 𝑑𝑇𝑉 (‹𝑃1, 𝑃1) < 𝜀2𝐶2 , and
𝑔2(𝑥)𝑔1(𝑥)
∈ [0, 𝐶] for all 𝑥 ∈ Ω. Given 𝑛 samples 𝑥1, . . . , 𝑥𝑛
from ‹𝑃1, define the random variable
𝑟 =1
𝑛
𝑛∑𝑖=1
𝑔2(𝑥𝑖)
𝑔1(𝑥𝑖). (2.184)
Let
𝑟 = E𝑥∼𝑃1
𝑔2(𝑥)
𝑔1(𝑥)=
𝑍2
𝑍1
(2.185)
63
and suppose 𝑟 ≥ 1𝐶. Then with probability ≥ 1− 𝑒−
𝑛𝜀2
2𝐶4 ,
∣∣∣∣𝑟𝑟 − 1∣∣∣∣ ≤ 𝜀. (2.186)
Proof. We have that
∣∣∣∣∣∣ E𝑥∼𝑃1
𝑔2(𝑥)
𝑔1(𝑥)− E
𝑥∼𝑃1
𝑔2(𝑥)
𝑔1(𝑥)
∣∣∣∣∣∣ ≤ 𝐶𝑑𝑇𝑉 (‹𝑃1, 𝑃1) ≤𝜀
2𝐶. (2.187)
The Chernoff bound gives
P
Ñ∣∣∣∣∣∣𝑟 − E𝑥∼𝑃1
𝑔2(𝑥)
𝑔1(𝑥)
∣∣∣∣∣∣ ≥ 𝜀
2𝐶
é≤ 𝑒
−𝑛( 𝜀
2𝐶 )2
2(𝐶2 )
2
= 𝑒−𝑛𝜀2
2𝐶4 . (2.188)
Combining (2.187) and (2.188) using the triangle inequality,
PÇ|𝑟 − 𝑟| ≥ 1
𝜀𝐶
å≤ 𝑒−
𝑛𝜀2
2𝐶4 . (2.189)
Dividing by 𝑟 and using 𝑟 ≥ 1𝐶
gives the result.
Lemma 2.9.2. Suppose that Algorithm 1 is run on 𝑓(𝑥) = − lnÅ∑𝑚
𝑗=1𝑤𝑗 expÅ−‖𝑥−𝜇𝑗‖2
2𝜎2
ããwith temperatures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿, rate 𝜆, and with partition function
estimates ”𝑍1, . . . , 𝑍ℓ satisfying
∣∣∣∣∣∣𝑍𝑖
𝑍𝑖
/”𝑍1
𝑍1
∣∣∣∣∣∣ ∈[Ç
1− 1
𝐿
å𝑖−1
,
Ç1 +
1
𝐿
å𝑖−1]
(2.190)
for all 1 ≤ 𝑖 ≤ ℓ. Suppose∑𝑚
𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 =
maxmax1≤𝑗≤𝑚 ‖𝜇𝑗‖ , 𝜎, and the parameters satisfy
𝜆 = Θ
Ç1
𝐷2
å(2.191)
64
𝛽1 = Θ
Ç𝜎2
𝐷2
å(2.192)
𝛽𝑖+1
𝛽𝑖
≤ 1 +1
𝑑 + lnÄ
1𝑤min
ä (2.193)
𝐿 = Θ
ÇÇ𝑑 + ln
Ç1
𝑤min
ååln
Ç𝐷
𝜎
å+ 1
å(2.194)
𝑇 = Ω
Ñ𝐿2𝐷2 ln
Äℓ
𝜀𝑤min
ä𝑤3
min
é(2.195)
𝜂 = 𝑂
Ñ𝜎3𝜀
𝐷2min
𝜎4Ä𝐷𝜎
+√𝑑ä𝑇,
1
𝐷12
,𝜎𝜀
𝑑𝑇
é
(2.196)
Let 𝑞0 be the distribution(𝑁(0, 𝜎
2
𝛽1
), 1)on [ℓ] × R𝑑. Then the distribution 𝑞𝑇 after
running time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀.
Setting 𝜀 = 𝑂Ä
1ℓ𝐿
äabove and taking 𝑛 = Ω
Ä𝐿2 ln
Ä1𝛿
ääsamples, with probability
1− 𝛿 the estimate
“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =1
𝑛
𝑛∑𝑗=1
𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗) (2.197)
also satisfies (2.190).
Proof. By the triangle inequality,
∥∥∥𝑝− 𝑞𝑇∥∥∥1≤∥∥∥𝑝− 𝑝𝑇
∥∥∥1
+∥∥∥𝑝𝑇 − 𝑞𝑇
∥∥∥1. (2.198)
For the first term, by Cauchy-Schwarz, (using capital letters for the probability mea-
sures)
∥∥∥𝑝− 𝑝𝑇∥∥∥1≤»𝜒2(𝑃 𝑇 ||𝑃 ) ≤ 𝑒−
𝑇2𝐶
»𝜒2(𝑃 0||𝑃 ) (2.199)
where 𝐶 = 𝑂
Ç𝑑2(ln(𝐷
𝜎 ))2𝐷2
𝑤2min
åis an upper bound on the Poincare constant as in The-
orem 2.7.6. (The assumption on “𝑍𝑖 means that 𝑟 ≤ 𝑒.) Let 𝑝𝑖 be the distribution of
𝑝 on the 𝑖th temperature, and 𝑝𝑖 be as in Lemma 2.7.3.
65
To calculate 𝜒2(𝑃 0||𝑃 ), first note by Lemma D.2.1, the 𝜒2 distance between
𝑁(0, 𝜎
2
𝛽1𝐼𝑑)
and 𝑁(𝜇, 𝜎2
𝛽1𝐼𝑑) is ≤ 𝑒‖𝜇‖
2𝛽1/𝜎2. Then
𝜒2(𝑃 0||𝑃 ) (2.200)
= 𝑂(ℓ)𝜒2
Ç𝑁
Ç0,
𝜎2
𝛽1
𝐼𝑑
å||𝑃1
å(2.201)
= 𝑂
Çℓ
𝑤min
åÇ1 + 𝜒2
Ç𝑁
Ç0,
𝜎2
𝛽1
𝐼𝑑
å||‹𝑃1
åå(2.202)
by Lemma 2.7.3 and Lemma D.1.5
= 𝑂
Çℓ
𝑤min
åÑ1 +
𝑚∑𝑗=1
𝑤𝑗𝜒2
Ç𝑁
Ç0,
𝜎2
𝛽1
𝐼𝑑
å||𝑁
Ç𝜇𝑗,
𝜎2
𝛽1
𝐼𝑑
ååé(2.203)
by Lemma D.1.4
= 𝑂
Ö𝑒
𝐷2𝛽1𝜎2 ℓ
𝑤min
è= 𝑂
Çℓ
𝑤min
å. (2.204)
Together with (2.199) this gives∥∥∥𝑝− 𝑝𝑇
∥∥∥1≤ 𝜀
3.
For the second term∥∥∥𝑝𝑇 − 𝑞𝑇
∥∥∥1, we first condition on there not being too many
transitions before time 𝑇 . Let 𝑁𝑇 = max 𝑛 : 𝑇𝑛 ≤ 𝑇 be the number of transitions.
Let 𝐶 be as in Lemma D.4.1. Note thatÄ𝐶𝑛𝑇𝜆
ä−𝑛 ≤ 𝜀 ⇐⇒ 𝑒𝑛(𝑇𝜆𝐶 )−ln𝑛 ≤ 𝜀, and
that this inequality holds when 𝑛 ≥ 𝑒𝑇𝜆𝐶
+ lnÄ1𝜀
ä. We have by Lemma D.4.1 that
P(𝑁𝑇 ≥ 𝑒𝑇𝜆𝐶
+ lnÄ1𝜀
ä) ≤ 𝜀
3. With our choice of 𝑇 , ln
Ä1𝜀
ä= 𝑂(𝑇 ).
If we condition on the event 𝐴 of the 𝑇𝑖’s being a particular sequence 𝑇1, . . . , 𝑇𝑛
with 𝑛 < 𝑒𝑇𝜆𝐶
+lnÄ1𝜀
ä, Pinsker’s inequality and Lemma 2.8.1 (with 𝐾 = 𝜅 = 1
𝜎2 ) gives
us
∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1≤»
2 KL(𝑃 𝑡(·|𝐴)||𝑄𝑡(·|𝐴)) (2.205)
= 𝑂
Ñmax
𝜂2𝐷6𝑇 2𝜆
𝜎14Ä𝐷2
𝜎2 + 𝑑ä , 𝜂2𝐷3 1
𝜎6𝐷2, 𝜂𝐷2 1
𝜎4𝑑𝑇
é
(2.206)
66
In order for this to be ≤ 𝜀3, we need (for some absolute constant 𝐶1)
𝜂 ≤ 𝐶1𝜎3𝜀
𝐷2min
𝜎4Ä𝐷𝜎
+√𝑑ä𝑇,
1
𝐷12
,𝜎𝜀
𝑑𝑇
. (2.207)
Putting everything together,
∥∥∥𝑝𝑇 − 𝑞𝑇∥∥∥1≤ P(𝑁𝑇 ≥ 𝑐𝑇𝜆) +
∥∥∥𝑝𝑡(·|𝑁𝑇 ≥ 𝑐𝑇𝜆)− 𝑞𝑡(·|𝑁𝑇 ≥ 𝑐𝑇𝜆)∥∥∥1≤ 𝜀
3+
𝜀
3=
2𝜀
3.
(2.208)
This gives∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀.
For the second part, setting 𝜀 = 𝑂Ä
1ℓ𝐿
ägives that
∥∥∥𝑝ℓ − 𝑞𝑇ℓ∥∥∥1
= 𝑂Ä1𝐿
ä. We will
apply Lemma 2.9.1. By Lemma D.3.1 the assumptions of Lemma 2.9.1 are satisfied
with 𝐶 = 𝑂(1), as we have
𝛽𝑖+1 − 𝛽𝑖
𝛽𝑖
= 𝑂
Ñ1
𝛼𝐷2
𝜎2 + 𝑑 + lnÄ
1𝑤min
äé . (2.209)
By Lemma 2.9.1, after collecting 𝑛 = ΩÄ𝐿2 ln
Ä1𝛿
ääsamples, with probability ≥ 1− 𝛿,∣∣∣∣‘𝑍ℓ+1/“𝑍ℓ
𝑍ℓ+1/𝑍ℓ− 1
∣∣∣∣ ≤ 1𝐿
. Set ’𝑍ℓ+1 = 𝑟𝑍ℓ. Then‘𝑍ℓ+1“𝑍ℓ
∈ [1 − 1𝐿, 1 + 1
𝐿]𝑍ℓ+1
𝑍ℓand
‘𝑍ℓ+1“𝑍1∈[Ä
1− 1𝐿
äℓ,Ä1 + 1
𝐿
äℓ] 𝑍ℓ+1
𝑍1.
Proof of Theorem 2.4.2. Choose 𝛿 = 𝜀2𝐿
where 𝐿 is the number of temperatures. Use
Lemma 2.9.2 inductively, with probability 1 − 𝜀2
each estimate satisfies“𝑍ℓ“𝑍1∈ [1
𝑒, 𝑒].
Estimating the final distribution within 𝜀2
accuracy gives the desired sample.
2.10 Conclusion
We initiated a study of sampling “beyond log-convexity.” In so doing, we developed
a new general technique to analyze simulated tempering, a classical algorithm used
in practice to combat multimodality but that has seen little theoretical analysis. The
67
technique is a new decomposition lemma for Markov chains based on decomposing
the Markov chain rather than just the state space. We have analyzed simulated
tempering with Langevin diffusion, but note that it can be applied to any with any
other Markov chain with a notion of temperature.
Our result is the first result in its class (sampling multimodal, non-log-concave
distributions with gradient oracle access). Admittedly, distributions encountered in
practice are rarely mixtures of distributions with the same shape. However, we hope
that our techniques may be built on to provide guarantees for more practical proba-
bility distributions. An exciting research direction is to provide (average-case) guar-
antees for probability distributions encountered in practice, such as posteriors for
clustering, topic models, and Ising models. For example, the posterior distribution
for a mixture of gaussians can have exponentially many terms, but may perhaps be
tractable in practice. Another interesting direction is to study other temperature
heuristics used in practice, such as particle filters [Sch12; DHW12; PJT15; GD17],
annealed importance sampling [Nea01], and parallel tempering [WSH09a]. We note
that our decomposition theorem can also be used to analyze simulated tempering in
the infinite switch limit [Mar+19].
68
Chapter 3
Online sampling from log-concave
distributions
This chapter is based on work in [LMV19].
3.1 Introduction
In this work, we study the following online sampling problem:
Problem 3.1.1. Consider a sequence of convex functions 𝑓0, 𝑓1, . . . , 𝑓𝑇 : R𝑑 → R
for some 𝑇 ∈ N, and let 𝜀 > 0. At each epoch 𝑡 ∈ 1, . . . , 𝑇, the function 𝑓𝑡 is
given to us, so that we have oracle access to the gradients of the first 𝑡 + 1 func-
tions 𝑓0, 𝑓1, . . . , 𝑓𝑡. The goal is to generate a sample from the distribution 𝜋𝑡(𝑥) ∝
𝑒−∑𝑡
𝑘=0𝑓𝑘(𝑥) with some fixed total-variation (TV) error 𝜀 > 0 at each epoch 𝑡. The
samples at different time steps should be almost independent.
The motivation to study this problem comes from machine learning, Bayesian statis-
tics, optimization, and theoretical computer science, and various versions of this
problem have been considered in the literature; see [NR17; Dou+00; ADH10] and
the references therein.
69
In Bayesian statistics, the goal is to infer the probability distribution (the poste-
rior) of a certain parameter based on observations; however, rather than obtaining
all the observations at once, one constantly acquires new data, and must continuously
update the posterior distribution (rather than only after all data has been collected).
One practical application of online sampling is online logistic regression, where one
wishes to obtain samples from a changing Bayesian posterior distribution as data is
acquired over time. Another practical application of online sampling which has been
well-studied is latent Dirichlet allocation (LDA), which is applied to document clas-
sificiation ([BNJ03]). As new documents are published, it is desirable to update the
distribution of topics without excessive re-computation. 1
We give some settings where online sampling algorithms can be used:
∙ Online Bayesian logistic regression. Concretely, suppose 𝜃 ∼ 𝑝0 for a
given prior distribution, and that samples 𝑦𝑡 are drawn from the conditional
distribution 𝑝(·|𝜃, 𝑦1, . . . , 𝑦𝑡−1). We would like to find the posterior distribution
of 𝑝(𝜃|𝑦1, . . . , 𝑦𝑇 ). By Bayes’ rule and letting 𝑝𝑡 := 𝑝(𝜃|𝑦1, . . . , 𝑦𝑡), we have the
following recursion.
𝑝𝑡(𝜃) ∝ 𝑝𝑡−1(𝜃)𝑝(𝑦𝑡|𝜃, 𝑦1, . . . , 𝑦𝑡−1). (3.1)
The goal is to efficiently obtain a sample(s) 𝜃𝑡 from the posterior distribution
𝑝𝑡(𝜃), for each 𝑡. We can think of the samples 𝑦𝑡 as arriving in a streaming or
online manner, and we want to keep updating our estimate for the probability
distribution. This fits the setting of Problem 3.1.1 by defining 𝑓0 to be such
that 𝑝0 ∝ 𝑒−𝑓0 and 𝑓𝑡 to be such that 𝑝(𝑦𝑡|𝜃, 𝑦1, . . . , 𝑦𝑡−1) ∝ 𝑒−𝑓𝑡 , whenever the
𝑓𝑡’s are convex.
1The theoretical results in this workdo not apply to LDA, since LDA requires sampling fromnon-log-concave distributions. However, one can still apply our algorithm to non-log-concave distri-butions such as those of LDA.
70
∙ Optimization. Online sampling is useful even if one is only interested in
optimization: one generic algorithm for online optimization is to sample a point
𝑥𝑡 from the exponential of the (suitably weighted) negative loss ([CL06], Lemma
10 in [NR17]). Indeed there are settings such as online logistic regression in
which the only known way to achieve optimal regret is through a Bayesian
sampling approach [Fos+18], with lower bounds known for the naive convex
optimization approach [HKL14].
∙ Reinforcement learning. In reinforcement learning problems [Rus+18; DFE18],
a class of online optimization problems, one seeks to choose a set of actions which
maximize a sum of “rewards” over multiple time periods. The expected value
of the reward depends on the value of a vector of unknown model parameters as
well as on the chosen action vector. While one seeks to choose an action at each
time period which gives a large reward, one also wishes to choose a wide range
of actions at different time periods in order to explore the set of possible actions,
allowing one to make a better choice of actions in future periods. Thompson
sampling [Rus+18; DFE18] solves this “exploration-exploitation dilemma” by
maximizing the expected reward at each period with respect to a sample from
the Bayesian posterior distribution for the model parameters. Every time one
chooses an action, more data is acquired from the outcome of the reward, so
that the Bayesian posterior distribution changes at each time period. To imple-
ment Thompson sampling efficiently in real time, one wishes to sample quickly
from this changing posterior distribution even as the number of data points
grows very large. For instance, if one implements Thompson sampling with a
logistic model, then one would need to sample from a changing Bayesian logistic
posterior distribution.
∙ Sampling from a log-concave distribution. Sampling from log-concave dis-
71
tributions is a classic problem in theoretical computer science with applications
to volume computation and integration [LV06], and an algorithm for Problem
3.1.1 can be used to come up with iterative (offline) sampling algorithms for
a log-concave distribution that has the form 𝑒−𝑓(𝑥) = 𝑒−∑𝑇
𝑡=0𝑓𝑡(𝑥). This “sum-
form” often arises in machine learning applications with 𝑇 ≫ 𝑑, and the cost of
evaluating the gradient of 𝑓 is 𝑇 times greater than the cost of evaluating the
gradient of a single 𝑓𝑡. Thus, one approach to sampling from 𝑒−𝑓(𝑥) could be to
think of 𝑓𝑡’s as a sequence and sample incrementally as in Problem 3.1.1.
In all of these applications, because a sample is needed at every epoch 𝑡, it is desirable
to have a fast online sampling algorithm. In particular, the ultimate goal is to design
an algorithm for Problem 3.1.1 such that the number of gradient evaluations is con-
stant at each epoch 𝑡, so that the computational requirements at each epoch do not
increase over time. However, this is quite challenging because at epoch 𝑡, one has to
incorporate information from all 𝑡+ 1 functions 𝑓0, . . . , 𝑓𝑡, while only using a number
of gradient computations which is logarithmic in the total number of functions.
The main contribution of this workis an algorithm for Problem 3.1.1 that, under
mild assumptions on the functions, makes ‹𝑂𝑇 (1) gradient evaluations per epoch (here
the subscript 𝑇 in ‹𝑂𝑇 means that we only show the dependence on the parameters 𝑡, 𝑇 ,
and exclude dependence on non-𝑇, 𝑡 parameters such as the dimension 𝑑, sampling
accuracy 𝜀 and the regularity parameters 𝐶,D, 𝐿 which we define in Section 3.2.1). All
previous rigorous results (even with comparable assumptions) for this problem imply
a bound on the number of gradient or function evaluations which is at least linear in 𝑇 ;
see Table 3.1. We assume that the functions are smooth, they have a bounded second
moment, and their minimizer drifts in a bounded manner, but we do not assume that
the functions are strongly convex. These assumptions are motivated from real-world
considerations and, as a concrete application, we show that these assumptions hold
in the setting of online Bayesian logistic regression, when the data vectors satisfy
72
natural regularity properties, giving a sampling algorithm with ‹𝑂𝑇 (1) updates. Our
result also implies the first algorithm to sample from a 𝑑-dimensional log-concave
distribution of the form 𝑒−∑𝑇
𝑡=0𝑓𝑡 where the 𝑓𝑡’s are not assumed to be strongly
convex and the total number of gradient evaluations is roughly 𝑇 log(𝑇 ) + poly(𝑑), as
opposed to 𝑇 · poly(𝑑) implied by prior works; see Table 3.2.
A natural approach to online sampling is to design a Markov chain with the right
steady state distribution [NR17; DMM18; Dwi+18; Cha+18]. The main difficulty is
that running a step of a Markov chain that incorporates all previous functions takes
time Ω(𝑡) at epoch 𝑡; all previous algorithms with provable guarantees suffer from this.
To overcome this, one must use stochasticity – for example, sample a subset of the
previous functions. However, this fails because of the large variance of the gradient.
Our result relies on a stochastic gradient Langevin dynamics (SGLD) Markov chain
that has a carefully designed variance reduction step built-in with a fixed – ‹𝑂𝑇 (1) –
batch size. Technically, lack of strong convexity is a significant barrier to analyzing
our Markov chain and, here, our main contribution is a martingale exit time argument
that shows that our Markov chain is constrained to a ball of radius roughly 1√𝑇
for
time that is sufficient for it to reach within 𝜀 of 𝜋𝑡.
3.2 Our results
3.2.1 Assumptions
Denote by ℒ(𝑌 ) the distribution of a random variable 𝑌 . For any two probability mea-
sures 𝜇, 𝜈, denote the 2-Wasserstein distance by 𝑊2(𝜇, 𝜈) := inf(𝑋,𝑌 )∼Π(𝜇,𝜈)
»E[‖𝑋 − 𝑌 ‖2],
where Π(𝜇, 𝜈) denotes the set of all possible couplings of random vectors (, 𝑌 ) with
marginals ∼ 𝜇 and 𝑌 ∼ 𝜈. For every 𝑡 ∈ 0, . . . , 𝑇, define 𝐹𝑡 :=∑𝑡
𝑘=0 𝑓𝑘, and
let 𝑥⋆𝑡 be a minimizer of 𝐹𝑡(𝑥) on R𝑑. For any 𝑥 ∈ R𝑑, let 𝛿𝑥 be the Dirac delta
distribution centered at 𝑥. We make the following assumptions:
73
Assumption 3.2.1 (Smoothness/Lipschitz gradient (with constants 𝐿0, 𝐿 > 0)).
For all 1 ≤ 𝑡 ≤ 𝑇 and 𝑥, 𝑦 ∈ R𝑑, ‖∇𝑓𝑡(𝑦)−∇𝑓𝑡(𝑥)‖ ≤ 𝐿 ‖𝑥− 𝑦‖. For 𝑡 = 0,
‖∇𝑓0(𝑦)−∇𝑓0(𝑥)‖ ≤ 𝐿0 ‖𝑥− 𝑦‖.
We allow 𝑓0 to satisfy our assumptions with a different parameter value, since in
Bayesian applications 𝑓0 models a “prior” which has different scaling than 𝑓1, 𝑓2, . . ..
Assumption 3.2.2 (Exponential concentration (with constants 𝐴, 𝑘 > 0,
𝑐 ≥ 0)). For all 0 ≤ 𝑡 ≤ 𝑇 , the concentration condition P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥ 𝛾√
𝑡+𝑐) ≤
𝐴𝑒−𝑘𝛾 holds.
Note that Assumption 3.2.2 implies a bound on the second moment, 𝑚122 :=Ä
E𝑥∼𝜋𝑡 ‖𝑥− 𝑥⋆𝑡‖
22
ä 12 ≤ 𝐶√
𝑡+𝑐for 𝐶 =
Ä2 + 1
𝑘
älog
Ä𝐴𝑘2
ä. For conciseness, we will write
bounds in terms of this parameter 𝐶.2
Assumption 3.2.3 (Drift of MAP (with constants D ≥ 0, 𝑐 ≥ 0)). 3 For all
0 ≤ 𝑡, 𝜏 ≤ 𝑇 such that 𝜏 ∈ [𝑡,max2𝑡, 1], ‖𝑥⋆𝑡 − 𝑥⋆
𝜏‖ ≤ D√𝑡+𝑐
.
Assumption 3.2.2 says that the “data is informative enough” – the current distribu-
tion 𝜋𝑡 (posterior) concentrates near the mode 𝑥⋆𝑡 as 𝑡 increases. The 1
𝑡decrease in
the second moment is what one would expect based on central limit theorems such as
the Bernstein-von Mises theorem. It is a much weaker condition than strong convex-
ity. Indeed, if the 𝑓𝑡’s are 𝛼-strongly convex, then 𝜋𝑡(𝑥) ∝ 𝑒−∑𝑡
𝑘=0𝑓𝑘(𝑥) has standard
deviation ≤√𝑑√
𝛼(𝑡+1)(consider for instance the example of Gaussians with variance
1𝛼
). In addition, many distributions satisfy Assumption 3.2.2 but are not strongly
logconcave. For instance, posterior distributions used in Bayesian logistic regression
satisfy Assumption 3.2.2 under natural conditions on the data, but are not strongly
2Having a bounded second moment suffices to obtain (weaker) polynomial bounds (by replacingthe use of the concentration inequality with Chebyshev’s inequality). We use this slightly strongercondition because exponential concentration improves the dependence on 𝜀, and is typically satisfiedin practice.
3The MAP (maximum a posteriori) is like the MLE except that it takes the prior into account.
74
logconcave unless the Bayesian prior is strongly logconcave (see section 3.2.4). More-
over, while the second moment in Assumption 3.2.2 decreases with the number of data
points, the strong convexity parameter remains constant even if the prior is strongly
logconcave. Hence, together Assumptions 3.2.1 and 3.2.2 are a weaker condition than
strong convexity and gradient Lipschitzness, the typical setting where the offline al-
gorithm is analyzed. In particular, the assumptions avoid the “ill-conditioned” case
when the distribution becomes more concentrated in one direction than another as
the number of functions 𝑡 increases.
Assumption 3.2.3 is typically satisfied in the setting where the 𝑓𝑡’s are iid. For
instance, in the case of Gaussian distributions, the maximum a posteriori (MAP) is
the mean, and the assumption reduces to the fact that a random walk drifts on the
order of√𝑡, and hence the mean drifts by 𝑂𝑇
(1√𝑡
), after 𝑡 time steps. We need this
assumption because our algorithm uses cached gradients computed Θ𝑇 (𝑡) time steps
ago, and in order for the past gradients to be close in value to the gradient at the
current point, the points where the gradients were last calculated should be at distance
𝑂𝑇
(1√𝑡
)from the current point. In Section 3.2.4 we show that these assumptions
hold for sequences of functions arising in online Bayesian logistic regression; unlike in
previous work on related techniques [Nag+17; Cha+18], our assumptions are weak
enough to hold for such applications, as they do not require 𝑓0, . . . , 𝑓𝑇 to be strongly
convex.
3.2.2 Result in the online setting
Theorem 3.2.4 (Online variance-reduced SGLD). Suppose that 𝑓0, . . . , 𝑓𝑇 :
R𝑑 → R are (weakly) convex4 and satisfy Assumptions 3.2.1-3.2.3 with 𝑐 = 𝐿0
𝐿.
Then there exist parameters 𝑏, and 𝑖max which are polynomial in 𝑑, 𝐿, 𝐶,D, 𝜀−1 and
poly-logarithmic in 𝑇 , such that at epoch 𝑡, Algorithm 4 generates an 𝜀-approximate
4In fact, it suffices for their sum to be convex.
75
independent sample X𝑡 from 𝜋𝑡.5 Moreover, the total number of gradient evaluations
required at each epoch 𝑡 is polynomial in 𝑑, 𝐿, 𝐶,D, 𝜀−1 and polylogarithmic in 𝑇 .
See Theorem 3.6.5 for a more precise statement with explicit dependencies. Note
that the algorithm needs to know the parameters, but bounds are enough.
Compared to previous work on the topic, this result is the first to obtain bounds
on the number of gradient evaluations which are polylogarthmic in 𝑇 at each epoch
(see Table 3.1 where we compare the dependence on 𝑇 of previous results applied
to the online sampling problem). Previous results for the basic Langevin and SGLD
algorithms, as well as for the variance reduced SGLD methods SAGA-LD and CV-
LD [Cha+18] and the online Dikin walk6 [NR17] all imply a bound on the number of
gradient or function7 evaluations at each epoch which is at least linear in 𝑇 . 8 On
the other hand, while polynomial, our result’s dependence on the other parameters
𝑑, 𝐿, 𝐶,D, 𝜀−1 is larger than that of the online Dikin walk and of the Langevin and
SGLD algorithms. We suspect that the order of this polynomial can be improved
with a more careful analysis.
Finally, the results of [Cha+18] require strong convexity while our result, only
requires a much weaker bound on the concentration of the target distribution (As-
sumption 3.2.2). This allows us to obtain bounds for applications such as logistic
regression where the functions 𝑓1, . . . , 𝑓𝑡 may not be strongly convex.
5See Definition 3.6.1 for the formal definition. Necessarily, ‖ℒ(X𝑡)− 𝜋𝑡‖TV ≤ 𝜀.6The online Dikin walk reduces to an online version of the Random Walk Metropolis algorithm
in our unconstrained setting.7In our setting a gradient evaluation can be computed in at worst 2𝑑 function evaluations. In
many applications (including logistic regression) computing the gradient takes the same number ofoperations as computing the function.
8Note that the number of gradient evaluations for the basic Langevin and SGLD algorithms andthe online Dikin walk depend multiplicatively on 𝑇 , (i.e., 𝑇 × poly(𝑑, 𝐿, other parameters)), whilethe number of gradient evaluations for the variance-reduced SGLD methods depend only additivelyon 𝑇 , (i.e., 𝑇 + poly(𝑑, 𝐿, other parameters)).
76
Algorithm oracle calls per epoch Other assumptions
Online Dikin walk [NR17, S5.1] 𝑂𝑇 (𝑇 ) Strong convexityBounded ratio of distributions
Langevin [DMM18; Dwi+18] 𝑂𝑇 (𝑇 ) -
SGLD [DMM18] 𝑂𝑇 (𝑇 ) -
SAGA-LD [Cha+18] 𝑂𝑇 (𝑇 ) Strong convexityLipschitz Hessian
CV-ULD [Cha+18] 𝑂𝑇 (𝑇 ) Strong convexity
This work polylog(𝑇 ) bounded second momentbounded drift of minimizer
Table 3.1: Bounds on the number of gradient (or function) evaluations required by dif-
ferent algorithms to solve the online sampling problem. Lipschitz gradient (smoothness)
is assumed for all algorithms. Note that the online Dikin walk was analyzed in [NR17]
for a different setting where the target distribution is restricted to a convex polytope; in
this table we give the result that one should obain when the support is R𝑑. It is therefore
possible that the assumptions we give for the online Dikin walk can be weakened.
3.2.3 Result in the offline setting
In the offline setting, we have access to all 𝑇 functions 𝑓1, . . . 𝑓𝑇 from the beginning
(for notational simplicity, in the rest of the paper we index the 𝑓𝑡’s from 𝑡 = 1 for
the offline setting). Our goal is simply to generate a sample from the single target
distribution 𝜋𝑇 (𝑥) ∝ 𝑒−∑𝑇
𝑡=1𝑓𝑡(𝑥) with TV error 𝜀. Since we do not assume that the
𝑓𝑡’s are given in any particular order, we replace Assumption 3.2.2 which depends on
the order in which the functions are given, with an Assumption (Assumption 3.2.5)
on the target function∑𝑇
𝑡=1 𝑓𝑡(𝑥) which does not depend on the ordering of the 𝑓𝑡’s.
Instead of working with the sequence of target distributions 𝜋1, 𝜋2 . . . which depend
on the ordering of the 𝑓𝑡’s, we introduce an inverse temperature parameter 𝛽 > 0
and consider the distributions 𝜋𝛽𝑇 (𝑥) ∝ 𝑒−𝛽
∑𝑇𝑡=1
𝑓𝑡(𝑥). In place of Assumption 3.2.2,
we assume the following:
Assumption 3.2.5 (Exponential concentration (with constants 𝐴, 𝑘 > 0)).
For all 1𝑇≤ 𝛽 ≤ 1, we have for all 𝑠 ≥ 0, P𝑋∼𝜋𝛽
𝑇
Ç‖𝑋 − 𝑥⋆‖ ≥ 𝑠√
𝛽𝑇
å≤ 𝐴𝑒−𝑘𝑠.
Assumption 3.2.5 says that the distributions 𝜋𝛽𝑇 become more concentrated as 𝛽
77
increases from 1𝑇
to 1. By sampling from a sequence of distributions 𝜋𝛽𝑇 where we
gradually increase 𝛽 from 1𝑇
to 1 at each epoch, our offline algorithm (Algorithm 5) is
able to approach the target distribution 𝜋𝑇 = 𝜋1𝑇 when starting from a cold start that
is far from a sublevel set containing most of the mass of the probability measure of
𝜋𝑇 , without requiring strong convexity. Moreover, since scaling by 𝛽 does not change
the location of the minimizer 𝑥⋆ of 𝛽∑𝑇
𝑡=1 𝑓𝑡(𝑥), we can drop Assumption 3.2.3.
Theorem 3.2.6 (Offline variance-reduced SGLD). Suppose that 𝑓1, . . . , 𝑓𝑇 sat-
isfy Assumptions 3.2.1 and 3.2.5. Then there exist 𝑏, 𝜂, and 𝑖max which are polynomial
in 𝑑, 𝐿, 𝐶, 𝜀−1 and poly-logarithmic in 𝑇 , such that Algorithm 5 generates a sample 𝑋𝑇
such that ‖ℒ(𝑋𝑇 ) − 𝜋𝑇‖TV ≤ 𝜀. Moreover, the total number of gradient evaluations
is polylog(𝑇 )× poly(𝑑, 𝐿, 𝐶,D, 𝜀−1) + ‹𝑂(𝑇 ).
See Theorem 3.7.2 for precise dependencies. The theorem could also be stated with
a 𝑓0, but we have omitted it for simplicity.
As in the online setting, we do not assume strong convexity. Further, our additive
dependence on 𝑇 in Theorem 3.2.6 is tight up to polylogarithmic factors, since the
number of gradient evaluations needed to sample from a target distribution satisfying
Assumptions 3.2.1-3.2.3 is at least Ω(𝑇 ) because of information theoretic require-
ments.
Compared to previous work in this setting, our results are the first to obtain
an additive dependence on 𝑇 and polynomial dependence on the other parameters
without assuming strong convexity. While the results of [Cha+18] for SAGA-LD and
CV-LD have additive dependence on 𝑇 , their results require the functions 𝑓1, . . . , 𝑓𝑇
to be strongly convex. Since the basic Dikin walk and basic Langevin algorithms
compute all 𝑇 functions or all 𝑇 gradients every time the Markov chain takes a step,
and the number of steps in their Markov chain depends polynomially on the other
parameters such as 𝑑 and 𝐿, the number of gradient (or function) evaluations required
by these algorithms is multiplicative in 𝑇 . Even though the basic SGLD algorithm
78
Algorithm # of oracle calls other Assumptions
Online Dikin walk [NR17, S5.1] 𝑇 × poly(𝑑, 𝐿) Strong convexity
Langevin [DMM18; Dwi+18] 𝑇 × poly(𝑑, 𝐿) Wasserstein warm start
SGLD [DMM18] 𝑇 × poly(𝑑, 𝐿) Wasserstein warm start
SAGA-LD [Cha+18] 𝑇 + poly(𝑑,𝑚−1, 𝐿, 𝐿𝐻) Strong convexity
CV-ULD [Cha+18] 𝑇 + poly(𝑑,𝑚−1, 𝐿) Strong convexity
This work 𝑇 + poly(𝑑,𝐶,D, 𝐿) bounded second momentbounded drift of minimizer
Table 3.2: Bounds on the number of gradient (or function) evaluations required by dif-
ferent algorithms to solve the offline sampling problem. Lipschitz gradient (smoothness) is
assumed for all algorithms.
computes a mini-batch of the gradients at each step, roughly speaking the batch size
at each step of the chain should be at least Ω𝑇 (𝑇 ) for the stochastic gradient to have
the required variance, implying that basic SGLD also has multiplicative dependence
on 𝑇 .
3.2.4 Application to Bayesian logistic regression
Next, we show that Assumptions 3.2.1-3.2.3, and therefore Theorem 3.2.4, hold in the
setting of online Bayesian logistic regression, when the data satisfy certain regularity
properties.
Logistic regression is a fundamental and widely used model in Bayesian statis-
tics [AC93]. It has served as a model problem for methods in scalable Bayesian
inference [WT11; HCB16; CB17; CB18a], of which online sampling is one approach.
Additionally, sampling from the logistic regression posterior is the key step in the
optimal algorithm for online logistic regret minimization [Fos+18].
In Bayesian logistic regression, one models the data (𝑢𝑡 ∈ R𝑑, 𝑦𝑡 ∈ −1, 1) as
follows: there is some unknown 𝜃0 ∈ R𝑑 such that given 𝑢𝑡 (which is thought of as
the independent variable), for all 𝑡 ∈ 1, . . . , 𝑇 the dependent variable 𝑦𝑡 follows a
Bernoulli logistic distribution with “success” probability 𝜑(𝑢⊤𝑡 𝜃) (𝑦𝑡 = 1 with proba-
bility 𝜑(𝑢⊤𝑡 𝜃) and −1 otherwise) where 𝜑(𝑥) = 1
1+𝑒−𝑥 . The Bayesian logistic regression
79
sampling problem we consider is as follows:
Problem 3.2.1 (Bayesian logistic regression). Suppose the 𝑦𝑡’s are generated
from 𝑢𝑡’s as Bernoulli random variables with “success” probability 𝜑(𝑢⊤𝑡 𝜃). At every
epoch 𝑡 ∈ 1, . . . , 𝑇, after observing (𝑢𝑘, 𝑦𝑘)𝑡𝑘=1, return a sample from the posterior
distribution9 𝑡(𝜃) ∝ 𝑒−∑𝑡
𝑘=0𝑓𝑘(𝜃), where 𝑓0(𝜃) := 𝑒−
12𝛼‖𝜃‖2 and 𝑓𝑘(𝜃) := − log[𝜑(𝑦𝑘𝑢
⊤𝑘 𝜃)].
We show that under reasonable conditions on the data-generating distribution –
namely, that the inputs are bounded and that we see data in all directions – our
online sampling algorithm, Algorithm 4, succeeds on Bayesian logistic regression.10
Theorem 3.2.2 (Online Bayesian logistic regression). Suppose that ‖𝜃0‖ ≤ B
for some B > 0, and that 𝑢𝑡 ∼ 𝑃𝑢 are iid, where 𝑃𝑢 is a distribution that satisfies
the following: for 𝑢 ∼ 𝑃𝑢, (1) For some 𝑀 > 0, ‖𝑢‖2 ≤ 𝑀 with probability 1
(bounded) and (2) E𝑢[𝑢𝑢⊤1|𝑢⊤𝜃0|≤2] ⪰ 𝜎𝐼𝑑 (“restricted” covariance matrix is bounded
away from 0). 11 Then for the functions 𝑓0, . . . , 𝑓𝑇 in Problem 3.2.1, and any 𝜀 > 0,
there exist parameters 𝐿, log(𝐴), 𝑘−1,D = poly(𝑀,𝜎−1, 𝛼,B, 𝑑, 1𝜀, log(𝑇 )) such that
Assumptions 3.2.1, 3.2.2, and 3.2.3 hold for all 𝑡 with probability at least 1 − 𝜀.
Therefore Algorithm 4 gives 𝜀-approximate samples from 𝜋𝑡 for 1 ≤ 𝑡 ≤ 𝑇 with
poly(𝑀,𝜎−1, 𝛼,B, 𝑑, 1𝜀, log(𝑇 )) gradient evaluations at each epoch.
Note that our result does not hold if the covariance matrix of the distribution of the
𝑢𝑡’s becomes much more ill-conditioned over time, as is the case in certain applications
of Thompson sampling [Rus+18]. In such applications we would have to add a pre-
conditioner to Algorithm 4 which changes at each epoch.
Our result in the offline case improves upon previous analyses of variance-reduced
SGLD for Bayesian logistic regression, where the number of gradient evaluations has
9Here we choose a Gaussian prior but this can be replaced by any 𝑒−𝑓0 where 𝑓0 is stronglyconvex and smooth.
10For simplicity, we state the result (Theorem 3.2.2) in the case where the input variables 𝑢 areiid, but note that the result holds more generally (see Lemma C.1.1 for a more general statement ofour result).
11The constant 2 may be replaced by any other constant. For a tighter condition, see the statementof Theorem C.1.2.
80
multiplicative dependence on 𝑇 [Nag+17]. Our bounds in the offline case only have
additive dependence on 𝑇 .
In Section 3.8 we show that our algorithm achieves competitive accuracy compared
to a Markov chain that is specialized to logistic regression (Polya-Gamma).
3.3 Algorithm and proof techniques
3.3.1 Overview of online algorithm
Algorithm 3 SAGA-LD
Input: Gradient oracles for 𝑓𝑘 : R𝑑 → R, for 0 ≤ 𝑘 ≤ 𝑡.Input: Step size 𝜂 > 0, batch size 𝑏 ∈ N, number of steps 𝑖max, initial point 𝑋0.Input: Cached gradients 𝐺𝑘 = ∇𝑓𝑘(𝑢𝑘) for some points 𝑢𝑘, and 𝑠 =
∑𝑡𝑘=1 𝐺
𝑘.Output: 𝑋𝑖max
1: for 𝑖 from 0 to 𝑖max − 1 do2: (Sample batch) Sample with replacement a (multi)set 𝑆 of size 𝑏 from1, . . . , 𝑡.
3: (Calculate gradients) For each 𝑘 ∈ 𝑆, let 𝐺𝑘new = ∇𝑓𝑘(𝑋𝑖).
4: (Variance-reduced gradient estimate) Let 𝑔𝑖 = ∇𝑓0(𝑋𝑖) + 𝑠 + 𝑡𝑏
∑𝑘∈𝑆(𝐺𝑘
new −𝐺𝑘).
5: (Langevin step) Let 𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑔𝑖 +√
2𝜂𝜉𝑖 where 𝜉𝑖 ∼ 𝑁(0, 𝐼).6: (Update sum) Update 𝑠← [ 𝑠 +
∑𝑘∈set(𝑆)(𝐺
𝑘new −𝐺𝑘).
7: (Update gradients) For each 𝑘 ∈ 𝑆, update 𝐺𝑘 ←[ 𝐺𝑘new.
8: end for9: Return 𝑋𝑖max .
Given gradient access to the functions 𝑓0, . . . , 𝑓𝑡, at every epoch 𝑡 = 1, . . . , 𝑇 , Algo-
rithm 4 generates a point 𝑋 𝑡 approximately distributed according to 𝜋𝑡 ∝ 𝑒−∑𝑡
𝑘=0𝑓𝑘(𝑥),
by running SAGA-LD given by Algorithm 3. Algorithm 3 makes the following update
rule at each step for the SGLD Markov chain 𝑋𝑖, for a certain choice of stochastic
gradient 𝑔𝑖, where E[𝑔𝑖] =∑𝑡
𝑘=0∇𝑓𝑘(𝑋𝑖):
𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑡𝑔𝑖 +»
2𝜂𝑡𝜉𝑖, 𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). (3.2)
81
Algorithm 4 Online SAGA-LD
Input: 𝑇 ∈ N and gradient oracles for functions 𝑓𝑡 : R𝑑 → R, for all 𝑡 ∈ 0, . . . , 𝑇 ,where only the gradient oracles ∇𝑓0, . . . ,∇𝑓𝑡 are available at epoch 𝑡.Input: step size 𝜂0, batch size 𝑏 > 0, 𝑖max > 0, constant offset 𝑐, acceptance radius𝐶 ′, an initial point X0 ∈ R𝑑.Output: At each epoch 𝑡, a sample X𝑡
1: Set 𝑠 = 0. ◁ Initial gradient sum2: for epoch 𝑡 = 1 to 𝑇 do
3: Set 𝑡′ =
2⌊log2(𝑡−1)⌋ 𝑡 > 1
0, 𝑡 = 1. ◁ The previous power of 2
4: if∥∥∥X𝑡−1 − X𝑡′
∥∥∥ ≤ 𝐶′√𝑡+𝑐
then X𝑡0 ←[ X𝑡−1 ◁ If the previous sample hasn’t
drifted too far, use the previous sample as warm start5: else X𝑡
0 ← [ X𝑡′ ◁ If the previous sample has drifted too far, reset to thesample at time 𝑡′
6: end if7: 𝐺𝑡 ←[ ∇𝑓𝑡(X𝑡
0)8: 𝑠←[ 𝑠 + 𝐺𝑡.9: For all gradients 𝐺𝑘 = ∇𝑓𝑘(𝑢𝑘) which were last updated at time 𝑡/2, replace
them by ∇𝑓𝑘(X𝑡0) and update 𝑠 accordingly.
10: Draw 𝑖𝑡 uniformly from 1, . . . , 𝑖max.11: Run Algorithm 3 with step size 𝜂0
𝑡+𝑐, batch size 𝑏, number of steps 𝑖𝑡, initial
point X𝑡0, and precomputed gradients 𝐺𝑘 with sum 𝑠. Keep track of when the
gradients are updated.12: Return the output X𝑡 = X𝑡
𝑖𝑡 of Algorithm 3.13: end for
Key to this algorithm is the construction of the variance reduced stochastic gradient
𝑔𝑖. It is constructed by taking the sum of the gradients at previous points in the
Markov chain and then correcting it with a batch. Roughly, we show that with high
probability the previous points at which each gradient in the batch was computed are
within ‹𝑂𝑇
(1√𝑡
)of 𝑥⋆
𝑡 .
Our main theorem, Theorem 3.2.4, says that to obtain a fixed TV error 𝜀 for each
sample, the number of steps at each epoch 𝑖max and the batch size 𝑏 only need to be
poly-logarithmic in 𝑇 .
The algorithm takes as input the parameter 𝜂0 > 0 which determines the step size
𝜂𝑡 of the Langevin dynamics Markov chain. Assumption 3.2.2 says that the variance
of the target distribution decreases at the rate 𝐶2
𝑡+𝑐. To ensure that the variance of
82
each step of Langevin dynamics decreases at roughly the same rate as the variance
of the target distribution 𝜋𝑡, we therefore set the step size 𝜂𝑡 to be 𝜂𝑡 = 𝜂0𝑡+𝑐
. With
this step size, the Markov chain can travel across a sub-level set containing most of
the probability measure of 𝜋𝑡 in roughly the same number 𝑖max = ‹𝑂𝑇 (1) of steps at
each epoch 𝑡. We will take the acceptance radius to be 𝐶 ′ = 2.5(𝐶1 + D) where
𝐶1 is given by (3.66), and show that with good probability this choice of 𝐶 ′ ensures∥∥∥𝑋 𝑡−1 −𝑋 𝑡′∥∥∥ ≤ 4(𝐶1+D)√
𝑡+𝑐in Algorithm 4.
3.3.2 Overview of offline algorithm
Similarly to the online Algorithm 4, our offline Algorithm 5 also calls the variance-
reduced SGLD Algorithm 3 multiple times. In the offline setting, all the functions
𝑓1, . . . , 𝑓𝑇 are given from the start, so there is no need to run Algorithm 3 on subsets
of the functions. Instead, we run SAGA-LD on 𝛽𝑓1, . . . , 𝛽𝑓𝑇 , where 𝛽 is the inverse
temperature and is doubled at each epoch, from roughly 𝛽 = 1𝑇
to 𝛽 = 1. There are
logarithmically many epochs, and each epoch takes 𝑖max = ‹𝑂𝑇 (1) Markov chain steps.
Note that we cannot just run SAGA-LD on 𝑓1, . . . , 𝑓𝑇 . The temperature schedule
is necessary because we only assume a cold start; in order for our variance-reduced
SGLD to work, the initial starting point must be ‹𝑂𝑇
(1√𝑇
)rather than ‹𝑂𝑇 (1) away
from the minimum. The temperature schedule helps us get there by roughly halving
the distance to the minimum each epoch; the step sizes are also halved at each epoch.
3.4 Proof overview
3.4.1 Online problem
For the online problem, information theoretic constraints require us to use the “in-
formation” from at least Ω(𝑡) gradients in order to sample with fixed TV error at
83
Algorithm 5 Offline variance-reduced SGLD
Input: 𝑇 ∈ N and gradient oracles for functions 𝑓𝑡 : R𝑑 → R, 1 ≤ 𝑡 ≤ 𝑇 .Input: step size 𝜂, batch size 𝑏 > 0, 𝑖max > 0, an initial point X0 ∈ R𝑑
Output: A sample X
1: X←[ X0
2: Set 𝛽 = 1𝑇
. ◁ Start at a high temperature, 𝑇 .3: while 𝛽 < 1 do4: Run Algorithm 3 with step size 𝜂
𝛽𝑇, batch size 𝑏, number of steps 𝑖max, initial
point X, and functions 𝛽𝑓𝑡, 1 ≤ 𝑡 ≤ 𝑇 .5: Set X←[ X𝛽, where X𝛽 is the output of Algorithm 3.6: 𝛽 ←[ max2𝛽, 1. ◁ Double the temperature.7: end while8: Return X.
the 𝑡th epoch. Thus, in order to use only ‹𝑂𝑇 (1) gradients at each epoch, we must
reuse gradient information from past epochs. We accomplish this by reusing gradients
computed at points in the Markov chain, including points at past epochs. This saves
a crucial factor of 𝑇 over naive SGLD, but only if we can show that these past points
in the Markov chain track the mode of the distribution, and that our Markov chain
also stays close to the mode (Lemma 3.6.1).
The distribution is concentrated to 𝑂𝑇 (1/√𝑡) at the 𝑡th epoch (Assumption 3.2.2),
and we need the Markov chain to stay within ‹𝑂𝑇 (1/√𝑡) of the mode. The bulk of the
proof (Lemma 3.6.2) is to show that with large probability the Markov chain stays
within this ball. Once we establish that the Markov chain stays close, we combine
our bounds with existing results on SGLD from [DMM18] to show that we only need‹𝑂𝑇 (1) steps per epoch (Lemma 3.6.4). Finally, an induction with careful choice of
constants finishes the proof (Theorem 3.6.5). Details of each of these steps follow.
Bounding the variance of the stochastic gradient (see Lemma 3.6.1). We
reduce the variance of our stochastic gradient by using the gradient evaluated a past
point 𝑢𝑘 and estimating the difference in the gradients between our current point
𝑋 𝑡𝑖 and the past point 𝑢𝑘. Using the 𝐿-Lipschitz property (Assumption 3.2.1) of
the gradients, we show that the variance of this stochastic gradient is bounded by
84
𝑡2
𝑏𝐿2 max𝑘 ‖𝑋 𝑡
𝑖 − 𝑢𝑘‖2. To obtain this bound, observe that the individual compo-
nents ∇𝑓𝑘(𝑋 𝑡𝑖 ) − ∇𝑓𝑘(𝑢𝑘)𝑘∈𝑆 of the stochastic gradient 𝑔𝑡𝑖 have variance at most
= 𝑡2𝐿2 max𝑘 ‖𝑋 𝑡𝑖 − 𝑢𝑘‖2 by the Lipschitz property. Averaging with a batch saves a
factor of 𝑏.
For the number of gradient evaluations to stay nearly constant at each step, in-
creasing the batch size is not a viable option to decrease the variance of our stochastic
gradient. Rather, if we can show that ‖𝑋 𝑡𝑖 − 𝑢𝑘‖ decreases as ‖𝑋 𝑡
𝑖 − 𝑢𝑘‖ = ‹𝑂𝑇 (1/√𝑡),
the variance of our stochastic gradient will decrease at each epoch at the desired rate.
Bounding the escape time from a ball where the stochastic gradient has
low variance (see Lemma 3.6.2). Our main challenge is to bound the distance
‖𝑋𝑖 − 𝑢𝑘‖. Because we do not assume that the target distribution is strongly con-
vex, we cannot use proof techniques of past papers analyzing variance-reduced SGLD
methods. [Cha+18; Nag+17] used strong convexity to show that with high prob-
ability, the Markov chain does not travel too far from its initial point, implying a
bound on the variance of their stochastic gradients. Unfortunately, many important
applications, including logistic regression, lack strong convexity.
To deal with the lack of strong convexity, we instead use a martingale exit time
argument to show that the Markov chain remains inside a ball of radius 𝑟 = ‹𝑂𝑇 (1/√𝑡)
with high probability for a large enough time 𝑖max for the Markov chain to reach a
point within TV distance 𝜀 of the target distribution. Towards this end, we would like
to bound the distance from the current state of the Markov chain to the mode ‖𝑋 𝑡𝑖 −
𝑥⋆𝑡‖ by ‹𝑂𝑇 (1/
√𝑡), and bound ‖𝑥⋆
𝑡 −𝑢𝑘‖ by ‹𝑂𝑇 (1/√𝑡). Together, this allows us to bound
the distance ‖𝑋 𝑡𝑖 − 𝑢𝑘‖ = 𝑂𝑇 (1/
√𝑡). We can then use our bound on ‖𝑋 𝑡
𝑖 − 𝑢𝑘‖ =‹𝑂𝑇 (1/√𝑡) together with Lemma 3.6.1 to bound the variance of the stochastic gradient
by roughly ‹𝑂𝑇 (1/𝑡).
Bounding ‖𝑥⋆𝑡 − 𝑢𝑘‖. Since 𝑢𝑘 is a point of the Markov chain, possibly at a
85
previous epoch 𝜏 ≤ 𝑡, roughly speaking we can bound this distance inductively by
using bounds obtained at the previous epoch 𝜏 (Theorem 3.6.5 and Lemma 3.6.4).
Noting that 𝑢𝑘 = 𝑋𝜏𝑖 for some 𝑖 ≤ 𝑖max, we use our bound for ‖𝑢𝑘−𝑥⋆
𝜏‖ = 𝑂𝑇 (1/√𝜏) =
𝑂𝑇 (1/√𝑡) obtained at the previous epoch 𝜏 , together with Assumption 3.2.3 which
says that ‖𝑥⋆𝑡 − 𝑥⋆
𝜏‖ = 𝑂𝑇 (1/√𝑡), to bound ‖𝑥⋆
𝑡 − 𝑢𝑘‖.
Bounding ‖𝑋 𝑡𝑖 − 𝑥⋆
𝑡‖. To bound the distance 𝜌𝑖 := ‖𝑋 𝑡𝑖 − 𝑥⋆
𝑡‖ to the mode, we
would like to bound the increase 𝜌𝑖+1−𝜌𝑖 at each step 𝑖 in the Markov chain. We use a
martingale exit time argument on ‖𝑋 𝑡𝑖 − 𝑥⋆
𝑡‖2, the squared distance from the current
state of the Markov chain to the mode. The advantage in using the squared distance
is that the expected increase in the squared distance due to the Gaussian noise term
√2𝜂𝑡𝜉𝑖 in the Markov chain update rule (equation (3.2)) is the same regardless of the
current position of the Markov chain, allowing us to obtain tighter bounds on the
increase regardless of the current position of the Markov chain.
To bound the component of the increase in ‖𝑋 𝑡𝑖 − 𝑥⋆
𝑡‖2
that is due to the gradi-
ent term −𝜂𝑡𝑔𝑖, we use weak convexity. By weak convexity, the (negative) gradient
never points away from the mode, meaning that, roughly speaking, the mean of the
stochastic gradient term in the Langevin Markov chain update does not increase the
squared distance to the mode. Any increase in the distance from the mode is due to
the Gaussian noise term√
2𝜂𝑡𝜉𝑖 or to the error term 𝑔𝑖 − ∇𝐹𝑡(𝑋𝑡𝑖 ) in the stochastic
gradient, both of which have mean zero and are independent of previous steps in
the Markov chain. We then apply Azuma’s martingale concentration inequalities to
bound the exit time from the ball. This shows that the Markov chain remains at
distance of roughly ‹𝑂𝑇 (1/√𝑡) from the mode.
Bounding the TV error (Lemma 3.6.4). We now show that if 𝑢𝑘 is close to 𝑥⋆𝜏 ,
then X𝑡 will be a good sample from 𝜋𝑡. More precisely, we show that if at epoch 𝑡 the
Markov chain starts at 𝑋 𝑡0 such that ‖𝑋 𝑡
0 − 𝑥⋆𝜏‖ ≤ R√
𝑡+𝑐(R to be chosen later), then
86
∥∥∥ℒ(𝑋 𝑡𝑖max
)− 𝜋𝑡
∥∥∥TV≤ 𝑂
(𝜀
log2(𝑇 )
).
To do this, we will use two bounds: a bound on the Wasserstein distance be-
tween the initial point 𝑋 𝑡0 and the target density 𝜋𝑡, and a bound on the variance
of the stochastic gradient. We then plug the bounds into Corollary 18 of [DMM18]
(reproduced as Theorem 3.6.3).
Firstly, to bound the initial Wasserstein distance, note by the triangle inequal-
ity that 𝑊2(𝛿𝑋𝑡0, 𝜋𝑡) = 𝑂(‖𝑋 𝑡
0 − 𝑥⋆𝜏‖ + ‖𝑥⋆
𝜏 − 𝑥⋆𝑡‖ + 𝑊2(𝛿𝑥⋆
𝑡, 𝜋𝑡)). The first term can
be bounded by the fact the algorithm“resets” 𝑋 𝑡0 if it has drifted too far from its
position at step 𝜏 . The second term is bounded by D√𝜏+𝑐
(by the drift assumption,
Assumption 3.2.3), and the third term by 𝐶√𝑡+𝑐
(by a bound on the second moment,
from Assumption 3.2.2). Thus 𝑊 22 (𝛿𝑋𝑡
0, 𝜋𝑡) = ‹𝑂𝑇 (1/𝑡).
Secondly, we can apply the variance bound (Lemma 3.6.1) to the Markov chain.
By the bound on the escape time from the ball (Lemma 3.6.2), with high probability
the chain stays within ‹𝑂𝑇 (1/√𝑡) of the mode. Lemma 3.6.1 then tells us that the
variance is 𝜎2𝑡 = E
î‖𝑔𝑡𝑖 −∇𝐹𝑡(𝑋
𝑡𝑖 )‖
2ó
= 𝑡2
𝑏𝐿2 max𝑘 ‖𝑋 𝑡
𝑖 − 𝑢𝑘‖2 = ‹𝑂𝑇 (1𝑡).
The result from [DMM18] then says that we can get a fixed KL-error 𝜀 with
𝑖max = 𝑂𝜀,𝑇
Ä𝑊 2
2 (𝛿𝑋𝑡0, 𝜋𝑡)𝜎
2𝑡 poly
Ä1𝜀
ää= ‹𝑂𝜀,𝑇
ÄÄ1𝑡
ä𝑡 poly
Ä1𝜀
ää= ‹𝑂𝜀,𝑇 (poly
Ä1𝜀
ä) steps
per epoch. Finally, Pinsker’s inequality bounds the TV-error by the KL-error.
These bounds allow us to prove by induction (through a union bound) that with
high probability, ‖𝑋 𝑡 − 𝑥*𝑡‖ is small whenever 𝑡 is a power of 2 (which we need for
restarts when the samples drift too far away) and that 𝑋𝑠𝑖 never drifts too far from
the current mode 𝑥*𝑠, for any 𝑖, 𝑠, and hence get a TV-error bound at each epoch.
Bounding the number of of gradient evaluations at each epoch (Theo-
rem 3.6.5). Working out the constants, we see that it suffices to have 𝑖max =
poly(𝑑, 𝐿, 𝐶,D, 𝜀−1, log(𝑇 )) to obtain TV-error 𝜀 at each epoch. A constant batch size
suffices, so the total number of gradient evaluations is 𝑂(𝑖max𝑏) = poly(𝑑, 𝐿, 𝐶,D, 𝜀−1, log(𝑇 )).
87
3.4.2 Offline problem
For the offline problem, the desired result – sampling from 𝜋𝑇 with TV error 𝜀 using‹𝑂(𝑇 ) + poly(𝑑, 𝐿, 𝐶, 𝜀−1) log2(𝑇 ) gradient evaluations – is known either when we as-
sume strong convexity, or we have a warm start. We show how to achieve the same
additive bound without either assumption.
Without strong convexity, we do not have access to a Lyapunov function which
guarantees that the distance between the Markov chain and the mode 𝑥⋆ of the target
distribution contracts at each step, even from a cold start. To get around this problem,
we sample from a sequence of log2(𝑇 ) distributions 𝜋𝛽𝑇 ∝ 𝑒−𝛽
∑𝑇𝑡=1
𝑓𝑡(𝑥), where the
inverse “temperature” 𝛽 doubles at each epoch from 1𝑇
to 1, causing the distribution
𝜋𝛽𝑇 to have a decreasing second moment and to become more “concentrated” about
the mode 𝑥⋆ at each epoch. This temperature schedule allows our algorithm to
gradually approach the target distribution, even though our algorithm is initialized
from a cold start 𝑥0 which may be far from a sub-level set containing most of the
target probability measure. The same martingale exit time argument as in the proof
for the online problem shows that at the end of each epoch, the Markov chain is at a
distance from 𝑥⋆ comparable to the (square root of the) second moment of the current
distribution 𝜋𝛽𝑇 . This provides a “warm start” for the next distribution 𝜋2𝛽
𝑇 , and in
this way our Markov chain approaches the target distribution 𝜋1𝑇 in log2(𝑇 ) epochs.
The total number of gradient evaluations is therefore 𝑇 log2(𝑇 ) + 𝑏 × 𝑖max, since
we only compute the full gradient at the beginning of each of the log2(𝑇 ) epochs,
and then only use a batch size 𝑏 for the gradient steps at each of the 𝑖max steps of
the Markov chain. As in the online case, 𝑏 and 𝑖max are polylogarithmic in 𝑇 and
polynomial in the various parameters 𝑑, 𝐿, 𝐶, 𝜀−1, implying that the total number
of gradient evaluations is ‹𝑂(𝑇 ) + poly(𝑑, 𝐶,D, 𝜀−1, 𝐿) log2(𝑇 ), in the offline setting
where our goal is only to sample from 𝜋1𝑇 .
88
The proof of Theorem 3.2.6 is similar to the proof of Theorem 3.2.4, except for
some differences as to how the stochastic gradients are computed and how one defines
the functions “𝐹𝑡”. We define 𝐹𝑡 := 𝛽𝑡∑𝑇
𝑘=1 𝑓𝑘, where 𝛽𝑡 =
2𝑡−1/𝑇, 0 ≤ 𝑠 ≤ log2(𝑇 ) + 1
1, 𝑡 = ⌈log2(𝑇 )⌉+ 1.
.
We then show that for this choice of 𝐹𝑡 the offline assumptions, proof and algorithm
are similar to those of the online case.
3.5 Related work
Online convex optimization. Our motivation for studying the online sampling
problem comes partly from the successes of online (convex) optimization. (For a
survey, see [Haz16].) In online convex optimization, one chooses a point 𝑥𝑡 ∈ 𝐾 at
each step and suffers a loss 𝑓𝑡(𝑥), where 𝐾 is a compact convex set and 𝑓𝑡 : 𝐾 → R
is a convex function [Zin03]. The aim is to minimize the regret compared to the
best point in hindsight, where Regret𝑇 =∑𝑇
𝑡=1 𝑓𝑡(𝑥𝑡)−min𝑥*∑𝑇
𝑡=1 𝑓𝑡(𝑥*). The same
algorithms for offline convex optimization (gradient descent, Newton’s method) can
be adapted essentially without change to the online setting, giving square-root regret
in the smooth setting [Zin03] and logarithmic regret in the strongly-convex setting
[HAK07].
Online sampling. To the best of our knowledge, all previous algorithms with prov-
able guarantees in our setting require computation time that grows polynomially with
𝑡. This is because any Markov chain which takes all the previous data into account
needs Ω𝑇 (𝑡) gradient evaluations per step. On the other hand, there are many stream-
ing algorithms that are used in practice which lack provable guarantees, or which rely
on properties of the data (such as compressibility).
The most relevant theoretical work in our direction is [NR17]. The authors con-
sider a changing log-concave distribution on a convex body, and show that under
89
certain conditions, they can use the previous sample as a warm start, and hence only
take a constant number of steps of their Markov chain (the Dikin walk) at each stage.
They use a zeroth-order, rather than a first-order (gradient) method.
[NR17] consider the online sampling problem in the more general setting where
the distribution is restricted to a convex body. However, they do not achieve the
optimal results in our setting, as we explain below. Firstly, they do not separately
consider the case when 𝐹𝑡(𝑥) =∑𝑡
𝑘=0 𝑓𝑘(𝑥) has a sum structure. Any method which
considers 𝐹𝑡(𝑥) =∑𝑡
𝑘=0 𝑓𝑘(𝑥) as a black box (and hence does not utilize the sum
structure) and takes at least one step per epoch, will require Ω(𝑡) steps at epoch
𝑡. Secondly, they do not consider how concentration properties of the distribution
translate into more efficient sampling. When the 𝑓𝑡 are linear, their algorithm needs
𝑂𝑇 (1) steps per epoch and 𝑂𝑇 (𝑡) gradient evaluations per epoch. However, in the
general convex setting where the 𝑓𝑡’s are smooth, the algorithm needs 𝑂𝑇 (𝑡) steps per
epoch, and 𝑂𝑇 (𝑡2) gradient evaluations per epoch. An increased number of steps here
may be inevitable because the distribution could concentrate unequally in different
directions; it could have ill-conditioned covariance matrix, with condition number 1𝑡.
We believe that with a concentration result such as Assumption 3.2.2 (for the mode
inside the convex body), their techniques can be used to show that only 𝑂𝑇 (1) steps
and 𝑂𝑇 (𝑡) gradient evaluations are necessary per epoch.
There are many other online sampling methods, and other approaches used to es-
timate changing probability distributions, used in practice. The Laplace approxima-
tion, perhaps the simplest, approximates the posterior distribution with a Gaussian
[BDT16]; however, most distributions cannot be well-approximated by Gaussians.
Stochastic gradient Langevin dynamics [WT11] can be used in an online setting;
however, it suffers from large variance which we address in this work. The particle
filter [DHW12; GD17] is a general algorithm to track a changing distribution. An-
other popular approach (besides sampling) to estimating a probability distribution is
90
variational inference, which has also been considered in an online setting ([WPB11],
[Bro+13])
Variance reduction techniques. Variance reduction techniques for SGLD were
initially proposed in [Dub+16], when sampling from a fixed distribution 𝜋 ∝ 𝑒−∑𝑇
𝑡=0𝑓𝑡 .
[Dub+16] propose two variance-reduced SGLD techniques, CV-ULD and SAGA-LD.
CV-ULD re-computes the full gradient ∇𝐹 at an “anchor” point every 𝑟 steps and
updates the gradient at intermediate steps by subsampling the difference in the gradi-
ents between the current point and the anchor point. SAGA-LD, on the other hand,
keeps track of when each gradient ∇𝑓𝑡 was computed, and updates individual gradi-
ents with respect to when they were last computed. [Cha+18] show that CV-ULD
can sample in the offline problem in roughly 𝑇 + ( 𝐿𝑚
)6 𝑑2
𝜀gradient evaluations, and
that SAGA-LD can sample in 𝑇 + 𝑇 ( 𝐿𝑚
)32
√𝑑𝜀
(1 +𝐿𝐻) gradient evaluations, where 𝐿𝐻
is the Lipschitz constant of the Hessian of − log(𝜋).12
3.6 Proof of online theorem (Theorem 3.2.4)
First we formally define what we mean by “almost independent”.
Definition 3.6.1. We say that 𝑋1, . . . , 𝑋𝑇 are 𝜀-approximate independent sam-
ples from probability distributions 𝜋1, . . . , 𝜋𝑇 if for independent random variables
𝑌𝑡 ∼ 𝜋𝑡, there exists a coupling between (𝑋1, . . . , 𝑋𝑇 ) and (𝑌 1, . . . , 𝑌 𝑇 ) such that
for each 𝑡 ∈ [1, 𝑇 ], 𝑋 𝑡 = 𝑌 𝑡 with probability 1− 𝜀.
12Note that the bounds of [Cha+18] are given for sampling within a specified Wasserstein error,not TV error. The bounds we give here are the number of gradient evaluations one would need if onesamples with Wasserstein error 𝜀 which roughly corresponds to TV error 𝜀; if there are 𝑇 stronglyconvex functions, roughly speaking, one requires 𝜀 = 𝑂( 𝜀√
𝑇) to sample with TV error 𝜀.
91
3.6.1 Bounding the variance of the stochastic gradient
We first show that the variance reduction in Algorithm 4 reduces the variance from
the order of 𝑡2 to 𝑡2 ‖𝑥− 𝑥′‖2, where 𝑥′ is a past point. This will be on the order of
𝑡 if we can ensure ‖𝑥− 𝑥′‖ = 𝑂𝑇
(1√𝑡
). Later, we will bound the probability of the
bad event that ‖𝑥− 𝑥′‖ becomes too large.
Lemma 3.6.1. Fix 𝑥 and 𝑢𝑘1≤𝑘≤𝑡 and let 𝑆 be a multiset chosen with replacement
from 1, . . . , 𝑡. Let
𝑔𝑡 = ∇𝑓0(𝑥) +
[𝑡∑
𝑘=1
∇𝑓𝑘(𝑢𝑘)
]+
𝑡
𝑏
∑𝑘∈𝑆
[∇𝑓𝑘(𝑥)−∇𝑓𝑘(𝑢𝑘)]. (3.3)
Then
∥∥∥∥∥∥𝑔𝑡 −𝑡∑
𝑘=0
∇𝑓𝑘(𝑥)
∥∥∥∥∥∥2
≤ 4𝑡2𝐿2 max𝑘‖𝑥− 𝑢𝑘‖2 (3.4)
E
∥∥∥∥∥∥𝑔𝑡 −𝑡∑
𝑘=0
∇𝑓𝑘(𝑥)
∥∥∥∥∥∥2 ≤ 𝑡2
𝑏𝐿2
(1
𝑡
𝑡∑𝑘=1
‖𝑥− 𝑢𝑘‖2)≤ 𝑡2
𝑏𝐿2 max
𝑘‖𝑥− 𝑢𝑘‖2 . (3.5)
Proof. For the first part,
∥∥∥∥∥∥𝑔𝑡 −𝑡∑
𝑘=0
∇𝑓𝑘(𝑥)
∥∥∥∥∥∥2
=
∥∥∥∥∥∥𝑡∑
𝑘=1
[∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)] +𝑡
𝑏
∑𝑘∈𝑆
[∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)]
∥∥∥∥∥∥2
(3.6)
≤
Ñ𝐿
𝑡∑𝑘=1
‖𝑢𝑘 − 𝑥‖+𝑡
𝑏𝐿∑𝑘∈𝑆‖𝑢𝑘 − 𝑥‖
é2
(3.7)
≤ 4𝑡2𝐿2 max𝑘‖𝑢𝑘 − 𝑥‖2 . (3.8)
For the second part, let 𝑉 be the random variable given by
𝑉 =𝑡
𝑏
ñ(∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥))− E
𝑘∈[𝑡][∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)]
ô(3.9)
where 𝑘 ∈ [𝑡] is chosen uniformly at random. Let 𝑉1, . . . , 𝑉𝑏 be independent draws of
92
𝑉 . Because the 𝑉𝑗 are independent,
E
∥∥∥∥∥∥𝑔𝑡 −𝑡∑
𝑘=0
∇𝑓𝑘(𝑥)
∥∥∥∥∥∥2 = E
∥∥∥∥∥∥𝑏∑
𝑗=1
𝑉𝑗
∥∥∥∥∥∥2 = tr
ÖE
Ñ
𝑏∑𝑗=1
𝑉𝑗
éÑ𝑏∑
𝑗=1
𝑉𝑗
é⊤è
(3.10)
= tr
ÑE
𝑏∑𝑗=1
𝑉𝑗𝑉⊤𝑗
é =𝑏∑
𝑗=1
Eîtr(𝑉𝑗𝑉
⊤𝑗 )ó
= 𝑏E[‖𝑉 ‖2].
(3.11)
We calculate
E[‖𝑉 ‖2] =𝑡2
𝑏2Var𝑘∈[𝑡] (∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)) (3.12)
≤ 𝑡2
𝑏2
ÇE
𝑘∈[𝑡]
î‖∇𝑓𝑘(𝑢𝑘)−∇𝑓𝑘(𝑥)‖2
óå(3.13)
≤ 𝑡2
𝑏2𝐿2 max
𝑘‖𝑥− 𝑢𝑘‖2 . (3.14)
Combining (3.11) and (3.14) gives the result.
3.6.2 Bounding the escape time from a ball
Lemma 3.6.2. Suppose that the following hold:
1. 𝐹 : R𝑑 → R is convex, differentiable, and 𝐿-smooth, with a minimizer 𝑥⋆ ∈ R𝑑.
2. 𝜁𝑖 is a random variable depending only on 𝑋0, . . . , 𝑋𝑖 such that E[𝜁𝑖|𝑋0, . . . , 𝑋𝑖] =
0, and whenever ‖𝑋𝑗 − 𝑥⋆‖ ≤ 𝑟 for all 𝑗 ≤ 𝑖, ‖𝜁𝑖‖ ≤ 𝑆.
Let 𝑋0 be such that ‖𝑋0 − 𝑥⋆‖ ≤ 𝑟 and define 𝑋𝑖 recursively by
𝑋𝑖+1 = 𝑋𝑖 − 𝜂𝑡𝑔𝑖 +√𝜂𝑡𝜉𝑖 (3.15)
where 𝑔𝑖 = ∇𝐹 (𝑋𝑖) + 𝜁𝑖 (3.16)
𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑) (3.17)
93
and define the event 𝐺 := ‖𝑋𝑗 − 𝑥⋆‖ ≤ 𝑟 ∀ 1 ≤ 𝑗 ≤ 𝑖max. Then for 𝑟2 >
‖𝑋0 − 𝑥⋆‖2 + 𝑖max[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2 and 𝐶𝜉 ≥
√2𝑑,
P(𝐺𝑐) ≤ 𝑖max
[exp
(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖max[2𝜂
2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑]
2(2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2
𝜉 )2
)+ exp
(−𝐶2
𝜉 − 𝑑
8
)]
(3.18)
Proof. Note that if ‖𝑥− 𝑥⋆‖ ≤ 𝑟, then because 𝐹 is 𝐿-smooth, ‖∇𝐹 (𝑥)‖ ≤ 𝐿 ‖𝑥− 𝑥⋆‖ ≤
𝐿𝑟. If ‖𝑋𝑖 − 𝑥⋆‖ ≤ 𝑟, then
‖𝑋𝑖+1 − 𝑥⋆‖2 − ‖𝑋𝑖 − 𝑥⋆‖2 (3.19)
= ‖𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖 +√𝜂𝜉𝑖‖2 − ‖𝑋𝑖 − 𝑥⋆‖2 (3.20)
= −2𝜂 ⟨𝑔𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 𝜂2 ‖𝑔𝑖‖2 + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2 (3.21)
= −2𝜂 ⟨∇𝐹𝑡(𝑋𝑖), 𝑋𝑖 − 𝑥⋆⟩︸ ︷︷ ︸≤0 by convexity
−2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 𝜂2 ‖𝑔𝑖‖2 + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2
(3.22)
≤ −2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2𝜂2Ä‖∇𝐹 (𝑥𝑖)‖2 + ‖𝜁𝑖‖2
ä+ 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2
(3.23)
≤ −2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2𝜂2(𝐿2𝑟2 + 𝑆2) + 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂 ‖𝜉𝑖‖2 (3.24)
= 2𝜂2(𝐿2𝑟2 + 𝑆2) + 𝜂𝑑−2𝜂 ⟨𝜁𝑖, 𝑋𝑖 − 𝑥⋆⟩+ 2√𝜂 ⟨𝑋𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉𝑖⟩+ 𝜂(‖𝜉𝑖‖2 − 𝑑)︸ ︷︷ ︸
(*)
(3.25)
Note that (*) has expectation 0 conditioned on 𝑋0, . . . , 𝑋𝑖. To use Azuma’s inequality,
we need our random variables to be bounded. Also, recall that we assumed ‖𝑋𝑖 − 𝑥⋆‖
is bounded above by 𝑟. Thus, we define a toy Markov chain coupled to 𝑋𝑖 as follows.
94
Let 𝑋 ′0 = 𝑋0 and
𝑋 ′𝑖+1 =
𝑋 ′
𝑖, if ‖𝑋 ′𝑖 − 𝑥⋆‖ ≥ 𝑟
𝑋 ′𝑖 − 𝜂𝑔𝑖 +
√𝜂𝜉′𝑖, otherwise
(3.26)
where 𝑔𝑖 = ∇𝐹 (𝑋 ′𝑖) + 𝜁𝑖 (3.27)
𝜉′𝑖 = min(𝐶𝜉, ‖𝜉𝑖‖)𝜉𝑖‖𝜉𝑖‖
(3.28)
𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). (3.29)
Then 𝑌 ′𝑖 := ‖𝑋 ′
𝑖 − 𝑥⋆‖ − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑] is a supermartingale with differences
upper-bounded by
𝑌 ′𝑖+1 − 𝑌 ′
𝑖 ≤
0, ‖𝑋 ′
𝑖 − 𝑥⋆‖ ≥ 𝑟
−2𝜂 ⟨𝜁𝑖, 𝑋 ′𝑖 − 𝑥⋆⟩+ 2
√𝜂 ⟨𝑋 ′
𝑖 − 𝑥⋆ − 𝜂𝑔𝑖, 𝜉′𝑖⟩+ 𝜂(‖𝜉𝑖‖2 − 𝑑), ‖𝑋 ′
𝑖 − 𝑥⋆‖ < 𝑟
(3.30)
≤ 2𝜂𝑆𝑟 + 2√𝜂(𝑟 + 𝜂(𝑆 + 𝐿𝑟))𝐶𝜉 + 𝜂(𝐶2
𝜉 − 𝑑) (3.31)
≤ 2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2
𝜉 . (3.32)
By Azuma’s inequality, for 𝜆 > 0 and for 𝑟2 > ‖𝑋0 − 𝑥⋆‖2 + 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑],
PÄ‖𝑋 ′
𝑖 − 𝑥⋆‖2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑] > 𝜆ä
(3.33)
≤ exp
(− 𝜆2
2(2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2
𝜉 )2
)(3.34)
=⇒ P (‖𝑋 ′𝑖 − 𝑥⋆‖ > 𝑟) (3.35)
≤ exp
(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖[2𝜂2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2
2(2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2
𝜉 )2
)(3.36)
If ‖𝑋𝑖 − 𝑥⋆‖ ≥ 𝑟 for some 𝑖 ≤ 𝑖max, then either ‖𝑋 ′𝑖 − 𝑥⋆‖ ≥ 𝑟 for some 𝑖 ≤ 𝑖max, or
𝑋𝑖 otherwise becomes different from 𝑋 ′𝑖, which happens only when 𝜉𝑖 ≥ 𝐶𝜉 for some
95
𝑖 ≤ 𝑖max. Thus by the Hanson-Wright inequality, since 𝐶𝜉 ≥√
2𝑑,
P (I ≤ 𝑖max) (3.37)
≤𝑖max∑𝑖=1
P(‖𝑋 ′𝑖 − 𝑥⋆‖2 ≥ 𝑟2) +
𝑖max∑𝑖=1
P(‖𝜉𝑖‖ ≥ 𝐶𝜉) (3.38)
≤ 𝑖max
[exp
(−(𝑟2 − ‖𝑋0 − 𝑥⋆‖2 − 𝑖max[2𝜂
2(𝑆2 + 𝐿2𝑟2) + 𝜂𝑑])2
2(2𝜂𝑆𝑟 + 2√𝜂𝐶𝜉(𝑟 + 𝜂𝑆 + 𝜂𝐿𝑟) + 𝜂𝐶2
𝜉 )2
)+ exp
(−𝐶2
𝜉 − 𝑑
8
)].
(3.39)
3.6.3 Bounding the TV error
Lemma 3.6.4 will allow us to carry out the induction step for the proof of the main
theorem.
We will use the following result of [DMM18]. Note that this result works more
generally with non-smooth functions, but we will only consider smooth functions.
Their algorithm, Stochastic Proximal Gradient Langevin Dynamics, reduces to SGLD
in the smooth case. We will apply this Lemma with our variance-reduced stochastic
gradients in Algorithm 3.
Lemma 3.6.3 ([DMM18], Corollary 18). Suppose that 𝑓 : R𝑑 → R is convex and
𝐿-smooth. Let ℱ𝑖 be a filtration with 𝜉𝑖 and 𝑔(𝑥𝑖) defined on ℱ𝑖, and satisfying
E[𝑔(𝑥𝑖)|ℱ𝑖−1] = ∇𝑓(𝑥𝑖), sup𝑥 Var[𝑔(𝑥)|ℱ𝑖−1] ≤ 𝜎2 < ∞. Consider SGLD for 𝑓(𝑥)
run with step size 𝜂 and stochastic gradient 𝑔(𝑥), with initial distribution 𝜇0 and step
size 𝜂; that is,
𝑥𝑖+1 = 𝑥𝑖 − 𝜂𝑔(𝑥𝑖) +√𝜂𝜉𝑖, 𝜉𝑖 ∼ 𝑁(0, 𝐼). (3.40)
Let 𝜇𝑛 denote the distribution of 𝑥𝑛 and let 𝜋 be the distribution such that 𝜋 ∝ 𝑒−𝑓 .
96
Suppose
𝜂 ≤ min
®𝜀
2(𝐿𝑑 + 𝜎2),
1
𝐿
´(3.41)
𝑛 ≥¢𝑊 2
2 (𝜇0, 𝜋)
𝜂𝜀
•. (3.42)
Let 𝜇 = 1𝑛
∑𝑛𝑘=1 𝜇𝑘 be the “averaged” distribution. Then KL(𝜇|𝜋) ≤ 𝜀.
Remark 3.6.2. The result in [DMM18] is stated when 𝑔(𝑥) is independent of the
history ℱ𝑖, but the proof works when the stochastic gradient is allowed to depend on
history, as in SAGA. For SAGA, ℱ𝑖 contains all the information up to time step 𝑖,
including which gradients were replaced at each time step.
Note [DMM18] is derived by analogy to online convex optimization. The optimiza-
tion guarantees are only given at the point equal to the average of the 𝑥𝑡 (by Jensens
inequality). For the sampling problem, this corresponds to selecting a point from the
averaged distribution 𝜇.
Define the good events
𝐺𝑡 =
∀𝑠 ≤ 𝑡, ∀0 ≤ 𝑖 ≤ 𝑖𝑠, ‖𝑋𝑠𝑖 − 𝑥⋆
𝑠‖ ≤R»
𝑠 + 𝐿0/𝐿
(3.43)
𝐻𝑡 =
∀𝑠 ≤ 𝑡 s.t. 𝑠 is a power of 2 or 𝑠 = 0, ‖𝑋𝑠 − 𝑥⋆𝑠‖ ≤
𝐶1»𝑠 + 𝐿0/𝐿
. (3.44)
𝐺𝑡 is the event that the Markov chain never drifts too far from the current mode
(which we want, in order to bound the stochastic gradient of SAGA), and 𝐻𝑡 is the
event that the samples at powers of 2 are close to the respective modes (which we want
because we will use them as reset points). Roughly, 𝐺𝑐𝑡 will involve union-bounding
over bad events whose probabilities we will set to be 𝑂Ä𝜀𝑇
äand 𝐻𝑐
𝑡 will involve union-
bounding over bad events whose probabilities we will set to be 𝑂(
𝜀log2(𝑇 )
).
Lemma 3.6.4 (Induction step). Suppose that Assumptions 3.2.1, 3.2.2, and 3.2.3
97
hold with 𝑐 = 𝐿0
𝐿and 𝐿0 ≥ 𝐿. Let 𝑋𝜏
𝑖 be obtained by running Algorithm 4 with
𝐶 ′ = 2.5(𝐶1 + D), 𝐶1 ≥ 𝐶, and R ≥ 2(𝐶1 + D). Suppose 𝜂𝑡 = 𝜂0𝑡+𝐿0/𝐿
and 𝜀2 > 0 is
such that
𝜂0 ≤𝜀22
𝐿𝑑 + 9𝐿2(R + D)2/𝑏, (3.45)
𝑖max ≥20(𝐶1 + D)2
𝜂0𝜀22. (3.46)
Suppose 𝜀1 > 0 is such that for any 𝜏 ≥ 1,
P (𝐺𝜏 |𝐺𝜏−1 ∩𝐻𝜏−1) ≥ 1− 𝜀1. (3.47)
Suppose 𝑡 is a power of 2. Then the following hold.
1. For 𝑡 < 𝜏 ≤ 2𝑡, P(𝐺𝜏 |𝐺𝑡 ∩𝐻𝑡) ≥ 1− (𝜏 − 𝑡)𝜀1.
2. Fix 𝑋𝑠𝑖 for 𝑠 ≤ 𝑡, 0 ≤ 𝑖 ≤ 𝑖max such that 𝐺𝑡 ∩ 𝐻𝑡 holds (i.e., condition on the
filtration ℱ𝑡 on which the algorithm is defined). Then
‖ℒ(𝑋𝜏 )− 𝜋𝜏‖𝑇𝑉 ≤ (𝜏 − 𝑡)𝜀1 + 𝜀2. (3.48)
3. We have for 𝜏 = 2𝑡,
P (𝐺𝜏 ∩𝐻𝜏 |𝐺𝑡 ∩𝐻𝑡) ≥ 1− (𝑡𝜀1 + 𝜀2 + 𝐴𝑒−𝑘𝐶1) (3.49)
These also hold in the case 𝑡 = 0 and 𝜏 = 1, when 𝐿0 ≥ 𝐿.
Proof. Let 𝐹𝑡(𝑥) =∑𝑡
𝑘=0 𝑓𝑘(𝑥).
First, note that 𝐻𝜏−1 = · · · = 𝐻𝑡, because 𝐻𝑠 is defined as an intersection of
events with indices ≤ 𝑠, that are powers of 2. (See (3.44).) Moreover, 𝐺𝜏 is a subset
of 𝐺𝜏−1 for each 𝜏 , by (3.43).
98
The first part holds by induction on 𝜏 and assumption on 𝜀1. We need to show
𝑃 (𝐺𝑐𝜏 |𝐺𝑡 ∩ 𝐻𝑡) ≤ (𝜏 − 𝑡)𝜀1 by induction. Assuming it is true for 𝜏 , we have by the
union bound that
P(𝐺𝑐𝜏+1|𝐺𝑡, 𝐻𝑡) ≤ P(𝐺𝑐
𝜏+1 ∩𝐺𝜏 |𝐺𝑡 ∩𝐻𝑡) + P(𝐺𝑐𝜏 |𝐺𝑡 ∩𝐻𝑡) (3.50)
≤ P(𝐺𝑐𝜏+1|𝐺𝜏 ∩𝐺𝑡 ∩𝐻𝑡) + P(𝐺𝑐
𝜏 |𝐺𝑡 ∩𝐻𝑡). (3.51)
Now the event 𝐺𝜏 ∩ 𝐺𝑡 ∩ 𝐻𝑡 is the same as the event 𝐺𝜏 ∩ 𝐻𝜏 , by the previous
paragraph. Thus this is ≤ 𝜀 + (𝜏 − 𝑡)𝜀, completing the induction step.
For the second part, note that for 𝑡 < 𝜏 ≤ 2𝑡,
‖𝑋𝜏0 − 𝑥⋆
𝜏‖ ≤∥∥∥𝑋𝜏
0 −𝑋 𝑡∥∥∥+
∥∥∥𝑋 𝑡 − 𝑥⋆𝑡
∥∥∥+ ‖𝑋⋆𝑡 − 𝑥⋆
𝜏‖ (3.52)
≤ 2.5(𝐶1 + D)»𝜏 + 𝐿0/𝐿
+𝐶1»
𝑡 + 𝐿0/𝐿+
D»𝑡 + 𝐿0/𝐿
(3.53)
≤ 4(𝐶1 + D)»𝜏 + 𝐿0/𝐿
(3.54)
where in the 2nd inequality we used that
1. Algorithm 4 ensures that ‖𝑋𝜏0 −𝑋 𝑡‖ ≤ 𝐶′√
𝜏+𝐿0/𝐿= 2.5(𝐶1+D)√
𝜏+𝐿0/𝐿(The algorithm
resets 𝑋𝜏0 to 𝑋 𝑡 if ‖𝑋𝜏
0 −𝑋 𝑡‖ is greater than 𝐶′√𝜏+𝐿0/𝐿
, making the term 0.
This is the place where the resetting is used.),
2. the definition of 𝐻𝑡, and
3. the drift assumption 3.2.3.
In the 3rd inequality we used that√𝑡 ≥
»𝜏/2 ≥
√𝜏/1.5.
Therefore
𝑊 22 (𝛿𝑋𝜏
0, 𝜋𝜏 ) ≤ 2 ‖𝑋𝜏
0 − 𝑥⋆𝜏‖
2 + 2𝑊 22 (𝛿𝑥𝜏 , 𝜋𝜏 ) ≤ 32(𝐶1 + D)2
𝜏 + 𝐿0/𝐿+
2𝐶2
𝜏 + 𝐿0/𝐿≤ 40(𝐶1 + D)2
𝜏 + 𝐿0/𝐿
(3.55)
99
where the second moment bound comes from Assumption 3.2.2 and 𝐶 ≤ 𝐶1.
Define a toy Markov chain coupled to 𝑋𝜏𝑖 as follows. Let 𝑋𝑠
𝑗 = 𝑋𝑠𝑗 for 𝑠 < 𝜏 ,
𝑋𝜏0 = 𝑋𝜏
0 , and
𝑋𝜏𝑖+1 =
𝑋𝜏
𝑖 − 𝜂𝑔𝜏𝑖 +√𝜂𝜉𝑖, when
∥∥∥𝑋𝜏𝑗 − 𝑥⋆
𝜏
∥∥∥ ≤ R√𝜏+𝐿0/𝐿
for all 0 ≤ 𝑗 ≤ 𝑖
𝑋𝜏𝑖 − 𝜂∇𝐹𝜏 (𝑋𝑖), otherwise.
(3.56)
where 𝑔𝜏𝑖 is the stochastic gradient for 𝑋𝜏𝑖 in Algorithm 3 and 𝜉𝑖 ∼ 𝑁(0, 𝐼𝑑). By
Lemma 3.6.1, the variance of 𝑔𝜏𝑖 is at most 𝜏2𝐿2
𝑏max( 𝜏+1
2,0)≤(𝑠,𝑗)≤(𝜏,𝑖)
∥∥∥𝑋𝜏𝑖 −𝑋𝑠
𝑗
∥∥∥2. (The
ordering on ordered pairs is lexicographic. Note 𝑠 > 𝑡2
because Algorithm 4 refreshes
all gradients that were updated at time 𝑡2.) If the first case of (3.56) always holds,
we bound (using the condition that 𝐺𝑡 holds)
∥∥∥𝑋𝜏𝑖 −𝑋𝑠
𝑗
∥∥∥ ≤ ∥∥∥𝑋𝜏𝑖 − 𝑥⋆
𝜏
∥∥∥+ ‖𝑥⋆𝜏 − 𝑥⋆
𝑠‖+∥∥∥𝑥⋆
𝑠 −𝑋𝑠𝑗
∥∥∥(3.57)
≤ R»𝜏 + 𝐿0/𝐿
+D»
𝑠 + 𝐿0/𝐿+
R»𝑠 + 𝐿0/𝐿
(3.58)
≤ 3R + 2D»𝜏 + 𝐿0/𝐿
<3(R + D)»𝜏 + 𝐿0/𝐿
(3.59)
=⇒ 𝜏 2𝐿2
𝑏max
( 𝑡+12
,0)≤(𝑠,𝑗)≤(𝜏,𝑖)
∥∥∥𝑋𝜏𝑖 −𝑋𝑠
𝑗
∥∥∥2 ≤ 9𝜏𝐿2(R + D)2
𝑏. (3.60)
We can apply Lemma 3.6.3 with 𝜀 = 2𝜀22, 𝐿 ←[ 𝐿(𝜏 + 𝐿0/𝐿), 𝜎2 ≤ 9𝜏𝐿2(R+D)2
𝑏,
𝑊 22 (𝜇0, 𝜋) ≤ 40(𝐶1+D)2
𝜏+𝐿0/𝐿. Note that 𝜂𝜏 ≤ 𝜀22
(𝜏+𝐿0/𝐿)(𝐿𝑑+9𝐿2(R+D)2/𝑏)≤ 𝜀22
(𝜏𝐿+𝐿0)𝑑+9𝐿2𝜏(R+D)2/𝑏
does satisfy (3.41), as 𝐹𝜏 =∑𝜏
𝑘=0 𝑓𝑘 is (𝜏𝐿 + 𝐿0)-smooth by Assumption 3.2.1. Let
𝑖 ∈ [𝑖max] be uniform random on [𝑖max], and 𝑋𝜏 = 𝑋𝜏𝑖 ; note that the distribution of
𝑋𝜏 is the mixture distribution of 𝑋𝜏1 , . . . , 𝑋
𝜏𝑖max
. Under the conditions on 𝜂, 𝑖max, by
100
Pinsker’s inequality and Lemma 3.6.3,
‖ℒ(𝑋𝜏 )− 𝜋𝜏‖TV ≤
1
2KL(|𝜋𝜏 ) ≤ 𝜀2. (3.61)
Note that under 𝐺𝜏 , 𝑋𝑠𝑖 = 𝑋𝑠
𝑖 for all 𝑖 ≤ 𝑖max and 𝑠 ≤ 𝜏 , so
‖ℒ(𝑋𝜏 )− 𝜋𝜏‖TV ≤ P(𝐺𝑐𝜏 |ℱ𝑡) + ‖ℒ(𝑋𝜏
𝑖 )− 𝜋𝜏‖TV ≤ (𝜏 − 𝑡)𝜀1 + 𝜀2 (3.62)
This shows part 2.
For part 3, note that by Assumption 3.2.2,
P𝑋∼𝜋2𝑡
‖𝑋 − 𝑥⋆2𝑡‖ ≥
𝐶1»2𝑡 + 𝐿0/𝐿
≤ 𝐴𝑒−𝑘𝐶1 (3.63)
Combining (3.62) and (3.63) for 𝜏 = 2𝑡 gives (3.49).
Finally, note that the proof goes through when 𝑡 = 0, 𝜏 = 1.
3.6.4 Setting the constants; Proof of main theorem
Theorem 3.6.5 (Theorem 3.2.4 with parameters). Suppose that Assumptions 3.2.1, 3.2.2,
and 3.2.3 hold, with 𝑘 ≤ 1, 𝑐 = 𝐿0/𝐿, 𝐿0 ≥ 𝐿, and ‖𝑋0 − 𝑥⋆0‖ ≤ 𝐶√
𝐿0/𝐿. Suppose
Algorithm 4 is run with parameters 𝜂0, 𝑖max given by
𝜀1 =𝜀
3𝑇(3.64)
𝜀2 =𝜀
3 ⌈log2(𝑇 ) + 1⌉(3.65)
𝐶1 =
Ç2 +
1
𝑘
ålog
Ç𝐴
𝜀2𝑘2
å(3.66)
R = 100 max
𝑑
𝐿
√log
Çmax
®𝐿,
𝑑
𝐿,𝐶1 + D,
1
𝜀1
´å, 𝐶1 + D
(3.67)
𝜂0 =𝜀22
2𝐿2R2(3.68)
101
𝑖max =
¢20(𝐶1 + D)2
𝜂0𝜀22
•=
¢40𝐿2R2(𝐶1 + D)2
𝜀42
•(3.69)
with any constant batch size 𝑏 ≥ 9. Then it outputs a sample 𝑋 𝑡 at each epoch,
so that the 𝑋 𝑡 are 𝜀-approximate independent samples of 𝜋𝑡 (1 ≤ 𝑡 ≤ 𝑇 ), using
𝑂(𝑖max𝑏) = polyÄ𝑑, 𝐿, log(𝐴), 1
𝑘,D, 1
𝜀
ägradient evaluations at each epoch.
Note that the dependence of 𝑖max on 𝜀 is 𝑖max = ‹𝑂𝜀
Ä1𝜀4
ä.
Proof. We will choose parameters and prove by induction that for 𝑡 = 2𝑎,
P(𝐺𝑡 ∩𝐻𝑡) ≥ 1− 𝑡𝜀1 − 2(𝑎 + 1)𝜀2 (3.70)
We will also show that (3.70) implies that if 𝑡 = 2𝑎 + 𝑏 for 0 < 𝑏 ≤ 2𝑎,
P(𝐺𝑡 ∩𝐻2𝑎) ≥ 1− 𝑡𝜀1 − 2(𝑎 + 1)𝜀2 (3.71)
‖ℒ(𝑋𝑡)− 𝜋𝑡‖TV ≤ 𝑡𝜀1 + (2𝑎 + 3)𝜀2. (3.72)
With the values of 𝜀1 and 𝜀2, (3.72) gives the theorem. 13
Let 𝜂0,R be constants to be chosen, and for any 𝑡 ∈ N, let
𝜂𝑡 =𝜂0»
𝑡 + 𝐿0/𝐿(3.73)
𝑟𝑡 =R»
𝑡 + 𝐿0/𝐿(3.74)
𝑆𝑡 = 6√𝑡𝐿(R + D) (3.75)
𝜎2𝑡 =
9𝑡𝐿2(R + D)2
𝑏(3.76)
By (3.60) and Lemma 3.6.1, when 𝐺𝑡−1 ∩ 𝐻𝑡−1 holds, the stochastic gradient 𝑔𝑡𝑖
13In fact, we will show a slightly stronger result. Namely, that the distribution of 𝑋𝑡 conditionedon the filtration ℱ1 ⊆ · · · ⊆ ℱ𝑡−1, where the filtration ℱ𝜏 includes both the random batch 𝑆 as well asthe points in the Markov chain up to time 𝜏 , satisfies ‖(ℒ(𝑋𝑡)|𝐹𝑡−1)−𝜋𝑡‖TV ≤ 𝑡𝜀1+(2𝑎+3)𝜀2. Thisimplies that the samples 𝑋1, 𝑋2, . . . , 𝑋𝑡 are 𝜀-approximately independent with 𝜀 = 𝑡𝜀1+(2𝑎+3)𝜀2.
102
in (3.56) satisfies ‖𝑔𝑡𝑖‖ ≤ 𝑆𝑡 and Var(𝑔𝑡𝑖) ≤ 𝜎2𝑡 . We claim that it suffices to choose
parameters so that the following hold for each 𝑡 and some 𝐶𝜉 ≥√
2𝑑:
𝜀1 ≥ 𝑖max
exp
Ñ−
(𝑟2𝑡 −16(𝐶1+D)2
𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2
𝑡 + 𝐿2𝑡2𝑟2𝑡 ) + 𝜂𝑡𝑑])2
(2𝜂𝑡𝑆𝑡𝑟𝑡 + 2√𝜂𝑡𝐶𝜉(𝑟𝑡 + 𝜂𝑡𝑆𝑡 + 𝜂𝑡𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝑡𝐶2
𝜉 )2
é(3.77)
+ exp
(−𝐶2
𝜉 − 𝑑
8
) (3.78)
𝜂0 ≤𝜀22
𝐿𝑑 + 9𝐿2(R + D)2/𝑏(3.79)
𝑖max ≥20(𝐶1 + D)2
𝜂0𝜀22(3.80)
𝐴𝑒−𝑘𝐶1 ≤ 𝜀2 (3.81)
𝐶1 ≥ 𝐶 :=
Ç2 +
1
𝑘
ålog
Ç𝐴
𝑘2
å. (3.82)
Indeed, suppose these inequalities hold. Lemma 3.6.2 and (3.77) show that 𝜀1 satisfies
the assumption in Lemma 3.6.4. Equations (3.79), (3.80), and (3.82) ensure that 𝜂0,
𝑖max, and 𝐶 satisfy the conditions in Lemma 3.6.4.
Base case of induction. By assumption ‖𝑋0 − 𝑥⋆0‖ ≤ 𝐶1√
𝐿0/𝐿so 𝐻0 holds and the
𝑡 = 1 case of Lemma 3.6.4 shows P(𝐺1) ≥ 1 − 𝜀1 and P(𝐺1 ∩ 𝐻1) ≥ 1 − (𝜀1 + 𝜀2 +
𝐴𝑒−𝑘𝐶1) ≥ 1− (𝜀1 + 2𝜀2), using (3.81) for the last inequality.
(3.70) implies (3.71), (3.72). This follows from parts 1 and 2 of Lemma 3.6.4.
Induction step. We work with the complements. Let 𝐴𝑡 = 𝐺𝑡 ∩ 𝐻𝑡. By a union
bound,
P(𝐴𝑐2𝑡) ≤ P(𝐴𝑐
2𝑡 ∩ 𝐴𝑡) + P(𝐴𝑐𝑡) ≤ P(𝐴𝑐
2𝑡|𝐴𝑡) + P(𝐴𝑐𝑡). (3.83)
103
The first term is bounded by Part 3 of Lemma (3.6.4) (and (3.81)), 𝑃 (𝐴𝑐2𝑡|𝐴𝑡) ≤
𝑡𝜀1 + 𝜀2 + 𝜀2. The second term is bounded by the induction hypothesis, which says
𝑃 (𝐴𝑐𝑡) ≤ 𝑡𝜀1+2(𝑎+1)𝜀2. Combining these gives 𝑃 (𝐴𝑐
2𝑡) ≤ 2𝑡𝜀1+2(𝑎+2)𝜀2, completing
the induction step.
Showing inequalities. Setting 𝐶1, 𝜂0, and 𝑖max as in (3.66), (3.68), and (3.69) (with
R to be determined), we get that (3.79), (3.80), and (3.81) are satisfied, as R ≥√
𝑑𝐿
,
𝑏 ≥ 9 imply 𝜀22
2𝐿2(R+D)2≤ 𝜀22
𝐿𝑑+9𝐿2(R+D)2/𝑏. Moreover, setting 𝐶𝜉 =
√2𝑑 + 8 log
Ä2𝑖max
𝜀1
ämakes 𝑖max exp
Å−𝐶2
𝜉−𝑑
8
ã≤ 𝜀1
2. It suffices to show that our choice of R makes
𝜀12𝑖max
≥ exp
Ñ−
(𝑟2 − 16(𝐶1+D)2
𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2
𝑡 + 𝐿2(𝑡 + 𝐿0/𝐿)2𝑟2𝑡 ) + 𝜂𝑡𝑑])2
2(2𝜂𝑡𝑆𝑡𝑟𝑡 + 2√𝜂𝑡𝐶𝜉(𝑟𝑡 + 𝜂𝑡𝑆𝑡 + 𝜂𝑡𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝑡𝐶2
𝜉 )2
é(3.84)
= exp
à−
(𝑟2𝑡 −
16(𝐶1+D)2
𝑡+𝐿0/𝐿− 𝑖max
[2𝜂20
(𝑡+𝐿0/𝐿)2(16𝑡𝐿2R2 + (𝑡 + 𝐿0/𝐿)𝐿2R2)
])22
Ç8𝜂0𝐿𝑡R2
(𝑡+𝐿0/𝐿)2+
2√𝜂0√
𝑡+𝐿0/𝐿𝐶𝜉
ÇR√
𝑡+𝐿0/𝐿+ 4𝜂0𝐿R
√𝑡
𝑡+𝐿0/𝐿+ 𝜂0𝐿R√
𝑡+𝐿0/𝐿
å+ 𝜂0
𝑡+𝐿0/𝐿𝐶2
𝜉
å2
í(3.85)
⇐√
2 log2𝑖max
𝜀1≤
𝑟2𝑡 − 1𝑡+𝐿0/𝐿
(16(𝐶1 + D)2 + 40𝑖max𝜂20𝐿
2R2)1
𝑡+𝐿0/𝐿
Ä8𝜂0𝐿R2 + 2
√𝜂0𝐶𝜉(R + 5𝜂0𝐿R) + 𝜂0𝐶2
𝜉
ä (3.86)
⇐⇒ R2
𝑡 + 𝐿0/𝐿= 𝑟2𝑡 ≥
1
𝑡 + 𝐿0/𝐿
ï Ä8𝜂0𝐿R
2 + 2√𝜂0𝐶𝜉(R + 5𝜂0𝐿R) + 𝜂0𝐶
2𝜉
ä√2 log
2𝑖max
𝜀1
(3.87)
+ 16(𝐶1 + D)2 + 40𝑖max𝜂20𝐿
2R2ò
(3.88)
Using 𝜂0 = 𝜀22
2𝐿2(R+D)2and 𝜂0𝑖max ≤ 40(𝐶1+D)2
𝜀24, it suffices to have
R2 ≥(
4𝜀22
𝐿+
√2𝜀2
2𝐶𝜉
𝐿+
5𝜀23𝐶𝜉
𝐿2R2+
𝜀22𝐶2
𝜉
2𝐿2R2
)√2 log
Ç2𝑖max
𝜀1
å+ 16(𝐶1 + D)2 + 800(𝐶1 + D)2
(3.89)
104
Using 𝜀2 ≤ 1 ≤ 𝐶𝜉 and 𝐶𝜉 ≤ 4√𝑑 log
Ä2𝑖max
𝜀1
ä, the RHS is
≤(
8𝜀22𝐶𝜉
𝐿+
8𝜀22𝐶2
𝜉
𝐿2R2
√log
Ç2𝑖max
𝜀1
å)√2 log
Ç2𝑖max
𝜀1
å+ 816(𝐶1 + D)2 (3.90)
≤
Ñ8𝜀2
2𝑑12
𝐿+
8𝜀22𝑑
𝐿2R2
é8 log
Ç2𝑖max
𝜀1
å+ 816(𝐶1 + D)2. (3.91)
Now note
𝑖max ≤10𝐿2R2 (𝐶1 + D)2
𝜀24(3.92)
2𝑖max
𝜀1≤ 20𝐿2R2 (𝐶1 + D)2
𝜀24𝜀1(3.93)
≤200, 000𝐿2 max
¶𝑑𝐿
logÄmax𝐿, 𝑑
𝐿, 𝐶1 + D, 1
𝜀1ä, (𝐶1 + D)2
©(𝐶1 + D)2
𝜀24𝜀1
(3.94)
≤200, 000𝐿2 max
¶𝑑𝐿
max𝐿, 𝑑𝐿, 𝐶1 + D, 1
𝜀1, (𝐶1 + D)2
©(𝐶1 + D)2
𝜀24𝜀1
(3.95)
log
Ç2𝑖max
𝜀1
å≤ log(200, 000) + 11 log
Çmax
®𝐿,
𝑑
𝐿,𝐶1 + D,
1
𝜀1
´å(3.96)
We want to show (3.91) ≤ R2; it suffices to show
8𝜀12√𝑑
𝐿8 log
Ç2𝑖max
𝜀1
å≤ R2
4(3.97)
8𝜀12𝑑
𝐿2R28
ñlog
Ç2𝑖max
𝜀1
åô 32
≤ R2
4(3.98)
816 (𝐶1 + D)2 ≤ R2
2. (3.99)
These inequalities hold because
R2 ≥ 10000𝑑
𝐿log
Çmax
®𝐿,
𝑑
𝐿,𝐶1 + D,
1
𝜀1
´å(3.100)
≥ 256𝜀2√𝑑
𝐿
Çlog(200, 000) + 11 log
Çmax
®𝐿,
𝑑
𝐿,𝐶1 + D,
1
𝜀1
´åå(3.101)
105
≥ 256𝜀2√𝑑
𝐿log
Ç2𝑖max
𝜀1
å(3.102)
R4 ≥ 108 𝑑2
𝐿2
Çlog
Çmax
®𝐿,
𝑑
𝐿,𝐶1 + D,
1
𝜀1
´åå2
≥ 256𝜀22𝑑
𝐿2
ñlog
Ç2𝑖max
𝜀1
åô 32
(3.103)
R2 ≥ 104 (𝐶1 + D)2 . (3.104)
3.7 Proof of offline theorem (Theorem 3.2.6)
The proof of Theorem 3.2.6 is similar to the proof of Theorem 3.2.4, except for some
key differences as to how the stochastic gradients are computed and how one defines
the functions “𝐹𝑡”.
We define 𝐹𝛽 := 𝛽𝐹 = 𝛽∑𝑇
𝑘=1 𝑓𝑘, where the 𝛽’s will range over the sequence
𝛽𝑡 =
2𝑡/𝑇, 0 ≤ 𝑡 ≤ log2(𝑇 )
1, 𝑡 = ⌈log2(𝑇 )⌉ .. (3.105)
For this choice of 𝐹𝛽, the offline assumptions, proof and algorithm are similar to those
of the online case.
Differences in assumptions. We have that 𝐹𝛽 is 𝛽𝑇𝐿-smooth, which (except for
Lemma 3.6.1) is the only way in which Assumption 3.2.1 is used in the proof of
Theorem 3.2.4.
Moreover, Assumption 3.2.5 for the offline case implies that 𝜋𝛽𝑇 ∝ 𝑒−𝐹𝛽 satisfies
Assumption 3.2.2 with constants 𝐶 and 𝑘 for every 𝑡. Since the minimizer 𝑥⋆𝛽 of 𝐹𝛽
does not change with 𝑡, 𝑥⋆𝛽 satisfies Assumption 3.2.3 with constant D = 0.
Differences in algorithm. The step size used in Algorithm 𝜂𝛽𝑇
, the same step size
used in Algorithm 4. Thus, we note that Algorithm 5 is similar to Algorithm 4 except
106
for a few key differences:
1. The way in which the stochastic gradient 𝑔𝛽𝑖 is computed is different. Specifi-
cally, in the offline algorithm our stochastic gradient is computed as
𝑔𝛽𝑖 = 𝑠 +𝛽𝑇
𝑏
∑𝑘∈𝑆
(𝐺𝑘new −𝐺𝑘). (3.106)
where 𝑆 is a multiset of size 𝑏 chosen with replacement from 1, . . . , 𝑇 (rather
than from 1, . . . , 𝑡).
2. There are logarithmically many epochs.
We now give the proof in some detail.
Letting 𝑋𝛽𝑖 be the iterates at inverse temperature 𝛽, define
𝐺𝛽 =
®∀𝑖,
∥∥∥𝑋𝛽𝑖 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑇
´. (3.107)
Lemma 3.7.1 (Analogue of Lemma 3.6.4). Assume that Assumptions 3.2.1 and 3.2.5
hold. Let 𝐶 =Ä2 + 1
𝑘
älog
Ä𝐴𝑘2
ä, 𝐶1 ≥ 𝐶, and suppose
𝜂0 ≤𝜀22
𝐿𝑑 + 4𝐿2R2/𝑏(3.108)
𝑖max ≥5𝐶2
1
𝜂0𝜀22. (3.109)
Suppose 𝜀1 > 0 is such that
PÇ∀0 ≤ 𝑖 ≤ 𝑖max,
∥∥∥𝑋𝛽𝑖 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑇|∥∥∥𝑋𝛽
0 − 𝑥⋆∥∥∥ ≤ 𝐶1√
𝛽𝑇
å≥ 1− 𝜀1. (3.110)
Suppose∥∥∥𝑋𝛽
0 − 𝑥⋆∥∥∥ ≤ 2𝐶1√
𝛽𝑇. Then
1.∥∥∥ℒ(𝑋𝛽)− 𝜋𝛽
𝑇
∥∥∥𝑇𝑉≤ 𝜀1 + 𝜀2.
107
2. For 𝑖 ∈ [𝑖max] chosen at random,
PÇ∥∥∥𝑋𝛽
𝑖 − 𝑥⋆∥∥∥ ≤ 𝐶1√
𝛽𝑇
å≥ 1− (𝜀1 + 𝜀2 + 𝐴𝑒−𝑘𝐶1). (3.111)
Proof. First we calculate the distance of the starting point from the stationary dis-
tribution,
𝑊 22 (𝛿𝑋𝛽
0, 𝜋𝛽
𝑇 ) ≤ 2∥∥∥𝑋𝛽
0 − 𝑥⋆∥∥∥2 + 2𝑊 2
2 (𝛿𝑥⋆ , 𝜋𝛽𝑇 ) ≤ 8𝐶2
1
𝛽𝑇+
2𝐶2
𝛽𝑇≤ 10𝐶2
1
𝛽𝑇. (3.112)
Define a toy Markov chain coupled to 𝑋𝛽𝑖 as follows. Let 𝑋𝛽
0 = 𝑋𝛽0 and
𝑋𝛽𝑖+1 =
𝑋𝛽
𝑖 − 𝜂𝑔𝛽𝑖 +√𝜂𝜉𝑖, when
∥∥∥𝑋𝜏𝑗 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑇
for all 0 ≤ 𝑗 ≤ 𝑖
𝑋𝛽𝑖 − 𝜂𝛽∇𝐹 (𝑋𝑖), otherwise.
(3.113)
By Lemma 3.6.1, the variance of 𝑔𝛽𝑖 is at most 𝛽2𝑇 2𝐿2
𝑏max0≤𝑗≤𝑖
∥∥∥𝑋𝛽𝑖 −𝑋𝛽
𝑗
∥∥∥2. If∥∥∥𝑋𝛽𝑖 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑇
for all 0 ≤ 𝑖 ≤ 𝑖max, then∥∥∥𝑋𝛽
𝑖 −𝑋𝛽𝑗
∥∥∥ ≤ 2R√𝛽𝑇
for all 0 ≤ 𝑖, 𝑗 ≤ 𝑖max.
Then we can apply Lemma 3.6.3 with 𝜀 = 2𝜀22, 𝐿←[ 𝐿𝛽𝑇 , 𝜎2 ≤ (𝛽𝑇 )2𝐿2
𝑏4R2
𝛽𝑇= 4𝛽𝑇𝐿2R2
𝑏,
and 𝑊 22 (𝜇0, 𝜋) ≤ 10𝐶2
1
𝛽𝑇. By Pinsker’s inequality, for random 𝑖 ∈ [𝑖max],
∥∥∥ℒ(𝑋𝛽𝑖 )− 𝜋𝛽
𝑇
∥∥∥TV≤
1
2KL(|𝜋𝜏 ) ≤ 𝜀2. (3.114)
Under 𝐺𝛽, 𝑋𝛽𝑖 = 𝑋𝛽
𝑖 for all 𝑖 ≤ 𝑖max and 𝑠 ≤ 𝜏 , so
‖ℒ(𝑋𝛽𝑖 )− 𝜋𝛽
𝑇‖TV ≤ P(𝐺𝑐𝛽) +
∥∥∥ℒ(𝑋𝛽𝑖 )− 𝜋𝛽
𝑇
∥∥∥TV≤ 𝜀1 + 𝜀2 (3.115)
This shows part 1.
108
For part 2, note that by Assumption 3.2.2,
P𝑋∼𝜋𝛽𝑇
ñ‖𝑋 − 𝑥⋆‖ ≥ 𝐶1√
𝛽𝑇
ô≤ 𝐴𝑒−𝑘𝐶1 (3.116)
Combining (3.115) and (3.116) gives part 2.
Theorem 3.7.2 (Theorem 3.2.6 with parameters). Suppose that Assumptions 3.2.1
and 3.2.5 hold, with 𝑘 ≤ 1 and ‖𝑋0 − 𝑥⋆‖ ≤ 𝐶. Suppose Algorithm 5 is run with
parameters 𝜂0, 𝑖max given by
𝜀1 =𝜀
3 ⌈log2(𝑇 ) + 1⌉(3.117)
𝐶1 =
Ç2 +
1
𝑘
ålog
Ç𝐴
𝜀2𝑘2
å(3.118)
R = 100 max
𝑑
𝐿
√log
Çmax
®𝐿,
𝑑
𝐿,𝐶1,
1
𝜀1
´å, 𝐶1
(3.119)
𝜂0 =𝜀21
2𝐿2R2(3.120)
𝑖max =
¢5𝐶2
1
𝜂0𝜀21
•=
¢10𝐿2R2𝐶2
1
𝜀41
•(3.121)
with any constant batch size 𝑏 ≥ 4. Then it outputs 𝑋1 such that 𝑋1 is a sample from
𝑇 satisfying ‖𝑇 − 𝜋𝑇‖𝑇𝑉 ≤ 𝜀, using ‹𝑂(𝑇 ) + poly log(𝑇 ) poly(𝑑, 𝐿, 𝐶, 𝜀−1) gradient
evaluations.
Proof. The proof is similar to the proof of Theorem 3.6.5, and we omit the details.
We show by induction that
PÇ∥∥∥𝑋𝛽𝑠
𝑖 − 𝑥⋆∥∥∥ ≤ R√
𝛽𝑠𝑇
å≥ 1− 2𝑠𝜀1. (3.122)
The base case follows from 𝐶 ≤ 𝐶1 ≤ R. The induction step follows from noting first
109
that
∥∥∥𝑋𝛽𝑠𝑖 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑠𝑇
=⇒∥∥∥𝑋𝛽𝑠+1
0 − 𝑥⋆∥∥∥ ≤ 2R√
𝛽𝑠+1𝑇, (3.123)
noting that the conditions imply (for 𝜂𝛽 = 𝜂0√𝛽𝑇
, 𝑟𝑡 = R√𝛽𝑇
, 𝑆𝑡 = 4√𝛽𝑇𝐿R, and
𝜎2𝑡 = 4𝛽𝑇𝐿2R2
𝑏, 𝐶𝜉 =
√2𝑑 + 8 log
Ä2𝑖max
𝜀1
ä) that
𝜀1 ≥ 𝑖max
exp
Ö−
(𝑟2𝛽 −4𝐶2
1
𝑡+𝐿0/𝐿− 𝑖[2𝜂2𝑡 (𝑆2
𝛽 + 𝐿2𝑡2𝑟2𝛽) + 𝜂𝛽𝑑])2
(2𝜂𝛽𝑆𝛽𝑟𝛽 + 2√𝜂𝛽𝐶𝜉(𝑟𝛽 + 𝜂𝛽𝑆𝛽 + 𝜂𝛽𝐿(𝑡 + 𝐿0/𝐿)𝑟𝑡) + 𝜂𝛽𝐶2
𝜉 )2
è(3.124)
+ exp
(−𝐶2
𝜉 − 𝑑
8
) (3.125)
Then using Lemma 3.6.2, we get that (3.110) is satisfied with 𝜀1, and the induction
step follows from item 2 of Lemma 3.7.1.
Finally, once we have ‖𝑋10 − 𝑥⋆‖ ≤ R√
𝑇, the conclusion about 𝑋1 follows from item
1 of Lemma 3.7.1.
3.8 Simulations
We test our algorithm against other sampling algorithms on a synthetic dataset for
logistic regression. The dataset consists of 𝑇 = 1000 data points in dimension 𝑑 = 20.
We compare the marginal accuracies of the algorithms.
The data is generated as follows. First, 𝜃 ∼ 𝑁(0, 𝐼𝑑), 𝑏 ∼ 𝑁(0, 1) are randomly
generated. For each 1 ≤ 𝑡 ≤ 𝑇 , a feature vector 𝑥𝑡 ∈ R𝑑 and output 𝑦𝑡 ∈ 0, 1 are
generated by
𝑥𝑡,𝑖 ∼ BernoulliÅ𝑠𝑑
ã1 ≤ 𝑖 ≤ 𝑑 (3.126)
𝑦𝑡 ∼ Bernoulli(𝜎(𝜃⊤𝑥𝑡 + 𝑏)) (3.127)
110
where the sparsity is 𝑠 = 5 in our simulations, and 𝜎(𝑥) = 11+𝑒−𝑥 is the logistic
function. We chose 𝑥𝑡 ∈ 0, 1𝑑 because in applications, features are often indicators.
The algorithms are tested in an online setting as follows. At epoch 𝑡 each algorithm
has access to 𝑥𝑠,𝑖, 𝑦𝑠 for 𝑠 ≤ 𝑡, and attempts to generate a sample from the posterior
distribution 𝑝𝑡(𝜃) ∝ 𝑒−‖𝜃‖22 𝑒−
𝑏2
2∏𝑡
𝑠=1 𝜎(𝜃⊤𝑥𝑡+𝑏); the time is limited to 𝑡 = 0.1 seconds.
We estimate the quality of the samples at 𝑡 = 𝑇 = 1000, by saving the state of the
algorithm at 𝑡 = 𝑇 − 1, and re-running it 1000 times to collect 1000 samples. We
replicate this entire simulation 8 times, and the marginal accuracies of the runs are
given in Figure 3.1.
The marginal accuracy (MA) is a heuristic to compare accuracy of samplers (see
e.g. [DMS17], [FOW11] and [C+17]). The marginal accuracy between the measure 𝜇
of a sample and the target 𝜋 is 𝑀𝐴(𝜇, 𝜋) := 1− 12𝑑
∑𝑑𝑖=1 ‖𝜇𝑖−𝜋𝑖‖TV, where 𝜇𝑖 and 𝜋𝑖 are
the marginal distributions of 𝜇 and 𝜋 for the coordinate 𝑥𝑖. Since MALA is known to
sample from the correct stationary distribution for the class of distributions analyzed
in this paper, we let 𝜋 be the estimate of the true distribution obtained from 1000
samples generated from running MALA for a long time (1000 steps). We estimate
the TV distance by the TV distance between the histograms when the bin widths are
0.25 times the sample standard deviation for the corresponding coordinate of 𝜋.
We compare our online SAGA-LD algorithm with SGLD, online Laplace approx-
imation, Polya-Gamma, and MALA. The Laplace method approximates the target
distribution with a multivariate Gaussian distribution. Here, one first finds the mode
of the target distribution using a deterministic optimization technique and then com-
putes the Hessian ∇2𝐹𝑡 of the log-posterior at the mode. The inverse of this Hessian
is the covariance matrix of the Gaussian. In the online version of the algorithm
we use, given in [CL11], to speed up optimization, only a quadratic approximation
(with diagonal Hessian) to the log-posterior is maintained. The Polya-Gamma chain
[DFE18] is a Markov chain specialized to sample from the posterior for logistic re-
111
Algorithm Mean marginal accuracy
SGLD 0.442Online Laplace 0.571
MALA 0.901Polya-Gamma 0.921
SAGA-LD 0.921
Figure 3.1: Marginal accuracies of 5 different sampling algorithms on online logisticregression, with 𝑇 = 1000 data points, dimension 𝑑 = 20, and time 0.1 seconds,averaged over 8 runs. SGLD and online Laplace perform much worse and are notpictured.
gression. Note that in contrast, our algorithm works more generally for any smooth
probability distribution over R𝑑.
The parameters are as follows. The step size at epoch 𝑡 is 0.11+0.5𝑡
for MALA, 0.011+0.5𝑡
for SGLD, and 0.051+0.5𝑡
for SAGA-LD. A smaller step size must be used with SGLD
because of the increased variance. For MALA, a larger step size can be used because
the Metropolis-Hastings acceptance step ensures the stationary distribution is correct.
The batch size for SGLD and SAGA-LD is 64.
Our results show that SAGA-LD is competitive with the best sampler for logistic
regression, namely, the Polya-Gamma Markov chain.
3.9 Discussion and future work
Comparison to using a regularizer. Recall that one issue in proving Theo-
rem 3.2.4 is that we don’t assume the 𝑓𝑡 are strongly convex. One way to get around
this is to add a strongly convex regularizer, and use existing results for Langevin in
the strongly convex case; however, because we are not leveraging the concentration
that already exists (Assumption 3.2.2), the polynomial dependence is worse.
In the online case, one would have to add 𝜀𝑡||𝑥− 𝑡||2 to the objective, where 𝑡
112
is an estimate of the mode 𝑥⋆𝑡 . Assuming we have such an estimate, using results on
Langevin for strong convexity, to get 𝜀 TV-error, we would require ‹𝑂 Ä 1𝜀6
ästeps per
iteration, rather than ‹𝑂 Ä 1𝜀4
äas in the current proof (see Theorem 3.6.5). (Specifically,
use [DMM18, Corollary 22], with strong convexity 𝑚 = 𝜀𝑡 to get that ‹𝑂 Ä 1𝜀3
äiterations
are required to get KL-error 𝜀, and apply Pinsker’s inequality.)
Preconditioning. We would like to obtain similar bounds under more general
assumptions where the covariance matrix could change at each epoch and be ill-
conditioned. This type of distribution arises in reinforcement learning applications
such as Thompson sampling [DFE18], where the data is determined by the user’s
actions. If the user favors actions in certain “optimal” directions, the distrbution will
have a much smaller covariance in those directions than in other directions, causing
the covariance matrix of the target distribution to become more ill-conditioned over
time.
Improved bounds for strongly convex functions. Suppose that we dropped
the requirement of independence. Note that if we use SAGA-LD with the last sample
from the previous epoch, we have a warm start for the previous distribution, and
would be able to achieve TV error that decreases as 𝑇 with ‹𝑂𝑇 (1) time per epoch.
It seems possible to reduce the TV error to 𝑂Å
𝜀
𝑡16
ãthis way, and possibly to 𝑂
Å𝜀
𝑡14
ãwith stronger drift assumptions. These guarantees may also extend to subexponential
distributions.
Distributions over discrete spaces. There has been work on stochastic methods
in the setting of discrete variables [DCW18] that could potentially be used to develop
analogous theory in the discrete case.
113
Bibliography
[AC93] James H Albert and Siddhartha Chib. “Bayesian analysis of binary and
polychotomous response data”. In: Journal of the American statistical
Association 88.422 (1993), pp. 669–679.
[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. “Particle
markov chain Monte Carlo methods”. In: Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 72.3 (2010), pp. 269–342.
[AG12] Shipra Agrawal and Navin Goyal. “Analysis of thompson sampling for
the multi-armed bandit problem”. In: Conference on Learning Theory.
2012, pp. 39–1.
[AG13a] Shipra Agrawal and Navin Goyal. “Further optimal regret bounds for
thompson sampling”. In: Artificial Intelligence and Statistics. 2013, pp. 99–
107.
[AG13b] Shipra Agrawal and Navin Goyal. “Thompson sampling for contextual
bandits with linear payoffs”. In: International Conference on Machine
Learning. 2013, pp. 127–135.
[AG16] Animashree Anandkumar and Rong Ge. “Efficient approaches for escap-
ing higher order saddle points in non-convex optimization”. In: Confer-
ence on learning theory. 2016, pp. 81–102.
114
[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. “The multiplicative weights
update method: a meta-algorithm and applications”. In: Theory of Com-
puting 8.1 (2012), pp. 121–164.
[Bak+08] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin.
“A simple proof of the Poincare inequality for a large class of probability
measures including the log-concave case”. In: Electron. Commun. Probab
13 (2008), pp. 60–66.
[Bal19] Jonathan Balkind. Who killed Granny? personal communication. 2019.
[BDT16] Rina Foygel Barber, Mathias Drton, and Kean Ming Tan. “Laplace ap-
proximation in high-dimensional Bayesian regression”. In: Statistical Anal-
ysis for High-Dimensional Data. Springer, 2016, pp. 15–36.
[BE85] Dominique Bakry and Michel Emery. “Diffusions hypercontractives”. In:
Seminaire de Probabilites XIX 1983/84. Springer, 1985, pp. 177–206.
[Bel+15] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexan-
der Rakhlin. “Escaping the local minima via simulated annealing: Opti-
mization of approximately convex functions”. In: Conference on Learning
Theory. 2015, pp. 240–265.
[BEL18] Sebastien Bubeck, Ronen Eldan, and Joseph Lehec. “Sampling from a
log-concave distribution with Projected Langevin Monte Carlo”. In: Dis-
crete & Computational Geometry 59.4 (2018), pp. 757–783.
[BGK05] Anton Bovier, Veronique Gayrard, and Markus Klein. “Metastability
in reversible diffusion processes II: Precise asymptotics for small eigen-
values”. In: Journal of the European Mathematical Society 7.1 (2005),
pp. 69–99.
115
[BGL13] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geome-
try of Markov diffusion operators. Vol. 348. Springer Science & Business
Media, 2013.
[Bha78] RN Bhattacharya. “Criteria for recurrence and existence of invariant
measures for multidimensional diffusions”. In: The Annals of Probabil-
ity (1978), pp. 541–553.
[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirich-
let allocation”. In: Journal of machine Learning research 3.Jan (2003),
pp. 993–1022.
[Bov+02] Anton Bovier, Michael Eckhoff, Veronique Gayrard, and Markus Klein.
“Metastability and Low Lying Spectra in Reversible Markov Chains”. In:
Communications in mathematical physics 228.2 (2002), pp. 219–255.
[Bov+04] Anton Bovier, Michael Eckhoff, Veronique Gayrard, and Markus Klein.
“Metastability in reversible diffusion processes I: Sharp asymptotics for
capacities and exit times”. In: Journal of the European Mathematical
Society 6.4 (2004), pp. 399–424.
[Bro+11] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Hand-
book of markov chain monte carlo. CRC press, 2011.
[Bro+13] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and
Michael I Jordan. “Streaming variational bayes”. In: Advances in Neural
Information Processing Systems. 2013, pp. 1727–1735.
[C+17] Nicolas Chopin, James Ridgway, et al. “Leave Pima Indians alone: bi-
nary regression as a benchmark for Bayesian computation”. In: Statistical
Science 32.1 (2017), pp. 64–87.
[CB17] Trevor Campbell and Tamara Broderick. “Automated Scalable Bayesian
Inference via Hilbert Coresets”. In: arXiv preprint arXiv:1710.05053 (2017).
116
[CB18a] Trevor Campbell and Tamara Broderick. “Bayesian coreset construction
via greedy iterative geodesic ascent”. In: arXiv preprint arXiv:1802.01737
(2018).
[CB18b] Xiang Cheng and Peter L Bartlett. “Convergence of Langevin MCMC
in KL-divergence”. In: Algorithmic Learning Theory, PMLR. 83. 2018,
pp. 186–211.
[Cha+16] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, and Yann LeCun.
“Entropy-SGD: Biasing gradient descent into wide valleys”. In: arXiv
preprint arXiv:1611.01838 (2016).
[Cha+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and
Michael Jordan. “On the Theory of Variance Reduction for Stochastic
Gradient Monte Carlo”. In: Proceedings of the 35th International Con-
ference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause.
Vol. 80. Proceedings of Machine Learning Research. Stockholmsmassan,
Stockholm Sweden: PMLR, Oct. 2018, pp. 764–773. url: http://proceedings.
mlr.press/v80/chatterji18a.html.
[Che+17] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan.
“Underdamped Langevin MCMC: A non-asymptotic analysis”. In: arXiv
preprint arXiv:1707.03663 (2017).
[Che+18] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett,
and Michael I Jordan. “Sharp convergence rates for langevin dynamics
in the nonconvex setting”. In: arXiv preprint arXiv:1805.01648 (2018).
[CL06] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games.
Cambridge university press, 2006.
[CL11] O. Chapelle and L. Li. “An empirical evaluation of Thompson sampling”.
In: Advances in Neural Information Processing Systems. 2011.
117
[Cov11] Thomas M Cover. “Universal portfolios”. In: The Kelly Capital Growth
Investment Criterion: Theory and Practice. World Scientific, 2011, pp. 181–
209.
[CSI12] Ming-Hui Chen, Qi-Man Shao, and Joseph G Ibrahim. Monte Carlo
methods in Bayesian computation. Springer Science & Business Media,
2012.
[D+91] Persi Diaconis, Daniel Stroock, et al. “Geometric bounds for eigenvalues
of Markov chains”. In: The Annals of Applied Probability 1.1 (1991),
pp. 36–61.
[Dal16] Arnak S Dalalyan. “Theoretical guarantees for approximate sampling
from smooth and log-concave densities”. In: Journal of the Royal Statis-
tical Society: Series B (Statistical Methodology) (2016).
[Dal17] Arnak Dalalyan. “Further and stronger analogy between sampling and
optimization: Langevin Monte Carlo and gradient descent”. In: Proceed-
ings of the 2017 Conference on Learning Theory. Ed. by Satyen Kale
and Ohad Shamir. Vol. 65. Proceedings of Machine Learning Research.
Amsterdam, Netherlands: PMLR, July 2017, pp. 678–689. url: http:
//proceedings.mlr.press/v65/dalalyan17a.html.
[DCW18] Chris De Sa, Vincent Chen, and Wing Wong. “Minibatch Gibbs Sam-
pling on Large Graphical Models”. In: Proceedings of the 35th Inter-
national Conference on Machine Learning. Ed. by Jennifer Dy and An-
dreas Krause. Vol. 80. Proceedings of Machine Learning Research. Stock-
holmsmassan, Stockholm Sweden: PMLR, Oct. 2018, pp. 1165–1173.
url: http://proceedings.mlr.press/v80/desa18a.html.
118
[DFE18] Bianca Dumitrascu, Karen Feng, and Barbara E Engelhardt. “PG-TS:
Improved Thompson Sampling for Logistic Contextual Bandits”. In: Ad-
vances in neural information processing systems. 2018.
[DFK91] Martin Dyer, Alan Frieze, and Ravi Kannan. “A random polynomial-time
algorithm for approximating the volume of convex bodies”. In: Journal
of the ACM (JACM) 38.1 (1991), pp. 1–17.
[DHW12] Pierre Del Moral, Peng Hu, and Liming Wu. “On the concentration prop-
erties of interacting particle processes”. In: Foundations and Trends R in
Machine Learning 3.3–4 (2012), pp. 225–389.
[Dia09] Persi Diaconis. “The markov chain monte carlo revolution”. In: Bulletin
of the American Mathematical Society 46.2 (2009), pp. 179–205.
[Dia11] Persi Diaconis. “The mathematics of mixing things up”. In: Journal of
Statistical Physics 144.3 (2011), p. 445.
[DK17] Arnak S Dalalyan and Avetik G Karagulyan. “User-friendly guaran-
tees for the Langevin Monte Carlo with inaccurate gradient”. In: arXiv
preprint arXiv:1710.00095 (2017).
[DM16] Alain Durmus and Eric Moulines. “High-dimensional Bayesian inference
via the Unadjusted Langevin Algorithm”. In: (2016).
[DMM18] Alain Durmus, Szymon Majewski, and B lazej Miasojedow. “Analysis
of Langevin Monte Carlo via convex optimization”. In: arXiv preprint
arXiv:1802.09188 (2018).
[DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. “On the convergence of
Hamiltonian Monte Carlo”. In: arXiv preprint arXiv:1705.00166 (2017).
119
[Dou+00] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell.
“Rao-Blackwellised particle filtering for dynamic Bayesian networks”. In:
Proceedings of the Sixteenth conference on Uncertainty in artificial intel-
ligence. Morgan Kaufmann Publishers Inc. 2000, pp. 176–183.
[Dub+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barn-
abas Poczos, Alexander J Smola, and Eric P Xing. “Variance reduction
in stochastic gradient Langevin dynamics”. In: Advances in neural infor-
mation processing systems. 2016, pp. 1154–1162.
[Dwi+18] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. “Log-
concave sampling: Metropolis-Hastings algorithms are fast!” In: Proceed-
ings of the 2018 Conference on Learning Theory, PMLR 75. 2018.
[Fos+18] Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik
Sridharan. “Logistic Regression: The Importance of Being Improper”. In:
Proceedings of Machine Learning Research vol 75 (2018), pp. 1–42.
[FOW11] Christel Faes, John T Ormerod, and Matt P Wand. “Variational Bayesian
inference for parametric and nonparametric regression with missing data”.
In: Journal of the American Statistical Association 106.495 (2011), pp. 959–
971.
[FRT14] Cameron E Freer, Daniel M Roy, and Joshua B Tenenbaum. Towards
common-sense reasoning via conditional simulation: legacies of Turing
in Artificial Intelligence. 2014.
[GD17] Francois Giraud and Pierre Del Moral. “Nonasymptotic analysis of adap-
tive and annealed Feynman–Kac particle models”. In: Bernoulli 23.1
(2017), pp. 670–709.
120
[Ge+15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. “Escaping from saddle
points—online stochastic gradient for tensor decomposition”. In: Confer-
ence on Learning Theory. 2015, pp. 797–842.
[GLR18] Rong Ge, Holden Lee, and Andrej Risteski. “Beyond Log-concavity:
Provable Guarantees for Sampling Multi-modal Distributions using Sim-
ulated Tempering Langevin Monte Carlo”. In: Advances in Neural In-
formation Processing Systems 31. Curran Associates, Inc., 2018. url:
http://tiny.cc/glr17.
[Gru+09] Natalie Grunewald, Felix Otto, Cedric Villani, and Maria G Westdicken-
berg. “A two-scale approach to logarithmic Sobolev inequalities and the
hydrodynamic limit”. In: Annales de l’IHP Probabilites et statistiques.
Vol. 45. 2. 2009, pp. 302–351.
[HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. “Logarithmic regret al-
gorithms for online convex optimization”. In: Machine Learning 69.2-3
(2007), pp. 169–192.
[Har04] Gilles Harge. “A convex/log-concave correlation inequality for Gaussian
measure and an application to abstract Wiener spaces”. In: Probability
theory and related fields 130.3 (2004), pp. 415–440.
[Haz16] Elad Hazan. “Introduction to online convex optimization”. In: Founda-
tions and Trends R in Optimization 2.3-4 (2016), pp. 157–325.
[HCB16] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. “Coresets
for scalable bayesian logistic regression”. In: Advances in Neural Infor-
mation Processing Systems. 2016, pp. 4080–4088.
[HH13] John Hammersley and D. C. Handscomb. Monte carlo methods. Springer
Science & Business Media, 2013.
121
[HKL14] Elad Hazan, Tomer Koren, and Kfir Y Levy. “Logistic regression: Tight
bounds for stochastic and online optimization”. In: Conference on Learn-
ing Theory. 2014, pp. 197–209.
[Jin+17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I
Jordan. “How to escape saddle points efficiently”. In: Proceedings of the
34th International Conference on Machine Learning-Volume 70. JMLR.
org. 2017, pp. 1724–1732.
[Jin+18] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. “On the local min-
ima of the empirical risk”. In: Advances in Neural Information Processing
Systems. 2018, pp. 4901–4910.
[JS93] Mark Jerrum and Alistair Sinclair. “Polynomial-time approximation al-
gorithms for the Ising model”. In: SIAM Journal on computing 22.5
(1993), pp. 1087–1116.
[JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. “A polynomial-time
approximation algorithm for the permanent of a matrix with nonnegative
entries”. In: Journal of the ACM (JACM) 51.4 (2004), pp. 671–697.
[KM15] Vladimir Koltchinskii and Shahar Mendelson. “Bounding the smallest
singular value of a random matrix without concentration”. In: Interna-
tional Mathematics Research Notices 2015.23 (2015), pp. 12991–13008.
[KW13] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.
In: arXiv preprint arXiv:1312.6114 (2013).
[Lel09] Tony Lelievre. “A general two-scale criteria for logarithmic Sobolev in-
equalities”. In: Journal of Functional Analysis 256.7 (2009), pp. 2211–
2221.
122
[Lia05] Faming Liang. “Determination of normalizing constants for simulated
tempering”. In: Physica A: Statistical Mechanics and its Applications
356.2-4 (2005), pp. 468–480.
[Liu08] Jun S Liu. Monte Carlo strategies in scientific computing. Springer Sci-
ence & Business Media, 2008.
[LM00] Beatrice Laurent and Pascal Massart. “Adaptive estimation of a quadratic
functional by model selection”. In: Annals of Statistics (2000), pp. 1302–
1338.
[LMV19] Holden Lee, Oren Mangoubi, and Nisheeth K Vishnoi. “Online Sampling
from Log-Concave Distributions”. In: arXiv preprint arXiv:1902.08179
(2019).
[LR16] Yuanzhi Li and Andrej Risteski. “Algorithms and matching lower bounds
for approximately-convex optimization”. In: Advances in Neural Infor-
mation Processing Systems. 2016, pp. 4745–4753.
[LS93] Laszlo Lovasz and Miklos Simonovits. “Random walks in a convex body
and an improved volume algorithm”. In: Random structures & algorithms
4.4 (1993), pp. 359–412.
[LV06] Laszlo Lovasz and Santosh Vempala. “Fast Algorithms for Logconcave
Functions: Sampling, Rounding, Integration and Optimization”. In: Pro-
ceedings of the 47th Annual IEEE Symposium on Foundations of Com-
puter Science. FOCS ’06. Washington, DC, USA: IEEE Computer Soci-
ety, 2006, pp. 57–68. isbn: 0-7695-2720-5. doi: 10.1109/FOCS.2006.28.
url: http://dx.doi.org/10.1109/FOCS.2006.28.
[Mar+19] Anton Martinsson, Jianfeng Lu, Benedict Leimkuhler, and Eric Vanden-
Eijnden. “The simulated tempering method in the infinite switch limit
123
with adaptive weight learning”. In: Journal of Statistical Mechanics: The-
ory and Experiment 2019.1 (2019), p. 013207.
[Men14] Shahar Mendelson. “Learning without concentration”. In: Conference on
Learning Theory. 2014, pp. 25–39.
[MHB16] Stephan Mandt, Matthew Hoffman, and David Blei. “A Variational Anal-
ysis of Stochastic Gradient Algorithms”. In: Proceedings of The 33rd In-
ternational Conference on Machine Learning. Ed. by Maria Florina Bal-
can and Kilian Q. Weinberger. Vol. 48. Proceedings of Machine Learning
Research. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 354–
363. url: http://proceedings.mlr.press/v48/mandt16.html.
[Mou+19] Wenlong Mou, Nhat Ho, Martin J. Wainwright, Peter Bartlett, and
Michael I. Jordan. “Polynomial-time Algorithm for Power Posterior Sam-
pling in Bayesian Mixture Models Mixture Models”. In: Preprint (2019).
[MP92] Enzo Marinari and Giorgio Parisi. “Simulated tempering: a new Monte
Carlo scheme”. In: EPL (Europhysics Letters) 19.6 (1992), p. 451.
[MR02] Neal Madras and Dana Randall. “Markov chain decomposition for con-
vergence rate analysis”. In: Annals of Applied Probability (2002), pp. 581–
606.
[MS17] Oren Mangoubi and Aaron Smith. “Rapid Mixing of Hamiltonian Monte
Carlo on Strongly Log-Concave Distributions”. In: arXiv preprint arXiv:1708.07114
(2017).
[MV17] Oren Mangoubi and Nisheeth K Vishnoi. “Convex Optimization with
Nonconvex Oracles”. In: arXiv preprint arXiv:1711.02621 (2017).
[Nag+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian
J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. “The true cost of
124
stochastic gradient Langevin dynamics”. In: arXiv preprint arXiv:1706.02692
(2017).
[Nea00] Radford M Neal. “Markov chain sampling methods for Dirichlet process
mixture models”. In: Journal of computational and graphical statistics
9.2 (2000), pp. 249–265.
[Nea01] Radford M Neal. “Annealed importance sampling”. In: Statistics and
computing 11.2 (2001), pp. 125–139.
[Nea96] Radford M Neal. “Sampling from multimodal distributions using tem-
pered transitions”. In: Statistics and computing 6.4 (1996), pp. 353–366.
[Nic12] Richard Nickl. “Statistical Theory”. In: Statistical Laboratory, Depart-
ment of Pure Mathematics and Mathematical Statistics, University of
Cambridge (2012).
[NR17] Hariharan Narayanan and Alexander Rakhlin. “Efficient sampling from
time-varying log-concave distributions”. In: The Journal of Machine Learn-
ing Research 18.1 (2017), pp. 4017–4045.
[PJT15] Daniel Paulin, Ajay Jasra, and Alexandre Thiery. “Error Bounds for Se-
quential Monte Carlo Samplers for Multimodal Distributions”. In: arXiv
preprint arXiv:1509.08775 (2015).
[PP07] Sanghyun Park and Vijay S Pande. “Choosing weights for simulated
tempering”. In: Physical Review E 76.1 (2007), p. 016703.
[RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochas-
tic Backpropagation and Approximate Inference in Deep Generative Mod-
els”. In: International Conference on Machine Learning. 2014, pp. 1278–
1286.
125
[RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. “Non-convex
learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic
analysis”. In: Conference on Learning Theory. 2017, pp. 1674–1703.
[Rus+18] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband,
Zheng Wen, et al. “A tutorial on Thompson sampling”. In: Foundations
and Trends R in Machine Learning 11.1 (2018), pp. 1–96.
[Sch12] Nikolaus Schweizer. “Non-asymptotic error bounds for sequential MCMC
and stability of Feynman-Kac propagators”. In: arXiv preprint arXiv:1204.2382
(2012).
[SR11] David Sontag and Dan Roy. “Complexity of inference in latent dirichlet
allocation”. In: Advances in neural information processing systems. 2011,
pp. 1008–1016.
[SS09] Ashok N Srivastava and Mehran Sahami. Text mining: Classification,
clustering, and applications. Chapman and Hall/CRC, 2009.
[TD14] Christopher Tosh and Sanjoy Dasgupta. “Lower bounds for the Gibbs
sampler over mixtures of Gaussians”. In: International Conference on
Machine Learning. 2014, pp. 1467–1475.
[TRR18] Nicholas G Tawn, Gareth O Roberts, and Jeffrey S Rosenthal. “Weight-
Preserving Simulated Tempering”. In: arXiv preprint arXiv:1808.04782
(2018).
[Vem05] Santosh Vempala. “Geometric random walks: a survey”. In: Combinato-
rial and computational geometry 52.573-612 (2005), p. 2.
[VW19] Santosh S Vempala and Andre Wibisono. “Rapid Convergence of the Un-
adjusted Langevin Algorithm: Log-Sobolev Suffices”. In: arXiv preprint
arXiv:1903.08568 (2019).
126
[W+08] Martin J Wainwright, Michael I Jordan, et al. “Graphical models, expo-
nential families, and variational inference”. In: Foundations and Trends R
in Machine Learning 1.1–2 (2008), pp. 1–305.
[WPB11] Chong Wang, John Paisley, and David Blei. “Online variational inference
for the hierarchical Dirichlet process”. In: Proceedings of the Fourteenth
International Conference on Artificial Intelligence and Statistics. 2011,
pp. 752–760.
[WSH09a] Dawn B Woodard, Scott C Schmidler, and Mark Huber. “Conditions for
rapid mixing of parallel and simulated tempering on multimodal distri-
butions”. In: The Annals of Applied Probability (2009), pp. 617–640.
[WSH09b] Dawn B Woodard, Scott C Schmidler, and Mark Huber. “Sufficient con-
ditions for torpid mixing of parallel and simulated tempering”. In: Elec-
tronic Journal of Probability 14 (2009), pp. 780–804.
[WT11] Max Welling and Yee W Teh. “Bayesian learning via stochastic gradient
Langevin dynamics”. In: Proceedings of the 28th International Confer-
ence on Machine Learning (ICML-11). 2011, pp. 681–688.
[YZM17] Nanyang Ye, Zhanxing Zhu, and Rafal Mantiuk. “Langevin Dynamics
with Continuous Tempering for Training Deep Neural Networks”. In: Ad-
vances in Neural Information Processing Systems 30. Ed. by I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett. Curran Associates, Inc., 2017, pp. 618–626. url: http://
papers.nips.cc/paper/6664-langevin-dynamics-with-continuous-
tempering-for-training-deep-neural-networks.pdf.
[Zhe03] Zhongrong Zheng. “On swapping and simulated tempering algorithms”.
In: Stochastic Processes and their Applications 104.1 (2003), pp. 131–154.
127
[Zin03] Martin Zinkevich. “Online convex programming and generalized infinites-
imal gradient ascent”. In: Proceedings of the 20th International Confer-
ence on Machine Learning (ICML-03). 2003, pp. 928–936.
[ZLC17] Yuchen Zhang, Percy Liang, and Moses Charikar. “A Hitting Time Anal-
ysis of Stochastic Gradient Langevin Dynamics”. In: Conference on Learn-
ing Theory. 2017, pp. 1980–2022.
128
Appendix A
Background on Markov chains and
processes
A.1 Markov chains and processes
Throughout, we will use upper-case 𝑃 for probability measures, and lower-case 𝑝 for
the corresponding density function (although we will occasionally abuse notation and
let 𝑝 stand in for the measure, as well).
A discrete-time (time-invariant) Markov chain on Ω is a probability law on a
sequence of random variables (𝑋𝑡)𝑡∈N0 taking values in Ω, such that the next state
𝑋𝑡+1 only depends on the previous state 𝑋𝑡, in a fixed way. More formally, letting
ℱ𝑡 = 𝜎((𝑋𝑡)0≤𝑠≤𝑡), there is a transition kernel 𝑇 on Ω (i.e. 𝑇 (𝑥, ·) is a probability
measure and 𝑇 (·, 𝐴) is a measurable function for any measurable 𝐴) such that
P(𝑋𝑡+1 ∈ ·|ℱ𝑡) = 𝑇 (𝑋𝑡, ·). (A.1)
A stationary measure is 𝑃 such that if 𝑋0 ∼ 𝑃 , then 𝑋𝑡 ∼ 𝑃 for all 𝑡. The idea of
Markov chain Monte Carlo is to design a Markov chain whose stationary distribution
is 𝑃 with good mixing; that is, if 𝜋𝑡 is the probability distribution at time 𝑡, then
129
𝜋𝑡 → 𝑃 rapidly as 𝑡→∞.
The Markov chains we consider will be discretized versions of continuous-time
Markov processes, so we will mainly work with Markov processes (postponing dis-
cretization analysis until the end).
Instead of being defined by a single transition kernel 𝑇 , a continuous time Markov
process is instead defined by a family of kernels (𝑃𝑡)𝑡≥0, and a more natural object to
consider is the generator.
Definition A.1.1. A (continuous-time, time-invariant) Markov process is given
by 𝑀 = (Ω, (𝑃𝑡)𝑡≥0), where each 𝑃𝑡 is a transition kernel. It defines the random
process (𝑋𝑡)𝑡≥0 by
P(𝑋𝑠+𝑡 ∈ 𝐴|ℱ𝑠) = P(𝑋𝑠+𝑡 ∈ 𝐴|𝑋𝑠) = 𝑃𝑡(𝑋𝑠, 𝐴) =∫𝐴𝑃𝑡(𝑥, 𝑑𝑦)
where ℱ𝑠 = 𝜎((𝑋𝑟)0≤𝑟≤𝑠). Define the action of P𝑡 on functions by
(P𝑡𝑔)(𝑥) = E𝑦∼𝑃𝑡(𝑥,·)𝑔(𝑦) =∫Ω𝑔(𝑦)𝑃𝑡(𝑥, 𝑑𝑦). (A.2)
A stationary measure is 𝑃 such that if 𝑋0 ∼ 𝑃 , then 𝑋𝑡 ∼ 𝑃 for all 𝑡. A
Markov process with stationary measure 𝑃 is reversible if P𝑡 is self-adjoint with
respect to 𝐿2(𝑃 ), i.e., as measures 𝑃 (𝑑𝑥)𝑃𝑡(𝑥, 𝑑𝑦) = 𝑃 (𝑑𝑦)𝑃𝑡(𝑦, 𝑑𝑥).
Define the generator L by
L 𝑔 = lim𝑡0
P𝑡𝑔 − 𝑔
𝑡, (A.3)
and let 𝒟(L ) denote the space of 𝑔 ∈ 𝐿2(𝑃 ) for which L 𝑔 ∈ 𝐿2(𝑃 ) is well-defined.
If 𝑃 is the unique stationary measure, define the Dirichlet form and the variance
130
by
E𝑀(𝑔, ℎ) = −⟨𝑔,L ℎ⟩𝑃 (A.4)
Var𝑃 (𝑔) =∥∥∥∥𝑔 − ∫
Ω𝑔 𝑃 (𝑑𝑥)
∥∥∥∥2𝐿2(𝑃 )
(A.5)
Note that in order for (𝑃𝑡)𝑡≥0 to be a valid Markov process, it must be the case that
P𝑡P𝑢𝑔 = P𝑡+𝑢𝑔, i.e., the (P𝑡)𝑡≥0 forms a Markov semigroup. All the Markov
processes we consider will be reversible.
We will use the shorthand E (𝑔) := E (𝑔, 𝑔).
Definition A.1.2. A continuous Markov process (given by Definition A.1.1) satisfies
a Poincare inequality with constant 𝐶 if for all 𝑔 such that 𝑔 ∈ 𝒟(L ),
E𝑀(𝑔, 𝑔) ≥ 1
𝐶Var𝑃 (𝑔). (A.6)
We will implicitly assume 𝑔 ∈ 𝒟(L ) every time we write E𝑀(𝑔, 𝑔). The minimal 𝜌
such that E𝑀(𝑔, 𝑔) ≥ 𝜌Var𝑃 (𝑔) for all 𝑔 is the spectral gap Gap(𝑀) of the Markov
process.
A Poincare inequality implies rapid mixing: If 𝐶 is maximal such that 𝑀 satisfies
a Poincare inequality with constant 𝐶, it can be shown that1
‖P𝑡𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) ≤ 𝑒−𝑡Gap(𝑀) ‖𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) = 𝑒−𝑡𝐶 ‖𝑔 − E𝑃𝑔‖2𝐿2(𝑃 ) . (A.7)
We can turn this into a statement about probability distributions, as follows. If the
probability distribution at time 𝑡 is 𝜋𝑡, then setting 𝑔 = 𝑑𝜋0
𝑑𝑃(the Radon-Nikodym
1Note the subtle point that the Poincare inequality as we defined it only makes sense for 𝑔 ∈𝒟(L ), whereas (A.7) makes sense when 𝑔 ∈ 𝐿2(𝑃 ). For Langevin diffusion, it suffices to showthe Poincare inequality for 𝑔 ∈ 𝒟(L ) to obtain (A.7) for all 𝑔 ∈ 𝐿2(𝑃 ). See [BGL13]. This will,however, not be an issue for us because we will start with a measure 𝜋0 with smooth density.
131
derivative) and assuming∥∥∥𝑑𝜋0
𝑑𝑃
∥∥∥𝐿2(𝑃 )
<∞, we haveÆP𝑡𝑓,
𝑑𝜋0
𝑑𝑃
∏𝐿2(𝑃 )
=∫Ω
P𝑡𝑓(𝑥)𝜋0(𝑑𝑥) =∫Ω
∫Ω𝑓(𝑦)𝑃𝑡(𝑥, 𝑑𝑦)𝜋0(𝑑𝑥) (A.8)
=∫Ω𝑓(𝑦)𝜋𝑡(𝑑𝑦) =
Æ𝑓,
𝑑𝜋𝑡
𝑑𝑃
∏𝐿2(𝑃 )
. (A.9)
If the Markov process is reversible, then¨P𝑡𝑓,
𝑑𝜋0
𝑑𝑃
∂𝐿2(𝑃 )
=¨𝑓,P𝑡
𝑑𝜋0
𝑑𝑃
∂𝐿2(𝑃 )
. Hence for
all 𝑓 ,¨𝑓,P𝑡
𝑑𝜋0
𝑑𝑃
∂𝐿2(𝑃 )
=¨𝑓, 𝑑𝜋𝑡
𝑑𝑃
∂𝐿2(𝑃 )
, so P𝑡𝑑𝜋0
𝑑𝑃= 𝑑𝜋𝑡
𝑑𝑃. The 𝜒2 divergence is defined
by 𝜒2(𝑄||𝑃 ) =∥∥∥𝑑𝑄𝑑𝑃− 1
∥∥∥2𝐿2(𝑃 )
, so putting 𝑔 = 𝑑𝜋0
𝑑𝑃in (A.7) gives
𝜒2(𝜋𝑡||𝑃 ) ≤ 𝑒−𝑡𝐶𝜒2(𝜋0||𝑃 ). (A.10)
The following gives one way to prove a Poincare inequality.
Theorem A.1.3 (Comparison theorem using canonical paths, [D+91]). Suppose Ω is
finite. Let 𝑇 : Ω×Ω→ R be a function with 𝑇 (𝑥, 𝑦) ≥ 0 for 𝑦 = 𝑥 and∑
𝑦∈Ω 𝑇 (𝑥, 𝑦) =
1. (Think of 𝑇 as a matrix in RΩ×Ω that operates on functions 𝑔 : Ω → R, i.e., 𝑔 ∈
RΩ.) Let 𝐿 = 𝑇−𝐼, so that 𝐿(𝑥, 𝑦) = 𝑇 (𝑥, 𝑦) for 𝑦 = 𝑥 and 𝐿(𝑥, 𝑥) = −∑𝑦 =𝑥 𝑇 (𝑥, 𝑦).
Consider the Markov process 𝑀 generated by 𝐿 (𝐿 acts as 𝐿𝑔(𝑗) =∑
𝑘 =𝑗[𝑔(𝑘) −
𝑔(𝑗)]𝑇 (𝑗, 𝑘)); let its Dirichlet form be E (𝑔, 𝑔) = −⟨𝑔, 𝐿𝑔⟩ and stationary distribution
be 𝑝.
Suppose each pair 𝑥, 𝑦 ∈ Ω, 𝑥 = 𝑦 is associated with a path 𝛾𝑥,𝑦. Define the
congestion to be
𝜌(𝛾) = max𝑧,𝑤∈Ω,𝑧 =𝑤
ñ∑𝛾𝑥,𝑦∋(𝑧,𝑤) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)
𝑝(𝑧)𝑇 (𝑧, 𝑤)
ôwhere |𝛾| denotes the length of path 𝛾. Then
Var𝑝(𝑔) ≤ 𝜌(𝛾)E (𝑔, 𝑔).
Proof. Note that the statement in [D+91] is for discrete-time Markov chains; we show
132
that our continuous-time result is a simple consequence.
Let 𝜀 > 0 be small enough such that 𝑇𝜀 = 𝐼 + 𝜀𝐿 = 𝐼 + 𝜀(𝑇 − 𝐼) = (1− 𝜀)𝐼 + 𝜀𝑇
has all entries ≥ 0.2 Note that the stationary distribution for 𝑀 is the same as the
stationary distribution of the discrete-time Markov chain generated by 𝑇𝜀, namely,
the (appropriately scaled) eigenvector of 𝑇 corresponding to the eigenvalue 1. The
Dirichlet form for 𝑇𝜀 is −⟨𝑔, (𝑇𝜀 − 𝐼)𝑔⟩ = −𝜀 ⟨𝑔, 𝐿𝑔⟩ = 𝜀E (𝑔, 𝑔).
Applying [D+91, Proposition 1′] to 𝑇𝜀 (note 𝑇𝜀(𝑧, 𝑤) = 𝜀𝑇 (𝑧, 𝑤) for 𝑧 = 𝑤) gives
Var𝑝(𝑔) ≤ max𝑧,𝑤∈Ω,𝑧 =𝑤
ñ∑𝛾𝑥,𝑦∋(𝑧,𝑤) |𝛾𝑥,𝑦|𝑝(𝑥)𝑝(𝑦)
𝑝(𝑧)𝜀𝑇 (𝑧, 𝑤)
ô(𝜀E (𝑔, 𝑔)) = 𝜌(𝛾)E (𝑔, 𝑔). (A.11)
A.2 Langevin diffusion
Langevin Monte Carlo is an algorithm for sampling from a measure 𝑃 with density
function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) given access to the gradient of the log-pdf, ∇𝑓 . We will always
assume that∫R𝑑 𝑒−𝑓(𝑥) 𝑑𝑥 <∞ and 𝑓 ∈ 𝐶2(R𝑑).
The continuous version, overdamped Langevin diffusion (often simply called Langevin
diffusion), is a stochastic process described by the stochastic differential equation
𝑑𝑋𝑡 = −∇𝑓(𝑋𝑡) 𝑑𝑡 +√
2 𝑑𝑊𝑡 (A.12)
where 𝑊𝑡 is the Wiener process (Brownian motion). For us, the crucial fact is that
Langevin dynamics converges to the stationary distribution given by 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥).
The Dirichlet form is
E𝑀(𝑔, 𝑔) = ‖∇𝑔‖2𝐿2(𝑃 ) . (A.13)
2Alternatively, note that nothing in their proof actually depends on 𝑇 having all entries ≥ 0, sotaking 𝜀 = 1 is fine.
133
Since this depends in a natural way on 𝑃 , we will also write this as E𝑃 (𝑔, 𝑔). A
Poincare inequality for Langevin diffusion thus takes the form
E𝑃 (𝑔, 𝑔) =∫Ω‖∇𝑔‖2 𝑃 (𝑑𝑥) ≥ 1
𝐶Var𝑃 (𝑔). (A.14)
Showing mixing for Langevin diffusion reduces to showing such an inequality. If
strongly log-concave measures, this is a classical result.
Theorem A.2.1 ([BGL13]). Let 𝑓 be 𝜌-strongly convex and differentiable. Then for
𝑔 ∈ 𝒟(E𝑃 ), the measure 𝑃 with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) satisfies the Poincare
inequality
E𝑃 (𝑔, 𝑔) ≥ 𝜌Var𝑃 (𝑔).
In particular, this holds for 𝑓(𝑥) = ‖𝑥−𝜇‖22
with 𝜌 = 1, giving a Poincare inequality
for the Gaussian distribution.
134
Appendix B
Appendix for Chapter 2
B.1 General log-concave densities
In this section we generalize the main theorem from gaussian to log-concave densities.
B.1.1 Simulated tempering for log-concave densities
First we rework Section 2.7 for log-concave densities.
Theorem B.1.1 (cf. Theorem 2.7.1). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0
is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).
Let 𝑀 be the continuous simulated tempering chain for the distributions
𝑝𝑖 ∝𝑚∑𝑗=1
𝑤𝑗𝑒−𝛽𝑖𝑓0(𝑥−𝜇𝑗) (B.1)
with rate ΩÄ
𝑟𝐷2
ä, relative probabilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1
where
𝐷 = max
max𝑗‖𝜇𝑗‖ ,
𝜅12
𝑑12𝐾
(B.2)
𝛽1 = ΘÅ 𝜅
𝑑𝐾2𝐷2
ã(B.3)
135
𝛽𝑖+1
𝛽𝑖
≤ 1 +𝜅
𝐾𝑑ÄlnÄ𝐾𝜅
ä+ 1
ä (B.4)
𝐿 = Θ
Ñ𝐾𝑑
ÄlnÄ𝐾𝜅
ä+ 1
ä2𝜅
ln
Ç𝑑𝐾𝐷
𝜅
åé(B.5)
𝑟 =min𝑖 𝑟𝑖max𝑖 𝑟𝑖
. (B.6)
Then 𝑀 satisfies the Poincare inequality
Var(𝑔) ≤ 𝑂
Ç𝐿2𝐷2
𝑟2
åE (𝑔, 𝑔) = 𝑂
Ñ𝐾2𝐷2 ln
ÄlnÄ𝐾𝜅
ä+ 1
ä4lnÄ𝑑𝐾𝐷𝜅
ä2𝜅2𝑟2
éE (𝑔, 𝑔).
(B.7)
Proof. Note that forcing 𝐷 ≤ 𝜅12
𝑑12𝐾
ensures 𝛽1 = Ω(1).
The proof follows that of Theorem 2.7.1, except that we need to use Lemmas D.2.7
and D.2.6 to bound the 𝜒2-divergences. Steps 1 and 2 are the same: we consider the
decomposition where 𝑝𝑖,𝑗 ∝ 𝑒−𝛽𝑖𝑓0(𝑥−𝜇𝑗) and note E𝑖,𝑗 satisfies the Poincare inequality
Var𝑝𝑖,𝑗(𝑔𝑖) ≤1
𝜅𝛽𝑖
E𝑖,𝑗 = 𝑂(𝐷2)E𝑖,𝑗(𝑔𝑖, 𝑔𝑖). (B.8)
By Lemma D.2.7,
𝜒2(𝑝𝑖−1,𝑗||𝑝𝑖,𝑗) ≤ 𝑒
12
∣∣∣1−𝛽𝑖−1𝛽𝑖
∣∣∣ 𝐾𝑑
𝜅−𝐾
∣∣∣1−𝛽𝑖−1𝛽𝑖
∣∣∣(»
ln(𝐾𝜅 )+5
)2
(B.9)
·ÇÇ
1− 𝐾
𝜅
∣∣∣∣∣1− 𝛽𝑖−1
𝛽𝑖
∣∣∣∣∣åÇ
1 +
∣∣∣∣∣1− 𝛽𝑖−1
𝛽𝑖
∣∣∣∣∣åå− 𝑑
2
− 1 = 𝑂(1). (B.10)
By Lemma D.2.6,
𝜒2(𝑝1,𝑗′ ||𝑝1,𝑗) ≤ 𝑒12𝛽1𝜅(2𝐷)2+
√𝛽1𝐾(2𝐷)
√𝑑𝜅
(»ln(𝐾
𝜅 )+5
)(B.11)
136
·(𝑒𝐾(2𝐷)
√𝑑𝜅 +
»𝛽1𝐾(2𝐷)
4𝜋
𝜅𝑒
2√
𝛽1𝐾(2𝐷)√𝑑
√𝜅
+𝛽1𝐾
2(2𝐷)2
2𝜅
)− 1 = 𝑂(1).
(B.12)
The rest of the proof is the same.
Theorem B.1.2 (cf. Theorem 2.7.6). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0
is 𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).
Suppose∑𝑚
𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 = max1≤𝑗≤𝑚 ‖𝜇𝑗‖. Let 𝑀
be the continuous simulated tempering chain for the distributions
𝑝𝑖 ∝
Ñ𝑚∑𝑗=1
𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗)
é𝛽𝑖
(B.13)
with rate 𝑂Ä
𝑟𝐷2
ä, relative probabilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1
satisfying the same conditions as in Theorem B.1.1. Then 𝑀 satisfies the Poincare
inequality
Var(𝑔) ≤ 𝑂
Ç𝐿2𝐷2
𝑟2𝑤3min
åE (𝑔, 𝑔) = 𝑂
Ñ𝐾2𝑑2
ÄlnÄ𝐾𝜅
ä+ 1
ä4lnÄ𝑑𝐾𝐷𝜅
ä2𝜅2𝑟2𝑤3
min
éE (𝑔, 𝑔). (B.14)
Proof. Let 𝑝𝑖 be the probability distributions in Theorem 2.7.1 with the same pa-
rameters as 𝑝𝑖 and let 𝑝 be the stationary distribution of that simulated tempering
chain. By Theorem B.1.1, Var𝑝(𝑔) = 𝑂Ä𝐿2𝐷2
𝑟2
äE 𝑝(𝑔, 𝑔). Now use By Lemma 2.7.3,
𝑝𝑖𝑝𝑖∈î1, 1
𝑤min
ó𝑍𝑖
𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ = 1
𝑤min.
B.1.2 Proof of main theorem for log-concave densities
Next we rework Section 2.9 for log-concave densities, and prove the main theorem for
log-concave densities, Theorem 2.4.3.
Lemma B.1.3 (cf. Lemma 2.9.2). Suppose 𝑓0 satisfies Assumption 2.4.1(2) (𝑓0 is
𝜅-strongly convex, 𝐾-smooth, and has minimum at 0).
137
Suppose that Algorithm 1 is run on 𝑓(𝑥) = − lnÄ∑𝑚
𝑗=1𝑤𝑗𝑓0(𝑥− 𝜇𝑗)äwith temper-
atures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿 with partition function estimates ”𝑍1, . . . , 𝑍ℓ
satisfying
∣∣∣∣∣∣𝑍𝑖
𝑍𝑖
/”𝑍1
𝑍1
∣∣∣∣∣∣ ∈[Ç
1− 1
𝐿
å𝑖−1
,
Ç1 +
1
𝐿
å𝑖−1]
(B.15)
for all 1 ≤ 𝑖 ≤ ℓ. Suppose∑𝑚
𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 =
maxß
max𝑗 ‖𝜇𝑗‖ , 𝜅12
𝑑12𝐾
™, 𝐾 ≥ 1, and the parameters satisfy
𝜆 = Θ
Ç1
𝐷2
å(B.16)
𝛽1 = 𝑂Å 𝜅
𝑑𝐾2𝐷2
ã(B.17)
𝛽𝑖+1
𝛽𝑖
≤ 1 +𝜅
𝐾𝑑ÄlnÄ𝐾𝜅
ä+ 1
ä (B.18)
𝐿 = Θ
Ñ𝐾𝑑
ÄlnÄ𝐾𝜅
ä+ 1
ä2𝜅
ln
Ç𝑑𝐾𝐷
𝜅
åé(B.19)
𝑇 =
Ç𝐿2𝐷2
𝑤3min
𝑑 ln
Çℓ
𝜀𝑤min
åln
Ç𝐾
𝜅
åå(B.20)
𝜂 = 𝑂
Ümin
𝜀
𝐷2𝐾72
Å𝐷 𝐾
𝜅12
+ 𝑑12
ã𝑇,
𝜀
𝐷52𝐾
32
ÅÄ𝐾𝜅
ä 12 + 1
ã , 𝜀
𝐷2𝐾2𝑑𝑇
ê
.
(B.21)
Let 𝑞0 be the distribution(𝑁(0, 1
𝜅𝛽1
), 1)on [ℓ]×R𝑑. The distribution 𝑞𝑇 after running
time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀.
Setting 𝜀 = 𝑂Ä
1ℓ𝐿
äabove and taking 𝑛 = Ω
Ä𝐿2 ln
Ä1𝛿
ääsamples, with probability
1− 𝛿 the estimate
“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =
Ñ1
𝑛
𝑛∑𝑗=1
𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗)
é(B.22)
also satisfies (B.15).
138
Proof. Begin as in the proof of Lemma 2.9.2. Let 𝑝𝛽,𝑖 ∝ 𝑒−𝛽1𝑓0(𝑥−𝜇𝑖) be a probability
density function.
Write∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤∥∥∥𝑝− 𝑝𝑇
∥∥∥1
+∥∥∥𝑝𝑇 − 𝑞𝑇
∥∥∥1. Bound the first term by
∥∥∥𝑝− 𝑝𝑇∥∥∥1≤»
𝜒2(𝑃 𝑇 ||𝑃 ) ≤ 𝑒−𝑇2𝐶
»𝜒2(𝑃 0||𝑃 ) where 𝐶 is the upper bound on the Poincare constant
in Theorem B.1.2. As in (2.203), we get
𝜒2(𝑝||𝑝0) = 𝑂
Çℓ
𝑤min
åÑ1 +
𝑚∑𝑗=1
𝑤𝑗𝜒2
Ç𝑁
Ç0,
1
𝜅𝛽1
𝐼𝑑
å||𝑝𝛽1,𝑗
åé. (B.23)
By Lemma D.2.8 with strong convexity constants 𝜅𝛽1 and 𝐾𝛽1, this is
𝑂
Ñℓ
𝑤min
Ç𝐾
𝜅
å 𝑑2
𝑒𝐾𝛽1𝐷2
é= 𝑂
Ñℓ
𝑤min
Ç𝐾
𝜅
å 𝑑2
é(B.24)
when 𝛽1 = 𝑂Ä
𝐾𝐷2
ä. Thus for 𝑇 = Ω
Ä𝐶 ln
Äℓ
𝜀𝑤min
ä𝑑 ln
Ä𝐾𝜅
ää,∥∥∥𝑝− 𝑝𝑇
∥∥∥1≤ 𝜀
3.
Again conditioning on the event 𝐴 that 𝑁𝑇 = max 𝑛 : 𝑇𝑛 ≤ 𝑇 = 𝑂(𝑇𝜆), we get
by Lemma 2.8.1 that
∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1
= 𝑂
Ç𝜂2𝐷6𝐾7
Ç𝐷2𝐾
2
𝜅+ 𝑑
å𝑇𝑛 + 𝜂2𝐷5
Ç𝐾
𝜅+ 1
å+ 𝜂𝐷2𝐾2𝑑𝑇
å.
(B.25)
Choosing 𝜂 as in the problem statement, we get∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀 as before. Finally,
apply Lemma 2.9.1, checking the assumptions are satisfied using Lemma D.3.2. The
assumptions of Lemma D.3.2 hold, as
𝛽𝑖+1 − 𝛽𝑖
𝛽𝑖
= 𝑂
Ñ1
𝛼𝐾𝐷2 + 𝑑𝜅
Ä1 + ln
Ä𝐾𝜅
ää+ 1
𝜅lnÄ
1𝑤min
äé . (B.26)
Proof of Theorem 2.4.3. This follows from Lemma B.1.3 in exactly the same way that
the main theorem for gaussians (Theorem 2.4.2) follows from Lemma 2.9.2.
139
B.2 Perturbation tolerance
The proof of Theorem 2.4.4 will follow immediately from Lemma B.2.2, which is a
straightforward analogue of Lemma 2.9.2.
B.2.1 Simulated tempering for distribution with perturba-
tion
First, we consider the mixing time of the continuous tempering chain, analogously to
Theorem B.1.2:
Theorem B.2.1 (cf. Theorem B.1.2). Suppose 𝑓0 satisfies Assumption 2.4.1
Let 𝑀 be the continuous simulated tempering chain with rate 𝑂Ä
𝑟𝐷2
ä, relative prob-
abilities 𝑟𝑖, and temperatures 0 < 𝛽1 < · · · < 𝛽𝐿 = 1 satisfying the same conditions
as in Lemma B.2.2. Then 𝑀 satisfies the Poincare inequality
Var(𝑔) ≤ 𝑂
Ç𝐿2𝐷2
𝑟2𝑤2min
åE (𝑔, 𝑔) = 𝑂
Ñ𝐾2𝑑2
ÄlnÄ𝐾𝜅
ä+ 1
ä4lnÄ𝑑𝐾𝐷𝜅
ä2𝜅2𝑟2𝑒3Δ𝑤3
min
éE (𝑔, 𝑔). (B.27)
Proof. The proof is almost the same as Let 𝑝𝑖 be the probability distributions in
Theorem 2.7.1 with the same parameters as 𝑝𝑖 and let 𝑝 be the stationary distribution
of that simulated tempering chain. By Theorem B.1.1, Var𝑝(𝑔) = 𝑂Ä𝐿2𝐷2
𝑟2
äE 𝑝(𝑔, 𝑔).
Now use By Lemma 2.7.3, 𝑝𝑖𝑝𝑖∈î1, 1
𝑤min
ó𝑍𝑖
𝑍𝑖. Now use Lemma 2.7.5 with 𝑒Δ substituted
to be 𝑒Δ 1𝑤min
.
B.2.2 Proof of main theorem with perturbations
Lemma B.2.2 (cf. Lemma B.1.3). Suppose that Algorithm 1 is run on 𝑓(𝑥) =
− lnÄ∑𝑚
𝑗=1𝑤𝑗𝑓0(𝑥− 𝜇𝑗)äwith temperatures 0 < 𝛽1 < · · · < 𝛽ℓ ≤ 1, ℓ ≤ 𝐿 with
140
partition function estimates ”𝑍1, . . . , 𝑍ℓ satisfying
∣∣∣∣∣∣𝑍𝑖
𝑍𝑖
/”𝑍1
𝑍1
∣∣∣∣∣∣ ∈[Ç
1− 1
𝐿
å𝑖−1
,
Ç1 +
1
𝐿
å𝑖−1]
(B.28)
for all 1 ≤ 𝑖 ≤ ℓ. Suppose∑𝑚
𝑗=1𝑤𝑗 = 1, 𝑤min = min1≤𝑗≤𝑚 𝑤𝑖 > 0, and 𝐷 =
maxß
max𝑗 ‖𝜇𝑗‖ , 𝜅12
𝑑12𝐾
™, 𝐾 ≥ 1, and the parameters satisfy
𝜆 = Θ
Ç1
𝐷2
å(B.29)
𝛽1 = 𝑂Å
minß
∆,𝜅
𝑑𝐾2𝐷2
™ã(B.30)
𝛽𝑖+1
𝛽𝑖
≤ min
∆, 1 +𝜅
𝐾𝑑ÄlnÄ𝐾𝜅
ä+ 1
ä (B.31)
𝐿 = Θ
Ñ𝐾𝑑
ÄlnÄ𝐾𝜅
ä+ 1
ä2𝜅
ln
Ç𝑑𝐾𝐷
𝜅
åé(B.32)
𝑇 =
Ç𝑒3Δ
𝐿2𝐷2
𝑤3min
𝑑 ln
Çℓ
𝜀𝑤min
åln
Ç𝐾
𝜅
åå(B.33)
𝜂 = 𝑂
Ümin
𝜀
𝐷2(𝐾 + 𝜏)72
Å𝐷𝐾+𝜏
𝜅12
+ 𝑑12
ã𝑇,
𝜀
𝐷52 (𝐾 + 𝜏)
32
ÅÄ𝐾+𝜏𝜅
ä 12 + 1
ã , 𝜀
𝐷2(𝐾 + 𝜏)2𝑑𝑇
ê
.
(B.34)
Let 𝑞0 be the distribution(𝑁(0, 1
𝜅𝛽1
), 1)on [ℓ]×R𝑑. The distribution 𝑞𝑇 after running
time 𝑇 satisfies∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀.
Setting 𝜀 = 𝑂Ä
1ℓ𝐿
äabove and taking 𝑛 = Ω
Ä𝐿2 ln
Ä1𝛿
ääsamples, with probability
1− 𝛿 the estimate
“𝑍ℓ+1 = 𝑟“𝑍ℓ, 𝑟 : =
Ñ1
𝑛
𝑛∑𝑗=1
𝑒(−𝛽ℓ+1+𝛽ℓ)𝑓𝑖(𝑥𝑗)
é(B.35)
also satisfies (B.28).
The way we prove this theorem is to prove the tolerance of each of the proof
ingredients to perturbations to 𝑓 .
141
Discretization
We now verify all the discretization lemmas continue to hold with perturbations.
The proof of Lemma 2.8.3, combined with the fact that∥∥∥∇𝑓 −∇𝑓∥∥∥
∞≤ ∆ gives
Lemma B.2.3 (Perturbed reach of continuous chain). Let 𝑃 𝛽𝑇 (𝑋) be the Markov
kernel corresponding to evolving Langevin diffusion
𝑑𝑋𝑡 = −𝛽∇𝑓(𝑋𝑡) 𝑑𝑡 + 𝑑𝐵𝑡
with 𝑓 and 𝐷 are as defined in 2.2 for time 𝑇 . Then,
E[‖𝑋𝑡 − 𝑥*‖2] . E[‖𝑋0 − 𝑥*‖2] +
Ç400𝛽
𝐷2𝐾2𝜏 2
𝜅+ 𝑑
å𝑇
Proof. The proof proceeds exactly the same as Lemma 2.8.3.
Furthermore, since ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖2 ≤ 𝜏, ∀𝑥 ∈ R𝑑, from Lemma 2.8.4, we get
Lemma B.2.4 (Perturbed Hessian bound).
‖∇2𝑓(𝑥)‖2 ≤ 4(𝐷𝐾)2 + 𝜏, ∀𝑥 ∈ R𝑑
As a consequence, the analogue of Lemma 2.8.5 gives:
Lemma B.2.5 (Bounding interval drift). In the setting of Lemma 2.8.5, let 𝑥 ∈
R𝑑, 𝑖 ∈ [𝐿], and let 𝜂 ≤ ( 1𝜎+𝜏)2
𝛼. Then,
KL(𝑃𝑇 (𝑥, 𝑖)||”𝑃𝑇 (𝑥, 𝑖)) ≤ 4𝐷6𝜂7(𝐾 + 𝜏)7
3
Ä‖𝑥− 𝑥*‖22 + 8𝑇𝑑
ä+ 𝑑𝑇𝐷2𝜂(𝐾 + 𝜏)2
Putting these together, we get the analogue of Lemma 2.8.1:
Lemma B.2.6. Fix times 0 < 𝑇1 < · · · < 𝑇𝑛 ≤ 𝑇 .
Let 𝑝𝑇 , 𝑞𝑇 : [𝐿]× R𝑑 → R be defined as follows.
142
1. 𝑝𝑇 is the continuous simulated tempering Markov process as in Definition 2.5.1
but with fixed transition times 𝑇1, . . . , 𝑇𝑛. The component chains are Langevin
diffusions on 𝑝𝑖 ∝Ä∑𝑚
𝑗=1 𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑖)
ä𝛽𝑖.
2. 𝑞𝑇 is the discretized version as in Algorithm (1), again with fixed transition
times 𝑇1, . . . , 𝑇𝑛, and with step size 𝜂 ≤ 𝜎2
2.
Then
KL(𝑃 𝑇 ||𝑄𝑇 ) . 𝜂2𝐷6(𝐾 + 𝜏)6Ç𝐷2 (𝐾 + 𝜏)2
𝜅+ 𝑑
å𝑇𝑛
+ 𝜂2𝐷3(𝐾 + 𝜏)3 max𝑖
E𝑥∼𝑃 0(·,𝑖)‖𝑥− 𝑥*‖22 + 𝜂𝐷2(𝐾 + 𝜏)2𝑑𝑇
where 𝑥* is the maximum of∑𝑚
𝑗=1𝑤𝑗𝑒−𝑓0(𝑥−𝜇𝑗) and satisfies ‖𝑥*‖ = 𝑂(𝐷) where
𝐷 = max ‖𝜇𝑗‖.
Putting it all together
Finally, we prove Lemma B.2.2.
Proof of Lemma B.2.2. The proof is analogous to the one of Lemma 2.9.2 in com-
bination with the Lemmas from the previous subsections, so we just point out the
differences.
We bound 𝜒2(‹𝑃 ||𝑄0) as follows: by the proof of Lemma 2.9.2, we have 𝜒2(𝑃 ||𝑄0) =
𝑂(
ℓ𝑤min
𝐾𝑑2
). By the definition of 𝜒2, this means
∫𝑞0(𝑥)2
𝑝(𝑥)𝑑𝑥 ≤ 𝑂
Çℓ
𝑤min
𝐾𝑑2
å
143
This in turn implies that
𝜒2(‹𝑃 ||𝑄0) ≤∫
(𝑞0(𝑥))2
𝑝(𝑥)𝑑𝑥 ≤ 𝑂
Çℓ
𝑤min
𝐾𝑑2 𝑒Δ
åThen, analogously as in Lemma 2.9.2, we get
∥∥∥𝑝𝑇 (·|𝐴)− 𝑞𝑇 (·|𝐴)∥∥∥1
= 𝑂
Ñ𝜂2𝐷6(𝐾 + 𝜏)7
Ç𝐷2 (𝐾 + 𝜏)2
𝜅+ 𝑑
å𝑇𝜂 (B.36)
+ 𝜂2𝐷5
Ç𝐾
𝜅+ 1
å+ 𝜂𝐷2(𝐾 + 𝜏)2𝑑𝑇
é. (B.37)
Choosing 𝜂 as in the statement of the lemma,∥∥∥𝑝− 𝑞𝑇
∥∥∥1≤ 𝜀 follows. The rest of the
lemma is identical to Lemma 2.9.2.
B.3 Continuous version of decomposition theorem
We consider the case where 𝐼 is continuous, and the Markov process is the Langevin
process. We will take 𝐼 = Ω(1) ⊆ R𝑑1 and each Ω𝑖 will be a fixed Ω(2) ⊆ R𝑑2 , so the
space is Ω(1) × Ω(2) ⊆ R𝑑1 × R𝑑2 = R𝑑1+𝑑2 .
The proof of the following is very similar to the proof of Theorem 2.6.3, and to
Lemma 5 in [Mou+19]. The main difference is that they bound using an 𝐿∞ norm
over Ω(1) × Ω(2), and we bound using a 𝐿∞ norm over just Ω(1); thus this bound is
stronger, and can prevent having to consider restrictions. We also note that there is
an analogue of the theorem for log-Sobolev inequalities [Lel09; Gru+09].
Theorem B.3.1 (Poincare inequality from marginal and conditional distribution).
Consider a probability measure 𝜋 with 𝐶1 density on Ω = Ω(1)×Ω(2), where Ω(1) ⊆ R𝑑1
and Ω(2) ⊆ R𝑑2 are closed sets. For 𝑋 = (𝑋1, 𝑋2) ∼ 𝑃 with probability density
function 𝑝 (i.e., 𝑃 (𝑑𝑥) = 𝑝(𝑥) 𝑑𝑥 and 𝑃 (𝑑𝑥2|𝑥1) = 𝑝(𝑥2|𝑥1) 𝑑𝑥2), suppose that
144
∙ The marginal distribution of 𝑋1 satisfies a Poincare inequality with constant
𝐶1.
∙ For any 𝑥1 ∈ Ω(1), the conditional distribution 𝑋2|𝑋1 = 𝑥1 satisfies a Poincare
inequality with constant 𝐶2.
Then 𝜋 satisfies a Poincare inequality with constant
‹𝐶 = max
𝐶2
Ñ1 + 2𝐶1
∥∥∥∥∥∥∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
∥∥∥∥∥∥𝐿∞(Ω(1))
é, 2𝐶1
(B.38)
Note that an alternate way to write the integral is
∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2 =
∫Ω(2)∇𝑥1𝑝(𝑥2|𝑥1)∇𝑥1(ln 𝑝(𝑥2|𝑥1)) 𝑑𝑥2. (B.39)
Adapting this theorem it would be possible, with additional work, to show that
Langevin on the space [𝛽0, 1]×Ω (that is, include the temperature as a coordinate in
the Langevin algorithm), where the first coordinate is the temperature, will mix.
In the proof we will draw analogies between the discrete and continuous case.
Proof. To make the analogy, let
E↔(𝑔, 𝑔) = E (1)(𝑔, 𝑔) : =∫Ω‖∇𝑥1𝑔(𝑥)‖2 𝑃 (𝑑𝑥) (B.40)
E(𝑔, 𝑔) = E (2)(𝑔, 𝑔) : =∫Ω‖∇𝑥2𝑔(𝑥)‖2 𝑃 (𝑑𝑥) (B.41)
and note E = E↔ + E = E (1) + E (2).
Let 𝑃 be the 𝑥1-marginal of 𝑃 , i.e., 𝑃 (𝐴) = 𝑃 (𝐴×Ω(2)). Given 𝑔 ∈ 𝐿2(Ω(1)×Ω(2)),
145
define 𝑔 ∈ 𝐿2(Ω(1)) by 𝑔(𝑥) = E𝑥2∼𝑃 (·|𝑥1)[𝑔(𝑥)]. Analogously to (2.68),
Var𝑃 (𝑔) =∫Ω(1)
Å∫Ω(2)
(𝑔(𝑥)− E𝑃𝑔(𝑥))2 𝑃 (𝑑𝑥2|𝑥1)
ã𝑃 (𝑑𝑥1) (B.42)
=∫Ω(1)
(∫Ω(2)
(𝑔(𝑥)− E𝑥2∼𝑃 (·|𝑥1)
𝑔(𝑥))2 𝑃 (𝑑𝑥2|𝑥1) +
ÇE
𝑥2∼𝑃 (·|𝑥1)[𝑔(𝑥)]− E
𝑃[𝑔(𝑥)]
å2)𝑃 (𝑑𝑥1)
(B.43)
≤∫Ω(1)
𝐶E𝑃 (·|𝑥1)(𝑔, 𝑔)𝑃 (𝑑𝑥1) + Var𝑃 (𝑔) (B.44)
≤ 𝐶2E(2)(𝑔, 𝑔) + 𝐶1E (𝑔, 𝑔). (B.45)
The second term is analogous to the 𝐵 term in (2.72). (There is no 𝐴 term.) Note
E (𝑔, 𝑔) =∫Ω(1) ‖∇𝑥1𝑔(𝑥1)‖2 𝑃 (𝑑𝑥1), and we can expand ∇𝑥1𝑔(𝑥1) using integration
by parts:
∇𝑥1𝑔(𝑥1) = ∇𝑥1
Å∫Ω(2)
𝑔(𝑥)𝑃 (𝑑𝑥2|𝑥1)ã
(B.46)
=∫Ω(2)∇𝑥1𝑔(𝑥)𝑃 (𝑑𝑥2|𝑥1) +
∫Ω(2)
𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2 (B.47)
Hence by Cauchy-Schwarz, (compare with (2.84))
E (𝑔, 𝑔) ≤ 2
ñ∫Ω‖∇𝑥1𝑔(𝑥)‖2 𝑃 (𝑑𝑥2|𝑥1)𝑃 (𝑑𝑥1) +
∫Ω(1)
∥∥∥∥∫Ω(2)
𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2
∥∥∥∥2 𝑃 (𝑑𝑥1)
ô(B.48)
The first term is E (1)(𝑔, 𝑔). The second term is bounded by Lemma D.1.2, the con-
tinuous analogue of Lemma D.1.1, with 𝑔 ←[ 𝑔(𝑥1, ·) and 𝑝𝑥1(𝑥2) = 𝑝(𝑥2|𝑥1).
∫Ω(1)
∥∥∥∥∫Ω(2)
𝑔(𝑥)∇𝑥1𝑝(𝑥2|𝑥1) 𝑑𝑥2
∥∥∥∥2 𝑃 (𝑑𝑥1) (B.49)
≤∫Ω(1)
Var𝑃 (·|𝑥1)[𝑔(𝑥)]
(∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
)𝑃 (𝑑𝑥1) (B.50)
≤∫Ω(1)
𝐶2E𝑃 (·|𝑥1)(𝑔, 𝑔)𝑃 (𝑑𝑥1) ·∥∥∥∥∥∥∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
∥∥∥∥∥∥𝐿∞(Ω(1))
(B.51)
146
by 𝐿1-𝐿∞ inequality (B.52)
= 𝐶2E(2)(𝑔, 𝑔)
∥∥∥∥∥∥∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
∥∥∥∥∥∥𝐿∞(Ω(1))
(B.53)
Hence, recalling (B.45) and (B.48),
E (𝑔, 𝑔) ≤ 2E (1)(𝑔, 𝑔) + 2
𝐶2E(2)(𝑔, 𝑔)
∥∥∥∥∥∥∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
∥∥∥∥∥∥𝐿∞(Ω(1))
(B.54)
Var𝑃 (𝑔) ≤ 𝐶2E(2)(𝑔, 𝑔) + 𝐶1E (𝑔, 𝑔) (B.55)
≤ max
𝐶2
Ñ1 + 2𝐶1
∥∥∥∥∥∥∫Ω(2)
‖∇𝑥1𝑝(𝑥2|𝑥1)‖2
𝑝(𝑥2|𝑥1)𝑑𝑥2
∥∥∥∥∥∥𝐿∞(Ω(1))
é, 2𝐶1
E (𝑔, 𝑔).
(B.56)
B.4 Examples
It might be surprising that sampling a mixture of gaussians require a complicated
Markov Chain such as simulated tempering. However, many simple strategies seem
to fail.
Langevin with few restarts One natural strategy to try is simply to run Langevin
a polynomial number of times from randomly chosen locations. While the time to
“escape” a mode and enter a different one could be exponential, we may hope that
each of the different runs “explores” the individual modes, and we somehow stitch
the runs together. The difficulty with this is that when the means of the gaussians
are not well-separated, it’s difficult to quantify how far each of the individual runs
will reach and thus how to combine the various runs.
147
Recovering the means of the gaussians Another natural strategy would be
to try to recover the means of the gaussians in the mixture by performing gradient
descent on the log-pdf with a polynomial number of random restarts. The hope
would be that maybe the local minima of the log-pdf correspond to the means of the
gaussians, and with enough restarts, we should be able to find them.
Unfortunately, this strategy without substantial modifications also seems to not
work: for instance, in dimension 𝑑, consider a mixture of 𝑑 + 1 gaussians, 𝑑 of them
with means on the corners of a 𝑑-dimensional simplex with a side-length substantially
smaller than the diameter 𝐷 we are considering, and one in the center of the simplex.
In order to discover the mean of the gaussian in the center, we would have to have a
starting point extremely close to the center of the simplex, which in high dimensions
seems difficult.
Additionally, this doesn’t address at all the issue of robustness to perturbations.
Though there are algorithms to optimize “approximately” convex functions, they can
typically handle only very small perturbations. [Bel+15; LR16]
Gaussians with different covariance Our result requires all the gaussians to have
the same variance. This is necessary, as even if the variance of the gaussians only
differ by a factor of 2, there are examples where a simulated tempering chain takes
exponential time to converge [WSH09b]. Intuitively, this is illustrated in Figure B.1.
The figure on the left shows the distribution in low temperature – in this case the
two modes are separate, and both have a significant mass. The figure on the right
shows the distribution in high temperature. Note that although in this case the two
modes are connected, the volume of the mode with smaller variance is much smaller
(exponentially small in 𝑑). Therefore in high dimensions, even though the modes
can be connected at high temperature, the probability mass associated with a small
variance mode is too small to allow fast mixing.
148
In the next section, we show that even if we do not restrict to the particular
simulated tempering chain, no efficient algorithm can efficiently and robustly sample
from a mixture of two Gaussians with different covariances.
Figure B.1: Mixture of two gaussians with different covariance at different tempera-ture
B.5 Lower bound when Gaussians have different
variance
In this section, we give a lower bound showing that in high dimensions, if the Gaus-
sians can have different covariance matrices, results similar to our Theorem 2.4.2
cannot hold. In particular, we construct a log density function 𝑓 that is close to the
log density of mixture of two Gaussians (with different variances), and show that any
algorithm must query the function at exponentially many locations in order to sample
from the distribution. More precisely, we prove the following theorem:
Theorem B.5.1. There exists a function 𝑓 such that 𝑓 is close to a negative log
density function 𝑓 for a mixture of two Gaussians:∥∥∥𝑓 − 𝑓
∥∥∥∞≤ log 2, ∀𝑥 ‖∇𝑓(𝑥) −
∇𝑓(𝑥)‖ ≤ 𝑂(𝑑), ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝑂(𝑑). Let 𝑝 be the distribution whose density
function is proportional to exp(−𝑓). There exists constant 𝑐 > 0, 𝐶 > 0, such that
when 𝑑 ≥ 𝐶, any algorithm with at most 2𝑐𝑑 queries to 𝑓 and ∇𝑓 cannot generate a
149
distribution that is within TV-distance 0.3 to 𝑝.
In order to prove this theorem, we will first specify the mixture of two Gaussians.
Consider a uniform mixture of two Gaussian distributions 𝑁(0, 2𝐼) and 𝑁(𝑢, 𝐼)(𝑢 ∈
R𝑑) in R𝑑.
Definition B.5.2. Let 𝑓1 = ‖𝑥‖2/4 + 𝑑2
log(2√
2𝜋) and 𝑓2 = ‖𝑥− 𝑢‖2/2 + 𝑑2
log(2𝜋).
The mixture 𝑓 used in the lower bound is
𝑓 = − log(1
2(𝑒−𝑓1 + 𝑒−𝑓1)).
In order to prove the lower bound, we will show that there is a function 𝑓 close to
𝑓 , such that 𝑓 behaves exactly like a single Gaussian 𝑁(0, 2𝐼) on almost all points.
Intuitively, any algorithm with only queries to 𝑓 will not be able to distinguish it
with a single Gaussian, and therefore will not be able to find the second component
𝑁(𝑢, 𝐼). More precisely, we have
Lemma B.5.3. When ‖𝑢‖ ≥ 4𝑑 log 2, for any point 𝑥 outside of the ball with center
2𝑢 and radius 1.5‖𝑢‖, we have 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥).
Proof. The Lemma follows from simple calculation. In order for 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥),
since 𝑒𝑥 is monotone we know
−‖𝑥− 𝑢‖2
2≤ −‖𝑥‖
2
4− 𝑑
4log 2.
This is a quadratic inequality in terms of 𝑥, reordering the terms we get
‖𝑥− 2𝑢‖2 ≥ 𝑑 log 2 + 2‖𝑢‖2.
Since 𝑑 log 2 ≤ 0.25‖𝑢‖2, we know whenever ‖𝑥−2𝑢‖2 ≥ 1.5‖𝑢‖ this is always satisfied,
and hence 𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥).
150
The lemma shows that outside of this ball, the contribution from the first Gaussian
is dominating. Intuitively, we try to make 𝑓 = 𝑓1 outside of this ball, and 𝑓 = 𝑓
inside the ball. To make the function continuous, we shift between the two functions
gradually. More precisely, we define 𝑓 as follows:
Definition B.5.4. The function
𝑓(𝑥) = 𝑔(𝑥)𝑓1(𝑥) + (1− 𝑔(𝑥))𝑓(𝑥). (B.57)
Here the function 𝑔(𝑥) (see Definition B.5.6) satisfies
𝑔(𝑥) =
1 ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖
0 ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖
∈ [0, 1] otherwise
Also 𝑔(𝑥) is twice differentiable with all first and second order derivatives bounded.
With a carefully constructed 𝑔(𝑥), it is possible to prove that 𝑓 is point-wise close
to 𝑓 in function value, gradient and Hessian, as stated in the Lemma below. Since
these are just routine calculations, we leave the construction of 𝑔(𝑥) and verification
of this lemma at the end of this section.
Lemma B.5.5. For the functions 𝑓 and 𝑓 defined in Definitions B.5.2 and B.5.4, if
‖𝑢‖ ≥ 4𝑑 log 2, there exists a large enough constant 𝐶 such that
|𝑓 − 𝑓 |∞ ≤ log 2
∀𝑥 ‖∇𝑓(𝑥)−∇𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖
∀𝑥 ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2.
Now we are ready to prove the main theorem:
151
Proof of Theorem B.5.1. We will show that if we pick ‖𝑢‖ to be a uniform random
vector with norm 8𝑑 log 2, there exists constant 𝑐 > 0 such that for any algorithm,
with probability at least 1− exp(−𝑐𝑑), in the first exp 𝑐𝑑 iterations of the algorithm
there will be no vector 𝑥 = 0 such that cos 𝜃(𝑥, 𝑢) ≥ 3/5.
First, by standard concentration inequalities, we know for any fixed vector 𝑥 = 0
and a uniformly random 𝑢,
Pr[cos 𝜃(𝑥, 𝑢) ≥ 3/5] ≤ exp−𝑐′𝑑,
for some constant 𝑐′ > 0 (𝑐′ = 0.01 suffices).
Now, for any algorithm, consider running the algorithm with oracle to 𝑓1 and
𝑓 respectively (if the algorithm is randomized, we also couple the random choices
of the algorithm in these two runs). Suppose when the oracle is 𝑓1 the queries are
𝑥1, 𝑥2, ..., 𝑥𝑡 and when the oracle is 𝑓 the queries are 1, ..., 𝑡.
Let 𝑐 = 𝑐′/2, when 𝑡 ≤ exp(𝑐𝑑), by union bound we know with probability at least
1− exp(−𝑐𝑑), we have cos 𝜃(𝑥𝑖, 𝑢) < 3/5 for all 𝑖 ≤ 𝑡. On the other hand, every point
𝑦 in the ball with center 2‖𝑢‖ and radius 1.6‖𝑢‖ has cos 𝜃(𝑦, 𝑢) ≥ 3/5. We know
‖𝑥𝑖 − 2𝑢‖ > 1.6‖𝑢‖, hence 𝑓1(𝑥𝑖) = 𝑓(𝑥𝑖) for all 𝑖 ≤ 𝑡 (the derivatives are also the
same). Therefore, the algorithm is going to get the same response no matter whether
it has access to 𝑓1 or 𝑓 . This implies 𝑖 = 𝑥𝑖 for all 𝑖 ≤ 𝑡.
Now, to see why this implies the output distribution of the last point is far from
𝑝, note that when 𝑑 is large enough 𝑝 has mass at least 0.4 in ball ‖𝑥𝑖− 2𝑢‖ ≤ 1.6‖𝑢‖
(because essentially all the mass in the second Gaussian is inside this ball), while the
algorithm has less than 0.1 probability of having any point in this region. Therefore
the TV distance is at least 0.3 and this finishes the proof.
152
B.5.1 Construction of 𝑔 and closeness of two functions
Now we finish the details of the proof by construction a function 𝑔.
Definition B.5.6. Let ℎ(𝑥) be the following function:
ℎ(𝑥) =
1 𝑥 ≥ 1
0 𝑥 ≤ 0
𝑥2(1− 𝑥)2 + (1− (1− 𝑥)2)2 𝑥 ∈ [0, 1]
We then define 𝑔(𝑥) to be 𝑔(𝑥) := ℎ(10(‖𝑥−2𝑢‖‖𝑢‖ − 1.5
)).
For this function we can prove:
Lemma B.5.7. The function 𝑔 defined above satisfies
𝑔(𝑥) =
1 ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖
0 ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖
∈ [0, 1] otherwise
Also 𝑔(𝑥) is twice differentiable. There exists large enough constant 𝐶𝑔 > 0 such that
for all 𝑥
‖∇𝑔(𝑥)‖ ≤ 𝐶𝑔‖𝑢‖ ‖∇2𝑔(𝑥)‖ ≤ 𝐶𝑔(‖𝑢‖2 + 1).
Proof. First we prove properties of ℎ(𝑥). Let ℎ0(𝑥) = 𝑥2(1− 𝑥)2 + (1− (1− 𝑥)2)2, it
is easy to check that ℎ0(0) = ℎ′0(0) = ℎ′′
0(0) = 0, ℎ0(1) = 1 and ℎ′0(1) = ℎ′′
0(1) = 0.
Therefore the entire function ℎ(𝑥) is twice differentiable.
Also, we know ℎ′0(𝑥) = 2𝑥(4𝑥2− 9𝑥+ 5), which is always positive when 𝑥 ∈ [0, 1].
Therefore ℎ(𝑥) is monotone in [0, 1]. The second derivative ℎ′′0(𝑥) = 24𝑥2− 36𝑥+ 10.
Just using the naive bound (sum of absolute values of individual terms) we can get
for any 𝑥 ∈ [0, 1] |ℎ′0(𝑥)| ≤ 36 and |ℎ′′(𝑥)| ≤ 60. (We can of course compute better
bounds but it is not important for this proof.)
153
Now consider the function 𝑔. We know when ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖,
∇𝑔(𝑥) = ℎ′Ç
10
Ç‖𝑥− 2𝑢‖‖𝑢‖
− 1.5
åå· 10(𝑥− 2𝑢).
Therefore ‖∇𝑔(𝑥)‖ ≤ 36 × 10 × ‖𝑥 − 2𝑢‖ ≤ 𝐶𝑔‖𝑢‖ (when 𝐶𝑔 is a large enough
constant).
For the second order derivative, we know
∇2𝑔(𝑥) = 100ℎ′′Ç
10
Ç‖𝑥− 2𝑢‖‖𝑢‖
− 1.5
åå(𝑥−2𝑢)(𝑥−2𝑢)⊤+10ℎ′
Ç10
Ç‖𝑥− 2𝑢‖‖𝑢‖
− 1.5
åå𝐼.
Again by bounds on ℎ′ and ℎ′′ we know there exists large enough constants so
that ‖∇2𝑔(𝑥)‖ ≤ 𝐶𝑔(‖𝑢‖2 + 1).
Finally we can prove Lemma B.5.5.
Proof of Lemma B.5.5. We first show that the function values are close. When ‖𝑥−
2𝑢‖ ≤ 1.5‖𝑢‖, by definition 𝑓(𝑥) = 𝑓(𝑥). When ‖𝑥 − 2𝑢‖ ≥ 1.5‖𝑢‖, by property
of 𝑔 we know 𝑓(𝑥) is between 𝑓(𝑥) and 𝑓1(𝑥). Now by Lemma B.5.3, in this range
𝑒−𝑓1(𝑥) ≥ 𝑒−𝑓2(𝑥), so 𝑓1(𝑥)− log 2 ≤ 𝑓(𝑥) ≤ 𝑓1(𝑥). As a result we know |𝑓(𝑥)−𝑓(𝑥)| ≤
log 2.
Next we consider the gradient. Again when ‖𝑥− 2𝑢‖ ≤ 1.5‖𝑢‖ the two functions
(and all their derivatives) are the same. When ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖, we have
∇𝑓(𝑥) = 𝑔(𝑥)∇𝑓1(𝑥) + (1− 𝑔(𝑥))∇𝑓(𝑥) + (𝑓1(𝑥)− 𝑓(𝑥))∇𝑔(𝑥).
By Lemma B.5.7 we have upperbounds for 𝑔(𝑥) and ‖∇𝑔(𝑥)‖, also both ‖∇𝑓1(𝑥)‖, ‖∇𝑓(𝑥)‖
can be easily bounded by 𝑂(1)‖𝑢‖, therefore ‖∇𝑓(𝑥) − ∇𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖ for large
enough constant 𝐶.
154
When ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖, we know ∇𝑓(𝑥) = ∇𝑓1(𝑥). Calculation shows
∇𝑓1(𝑥)−∇𝑓(𝑥) =𝑒−𝑓2(𝑥)
𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥)(∇𝑓1(𝑥)−∇𝑓2(𝑥)).
When ‖𝑥‖ ≤ 50‖𝑢‖, we have ‖∇𝑓1(𝑥) − ∇𝑓2(𝑥)‖ ≤ 2‖𝑥‖ + 2‖𝑢‖ ≤ 𝑂(1)‖𝑢‖.
When ‖𝑥‖ ≥ 50‖𝑢‖, it is easy to check that 𝑒−𝑓2(𝑥)
𝑒−𝑓1(𝑥)+𝑒−𝑓2(𝑥)≤ exp−‖𝑥‖2/5 and
‖∇𝑓1(𝑥)−∇𝑓2(𝑥)‖ ≤ 2‖𝑥‖, therefore in this case the difference in gradient bounded
by exp(−𝑡2/5)2𝑡 which is always small.
Finally we can check the Hessian. Once again when ‖𝑥 − 2𝑢‖ ≤ 1.5‖𝑢‖ the two
functions are the same. When ‖𝑥− 2𝑢‖ ∈ [1.5, 1.6]‖𝑢‖, we have
∇2𝑓(𝑥) =𝑔(𝑥)∇2𝑓1(𝑥) + (1− 𝑔(𝑥))∇2𝑓(𝑥)
+ (∇𝑓1(𝑥)−∇𝑓(𝑥))(∇𝑔(𝑥))⊤ + (∇𝑔(𝑥))(∇𝑓1(𝑥)−∇𝑓(𝑥))⊤
+ (𝑓1 − 𝑓)∇2𝑔(𝑥).
In this case we get bounds for 𝑔(𝑥),∇𝑔(𝑥),∇2𝑔(𝑥) from Lemma B.5.7, ‖∇𝑓1(𝑥)‖, ‖∇𝑓(𝑥)‖
can still be bounded by 𝑂(1)‖𝑢‖, ‖∇2𝑓(𝑥)‖, ‖∇2𝑓1(𝑥)‖ can be bounded by 𝑂(‖𝑢‖2)
and 𝑂(1) respectively. Therefore we know ‖∇2𝑓(𝑥) − ∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2 for large
enough constant 𝐶.
When ‖𝑥− 2𝑢‖ ≥ 1.6‖𝑢‖, we have 𝑓(𝑥) = 𝑓1(𝑥), and
∇2𝑓1(𝑥)−∇2𝑓(𝑥) =𝑒−𝑓2(𝑥)
𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥)(∇2𝑓1(𝑥)−∇2𝑓2(𝑥))
+𝑒−𝑓1(𝑥)−𝑓2(𝑥)(∇𝑓1(𝑥)−∇𝑓2(𝑥))(∇𝑓1(𝑥)−∇𝑓2(𝑥))⊤
(𝑒−𝑓1(𝑥) + 𝑒−𝑓2(𝑥))2.
Here the first term is always bounded by a constant (because 𝑒−𝑓2(𝑥) is smaller,
and ∇2𝑓1(𝑥)−∇2𝑓2(𝑥) = 𝐼/2). For the second term, by arguments similar as before,
we know when ‖𝑥‖ ≤ 50‖𝑢‖ this is bounded by 𝑂(1)‖𝑢‖2. When ‖𝑥‖ ≥ 50‖𝑢‖ we can
155
check 𝑒−𝑓1(𝑥)−𝑓2(𝑥)
(𝑒−𝑓1(𝑥)+𝑒−𝑓2(𝑥))2≤ exp(−‖𝑥‖2/5) and ‖(∇𝑓1(𝑥)−∇𝑓2(𝑥))(∇𝑓1(𝑥)−∇𝑓2(𝑥))⊤‖ ≤
4‖𝑥|2. Therefore the second term is bounded by exp(−𝑡2/5) · 4𝑡2 which is no larger
than a constant. Combining all the cases we know there exists a large enough constant
𝐶 such that ‖∇2𝑓(𝑥)−∇2𝑓(𝑥)‖ ≤ 𝐶‖𝑢‖2 for all 𝑥.
156
Appendix C
Appendix for Chapter 3
C.1 Proof for logistic regression application
C.1.1 Theorem for general posterior sampling, and applica-
tion to logistic regression
We show that under some general conditions—roughly, that we see data in all directions—
the posterior distribution concentrates. We specialize to logistic regression and show
that the posterior for logistic regression concentrates under reasonable assumptions.
The proof shares elements with the proof of the Bernstein-von Mises theorem (see
e.g. [Nic12]), which says that under some weak smoothness and integrability assump-
tions, the posterior distribution after seeing iid data (asymptotically) approaches a
normal distribution. However, we only need to prove a weaker result—not that the
posterior distribution is close to normal, but just 𝛼𝑇 -strongly log concave in a neigh-
borhood of the MLE, for some 𝛼 > 0; hence, we get good, nonasymptotic bounds.
This is true under more general assumptions; in particular, the data do not have have
to be iid, as long as we observe data “in all directions.”
Theorem C.1.1 (Validity of the assumptions for posterior sampling). Sup-
157
pose that ‖𝜃0‖ ≤ 𝐵, 𝑥𝑡 ∼ 𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0). Let 𝑓𝑡, 𝑡 ≥ 1 be such that 𝑃𝑥(𝑥𝑡|𝑥1:𝑡−1, 𝜃) ∝
𝑒−𝑓𝑡(𝜃) and let 𝜋𝑡(𝜃) be the posterior distribution, 𝜋𝑡(𝜃) ∝ 𝑒−∑𝑡
𝑘=0𝑓𝑡(𝜃). Suppose there
is 𝑀,𝐿, 𝑟, 𝜎min, 𝑇min > 0 and 𝛼, 𝛽 ≥ 0 such that the following conditions hold:
1. For each 𝑡, 1 ≤ 𝑡 ≤ 𝑇 , 𝑓𝑡(𝜃) is twice continuously differentiable and convex.
2. (Gradients have bounded variation) For each 𝑡, given 𝑥1:𝑡−1,
‖∇𝑓𝑡(𝜃)− E[∇𝑓𝑡(𝜃)|𝑥1:𝑡−1]‖ ≤𝑀. (C.1)
3. (Smoothness) Each 𝑓𝑡 is 𝐿-smooth, for 1 ≤ 𝑡 ≤ 𝑇 .
4. (Strong convexity in neighborhood) Let
𝐼𝑇 (𝜃) : =1
𝑇
𝑇∑𝑡=1
∇2𝑓𝑡(𝜃) (C.2)
Then for 𝑇 ≥ 𝑇min, with probability ≥ 1− 𝜀2,
∀𝜃 ∈ B(𝜃0, 𝑟), 𝐼𝑇 (𝜃) ⪰ 𝜎min𝐼𝑑 (C.3)
5. 𝑓0(𝜃) is 𝛼-strongly convex and 𝛽-smooth, and has minimum at 𝜃 = 0.
Let 𝜃⋆𝑇 be the minimum of∑𝑇
𝑡=0 𝑓𝑡(𝜃), i.e., the MAP for 𝜃 after observing 𝑥1:𝑇 . Letting
𝐶 = max
1,𝑀
√2𝑑 log
Ç2𝑑
𝜀
å,
4𝑑
𝜎min
,
and 𝑐 = 𝛼𝜎min
, if 𝑇 ≥ 𝑇min is such that 𝐶√𝑇+𝛽𝐵
𝜎min𝑇+𝛼+ 𝐶√
𝑇+𝑐< 𝑟, then with probability 1−𝜀,
the following hold:
1. ‖𝜃⋆𝑇 − 𝜃0‖ ≤ 𝐶√𝑇+𝛽𝐵
𝜎min𝑇+𝛼.
2. For 𝐶 ′ ≥ 0, P𝜃∼𝜋𝑇
(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′
√𝑇+𝑐
)≤ 𝐾1
𝜎min𝐶√𝑇+𝑐
((𝐿𝑇+𝛽)𝑒
𝑑
) 𝑑2 𝑒
12𝜎min𝐶
2−𝜎min𝐶𝐶′2
for some constant 𝐾1.
158
The strong convexity condition is analogous to a small-ball inequality [KM15; Men14]
for the sample Fisher information matrix in a neighborhood of the true parameter
value. In the iid case we have concentration (which is necessary for a central limit
theorem to hold, as in the Bernstein-von Mises Theorem); in the non-iid case we do
not necessarily have concentration, but the small-ball inequality can still hold.
We show that under reasonable conditions on the data-generating distribution,
logistic regression satisfies the above conditions. Let 𝜑(𝑥) = 11+𝑒−𝑥 be the logistic
function. Note that 𝜑(−𝑥) = 1− 𝜑(𝑥).
Applying Theorem C.1.1 to the setting of logistic regression, we will obtain the
following.
Lemma C.1.2. In the setting of Problem 3.2.1 (logistic regression), suppose that
‖𝜃0‖ ≤ B, 𝑢𝑡 ∼ 𝑃𝑢 are iid, where 𝑃𝑢 is a distribution that satisfies the following: for
𝑢 ∼ 𝑃𝑢,
1. (Bounded) ‖𝑢‖2 ≤𝑀 with probability 1.
2. (Minimal eigenvalue of Fisher information matrix)
𝐼(𝜃0) : =∫R𝑑
𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢⊤ 𝑑𝑃𝑢 ⪰ 𝜎𝐼𝑑, (C.4)
for 𝜎 > 0.
Let
𝐶 = max
1, 2𝑀
√2𝑑 log
Ç2𝑑
𝜀
å,4𝑒𝑑
𝜎
(C.5)
Then for 𝑡 > maxß
𝑀4 log( 2𝑑𝜀 )
8𝜎2 , 4𝑀2Ä2𝑒𝐶𝜎
+ 1ä2
, 4𝑒𝑀B𝛼𝜎
™, we have
1. ∇𝑓𝑘(𝜃) is 𝑀2
4-Lipschitz for all 𝑘 ∈ N.
159
2. For any 𝐶 ′ ≥ 0, and 𝑐 = 2𝑒𝛼𝜎,
P𝜃∼𝜋𝑡
Ç‖𝜃 − 𝜃⋆𝑡 ‖ ≥
𝐶 ′√𝑇 + 𝑐
å≤ 𝐾1
𝜎𝐶√𝑇 + 𝑐
ÑÄ𝑀2
4𝑇 + 𝛼
ä𝑒
𝑑
é 𝑑2
𝑒14𝑒
𝜎𝐶2−𝜎𝐶𝐶′4𝑒
(C.6)
for some constant 𝐾1.
3. With probability 1− 𝜀, ‖𝜃⋆𝑡 − 𝜃0‖ ≤ 𝐶√𝑡+𝛼B
𝜎𝑡/2𝑒+𝛼.
Remark C.1.3. We explain the condition 𝐼(𝜃0) =∫R𝑑 𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢
⊤ 𝑑𝑃𝑢 ⪰
𝜎𝐼𝑑. Note that 𝜑(𝑥)𝜑(−𝑥) can be bounded away from 0 in a neighborhood of 𝑥 = 0,
and then decays to 0 exponentially in 𝑥. Thus, 𝐼(𝜃0) is essentially the second moment,
where we ignore vectors that are too large in the direction of ±𝜃0.
More precisely, we have the following implication:
E𝑢[𝑢𝑢⊤1𝜑(𝑢⊤𝜃0)≤𝐶1
] ⪰ 𝜎𝐼𝑑 =⇒∫R𝑑
𝜑(𝑢⊤𝜃0)𝜑(−𝑢⊤𝜃0)𝑢𝑢⊤ 𝑑𝑃𝑢 ⪰
1
𝜑(𝐶1)(1− 𝜑(𝐶1))𝜎𝐼𝑑.
(C.7)
Theorem 3.2.2 is stated with 𝐶1 = 2.
C.1.2 Proof of Theorem C.1.1
Proof of Theorem C.1.1. Let ℰ be the event that (C.3) holds.
Step 1: We bound ‖𝜃⋆𝑇 − 𝜃0‖ with high probability.
We show that with high probability∑𝑇
𝑡=0∇𝑓𝑡(𝜃0) is close to 0. Since∑𝑇
𝑡=0∇𝑓𝑡(𝜃⋆𝑇 ) =
0, the gradient at 𝜃0 and 𝜃⋆𝑇 are close. Then by strong convexity, we conclude 𝜃0 and
𝜃⋆𝑇 are close.
First note that E[𝑓𝑡(𝜃)|𝑥1:𝑡−1] =∫R𝑑 − log𝑃𝑥(𝑥𝑡|𝑥1:𝑡−1, 𝜃) 𝑑𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0) is a KL
divergence minus the entropy for 𝑃𝑥(·|𝑥1:𝑡−1, 𝜃0), and hence is minimized at 𝜃 = 𝜃0.
160
Hence 1𝑇
∑𝑇𝑡=1 E[∇𝑓𝑡(𝜃0)|𝑥1:𝑡−1] = 0. Thus by Lemma D.4.2 applied to
𝑇∑𝑡=1
∇𝑓𝑡(𝜃0) =𝑇∑𝑡=1
[∇𝑓𝑡(𝜃0)− E[∇𝑓𝑡(𝜃0)|𝑥1:𝑡−1]] , (C.8)
we have by Chernoff’s inequality that
P
Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1
∇𝑓𝑡(𝜃0)∥∥∥∥∥∥ ≥ 𝐶√
𝑇
é≤ 2𝑑𝑒−
𝐶2
2𝑀2𝑑 ≤ 𝜀
2(C.9)
when 𝐶2
2𝑀2𝑑≥ log
Ä4𝑑𝜀
ä, which happens when 𝐶 ≥𝑀
√2𝑑 log
Ä4𝑑𝜀
ä.
Let 𝒜 be the event that∥∥∥ 1𝑇
∑𝑇𝑡=1∇𝑓𝑡(𝜃0)
∥∥∥ < 𝐶√𝑇
. Then under 𝒜,
∥∥∥∥∥∥ 1
𝑇
𝑇∑𝑡=0
∇𝑓𝑡(𝜃0)∥∥∥∥∥∥ > − 𝐶√
𝑇− 1
𝑇𝛽 ‖𝜃0‖ ≥ −
𝐶√𝑇− 𝛽𝐵
𝑇(C.10)
Let 𝑤 =𝜃⋆𝑇−𝜃0
‖𝜃⋆𝑇−𝜃0‖ . Under the event ℰ ,
1
𝑇
𝑇∑𝑡=0
∇𝑓𝑡(𝜃0 + 𝑠𝑤)⊤𝑤 ≥ − 𝐶√𝑇− 𝛽𝐵
𝑇+Å𝜎min +
𝛼
𝑇
ãmin𝑠, 𝑟. (C.11)
Hence, if 𝑠, 𝑟 > 𝐶√𝑇+𝛽𝐵
𝜎min𝑇+𝛼, then
∑𝑇𝑡=0∇𝑓𝑡(𝜃0) = 0. Considering 𝑠 = ‖𝜃⋆𝑇 − 𝜃0‖, this
means that
‖𝜃⋆𝑇 − 𝜃0‖ ≤𝐶√𝑇 + 𝛽𝐵
𝜎min𝑇 + 𝛼. (C.12)
Step 2: For 𝑐 = 𝛼𝜎min
, we bound P𝜃∼𝜋𝑇(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′
√𝑇+𝑐
).
Under ℰ , 1𝑇
∑𝑇𝑡=1 𝑓𝑡(𝜃) is 𝜎min-strongly convex for 𝜃 ∈ B
(𝜃⋆𝑇 ,
𝐶√𝑇+𝑐
)⊂ B(𝜃0, 𝑟), and
𝑓0(𝜃) is 𝛼-strongly convex.
Let 𝑟′ = 𝑟 − 𝐶√𝑇+𝛽𝐵
𝜎min𝑇+𝛼. Under 𝒜, B(𝜃⋆𝑇 , 𝑟
′) ⊂ B(𝜃0, 𝑟). Thus under ℰ ∩ 𝒜, letting
161
𝑤(𝜃) :=𝜃−𝜃⋆𝑇
‖𝜃−𝜃⋆𝑇‖,
∀𝜃 ∈ B(𝜃⋆𝑇 , 𝑟′) ⊂ B(𝜃0, 𝑟),
𝑇∑𝑡=0
∇𝑓𝑡(𝜃)⊤𝑤(𝜃) ≥ (𝑇𝜎min + 𝛼) ‖𝜃 − 𝜃⋆𝑇‖ . (C.13)
Suppose 𝑇 is such that 𝐶√𝑇+𝑐
< 𝑟′, i.e., 𝐶√𝑇+𝛽𝐵
𝜎min𝑇+𝛼+ 𝐶√
𝑇+𝑐< 𝑟. By shifting, we may
assume that∑𝑇
𝑡=0 𝑓𝑡(𝜃⋆𝑇 ) = 0. Because 𝑓𝑡(𝜃) is 𝐿-smooth for 1 ≤ 𝑡 ≤ 𝑇 and 𝛽-smooth
for 𝑡 = 0,
𝑇∑𝑡=0
𝑓𝑡(𝜃) ≤ 𝐿𝑇 + 𝛽
2‖𝜃 − 𝜃⋆𝑇‖
2 . (C.14)
Then for all 𝜃 ∈ B(𝜃⋆𝑇 ,
𝐶√𝑇+𝑐
)𝑐,
𝑇∑𝑡=0
𝑓𝑡(𝜃) ≥𝑇∑𝑡=0
𝑓𝑡
Ç𝜃⋆𝑇 +
𝐶√𝑇 + 𝑐
𝑤(𝜃)
å+
𝑇∑𝑡=0
ñ𝑓𝑡(𝜃)− 𝑓𝑡
Ç𝜃⋆𝑇 +
𝐶√𝑇 + 𝑐
𝑤(𝜃)
åô(C.15)
≥ 1
2(𝑇𝜎min + 𝛼)
𝐶2
𝑇 + 𝑐+ (𝑇𝜎min + 𝛼)
𝐶√𝑇 + 𝑐
Ç‖𝜃 − 𝜃⋆𝑇‖ −
𝐶√𝑇 + 𝑐
å(C.16)
≥ 1
2𝜎min𝐶
2 + 𝜎min𝐶√𝑇 + 𝑐
Ç‖𝜃 − 𝜃⋆𝑇‖ −
𝐶√𝑇 + 𝑐
å. (C.17)
Thus for any 𝐶 ′ ≥ 0,
∫R𝑑
𝑒−∑𝑇
𝑡=0𝑓𝑡(𝜃) 𝑑𝜃 ≥
∫R𝑑
𝑒−𝐿𝑇+𝛽
2 ‖𝜃−𝜃⋆𝑇‖2
𝑑𝜃 =
Ç2𝜋
𝐿𝑇 + 𝛽
å 𝑑2
(C.18)∫BÄ𝜃⋆𝑇 , 𝐶′
√𝑇+𝑐
ä𝑐 𝑒−∑𝑇𝑡=0
𝑓𝑡(𝜃) 𝑑𝜃 ≤∫BÄ𝜃⋆𝑇 , 𝐶′
√𝑇+𝑐
ä𝑐 𝑒− 12𝜎min𝐶
2
𝑒−𝜎min𝐶
√𝑇+𝑐
Ä‖𝜃−𝜃⋆𝑇‖− 𝐶√
𝑇+𝑐
ä𝑑𝜃
(C.19)
=∫ ∞
𝐶′√𝑇+𝑐
Vol𝑑−1(S𝑑−1)𝛾𝑑−1𝑒12𝜎min𝐶
2
𝑒−𝜎min𝐶√𝑇+𝑐𝛾 𝑑𝛾 (C.20)
162
=∫ ∞
𝐶′√𝑇+𝑐
Vol𝑑−1(S𝑑−1)𝑒12𝜎min𝐶
2
𝑒−(𝜎min𝐶√𝑇+𝑐𝛾−(𝑑−1) log 𝛾) 𝑑𝛾
(C.21)
Now, when 𝐶 ≥ max2(𝑑−1)𝜎min
, 1, we have that
𝜎min𝐶√𝑇 + 𝑐𝛾 − (𝑑− 1) log 𝛾 ≥ 𝜎min𝐶
√𝑇 + 𝑐𝛾 − (𝑑− 1)𝛾 (C.22)
≥ 𝜎min𝐶√𝑇 + 𝑐𝛾 − 𝜎min𝐶
√𝑇 + 𝑐𝛾
2(C.23)
=𝜎min𝐶
√𝑇 + 𝑐𝛾
2. (C.24)
Then by Stirling’s formula, for some 𝐾1,
(C.21) ≤ Vol𝑑−1(S𝑑−1)𝑒12𝜎min𝐶
2∫ ∞
𝐶′√𝑇+𝑐
𝑒−𝜎min𝐶
√𝑇+𝑐𝛾
2 𝑑𝛾 (C.25)
≤ 2𝜋𝑑2
ΓÄ𝑑2
ä𝑒 12𝜎min𝐶
2 2
𝜎min𝐶√𝑇 + 𝑐
𝑒−𝜎min𝐶𝐶′
2 (C.26)
≤ 𝐾1
𝜎min𝐶√𝑇 + 𝑐
Ç2𝜋𝑒
𝑑
å 𝑑2
𝑒12𝜎min𝐶
2−𝜎min𝐶𝐶′2 (C.27)
We bound P𝜃∼𝜋𝑇
(‖𝜃 − 𝜃⋆𝑇‖ ≥ 𝐶′
√𝑇+𝑐
). By (C.18) and (C.21),
P𝜃∼𝜋𝑇
Ç‖𝜃 − 𝜃⋆𝑇‖ ≥
𝐶 ′√𝑇 + 𝑐
å=
∫𝜃∈𝐵
Ä𝜃⋆𝑇 , 𝐶′
√𝑇+𝑐
ä𝑐 𝑒−∑𝑇𝑡=0
𝑓𝑡(𝜃) 𝑑𝜃∫R𝑑 𝑒−
∑𝑇𝑡=0 𝑓𝑡(𝜃) 𝑑𝜃
(C.28)
≤ 𝐾1
𝜎min𝐶√𝑇 + 𝑐
Ç𝐿𝑇 + 𝛽
2𝜋
å 𝑑2Ç
2𝜋𝑒
𝑑
å 𝑑2
𝑒12𝜎min𝐶
2−𝜎min𝐶𝐶′2
(C.29)
=𝐾1
𝜎min𝐶√𝑇 + 𝑐
Ç(𝐿𝑇 + 𝛽)𝑒
𝑑
å 𝑑2
𝑒12𝜎min𝐶
2−𝜎min𝐶𝐶′2
(C.30)
as needed. The requirements on 𝐶 are 𝐶 ≥ max
1,𝑀√
2𝑑 logÄ4𝑑𝜀
ä, 2𝑑𝜎min
, so the
theorem follows.
163
C.1.3 Online logistic regression: Proof of Lemma C.1.2 and
Theorem 3.2.2
To prove Theorem C.1.2, we will apply Theorem C.1.1. To do this, we need to verify
the conditions in Theorem C.1.1.
Lemma C.1.4. Under the assumptions of Theorem C.1.2,
1. (Gradients have bounded variation) For all 𝑡, ‖∇𝑓𝑡(𝜃)‖ ≤𝑀 and
‖∇𝑓𝑡(𝜃)− E∇𝑓𝑡(𝜃)‖ ≤ 2𝑀 .
2. (Smoothness) For all 𝑡, 𝑓𝑡 is14𝑀2-smooth.
3. (Strong convexity in neighborhood) for 𝑇 ≥ 𝑀4 log( 𝑑𝜀)
8𝜎2 ,
P(∀𝜃 ∈ B
Ç𝜃0,
1
𝑀
å,
𝑇∑𝑡=1
∇2𝑓𝑡(𝜃) ⪰ 𝜎
2𝑒𝑇𝐼𝑑
)≥ 1− 𝜀. (C.31)
Proof. First, we calculate the Hessian of the negative log-likelihood.
If 𝑓𝑡(𝜃) = − log 𝜑(𝑦𝑢⊤𝜃), then
∇𝑓𝑡(𝜃) =−𝑦𝜑(𝑦𝑢⊤𝜃)𝜑(−𝑦𝑢⊤𝜃)
𝜑(𝑦𝑢⊤𝜃)𝑢 = −𝑦𝜑(−𝑦𝑢⊤𝜃)𝑢 (C.32)
∇2𝑓𝑡(𝜃) = 𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤. (C.33)
Note that ‖∇𝑓𝑡(𝜃)‖ ≤ ‖𝑢‖ ≤𝑀 , so the first point follows.
To obtain the expected values, note that 𝑦 = 1 with probability 𝜑(𝑢⊤𝜃0), and
164
𝑦 = −1 with probability 1− 𝜑(𝑢⊤𝜃0), so that
E[∇2𝑓𝑡(𝜃)] = E(𝑢,𝑦)[𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤] (C.34)
= E𝑢[𝜑(𝑢⊤𝜃0)𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤ + (1− 𝜑(𝑢⊤𝜃0))𝜑(−𝑦𝑢⊤𝜃)𝜑(𝑦𝑢⊤𝜃)𝑢𝑢⊤]
(C.35)
= E𝑢[𝜑(𝑢⊤𝜃)(1− 𝜑(𝑢⊤𝜃))𝑢𝑢⊤]. (C.36)
Suppose that E𝑢[𝜑(𝑢⊤𝜃)(1− 𝜑(𝑢⊤𝜃))𝑢𝑢⊤] ⪰ 𝜎𝐼.
Next, we show that∑𝑇
𝑡=1∇2𝑓𝑡(𝜃0) is lower-bounded with high probability.
Note that ‖∇2𝑓𝑡(𝜃0)‖ =∥∥∥𝜑(−𝑦𝑢⊤𝜃0)𝜑(𝑦𝑢⊤𝜃0)𝑢𝑢
⊤∥∥∥2≤ 1
4𝑀2. (So the second point
follows.) By the Matrix Chernoff bound,
P(
𝑇∑𝑡=1
∇𝑓 2𝑡 (𝜃0) ⪰
𝜎
2𝑇𝐼𝑑
)≤ 𝑑𝑒−
2·42𝑀4 𝑇(𝜎
2 )2
= 𝑑𝑒−8𝜎2𝑇𝑀4 ≤ 𝜀 (C.37)
when 𝑇 ≥ 𝑀4 log( 𝑑𝜀)
8𝜎2 .
Finally, we show that if the minimum eigenvalue of this matrix is bounded away
from 0 at 𝜃0, then it is also bounded away from 0 in a neighborhood. To see this,
note
𝜑(𝑥 + 𝑐)(1− 𝜑(𝑥 + 𝑐))
𝜑(𝑥)(1− 𝜑(𝑥))=
𝑒𝑥+𝑐
(1 + 𝑒𝑥+𝑐)2(1 + 𝑒𝑥)2
𝑒𝑥≥ 𝑒𝑐
𝑒2𝑐= 𝑒−𝑐. (C.38)
Therefore, if∑𝑇
𝑡=1∇2𝑓𝑡(𝜃0) ⪰ 𝜎′𝐼𝑑, then for ‖𝜃 − 𝜃0‖2 ≤1𝑀
, |𝑢⊤𝜃 − 𝑢⊤𝜃0| < 1 so
by (C.38),
𝑇∑𝑡=1
∇2𝑓𝑡(𝜃) =𝑇∑𝑡=1
𝜑(𝑢⊤𝑡 𝜃)(1− 𝜑(𝑢⊤
𝑡 𝜃))𝑢𝑡𝑢⊤𝑡 (C.39)
⪰𝑇∑𝑡=1
𝑒−1𝜑(𝑢⊤𝑡 𝜃0)(1− 𝜑(𝑢⊤
𝑡 𝜃0))𝑢𝑡𝑢⊤𝑡 ⪰
𝜎′
𝑒𝐼𝑑. (C.40)
165
Therefore,
P(∀𝜃 ∈ B
Ç𝜃0,
1
𝑀
å,
𝑇∑𝑡=1
∇2𝑓𝑡(𝜃) ⪰ 𝜎
2𝑒𝑇𝐼𝑑
)≤ P
(𝑇∑𝑡=1
∇𝑓 2𝑡 (𝜃0) ⪰
𝜎
2𝑇𝐼𝑑
)≤ 𝜀. (C.41)
Proof of Lemma C.1.2. Part 1 was already shown in Lemma C.1.4.
Lemma C.1.4 shows that the conditions of Theorem C.1.1 are satisfied with 𝑀 ←[
2𝑀 , 𝐿 = 𝑀2
4, 𝑟 = 1
𝑀, 𝜎min = 𝜎
2𝑒, 𝑇min =
𝑀4 log( 2𝑑𝜀 )
8𝜎2 . Also, 𝛼 = 𝛽. We further need
to check that the condition on 𝑡 implies that 𝐶√𝑡+𝛽B
𝜎min𝑡+𝛼+ 𝐶√
𝑡< 1
𝑀. We have, noting
𝜎min ≤ 𝐿 (the strong convexity is at most the smoothness),
𝐶√𝑡 + 𝛽B
𝜎min𝑡 + 𝛼+
𝐶√𝑡≤Ç
𝐶
𝜎min
+ 1
å1»𝑡 + 𝛼
𝐿
+𝛽B
𝜎min
Ä𝑡 + 𝛼
𝜎min
ä (C.42)
so it suffices to have each entry be < 12𝑀
, and this holds when 𝑡 > 4𝑀2Ä
𝐶𝜎min
+ 1ä2
=
4𝑀2Ä2𝑒𝐶𝜎
+ 1ä2
and 𝑡 > 2𝑀B𝛽𝜎min
= 4𝑒𝑀B𝛼𝜎
.
Part 2 and 3 then follow immediately.
Proof of Theorem 3.2.2. Redefine 𝜎 such that 𝐼(𝜃0) ⪰ 𝜎𝐼𝑑 holds. (By Remark C.1.3,
this 𝜎 is a constant factor times the 𝜎 in Theorem 3.2.2) Theorem 3.2.2 follows
from Theorem 3.2.4 once we show that Assumptions 3.2.1, 3.2.2, and 3.2.3 are sat-
isfied. Assumption 3.2.1 is satisfied with 𝐿0 = 𝛼 and 𝐿 = 𝑀2
4. The rest will fol-
low from Lemma C.1.2 except that we need bounds to cover the case 𝑡 ≤ 𝑇min :=
maxß
𝑀4 log( 2𝑑𝜀 )
8𝜎2 , 16𝑒2𝑀2𝐶2
𝜎2 , 4𝑒𝑀B𝛼𝜎
™as well.
Showing that Assumption 3.2.2 holds. Note 𝐿 ≥ 𝜎 so 𝐶′√𝑇+𝛼
𝐿
≥ 𝐶′√𝑇+ 2𝑒𝛼
𝜎
. For
𝑡 > 𝑇min, item 2 of Lemma C.1.2 shows Assumption 3.2.2 is satisfied with 𝑐 = 𝛼𝐿
(where 𝐿 = 𝑀2
4), 𝐴1 = 𝐾1
𝜎𝐶
(Ä𝑀2
4𝑇+𝛼
ä𝑒
𝑑
) 𝑑2
𝑒14𝑒
𝜎𝐶2and 𝑘1 = 𝜎𝐶
4𝑒.
166
For 𝑡 ≤ 𝑇min, we use Lemma D.2.5, which says that if 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) in R𝑑 and 𝑓
is 𝜅-strongly convex and 𝐾-smooth, and 𝑥⋆ = argmin𝑥 𝑓(𝑥), then
P𝑥∼𝑝
Ñ‖𝑥− 𝑥⋆‖2 ≥ 1
𝜅
(√𝑑 +
√2𝑡 + 𝑑 log
Ç𝐾
𝜅
å)2é≤ 𝑒−𝑡. (C.43)
In our case,∑𝑡
𝑠=0 𝑓𝑠(𝑥) is 𝛼-strongly convex and 𝛼 + 𝑇min𝐿-smooth, so
P𝑥∼𝑝 (‖𝑥− 𝑥⋆‖ ≥ 𝛾) ≤ exp
−(𝛾√𝜅−√𝑑)2 − 𝑑 log
Ä𝐾𝜅
ä2
(C.44)
= 𝑒𝑑2(−1+log(𝐾
𝜅 ))𝑒𝛾√𝜅𝑑− 𝛾2𝜅
2 (C.45)
≤ 𝑒𝑑2(−1+log(𝐾
𝜅 ))−Ä𝛾−2√
𝑑𝜅
ä√𝜅𝑑
(C.46)
Thus for 𝑡 ≤ 𝑇min,
P𝜃∼𝜋𝑡(‖𝜃 − 𝜃⋆𝑡 ‖ ≥ 𝛾) ≤ 𝐴2𝑒−𝑘2𝛾 (C.47)
with 𝐴2 = 𝑒𝑑2(−1+log(𝐾
𝜅 )) = 𝑒𝑑2
Ä−1+log
Ä𝑇min𝐿+𝛼
𝛼
ää(C.48)
𝑘2 =
√𝜅𝑑»
𝑇min + 𝛼𝐿
=
√𝛼𝑑»
𝑇min + 𝛼𝐿
. (C.49)
Take 𝐴 = max𝐴1, 𝐴2 and 𝑘 = min𝑘1, 𝑘2 and note that log(𝐴), 𝑘−1 are polyno-
mial in all parameters and log(𝑇 ).
Showing that Assumption 3.2.3 holds. For 𝑡 > 𝑇min, item 3 of Lemma C.1.2
shows that with probability at least 1− 𝜀, (using 𝐿 ≥ 𝜎)
‖𝜃⋆𝑡 − 𝜃0‖ ≤𝐶√𝑡 + 𝛼B
𝜎𝑡/2𝑒 + 𝛼≤
Ñ𝐶
𝜎/2𝑒+
𝛼B
𝜎/2𝑒 ·»𝑡 + 2𝑒𝛼
𝜎
é1»𝑡 + 𝛼
𝐿
. (C.50)
Now consider 𝑡 ≤ 𝑇min. Since 𝐹𝑡 is strongly convex, the minimizer 𝜃⋆𝑡 of 𝐹𝑡 is
the unique point where ∇𝐹𝑡(𝜃⋆𝑡 ) = 0. Moreover, ‖∑𝑡
𝑘=1∇𝑓𝑘(𝜃)‖ ≤ 𝑇min𝑀 for
167
𝑡 ≤ 𝑇min. Therefore, since 𝑓0 is 𝛼-strongly convex, we have that ‖∇𝐹𝑡(𝜃)‖ =
‖∇𝑓0(𝜃) +∑𝑡
𝑘=1∇𝑓𝑘(𝜃)‖ > 0 for all ‖𝜃‖ > 𝑇min𝑀𝛼−1. Therefore, we must have
that ‖𝜃⋆𝑡 ‖ ≤ 𝑇min𝑀𝛼−1 for all 𝑡 ≤ 𝑇min, and hence that
‖𝜃⋆𝑡 − 𝜃0‖ ≤ 𝑇min𝑀𝛼−1 + B ∀𝑡 ≤ 𝑇min. (C.51)
Set D = 2 max
®(𝑇min𝑀𝛼−1 + B)
»𝑇min + 𝛼
𝐿, 𝐶
𝜎/2𝑒+
√𝛼B√𝜎/2𝑒
´. Then Equations (C.50)
and (C.51) and the triangle inequality would imply that if 𝑡 < 𝜏 , then ‖𝜃⋆𝑡 − 𝜃⋆𝜏‖ ≤D√𝑡+𝛼
𝐿
. To get Assumption 3.2.3 to hold with probability at least 1−𝜀 for all 𝑡, 𝜏 < 𝑇 ,
substitute 𝜀←[ 𝜀𝑇
. D is polynomial in all parameters and log(𝑇 ).
168
Appendix D
Calculations on probability
distributions
D.1 Chi-squared and KL inequalities
Lemma D.1.1. Let 𝑃,𝑄 be probability measures on Ω such that 𝑄≪ 𝑃 , 𝜒2(𝑄||𝑃 ) <
∞, and 𝑔 : Ω→ R satisfies 𝑔 ∈ 𝐿2(𝑃 ). ThenÅ∫Ω𝑔(𝑥)𝑃 (𝑑𝑥)−
∫Ω𝑔(𝑥)𝑄(𝑑𝑥)
ã2≤ Var𝑃 (𝑔)𝜒2(𝑄||𝑃 ). (D.1)
Proof. Noting that∫Ω 𝑃 (𝑑𝑥)−𝑄(𝑑𝑥) = 0 and using Cauchy-Schwarz,Å∫
Ω𝑔(𝑥) 𝑑𝑃 (𝑥)−
∫Ω𝑔(𝑥)𝑄(𝑑𝑥)
ã2=Å∫
Ω(𝑔(𝑥)− E
𝑃[𝑔(𝑥)])(𝑃 (𝑑𝑥)−𝑄(𝑑𝑥))
ã2(D.2)
≤Å∫
Ω(𝑔(𝑥)− E
𝑃[𝑔(𝑥)])2𝑃 (𝑑𝑥)
ã(∫Ω
Ç1− 𝑑𝑄
𝑑𝑃
å2
𝑃 (𝑑𝑥)
)
(D.3)
= Var𝑃 (𝑔)𝜒2(𝑄||𝑃 ). (D.4)
169
The continuous analogue of Lemma D.1.1 is the following.
Lemma D.1.2. Let Ω = Ω(1) × Ω(2) with Ω(1) ⊆ R𝑑1. Suppose 𝑃𝑥1 is a probability
measure on Ω(2) for each 𝑥1 ∈ Ω(1) with density function 𝑝𝑥1 (with respect to some
reference measure 𝑑𝑥), 𝑔 : Ω(2) satisfies 𝑔 ∈ 𝐿2(𝑃𝑥1), and∫Ω(2)‖∇𝑥1𝑝𝑥1 (𝑥2)‖2
𝑝𝑥1 (𝑥2)𝑑𝑥2 <∞.
Then
∥∥∥∥∫Ω(2)
𝑔(𝑥2)∇𝑥1(𝑝𝑥1(𝑥2)) 𝑑𝑥2
∥∥∥∥2 ≤ Var𝑃𝑥1(𝑔)
(∫Ω(2)
‖∇𝑥1𝑝𝑥1(𝑥2)‖2
𝑝𝑥1(𝑥2)𝑑𝑥2
)(D.5)
Proof. Because each 𝑃𝑥1 is a probability measure,
∫Ω(2)∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2 = ∇𝑥1
∫Ω(2)
𝑝𝑥1(𝑥2) 𝑑𝑥2 = ∇𝑥1(1) = 0. (D.6)
Hence
∥∥∥∥∫Ω(2)
𝑔(𝑥2)∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2
∥∥∥∥2 (D.7)
≤∥∥∥∥∥∫Ω(2)
[𝑔(𝑥2)− E
𝑃𝑥1
[𝑔(𝑥2)]
]∇𝑥1𝑝𝑥1(𝑥2) 𝑑𝑥2
∥∥∥∥∥2
(D.8)
≤
Ñ∫Ω(2)
[𝑔(𝑥2)− E
𝑃𝑥1
[𝑔(𝑥2)]
]2𝑝𝑥1(𝑥2) 𝑑𝑥
é(∫Ω(2)
‖∇𝑥1𝑝𝑥1(𝑥2)‖2
𝑝𝑥1(𝑥2)𝑑𝑥2
)(D.9)
= Var𝑃𝑥1(𝑔)
(∫Ω(2)
‖∇𝑥1𝑝𝑥1(𝑥2)‖2
𝑝𝑥1(𝑥2)𝑑𝑥2
)(D.10)
(D.11)
Lemma D.1.3. Let 𝑃 be a probability measure and 𝑄 be a nonnegative measure on
Ω. Define the measure 𝑅 by 𝑅 = min¶𝑑𝑄𝑑𝑃
, 1©𝑃 . (If 𝑝, 𝑞 are the density functions of
𝑃,𝑄, then the density function of 𝑅 is simply 𝑟(𝑥) = min𝑝(𝑥), 𝑞(𝑥).) Let 𝛿 be the
170
overlap 𝛿 = 𝑅(Ω), and ‹𝑅 = 𝑅𝛿
= 𝑅𝑅(Ω)
the normalized overlap measure. Then
𝜒2 (𝑅||𝑃 ) ≤ 1
𝛿. (D.12)
Proof. We make a change of variable to 𝑢 = 𝑑𝑄𝑑𝑃
. Let 𝐹 (𝑢) = 𝑃Ķ𝑥 : 𝑑𝑄
𝑑𝑃(𝑥) ≤ 𝑢
©ä.
Then
𝜒2 (𝑅||𝑃 ) =∫Ω
Ñmin
¶1, 𝑑𝑄
𝑑𝑃
©𝛿
− 1
é2
𝑃 (𝑑𝑥) (D.13)
=∫Ω
Çmin1, 𝑢
𝛿− 1
å2
𝑑𝐹 (𝑢) (Stieltjes integral) (D.14)
≤ 1
𝛿2
∫Ω
(min1, 𝑢)2 𝑑𝐹 (𝑢) (D.15)
≤ 1
𝛿2
∫Ω
min1, 𝑢 𝑑𝐹 (𝑢). (D.16)
Now note that∫Ω min1, 𝑢 𝑑𝐹 (𝑢) =
∫Ω min
¶1, 𝑑𝑄
𝑑𝑃
©𝑃 (𝑑𝑥) = 𝛿. Hence (D.16) is at
most 1𝛿2𝛿 = 1
𝛿.
Lemma D.1.4. If 𝑃, 𝑃𝑖 are probability measures on Ω such that 𝑃 =∑𝑛
𝑖=1𝑤𝑖𝑃𝑖
(where 𝑤𝑖 > 0 sum to 1), and 𝑄≪ 𝑃, 𝑃𝑖, then
𝜒2(𝑄||𝑃 ) ≤𝑛∑
𝑖=1
𝑤𝑖𝜒2(𝑄||𝑃𝑖). (D.17)
This inequality follows from convexity of 𝑓 -divergences; for completeness we in-
clude a proof.
Proof. By Cauchy-Schwarz,
𝜒2(𝑄||𝑃 ) =
(∫Ω
𝑛∑𝑖=1
Ç𝑑𝑄
𝑑𝑃
å2
𝑃 (𝑑𝑥)
)− 1 (D.18)
=∫Ω
(𝑛∑
𝑖=1
𝑤𝑖𝑑𝑃𝑖
𝑑𝑃
𝑑𝑄
𝑑𝑃𝑖
)2
𝑃 (𝑑𝑥)− 1 (D.19)
171
≤∫Ω
(𝑛∑
𝑖=1
𝑤𝑖𝑑𝑃𝑖
𝑑𝑃
)(𝑛∑
𝑖=1
𝑤𝑖
Ç𝑑𝑄
𝑑𝑃𝑖
å2 𝑑𝑃𝑖
𝑑𝑃
)𝑃 (𝑑𝑥)− 1 (D.20)
≤𝑛∑
𝑖=1
𝑤𝑖
(∫Ω
Ç𝑑𝑄
𝑑𝑃𝑖
å2
− 1
)𝑃𝑖(𝑑𝑥) =
𝑛∑𝑖=1
𝑤𝑖𝜒2(𝑄||𝑃𝑖) (D.21)
Lemma D.1.5. Suppose 𝑃, ‹𝑃 are probability distributions on Ω such that 𝑑𝑃𝑑𝑃≤ 𝐾.
Then for any probability measure 𝑄≪ 𝑃 ,
𝜒2(𝑄||𝑃 ) ≤ 𝐾𝜒2(𝑄||‹𝑃 ) + 𝐾 − 1. (D.22)
Proof.
𝜒2(𝑄||𝑃 ) =∫Ω
Ç𝑑𝑄
𝑑𝑃
å2
𝑃 (𝑑𝑥)− 1 =∫Ω
Ç𝑑𝑄
𝑑‹𝑃 å2(𝑑‹𝑃𝑑𝑃
)2
𝑃 (𝑑𝑥)− 1 (D.23)
≤ 𝐾
(∫Ω
Ç𝑑𝑄
𝑑‹𝑃 å2 ‹𝑃 (𝑑𝑥)
)− 1 = 𝐾(𝜒2(𝑄||‹𝑃 ) + 1)− 1. (D.24)
Lemma D.1.6. Let 𝑊 and 𝑊 ′ be probability measures over 𝐼, with densities 𝑤(𝑖),
𝑤′(𝑖) with respect to a reference measure 𝑑𝑖, such that KL(𝑊 ||𝑊 ′) < ∞. For each
𝑖 ∈ 𝐼, suppose 𝑃𝑖, 𝑄𝑖 are probability measures over Ω. Then
KLÅ∫
𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||
∫𝐼𝑤′(𝑖)𝑄𝑖 𝑑𝑖
ã≤ KL(𝑊 ||𝑊 ′) +
∫𝐼𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖.
Proof. Overloading notation, we will use 𝐾𝐿(𝑎||𝑏) for two measures 𝑎, 𝑏 even if they
are not necessarily probability distributions, with the obvious definition. Using the
convexity of KL divergence,
KLÅ∫
𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||
∫𝐼𝑤′(𝑖)𝑄𝑖 𝑑𝑖
ã= KL
Ç∫𝐼𝑤(𝑖)𝑃𝑖 𝑑𝑖||
∫𝐼𝑤(𝑖)𝑄𝑖
𝑤′(𝑖)
𝑤(𝑖)𝑑𝑖
å172
≤∫𝐼𝑤(𝑖)KL
Ç𝑃𝑖||𝑄𝑖
𝑤′(𝑖)
𝑤(𝑖)
å𝑑𝑖
=∫𝐼𝑤(𝑖) log
Ç𝑤(𝑖)
𝑤′(𝑖)
å𝑑𝑖 +
∫𝑖𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖
= KL(𝑊 ||𝑊 ′) +∫𝐼𝑤(𝑖)KL(𝑃𝑖||𝑄𝑖) 𝑑𝑖
D.2 Chi-squared divergence calculations for log-
concave distributions
We calculate the chi-squared divergence between log-concave distributions at different
temperatures, and at different locations. In the gaussian case there is a closed formula
(Lemma D.2.1). The general case is more involved (Lemmas D.2.6 and D.2.7), and
the bound is in terms of the strong convexity and smoothness constants.
Lemma D.2.1. For a matrix Σ, let |Σ| denote its determinant. The 𝜒2 divergence
between 𝑁(𝜇1,Σ1) and 𝑁(𝜇2,Σ2) is
𝜒2(𝑁(𝜇2,Σ2)||𝑁(𝜇1,Σ1)) (D.25)
=|Σ1|
12
|Σ2|∣∣∣(2Σ−1
2 − Σ−11 )
∣∣∣− 12 (D.26)
· exp
Ç1
2(2Σ−1
2 𝜇2 − Σ−11 𝜇1)
𝑇 (2Σ−12 − Σ−1
1 )−1(2Σ−12 𝜇2 − Σ−1
1 𝜇1) +1
2𝜇𝑇1 Σ−1
1 𝜇1 − 𝜇𝑇2 Σ−1
2 𝜇2
å− 1
(D.27)
In particular, in the cases of equal mean or equal variance,
𝜒2(𝑁(𝜇,Σ2)||𝑁(𝜇,Σ1)) =|Σ1|
12
|Σ2|∣∣∣(2Σ−1
2 − Σ−11 )
∣∣∣− 12 − 1 (D.28)
𝜒2(𝑁(𝜇2,Σ)||𝑁(𝜇1,Σ)) = exp[(𝜇2 − 𝜇1)𝑇Σ−1(𝜇2 − 𝜇1)]. (D.29)
173
Proof.
𝜒2(𝑁(𝜇,Σ2)||𝑁(𝜇,Σ1)) + 1 (D.30)
=1
(2𝜋)𝑑2
|Σ1|12
|Σ2|
∫R𝑑
exp
ñ−1
2
Ä2(𝑥− 𝜇2)
𝑇Σ−12 (𝑥− 𝜇2)− (𝑥− 𝜇1)
𝑇Σ−11 (𝑥− 𝜇1)
äô𝑑𝑥
(D.31)
=1
(2𝜋)𝑑2
|Σ1|12
|Σ2|
∫R𝑑
exp
− 1
2
Ñ𝑥𝑇 (2Σ−1
2 − Σ−11 )𝑥 + 2𝑥𝑇Σ−1
1 𝜇1 − 4𝑥𝑇Σ−12 𝑥 (D.32)
− 𝜇𝑇1 Σ−1
1 𝜇1 + 2𝜇𝑇2 Σ−1
2 𝜇2
é 𝑑𝑥 (D.33)
=1
(2𝜋)𝑑2
|Σ1|12
|Σ2|
∫R𝑑
exp
ñ−1
2(𝑥′𝑇 (2Σ−1
2 − Σ−11 )𝑥′ + 𝑐)
ô(D.34)
𝑥′ : = 𝑥− (2Σ−12 − Σ−1
1 )−1(𝜇𝑇1 Σ−1
1 − 2𝜇𝑇2 Σ−1
2 ) (D.35)
𝑐 : =1
2(2Σ−1
2 𝜇2 − Σ−11 𝜇1)
𝑇 (2Σ−12 − Σ−1
1 )−1(2Σ−12 𝜇2 − Σ−1
1 𝜇1) +1
2𝜇𝑇1 Σ−1
1 𝜇1 − 𝜇𝑇2 Σ−1
2 𝜇2
(D.36)
Integrating gives the result. For the equal variance case,
𝑐 =1
2(2𝜇2 − 𝜇1)Σ
−1(2𝜇2 − 𝜇1) +1
2𝜇1Σ
−1𝜇1 − 𝜇2Σ−1𝜇2 = (𝜇2 − 𝜇1)
𝑇Σ−1(𝜇2 − 𝜇1)𝑇 .
(D.37)
The following theorem is essential in generalizing from gaussian to log-concave
densities.
Theorem D.2.2 (Harge, [Har04]). Suppose the 𝑑-dimensional gaussian 𝑁(0,Σ) has
density 𝛾. Let 𝑝 = ℎ·𝛾 be a probability density, where ℎ is log-concave. Let 𝑔 : R𝑑 → R
be convex. Then
∫R𝑑
𝑔(𝑥− E𝑝𝑥)𝑝(𝑥) 𝑑𝑥 ≤
∫R𝑑
𝑔(𝑥)𝛾(𝑥) 𝑑𝑥. (D.38)
174
Lemma D.2.3 (𝜒2-tail bound). Let 𝛾 = 𝑁Ä0, 1
𝜅
ä. Then
∀𝑦 ≥ 𝑑
𝜅, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒
−𝜅2
Ä𝑦−√
𝑑𝜅
ä2. (D.39)
Proof. By the 𝜒2𝑑 tail bound in [LM00], for all 𝑡 ≥ 0,
P𝑥∼𝛾
Ç‖𝑥‖2 ≥ 1
𝜅(√𝑑 +√
2𝑡)2å≤ P𝑥∼𝛾
Ç‖𝑥‖2 ≥ 1
𝜅(𝑑 + 2(
√𝑑𝑡 + 𝑡))
å≤ 𝑒−𝑡
(D.40)
=⇒ ∀𝑦 ≥ 𝑑
𝜅, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒
−Ä√
𝜅𝑦−√𝑑√
2
ä2= 𝑒
−𝜅2
Ä𝑦−√
𝑑𝜅
ä2(D.41)
Lemma D.2.4. Let 𝑓 : R𝑑 → R be a 𝜅-strongly convex and 𝐾-smooth function and let
𝑃 be a probability measure with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥). Let 𝑥* = argmin𝑥 𝑓(𝑥)
and 𝑥 = E𝑃 𝑓(𝑥). Then
‖𝑥* − 𝑥‖ ≤ 𝑑
𝜅
(√ln
Ç𝐾
𝜅
å+ 5
). (D.42)
Proof. We establish both concentration around the mode 𝑥* and the mean 𝑥. This
will imply that the mode and the mean are close. Without loss of generality, assume
𝑥* = 0 and 𝑓(0) = 0.
For the mode, note that by Lemma D.2.3, for all 𝑟 ≥ 𝑑𝑚
,
∫‖𝑥‖≥𝑟
𝑒−𝑓(𝑥) 𝑑𝑥 ≤∫‖𝑥‖≥𝑟
𝑒−12𝜅𝑥2 ≤
Ç2𝜋
𝜅
å 𝑑2
𝑒−𝜅
2
Ä𝑟−√
𝑑𝜅
ä2(D.43)
∫‖𝑥‖<𝑟
𝑒−𝑓(𝑥) 𝑑𝑥 ≥∫‖𝑥‖<𝑟
𝑒−12𝐾𝑥2 ≥
Ç2𝜋
𝐾
å 𝑑2(
1− 𝑒−𝐾
2
Ä𝑟−√
𝑑𝜅
ä2). (D.44)
175
Let 𝑟 =√
𝑑𝜅
(√lnÄ𝐾𝜅
ä+ 3
). Then
∫‖𝑥‖≥𝑟
𝑒−𝑓(𝑥) 𝑑𝑥 ≤Ç
2𝜋
𝜅
å 𝑑2
𝑒−𝑑2(ln(𝐾
𝜅 )+2) ≤Ç
2𝜋
𝐾
å 𝑑2
𝑒−𝑑 (D.45)
∫‖𝑥‖<𝑟
𝑒−𝑓(𝑥) 𝑑𝑥 ≥Ç
2𝜋
𝐾
å 𝑑2(
1− 𝑒−𝐾
2
Ä𝑟−√
𝑑𝜅
ä2)(D.46)
≥Ç
2𝜋
𝐾
å 𝑑2Å
1− 𝑒−𝐾𝑑2𝜅 (2+ln( 𝐾
𝜅𝜅))ã≥Ç
2𝜋
𝐾
å 𝑑2
(1− 𝑒−𝑑) (D.47)
Thus
P𝑥∼𝑃 (‖𝑥‖ ≥ 𝑟) =
∫‖𝑥‖≥𝑟 𝑒
−𝑓(𝑥) 𝑑𝑥∫‖𝑥‖≥𝑟 𝑒
−𝑓(𝑥) 𝑑𝑥 +∫‖𝑥‖<𝑟 𝑒
−𝑓(𝑥) 𝑑𝑥≤ 𝑒−𝑑 ≤ 1
2. (D.48)
Now we show concentration around the mean. By adding a constant to 𝑓 , we
may assume that 𝑝(𝑥) = 𝑒−𝑓(𝑥). Note that because 𝑓 is 𝜅-smooth, 𝑝 is the product of
𝛾(𝑥) with a log-concave function, where 𝛾(𝑥) is the density of 𝑁(0, 1𝜅𝐼𝑑). note that
by Harge’s Theorem D.2.2,
∫R𝑑‖𝑥− 𝑥‖2 𝑝(𝑥) 𝑑𝑥 ≤
∫R𝑑‖𝑥‖2 𝛾(𝑥) 𝑑𝑥 =
𝑑
𝜅. (D.49)
By Markov’s inequality,
P𝑥∼𝑃
(‖𝑥− 𝑥‖ ≥
2𝑑
𝜅
)= P𝑥∼𝑃
Ç‖𝑥− 𝑥‖2 ≥ 2𝑑
𝜅
å≤ 1
2. (D.50)
Let 𝐵𝑟(𝑥) denote the ball of radius 𝑟 around 𝑥. By (D.48) and (D.50), 𝐵√𝑑𝜅
(»ln(𝐾
𝜅 )+3
)(𝑥*)
and 𝐵√ 2𝑑𝜅
(𝑥) intersect. Thus ‖𝑥− 𝑥*‖ ≤√
𝑑𝜅
(√lnÄ𝐾𝜅
ä+ 5
).
Lemma D.2.5 (Concentration around mode for log-concave distributions). Suppose
𝑓 : R𝑑 → R is 𝜅-strongly convex and 𝐾-smooth. Let 𝑃 be the probability measure
176
with density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥). Let 𝑥* = argmin𝑥 𝑓(𝑥). Then
P𝑥∼𝑃
Ñ‖𝑥− 𝑥*‖2 ≥ 1
𝜅
(√𝑑 +
√2𝑡 + 𝑑 ln
Ç𝐾
𝜅
å)2é≤ 𝑒−𝑡. (D.51)
Proof. By (D.43) and (D.44),
P𝑥∼𝑃 (‖𝑥‖ ≥ 𝑟) ≤Ç𝐾
𝜅
å 𝑑2
𝑒−𝜅
2
Ä𝑟−√
𝑑𝜅
ä2. (D.52)
Substituting in 𝑟 = 1√𝜅
(√𝑑 +
√2𝑡 + 𝑑 ln
Ä𝐾𝜅
ä)gives the lemma.
Lemma D.2.6 (𝜒2-divergence between translates). Let 𝑓 : R𝑑 → R be a 𝜅-strongly
convex and 𝐾-smooth function and let 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) be a probability distribution. Let
‖𝜇‖ = 𝐷. Then
𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) ≤ 𝑒12𝜅𝐷2+𝐾𝐷
√𝑑𝜅
(»ln(𝐾
𝜅 )+5
) (𝑒𝐾𝐷√
𝑑𝜅 + 𝐾𝐷
4𝜋
𝜅𝑒
2𝐾𝐷√𝑑√
𝜅+𝐾2𝐷2
2𝜅
)− 1.
(D.53)
Proof. Without loss of generality, suppose 𝑓 attains minimum at 0, or equivalently,
∇𝑓(0) = 0. We bound
𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) + 1 =∫R𝑑
𝑒−2𝑓(𝑥)
𝑒−𝑓(𝑥−𝜇)𝑑𝑥 =
∫R𝑑
𝑒−𝑓(𝑥)𝑒𝑓(𝑥−𝜇)−𝑓(𝑥) 𝑑𝑥 (D.54)
≤∫R𝑑
𝑒−𝑓(𝑥)𝑒𝐾𝐷‖𝑥‖+ 12𝜅𝐷2
𝑑𝑥 (D.55)
Note that because 𝑓 is 𝜅-strongly convex, 𝑝 is the product of 𝛾(𝑥) with a log-concave
function, where 𝛾(𝑥) is the density of 𝑁(0, 1𝜅𝐼𝑑). Let 𝑥 = E𝑥∼𝑝 𝑥 be the average value
of 𝑥 under 𝑝. Apply Harge’s Theorem D.2.2 on 𝑔(𝑥) = 𝑒𝐾𝐷‖𝑥+𝑥‖, 𝑝(𝑥) = 𝑒−𝑓(𝑥) to get
that
∫R𝑑
𝑒−𝑓(𝑥)𝑒𝐾𝐷‖𝑥‖ 𝑑𝑥 ≤∫R𝑑
𝛾(𝑥)𝑒𝐾𝐷‖𝑥+𝑥‖ 𝑑𝑥 (D.56)
177
= 𝑒𝐾𝐷‖𝑥‖(𝑒𝐾𝐷√
𝑑𝜅 +
∫ ∞√
𝑑𝜅
P𝑥∼𝛾(‖𝑥‖ ≥ 𝑦)𝐾𝐷𝑒𝐾𝐷𝑦 𝑑𝑦
)(D.57)
where we used the identity∫R 𝑓(𝑥)𝑝(𝑥) 𝑑𝑥 = 𝑓(𝑦0) +
∫∞𝑦0
P𝑥∼𝑝(𝑥 ≥ 𝑦)𝑓 ′(𝑦) 𝑑𝑦 when
𝑓(𝑥) is an increasing function. By Lemma D.2.3,
∀𝑦 ≥
𝑑
𝑚, P𝑥∼𝛾 (‖𝑥‖ ≥ 𝑦) ≤ 𝑒
−𝜅2
Ä𝑦−√
𝑑𝜅
ä2(D.58)
=⇒∫ ∞√
𝑑𝜅
P𝑥∼𝛾(‖𝑥‖ ≥ 𝑦)𝐾𝐷𝑒𝐾𝐷𝑦 𝑑𝑦 ≤ 𝐾𝐷∫ ∞√
𝑑𝜅
𝑒−𝜅2 (𝑦− 𝑑
𝜅)2+𝐾𝐷𝑦 𝑑𝑦 (D.59)
= 𝐾𝐷∫ ∞√
𝑑𝜅
𝑒−𝜅
2
ï(𝑦− 𝑑
𝜅)2− 2𝐾𝐷
√𝑑
𝜅32
−𝐾2𝐷2
𝜅2
ò𝑑𝑦
(D.60)
= 𝐾𝐷∫ ∞√
𝑑𝜅
𝑒−𝜅
2 (𝑦− 𝑑𝜅)
2+ 2𝐾𝐷
√𝑑√
𝜅+𝐾2𝐷2
2𝜅 𝑑𝑦 (D.61)
≤ 𝐾𝐷
4𝜋
𝜅𝑒
2𝐾𝐷√𝑑√
𝜅+𝐾2𝐷2
2𝜅 . (D.62)
Putting together (D.55), (D.57), and (D.62), and using Lemma D.2.4,
𝜒2(𝑝(𝑥)||𝑝(𝑥− 𝜇)) ≤ 𝑒12𝜅𝐷2+𝐾𝐷
√𝑑𝜅
(»ln(𝐾
𝜅 )+5
) (𝑒𝐾𝐷√
𝑑𝜅 + 𝐾𝐷
4𝜋
𝜅𝑒
2𝐾𝐷√𝑑√
𝜅+𝐾2𝐷2
2𝜅
).
(D.63)
Lemma D.2.7 (𝜒2-divergence between temperatures). Let 𝑓 : R𝑑 → R be a 𝜅-
strongly convex and 𝐾-smooth function and let 𝑃, 𝑃𝛽 be probability measures with
density functions 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), 𝑝𝛽(𝑥) ∝ 𝑒−𝛽𝑓(𝑥). Suppose 𝛽1, 𝛽2 > 0 and |𝛽2 − 𝛽1| <𝜅𝐾. Then
𝜒2(𝑃𝛽2||𝑃𝛽1) ≤ 𝑒
12
∣∣∣1−𝛽1𝛽2
∣∣∣ 𝐾𝑑
𝜅−𝐾|1−𝛽1𝛽2|(»
ln(𝐾𝜅 )+5
)2 ÇÇ1− 𝐾
𝜅
∣∣∣∣∣1− 𝛽1
𝛽2
∣∣∣∣∣åÇ
1 +
∣∣∣∣∣1− 𝛽𝑖−1
𝛽𝑖
∣∣∣∣∣åå− 𝑑
2
− 1.
(D.64)
178
Proof. Without loss of generality, suppose 𝑓 attains minimum at 0 (or equivalently,
∇𝑓(0) = 0), and 𝑓(0) = 0. We bound
𝜒2(𝑃𝛽2||𝑃𝛽1) + 1 =
∫R𝑑 𝑒−𝛽1𝑓(𝑥) 𝑑𝑥
∫R𝑑 𝑒(𝛽1−2𝛽2)𝑓(𝑥) 𝑑𝑥
(∫R𝑑 𝑒−𝛽2𝑓(𝑥) 𝑑𝑥)
2 . (D.65)
Let 𝑥 = E𝑥∼𝑃𝛽2𝑥 be the average value of 𝑥 under 𝑝𝛽2 . Note that because 𝑓 is 𝑚-
strongly convex, 𝑒−𝛽1𝑓(𝑥) is the product of 𝛾(𝑥) with a log-concave function, where
𝛾(𝑥) is the density of 𝑁(0, 1
𝛽2𝜅𝐼𝑑). Applying Harge’s Theorem D.2.2 on 𝑔1(𝑥) =
𝑒(𝛽2−𝛽1)𝑓(𝑥+𝑥) and 𝑔2(𝑥) = 𝑒(𝛽1−𝛽2)𝑓(𝑥+𝑥) to get
(D.65) ≤∫R𝑑
𝑒(𝛽2−𝛽1)𝑓(𝑥+𝑥)𝛾(𝑥) 𝑑𝑥∫R𝑑
𝑒(𝛽1−𝛽2)𝑓(𝑥+𝑥)𝛾(𝑥) 𝑑𝑥. (D.66)
Because 𝑓 is 𝑚-strongly convex and 𝑀 -smooth, and 𝑓(0) = 0 is the minimum of 𝑓 ,
(D.66) ≤∫R𝑑 𝑒−|𝛽2−𝛽1|𝐾2 ‖𝑥+𝑥‖2𝑒−
𝛽2𝜅
2‖𝑥‖2 𝑑𝑥
∫R𝑑 𝑒−|𝛽2−𝛽1|𝜅2 ‖𝑥+𝑥‖2𝑒−
𝛽2𝜅
2‖𝑥‖2 𝑑𝑥(∫
R𝑑 𝑒−𝛽2𝑚
2‖𝑥‖2 𝑑𝑥
)2 . (D.67)
Using the identity
𝑎 ‖𝑥 + 𝑥‖2 + 𝑏 ‖𝑥‖2 = (𝑎 + 𝑏) ‖𝑥‖2 + 2𝑎 ⟨𝑥, 𝑥⟩+𝑎2
𝑎 + 𝑏‖𝑥‖2 +
𝑎𝑏
𝑎 + 𝑏‖𝑥‖2 (D.68)
= (𝑎 + 𝑏)∥∥∥∥𝑥 +
𝑎
𝑎 + 𝑏𝑥∥∥∥∥2 +
𝑎𝑏
𝑎 + 𝑏‖𝑥‖2 , (D.69)
we get using Lemma D.2.4 (· · · denote quantities not involving 𝑥, that we will not
need)
(D.67) =1(∫
R𝑑 𝑒−𝛽2𝜅
2‖𝑥‖2 𝑑𝑥
)2𝑒 𝐾𝜅|𝛽2−𝛽1|𝛽2
2(𝜅𝛽2−𝐾|𝛽2−𝛽1|)‖𝑥‖2
∫R𝑑
𝑒(𝐾2|𝛽2−𝛽1|−𝜅
2𝛽2)‖𝑥+···‖2 𝑑𝑥 (D.70)
· 𝑒−𝜅|𝛽1−𝛽2|𝛽2
2𝜅(𝛽2−|𝛽2−𝛽1|)‖𝑥‖2
∫R𝑑
𝑒(−𝜅2𝛽2−|𝛽2−𝛽1|𝜅2 )‖𝑥+···‖2 𝑑𝑥
(D.71)
179
≤ 𝑒|𝛽2−𝛽1|
2
𝐾𝜅𝛽2𝜅𝛽2−𝐾|𝛽2−𝛽1|
‖𝑥‖2Ç
2𝜋
𝜅𝛽2 −𝐾|𝛽2 − 𝛽1|
å 𝑑2Ç
2𝜋
𝜅(𝛽2 + |𝛽2 − 𝛽1|)
å 𝑑2Ç
2𝜋
𝜅𝛽2
å−𝑑
(D.72)
≤ 𝑒
12
∣∣∣1−𝛽1𝛽2
∣∣∣ 𝐾𝑑
𝜅−𝐾|1−𝛽1𝛽2|(»
ln(𝐾𝜅 )+5
)2 ÇÇ1− 𝐾
𝜅
∣∣∣∣∣1− 𝛽1
𝛽2
∣∣∣∣∣åÇ
1 +
∣∣∣∣∣1− 𝛽𝑖−1
𝛽𝑖
∣∣∣∣∣åå− 𝑑
2
.
(D.73)
Lemma D.2.8 (𝜒2 divergence between gaussian and log-concave distribution). Sup-
pose that probability measure 𝑃 has probability density function 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥−𝜇), where
𝑓 is 𝜅-strongly convex, 𝐾-smooth, and attains minimum at 0. Let 𝐷 = ‖𝜇‖. Then
𝜒2
Ç𝑁
Ç0,
1
𝐾𝐼𝑑
å||𝑃
å≤Ç𝐾
𝜅
å 𝑑2
𝑒𝐾𝐷2
. (D.74)
Proof. We calculate
𝑝(𝑥) =𝑒−𝑓0(𝑥−𝜇)∫
R𝑑 𝑒−𝑓0(𝑢−𝜇) 𝑑𝑢≥ 𝑒−
𝐾2‖𝑥−𝜇‖2∫
R𝑑 𝑒−𝜅2‖𝑢−𝜇‖2 𝑑𝑢
=Å 𝜅
2𝜋
ã 𝑑2
𝑒−𝐾2‖𝑥−𝜇‖2 . (D.75)
Then
𝜒2
Ç𝑁
Ç0,
1
𝐾𝐼𝑑
å||𝑃
å=∫R𝑑
Ä𝐾2𝜋
ä𝑑𝑒−𝐾‖𝑥‖2
𝑝(𝑥)𝑑𝑥− 1 (D.76)
≤Ç𝐾
2𝜋
å𝑑 Ç2𝜋
𝜅
å 𝑑2∫R𝑑
𝑒−𝐾(‖𝑥‖2− 12‖𝑥−𝜇‖2) 𝑑𝑥 (D.77)
=
Ç𝐾
𝜅
å 𝑑2Ç𝐾
2𝜋
å 𝑑2∫R𝑑
𝑒−𝐾2‖𝑥+𝜇‖2+𝐾‖𝜇‖2 𝑑𝑥 (D.78)
≤Ç𝐾
𝜅
å 𝑑2
𝑒𝐾𝐷2
. (D.79)
180
D.3 A probability ratio calculation
Lemma D.3.1. Suppose that 𝑓(𝑥) = − ln
ñ∑𝑛𝑖=1𝑤𝑖𝑒
−‖𝑥−𝜇𝑖‖22
ô, 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥), and for
𝛼 ≥ 0 let 𝑝𝛼(𝑥) ∝ 𝑒−𝛼𝑓(𝑥), 𝑍𝛼 =∫R𝑑 𝑒−𝛼𝑓(𝑥) 𝑑𝑥. Suppose that ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖.
If 𝛼 < 𝛽, thenï∫𝐴
min𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥ò/𝑝𝛽(𝐴) ≥ min
𝑥
𝑝𝛼(𝑥)
𝑝𝛽(𝑥)≥ 𝑍𝛽
𝑍𝛼
(D.80)
𝑍𝛽
𝑍𝛼
∈
1
2𝑒−2(𝛽−𝛼)
(𝐷+ 1√
𝛼
(√𝑑+2
√lnÄ
2𝑤min
ä))2
, 1
. (D.81)
Choosing 𝛽 − 𝛼 = 𝑂
(1
𝐷2+ 𝑑𝛼+ 1
𝛼lnÄ
1𝑤min
ä), this quantity is Ω(1).
This is a special case of the following more general lemma.
Lemma D.3.2. Suppose that 𝑓(𝑥) = − lnî∑𝑛
𝑖=1 𝑤𝑖𝑒−𝑓𝑖(𝑥)
ó, where 𝑓𝑖(𝑥) = 𝑓0(𝑥 − 𝜇𝑖)
and 𝑓0 is 𝜅-strongly convex and 𝐾-smooth. Let 𝑃 , 𝑃𝛼 (for 𝛼 > 0) be probability
measures with densities 𝑝(𝑥) ∝ 𝑒−𝑓(𝑥) and 𝑝𝛼(𝑥) ∝ 𝑒−𝛼𝑓(𝑥). Let 𝑍𝛼 =∫R𝑑 𝑒−𝛼𝑓(𝑥) 𝑑𝑥.
Suppose that ‖𝜇𝑖‖ ≤ 𝐷 for all 𝑖.
Let 𝐶 = 𝐷 + 1√𝛼𝜅
(√𝑑 +
√𝑑 ln
Ä𝐾𝜅
ä+ 2 ln
Ä2
𝑤min
ä). If 𝛼 < 𝛽, thenï∫
𝐴min𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥
ò/𝑝𝛽(𝐴) ≥ min
𝑥
𝑝𝛼(𝑥)
𝑝𝛽(𝑥)≥ 𝑍𝛽
𝑍𝛼
(D.82)
𝑍𝛽
𝑍𝛼
∈ñ
1
2𝑒−
12(𝛽−𝛼)𝐾𝐶2
, 1
ô. (D.83)
If 𝛽 − 𝛼 = 𝑂
(1
𝐾Ä𝐷2+ 𝑑
𝛼𝜅(1+ln(𝐾𝜅 ))+ 1
𝛼𝜅lnÄ
1𝑤min
ää), then this quantity is Ω(1).
Proof. Let ‹𝑃𝛼 be the probability measure with density function 𝑝𝛼(𝑥) ∝ ∑𝑛𝑖=1𝑤𝑖𝑒
−𝛼𝑓0(𝑥−𝜇𝑖).
181
By Lemma 2.7.3 and Lemma D.2.5, since 𝛼𝑓 is 𝛼𝜅-strongly, convex,
P𝑥∼𝑃 (‖𝑥‖ ≥ 𝐶) ≤ 1
𝑤min
P𝑥∼𝑃𝛼
(‖𝑥‖ ≥ 𝐶) (D.84)
≤ 1
𝑤min
𝑛∑𝑖=1
𝑤𝑖P𝑥∼𝑃𝛼(‖𝑥‖ ≥ 𝐶) (D.85)
≤ 1
𝑤min
𝑛∑𝑖=1
𝑤𝑖P𝑥∼𝑃 (‖𝑥‖2 ≥ (𝐶 −𝐷)2) (D.86)
=1
𝑤min
P𝑥∼𝑃
‖𝑥‖2 ≥ 1
𝛼𝜅
(√𝑑 +
√𝑑 ln
Ç𝐾
𝜅
å+ 2 ln
Ç2
𝑤min
å)2
(D.87)
≤ 1
𝑤min
𝑤min
2=
1
2. (D.88)
Thus, using 𝑓(𝑥) ≥ 0,ï∫𝐴
min 𝑝𝛼(𝑥), 𝑝𝛽(𝑥) 𝑑𝑥ò/𝑝𝛽(𝐴) ≥
∫𝐴
min
®𝑝𝛼(𝑥)
𝑝𝛽(𝑥), 1
´𝑝𝛽(𝑥) 𝑑𝑥
¡𝑝𝛽(𝐴) (D.89)
≥∫𝐴
min
®𝑍𝛽
𝑍𝛼
𝑒(𝛽−𝛼)𝑓(𝑥), 1
´𝑝𝛽(𝑥) 𝑑𝑥
¡𝑝𝛽(𝐴)
(D.90)
≥ 𝑍𝛽
𝑍𝛼
(D.91)
=
∫𝑒−𝛽𝑓(𝑥) 𝑑𝑥∫𝑒−𝛼𝑓(𝑥) 𝑑𝑥
(D.92)
=∫R𝑑
𝑒(−𝛽+𝛼)𝑓(𝑥)𝑝𝛼(𝑥) 𝑑𝑥 (D.93)
≥∫‖𝑥‖≤𝐶
𝑒(−𝛽+𝛼)𝑓(𝑥)𝑝𝛼(𝑥) 𝑑𝑥 (D.94)
≥ 1
2𝑒−(𝛽−𝛼)max‖𝑥‖≤𝐶(𝑓(𝑥)) (D.95)
≥ 1
2𝑒−
12(𝛽−𝛼)𝑀𝐶2
. (D.96)
182
D.4 Other facts
Lemma D.4.1. Let (𝑁𝑇 )𝑇≥0 be a Poisson process with rate 𝜆. Then there is a
constant 𝐶 such that
P(𝑁𝑇 ≥ 𝑛) ≤Ç𝐶𝑛
𝑇𝜆
å−𝑛
. (D.97)
Proof. Assume 𝑛 > 𝑇𝜆. We have by Stirling’s formula
P(𝑁𝑇 ≥ 𝑛) = 𝑒−𝜆𝑇∞∑
𝑚=𝑛
(𝜆𝑇 )𝑚
𝑚!(D.98)
≤ 𝑒−𝜆𝑇 1
𝑛!
1
1− 𝜆𝑇𝑛
(𝜆𝑇 )𝑛 (D.99)
= 𝑒−𝜆𝑇𝑂
Ç𝑛− 1
2
Ç𝑒𝜆𝑇
𝑛
å𝑛å(D.100)
≤Ç𝐶𝑛
𝜆𝑇
å𝑛
(D.101)
for some 𝐶, since 𝑒−𝜆𝑇 ≥ 𝑒−𝑛.
Lemma D.4.2. Suppose that 𝑋𝑡 are a sequence of random variables in R𝑑 and for
each 𝑡, ‖𝑋𝑡 − E[𝑋𝑡|𝑋1:𝑡−1]‖∞ ≤ 𝑀 (with probability 1). Let 𝑆𝑇 =∑𝑇
𝑡=1 E[𝑋𝑡|𝑋1:𝑡−1]
(a random variable depending on 𝑋1:𝑇 ). Then
P
Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1
𝑋𝑡 − 𝑆𝑡
∥∥∥∥∥∥2
≥ 𝑐
é≤ 2𝑑𝑒−
𝑐2𝑇2𝑀2𝑑 . (D.102)
Proof. By Azuma’s inequality, for each 1 ≤ 𝑗 ≤ 𝑑,
P
Ñ∣∣∣∣∣∣ 𝑇∑𝑡=1
(𝑋𝑡)𝑗 − (𝑆𝑡)𝑗
∣∣∣∣∣∣ ≥ 𝑐
é≤ 2𝑒−
𝑐2𝑇2𝑀2 (D.103)
183
By a union bound,
P
Ñ∥∥∥∥∥∥ 𝑇∑𝑡=1
𝑋𝑡 − 𝑆𝑡
∥∥∥∥∥∥2
≥ 𝑐
é≤
𝑑∑𝑗=1
P
Ñ∣∣∣∣∣∣ 𝑇∑𝑡=1
(𝑋𝑡)𝑗 − (𝑆𝑡)𝑗
∣∣∣∣∣∣ ≥ 𝑐√𝑑
é≤ 2𝑑𝑒−
𝑐2𝑇2𝑀2𝑑 (D.104)
Lemma D.4.1. Suppose that 𝜋 is a distribution with P𝜃∼𝜋(‖𝜃 − 𝜃0‖ ≥ 𝛾) ≤ 𝐴𝑒−𝑘𝛾,
for some 𝜃0. Then
E𝜃∼𝜋[‖𝜃 − 𝜃0‖2] ≤Ç
2 +1
𝑘
ålog
Ç𝐴
𝑘2
å.
Proof. Without loss of generality, 𝜃0 = 0. Then
E𝜃∼𝜋[‖𝜃‖2] =∫ ∞
02𝛾P𝜃∼𝜋(‖𝜃‖ ≥ 𝛾) 𝑑𝛾 (D.105)
≤ 𝛾0 +∫ ∞
𝛾02𝛾P𝜃∼𝜋(‖𝜃‖ ≥ 𝛾) 𝑑𝛾 (D.106)
≤ 𝛾0 +∫ ∞
𝛾02𝛾𝐴𝑒−𝑘𝛾 𝑑𝛾 by assumption
(D.107)
= 𝛾0 + 𝐴
Ç−2𝛾
𝑘𝑒−𝑘𝛾
∣∣∣∣∞𝛾0
−∫ ∞
𝛾0−2
𝑘𝑒−𝑘𝛾 𝑑𝛾
åintegration by parts
(D.108)
= 𝐴
Ç2𝛾0𝑘
𝑒−𝑘𝛾0 +2
𝑘2𝑒−𝑘𝛾0
å. (D.109)
Set 𝛾0 =log( 𝐴
𝑘2)
𝑘. Then this is ≤
Ä2 + 1
𝑘
älog
Ä𝐴𝑘2
ä, as desired.
184
Nomenclature
𝒟(L ) Domain of L
E Dirichlet form
ℒ(𝑋) Distribution of random variable 𝑋
L Generator of Markov process‹𝑂𝑇 Subscript 𝑇 means that only the dependence on 𝑇 is shown.
Ω State space
Π(𝜇, 𝜈) Set of all possible couplings of random vectors (, 𝑌 ) with marginals ∼ 𝜇
and 𝑌 ∼ 𝜈
𝑃𝑡 Family of kernels defining Markov process
P𝑡 P𝑡𝑔(𝑥) = E𝑦∼𝑃𝑡(𝑥,·)𝑔(𝑦) =∫Ω 𝑔(𝑦)𝑃𝑡(𝑥, 𝑑𝑦)
Var𝑃 Variance with respect to 𝑃
𝛽 Inverse temperature
𝑑 Dimension
𝜂 Step size
𝜑 Logistic function 𝜑(𝑥) = 11+𝑒−𝑥
185
𝜉 Gaussian noise 𝜉 ∼ 𝑁(0, 𝐼)
𝐶 Poincare constant for projected Markov process (Chapter 2)
𝐷 Bound on centers ‖𝜇𝑖‖ ≤ 𝐷 (Chapter 2)
∆ Perturbation of 𝑓 ,∥∥∥𝑓 − 𝑓
∥∥∥∞≤ ∆ (Chapter 2)
E E(𝑔, 𝑔) =∑
𝑖∈𝐼 E𝑖(𝑔, 𝑔) (Chapter 2)
E↔ E↔(𝑔, 𝑔) = −∑𝑖,𝑗∈𝐼 𝑤𝑖 ⟨𝑔, (T𝑖,𝑗 − Id)𝑔⟩𝑃𝑖. (Chapter 2)
𝐾 Smoothness of 𝑓0 (Chapter 2)
𝐿 Number of temperatures (Chapter 2)
L Generator of projected Markov process (Chapter 2)
𝑀 Projected Markov process (Chapter 2)
𝑀st Simulated tempering chain (Chapter 2)
𝑃 Stationary measure of projected Markov process (Chapter 2)‹𝑃𝑖,𝑗‹𝑃𝑖,𝑗(𝑑𝑥) =
∫𝑦∈Ω𝑗
‹𝑄𝑖,𝑗(𝑑𝑥, 𝑑𝑦) (Chapter 2)
𝑄𝑖,𝑗 Stationary distribution of 𝑃𝑖, times transition kernel: 𝑃𝑖(𝑑𝑥)𝑇𝑖,𝑗(𝑥, 𝑑𝑦) (Chap-
ter 2)
𝑄𝑗,𝑘 Minimum of probability measures 𝑃𝑗, 𝑃𝑘: min𝑑𝑃𝑘
𝑑𝑃𝑗, 1𝑃𝑗 (Chapter 2)‹𝑄𝑖,𝑗 Normalized version of 𝑄𝑖,𝑗 (Chapter 2)
𝑆 Subset of 𝐼 × 𝐼 chosen to contain pairs (𝑖, 𝑗) such that 𝑃𝑖, 𝑃𝑗 are close. (Chap-
ter 2)
𝑆↔ Subset of 𝐼×𝐼 chosen to contain pairs (𝑖, 𝑗) such that there is a lot of probability
flow between 𝑃𝑖 and 𝑃𝑗. (Chapter 2)
186
𝑇 Transition probabilities of projected Markov process (Chapter 2)
𝑇𝑖,𝑗 Transition between different components, in the general density decomposition
theorem (Chapter 2)
𝑍𝑖 Partition function of 𝑖th temperature,∫R𝑑 𝑒−𝛽𝑖𝑓(𝑥) 𝑑𝑥 (Chapter 2)“𝑍𝑖 Estimate of partition function 𝑍𝑖 (Chapter 2)
𝛿𝑗,𝑘 Overlap between measures 𝑃𝑗, 𝑃𝑘:∫Ω min
𝑑𝑃𝑘
𝑑𝑃𝑗, 1𝑃𝑗(𝑑𝑥) (Chapter 2)
𝑓0 Base function (Chapter 2)
𝑓𝑖 Translate of base function 𝑓𝑖(𝑥) = 𝑓0(𝑥− 𝜇𝑖) (Chapter 2)
𝜅 Strong convexity of 𝑓0 (Chapter 2)
𝜆 Rate of simulated tempering (Chapter 2)
𝜇𝑖 Center of 𝑖th component (Chapter 2)
𝑟𝑖 Relative probabilities (Chapter 2)
𝜏 Bound on gradient and Hessian perturbations∥∥∥∇𝑓 −∇𝑓∥∥∥
∞≤ 𝜏 ,
∥∥∥∇2𝑓 −∇2𝑓∥∥∥∞≤
𝜏 (Chapter 2)
𝑤min Minimum weight 𝑤min = min𝑖𝑤𝑖 (Chapter 2)
𝑤𝑖 Weight of 𝑖th component (Chapter 2)
𝐴 Exponential concentration in Assumption 3.2.2 and 3.2.5, P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥
𝛾√𝑡+𝑐
) ≤ 𝐴𝑒−𝑘𝛾 (Chapter 3)
B Bound on ‖𝜃0‖ (Chapter 3)
𝐶 Bound on second moment from Assumption 3.2.2 and 3.2.5, 𝑚122 :=
ÄE𝑥∼𝜋𝑡 ‖𝑥− 𝑥⋆
𝑡‖22
ä 12 ≤
𝐶√𝑡+𝑐
for 𝐶 =Ä2 + 1
𝑘
älog
Ä𝐴𝑘2
ä(Chapter 3)
187
𝐶 ′ acceptance radius 𝐶 ′ = 2.5(𝐶1 + D) (Chapter 3)
D Drift parameter in Assumption 3.2.3 (Chapter 3)
𝐹𝑡 𝐹𝑡 =∑𝑡
𝑘=0 𝑓𝑘 (Chapter 3)
𝐺𝛽 𝐺𝛽 =
®∀𝑖,
∥∥∥𝑋𝛽𝑖 − 𝑥⋆
∥∥∥ ≤ R√𝛽𝑇
´(Chapter 3)
𝐺𝑡 𝐺𝑡 =
®∀𝑠 ≤ 𝑡,∀0 ≤ 𝑖 ≤ 𝑖𝑠, ‖𝑋𝑠
𝑖 − 𝑥⋆𝑠‖ ≤ R√
𝑠+𝐿0/𝐿
´(Chapter 3)
𝐻𝑡 𝐻𝑡 =
®∀𝑠 ≤ 𝑡 s.t. 𝑠 is a power of 2 or 𝑠 = 0, ‖𝑋𝑠 − 𝑥⋆
𝑠‖ ≤ 𝐶1√𝑠+𝐿0/𝐿
´(Chap-
ter 3)
𝐿 Smoothness (Lipschitz constant of gradient) in Assumption 3.2.1 (Chapter 3)
𝐿0 Smoothness (Lipschitz constant of gradient) of 𝑓0 in Assumption 3.2.1 (Chapter
3)
𝑀 Bound on input vectors to logistic regression (Chapter 3)
𝑃𝑢 Distribution of input vectors (Chapter 3)
𝑋 𝑡 sample returned at epoch 𝑡, 𝑋 𝑡 = 𝑋 𝑡𝑖𝑡 (Chapter 3)
𝑋 𝑡𝑖 𝑖th iterate at epoch 𝑡 (Chapter 3)
𝑏 batch size (Chapter 3)
𝑐 Offset in Assumptions 3.2.2 and 3.2.3 (Chapter 3)
𝜂0 initial step size (Chapter 3)
𝑔𝑖 Stochastic gradient at step 𝑖 (Chapter 3)
𝑖max number of steps (Chapter 3)
188
𝑘 Exponential concentration in Assumption 3.2.2 and 3.2.5, P𝑋∼𝜋𝑡(‖𝑋 − 𝑥⋆𝑡‖ ≥
𝛾√𝑡+𝑐
) ≤ 𝐴𝑒−𝑘𝛾 (Chapter 3)
𝜋𝛽𝑇 Distribution at inverse temperature 𝛽, 𝜋𝑇 (𝑥) ∝ 𝑒−
∑𝑇𝑡=1
𝑓𝑡(𝑥) (Chapter 3)
𝜋𝑡 Distribution at epoch 𝑡, 𝑒−∑𝑡
𝑘=0𝑓𝑘 (Chapter 3)
𝜃0 True value of parameter (Chapter 3)
𝑥⋆𝑡 minimizer of 𝐹𝑡 (Chapter 3)
189