Sparse Variational Inference: Bayesian Coresets from Scratch · The major challenge posed by this...

Sparse Variational Inference:Bayesian Coresets from Scratch

Trevor CampbellDepartment of Statistics

University of British ColumbiaVancouver, BC V6T [email protected]

Boyan BeronovDepartment of Computer ScienceUniversity of British Columbia

Vancouver, BC V6T [email protected]

Abstract

The proliferation of automated inference algorithms in Bayesian statistics has pro-vided practitioners newfound access to fast, reproducible data analysis and powerfulstatistical models. Designing automated methods that are also both computationallyscalable and theoretically sound, however, remains a significant challenge. Recentwork on Bayesian coresets takes the approach of compressing the dataset beforerunning a standard inference algorithm, providing both scalability and guaranteeson posterior approximation error. But the automation of past coreset methods islimited because they depend on the availability of a reasonable coarse posteriorapproximation, which is difficult to specify in practice. In the present work we re-move this requirement by formulating coreset construction as sparsity-constrainedvariational inference within an exponential family. This perspective leads to a novelconstruction via greedy optimization, and also provides a unifying information-geometric view of present and past methods. The proposed Riemannian coresetconstruction algorithm is fully automated, requiring no problem-specific inputsaside from the probabilistic model and dataset. In addition to being significantlyeasier to use than past methods, experiments demonstrate that past coreset con-structions are fundamentally limited by the fixed coarse posterior approximation;in contrast, the proposed algorithm is able to continually improve the coreset, pro-viding state-of-the-art Bayesian dataset summarization with orders-of-magnitudereduction in KL divergence to the exact posterior.

1 Introduction

Bayesian statistical models are powerful tools for learning from data, with the ability to encodecomplex hierarchical dependence and domain expertise, as well as coherently quantify uncertainty inlatent parameters. In practice, however, exact Bayesian inference is typically intractable, and we mustuse approximate inference algorithms such as Markov chain Monte Carlo (MCMC) [1; 2, Ch. 11,12]and variational inference (VI) [3, 4]. Until recently, implementations of these methods were createdon a per-model basis, requiring expert input to design the MCMC transition kernels or derive VIgradient updates. But developments in automated tools—e.g., automatic differentiation [5, 6], “black-box” gradient estimates [7], and Hamiltonian transition kernels [8, 9]—have obviated much of thisexpert input, greatly expanding the repertoire of Bayesian models accessible to practitioners.

In modern data analysis problems, automation alone is insufficient; inference algorithms must alsobe computationally scalable—to handle the ever-growing size of datasets—and provide theoreticalguarantees on the quality of their output such that statistical pracitioners may confidently use themin failure-sensitive settings. Here the standard set of tools falls short. Designing correct MCMCschemes in the large-scale data setting is a challenging, problem-specific task [10–12]; and despite

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

recent results in asymptotic theory [13–16], it is difficult to assess the effect of the variational familyon VI approximations for finite data, where a poor choice can result in severe underestimationof posterior uncertainty [17, Ch. 21]. Other scalable Bayesian inference algorithms have largelybeen developed by modifying standard inference algorithms to handle distributed or streaming dataprocessing [10, 11, 18–29], which tend to have no guarantees on inferential quality and requireextensive model-specific expert tuning.

Bayesian coresets (“core of a dataset”) [30–32] are an alternative approach—based on the notion thatlarge datasets often contain a significant fraction of redundant data—that summarize and sparsify thedata as a preprocessing step before running a standard inference algorithm such as MCMC or VI. Incontrast to other large-scale inference techniques, Bayesian coreset construction is computationallyinexpensive, simple to implement, and provides theoretical guarantees relating coreset size to posteriorapproximation quality. However, state-of-the-art algorithms formulate coreset construction as a sparseregression problem in a Hilbert space, which involves the choice of a weighted L2 inner product [31].If left to the user, the choice of weighting distribution significantly reduces the overall automationof the approach; and current methods for finding the weighting distribution programatically aregenerally as expensive as posterior inference on the full dataset itself. Further, even if an appropriateinner product is specified, computing it exactly is typically intractable, requiring the use of finite-dimensional projections for approximation [31]. Although the problem in finite-dimensions can bestudied using well-known techniques from sparse regression, compressed sensing, random sketching,boosting, and greedy approximation [33–51], these projections incur an unknown error in theconstruction process in practice, and preclude asymptotic consistency as the coreset size grows.

In this work, we provide a new formulation of coreset construction as exponential family variationalinference with a sparsity constraint. The fact that coresets form a sparse subset of an exponentialfamily is crucial in two regards. First, it enables tractable unbiased Kullback-Leibler (KL) divergencegradient estimation, which is used in the development of a novel coreset construction algorithmbased on greedy optimization. In contrast to past work, this algorithm is fully automated, withno problem-specific inputs aside from the probabilistic model and dataset. Second, it provides aunifying view and strong theoretical underpinnings of both the present and past coreset constructionsthrough Riemannian information geometry. In particular, past methods are shown to operate in asingle tangent space of the coreset manifold; our experiments show that this fundamentally limits thequality of the coreset constructed with these methods. In contrast, the proposed method proceedsalong the manifold towards the posterior target, and is able to continually improve its approximation.Furthermore, new relationships between the optimization objective of past approaches and the coresetposterior KL divergence are derived. The paper concludes with experiments demonstrating that,compared with past methods, Riemannian coreset construction is both easier to use and providesorders-of-magnitude reduction in KL divergence to the exact posterior.

2 Background

In the problem setting of the present paper, we are given a probability density π(θ) for variablesθ ∈ Θ that decomposes into N potentials (fn(θ))Nn=1 and a base density π0(θ),

π(θ) :=1

Zexp

(N∑n=1

fn(θ)

)π0(θ), (1)

where Z is the (unknown) normalization constant. Such distributions arise frequently in a number ofscenarios: for example, in Bayesian statistical inference problems with conditionally independentdata given θ, the functions fn are the log-likelihood terms for the N data points, π0 is the priordensity, and π is the posterior; or in undirected graphical models, the functions fn and log π0 mightrepresent N + 1 potentials. The algorithms and analysis in the present work are agnostic to theirparticular meaning, but for clarity we will focus on the setting of Bayesian inference throughout.

As it is often intractable to compute expectations under π exactly, practitioners have turned toapproximate algorithms. Markov chain Monte Carlo (MCMC) methods [1, 8, 9], which returnapproximate samples from π, remain the gold standard for this purpose. But since each sampletypically requires at least one evaluation of a function proportional to π with computational cost Θ(N),in the large N setting it is expensive to obtain sufficiently many samples to provide high confidence inempirical estimates. To reduce the cost of MCMC, we can instead run it on a small, weighted subset

2

of data known as a Bayesian coreset [30], a concept originating from the computational geometryand optimization literature [52–57]. Let w ∈ RN≥0 be a sparse vector of nonnegative weights suchthat only M � N are nonzero, i.e. ‖w‖0 :=

∑Nn=1 1 [wn > 0] ≤ M . Then we approximate the

full log-density with a w-reweighted sum with normalization Z(w) > 0 and run MCMC on theapproximation1,

πw(θ) :=1

Z(w)exp

(N∑n=1

wnfn(θ)

)π0(θ), (2)

where π1 = π corresponds to the full density. If M � N , evaluating a function proportional to πw ismuch less expensive than doing so for the original π, resulting in a significant reduction in MCMCcomputation time. The major challenge posed by this approach, then, is to find a set of weights w thatrenders πw as close as possible to π while maintaining sparsity. Past work [31, 32] formulated this asa sparse regression problem in a Hilbert space with the L2(π) norm for some weighting distributionπ and vectors2 gn := (fn − Eπ [fn]),

w? = arg minw∈RN

Eπ

( N∑n=1

gn −N∑n=1

wngn

)2 s.t. w ≥ 0, ‖w‖0 ≤M. (3)

As the expectation is generally intractable to compute exactly, a Monte Carlo approximation is used inits place: taking samples (θs)

Ss=1

i.i.d.∼ π and setting gn =√S −1 [gn(θ1)− gn, . . . , gn(θS)− gn]

T ∈RS where gn = 1

S

∑Ss=1 gn(θs) yields a linear finite-dimensional sparse regression problem in RS ,

w? = arg minw∈RN

∥∥∥∥∥N∑n=1

gn −N∑n=1

wngn

∥∥∥∥∥2

2

s.t. w ≥ 0, ‖w‖0 ≤M, (4)

which can be solved with sparse optimization techniques [31, 32, 34–36, 45, 46, 48, 58–60]. However,there are two drawbacks inherent to the Hilbert space formulation. First, the use of the L2(π) normrequires the selection of the weighting function π, posing a barrier to the full automation of coresetconstruction. There is currently no guidance on how to select π, or the effect of different choicesin the literature. We show in Sections 4 and 5 that using such a fixed weighting π fundamentallylimits the quality of coreset construction. Second, the inner products typically cannot be computedexactly, requiring a Monte Carlo approximation. This adds noise to the construction and precludesasymptotic consistency (in the sense that πw 6→ π1 as the sparsity budget M → ∞). Addressingthese drawbacks is the focus of the present work.

3 Bayesian coresets from scratch

In this section, we provide a new formulation of Bayesian coreset construction as variational inferenceover an exponential family with sparse natural parameters, and develop an iterative greedy algorithmfor optimization.

3.1 Sparse exponential family variational inference

We formulate coreset construction as a sparse variational inference problem,

w? = arg minw∈RN

DKL (πw||π1) s.t. w ≥ 0, ‖w‖0 ≤M. (5)

Expanding the objective and denoting expectations under πw as Ew,

DKL (πw||π1) = logZ(1)− logZ(w)−N∑n=1

(1− wn)Ew [fn(θ)] . (6)

1Throughout, [N ] := {1, . . . , N}, 1 and 0 are the constant vectors of all 1s / 0s respectively (the dimensionwill be clear from context), 1A is the indicator vector for A ⊆ [N ], and 1n is the indicator vector for n ∈ [N ].

2In [31], the Eπ [fn] term was missing; it is necessary to account for the shift-invariance of potentials.

3

Eq. (6) illuminates the major challenges with the variational approach posed in Eq. (5). First, thenormalization constant Z(w) of πw—itself a function of the weights w—is unknown; typically, theform of the approximate distribution is known fully in variational inference. Second, even if theconstant were known, computing the objective in Eq. (5) requires taking expectations under πw,which is in general just as difficult as the original problem of sampling from the true posterior π1.

Two key insights in this work address these issues and lead to both the development of a newcoreset construction algorithm (Algorithm 1) and a more comprehensive understanding of the coresetconstruction literature (Section 4). First, the coresets form a sparse subset of an exponential family:the nonnegative weights form the natural parameter w ∈ RN≥0, the component potentials (fn(θ))Nn=1

form the sufficient statistic, logZ(w) is the log partition function, and π0 is the base density,

πw(θ) := exp(wT f(θ)− logZ(w)

)π0(θ) f(θ) := [ f1(θ) . . . fN (θ) ]

T. (7)

Using the well-known fact that the gradient of an exponential family log-partition function is themean of the sufficient statistic, Ew [f(θ)] = ∇w logZ(w), we can rewrite the optimization Eq. (5) as

w? = arg minw∈RN

logZ(1)− logZ(w)− (1− w)T∇w logZ(w) s.t. w ≥ 0, ‖w‖0 ≤M. (8)

Taking the gradient of this objective function and noting again that, for an exponential family, theHessian of the log-partition function logZ(w) is the covariance of the sufficient statistic,

∇wDKL (πw||π1) = −∇2w logZ(w)(1− w) = −Covw

[f, fT (1− w)

], (9)

where Covw denotes covariance under πw. In other words, increasing the weight wn by a smallamount decreases DKL (πw||π1) by an amount proportional to the covariance of the nth potentialfn(θ) with the residual error

∑Nn=1 fn(θ)−

∑Nn=1 wnfn(θ) under πw. If required, it is not difficult

to use the connection between derivatives of logZ(w) and moments of the sufficient statistic underπw to derive 2nd and higher order derivatives of DKL (πw||π1).

This provides a natural tool for optimizing the coreset construction objective in Eq. (5)—MonteCarlo estimates of sufficient statistic moments—and enables coreset construction without both theproblematic selection of a Hilbert space (i.e., π) and finite-dimensional projection error from pastapproaches. But obtaining Monte Carlo estimates requires sampling from πw; the second key insightin this work is that as long as we build up the sparse approximation w incrementally, the iterateswill themselves be sparse. Therefore, using a standard Markov chain Monte Carlo algorithm [9] toobtain samples from πw for gradient estimation is actually not expensive—with cost O(M) insteadof O(N)—despite the potentially complicated form of πw.

3.2 Greedy selection

One option to build up a coreset incrementally is to use a greedy approach (Algorithm 1) to select andsubsequently reweight a single potential function at a time. For greedy selection, the naïve approachis to select the potential that provides the largest local decrease in KL divergence around the currentweights w, i.e., selecting the potential with the largest covariance with the residual error per Eq. (9).However, since the weight wn? will then be optimized over [0,∞), the selection of the next potentialto add should be invariant to scaling each potential fn by any positive constant. Thus we propose theuse of the correlation—rather than the covariance—between fn and the residual error fT (1− w) asthe selection criterion:

n?= arg maxn∈[N ]

{ ∣∣Corrw[fn, f

T (1− w)]∣∣ wn > 0

Corrw[fn, f

T (1− w)]

wn = 0. (10)

Although seemingly ad-hoc, this modification will be placed on a solid information-geometrictheoretical foundation in Proposition 1 (see also Eq. (34) in Appendix A). Note that since we do nothave access to the exact correlations, we must use Monte Carlo estimates via sampling from πw forgreedy selection. Given S samples (θs)

Ss=1

i.i.d.∼ πw, these are given by the N -dimensional vector

Corr = diag

[1

S

S∑s=1

gsgTs

]− 12(

1

S

S∑s=1

gsgTs (1− w)

)gs :=

[f1(θs)

...fN (θs)

]− 1

S

S∑r=1

[f1(θr)

...fN (θr)

], (11)

4

where diag [·] returns a diagonal matrix with the same diagonal entries as its argument. The details ofusing the correlation estimate (Eq. (11)) in the greedy selection rule (Eq. (10)) to add points to thecoreset are shown in lines 4–9 of Algorithm 1. Note that this computation has cost O(NS). If N islarge enough that computing the entire vectors gs ∈ RN is cost-prohibitive, one may instead computegs in Eq. (11) only for indices in I ∪ U—where I = {n ∈ [N ] : wn > 0} is the set of active indices,and U is a uniformly selected subsample of U ∈ [N ] indices—and perform greedy selection onlywithin these indices.

3.3 Weight update

After selecting a new potential function n?, we add it to the active set of indices I ⊆ [N ] and updatethe weights by optimizing

w? = arg minv∈RN

DKL (πv||π) s.t. v ≥ 0, (1− 1I)T v = 0. (12)

In particular, we run T steps of generating S samples (θs)Ss=1

i.i.d.∼ πw, computing a Monte Carloestimate D of the gradient∇wDKL (πw||π1) based on Eq. (9),

D := − 1

S

S∑s=1

gsgTs (1− w) ∈ RN gs as in Eq. (11), (13)

and taking a stochastic gradient step wn ← wn − γtDn at step t ∈ [T ] for each n ∈ I, using atypical learning rate γt ∝ t−1. The details of the weight update step are shown in lines 10–15 ofAlgorithm 1. As in the greedy selection step, the cost of each gradient step is O(NS), due to the gTs 1term in the gradient. If N is large enough that this computation is cost-prohibitive, one can use gscomputed only for indices in I ∪ U , where U is a uniformly selected subsample of U ∈ [N ] indices.

4 The information geometry of coreset construction

The perspective of coresets as a sparse exponential family also enables the use of information geometryto derive a unifying connection between the variational formulation and previous constructions.In particular, the family of coreset posteriors defines a Riemannian statistical manifold M ={πw}w∈RN

≥0with chartM→ RN≥0, endowed with the Fisher information metric G [61, p. 33,34],

G(w) =

∫πw(θ)∇w log πw(θ)∇w log πw(θ)Tdθ = ∇2

w logZ(w) = Covw [f ] . (14)

For any differentiable curve γ : [0, 1]→ RN≥0, the metric defines a notion of path length,

L(γ) =

∫ 1

0

√dγ(t)

dt

T

G(γ(t))dγ(t)

dtdt, (15)

and a constant-speed curve of minimal length between any two points w,w′ ∈ RN≥0 is referred to as ageodesic [61, Thm. 5.2]. The geodesics are the generalization of straight lines in Euclidean space tocurved Riemannian manifolds, such asM. Using this information-geometric view, Proposition 1shows that both Hilbert coreset construction (Eq. (3)) and the proposed greedy sparse variationalinference procedure (Algorithm 1) attempt to directionally align the w → w and w → 1 geodesicsonM for w, w, 1 ∈ RN≥0 (reference, coreset, and true posterior weights, respectively) as illustratedin Fig. 1. The key difference is that Hilbert coreset construction uses a fixed reference point w—corresponding to π in Eq. (3)—and thus operates entirely in a single tangent space ofM, whilethe proposed greedy method uses w = w and thus improves its tangent space approximationas the algorithm iterates. For this reason, we refer to the method in Section 3 as a Riemanniancoreset construction algorithm. In addition to this unification of coreset construction methods, thegeometric perspective also provides the means to show that the Hilbert coresets objective bounds thesymmetrized coreset KL divergence DKL (πw||π) + DKL (π||πw) if the Riemannian metric does notvary too much, as shown in Proposition 2. Incidentally, Lemma 3 in Appendix A—which is used toprove Proposition 2—also provides a nonnegative unbiased estimate of the symmetrized coreset KLdivergence, which may be used for performance monitoring in practice.

5

Algorithm 1 Greedy sparse stochastic variational inference

1: procedure SPARSEVI(f , π0, S, T , (γt)∞t=1, M )

2: w ← 0 ∈ RN , I ← ∅3: for m = 1, . . . ,M do

. Take S samples from the current coreset posterior approximation πw4: (θs)

Ss=1

i.i.d.∼ πw ∝ exp(wT f(θ))π0(θ). Compute the N -dimensional potential vector for each sample

5: fs ← f(θs) ∈ RN for s ∈ [S], and f ← 1S

∑Ss=1 fs

6: gs ← fs − f for s ∈ [S]. Estimate correlations between the potentials and the residual error

7: Corr← diag[1S

∑Ss=1 gsg

Ts

]− 12(1S

∑Ss=1 gsg

Ts (1− w)

)∈ RN

. Add the best next potential to the coreset8: n? ← arg maxn∈[N ] |Corrn|1 [n ∈ I] + Corrn1 [n /∈ I]

9: I ← I ∪ {n?}. Update all the active weights in I via stochastic gradient descent on DKL (πw||π)

10: for t = 1, . . . , T do. Use samples from πw to estimate the gradient

11: (θs)Ss=1

i.i.d.∼ πw ∝ exp(wT f(θ))π0(θ)

12: fs ← f(θs) ∈ RN for s ∈ [S], and f ← 1S

∑Ss=1 fs

13: gs ← fs − f for s ∈ [S]

14: D ← − 1S

∑Ss=1 gsg

Ts (1− w)

. Take a stochastic gradient step for active indices in I15: w ← w − γtIID where II :=

∑n∈I 1n1Tn is the diagonal indicator matrix for I

16: end for17: end for18: return w19: end procedure

Proposition 1. Suppose π in Eq. (3) satisfies π = πw for a set of weights w ∈ RN≥0. For u, v ∈ RN≥0,let ξu→v denote the initial tangent of the u→ v geodesic onM, and 〈·, ·〉u denote the inner productunder the Riemannian metric G(u) with induced norm ‖ · ‖u. Then Hilbert coreset construction inEq. (3) is equivalent to

w? = arg minw∈RN

‖ξw→1 − ξw→w‖w s.t. w ≥ 0, ‖w‖0 ≤M, (16)

and each greedy selection step of Riemannian coreset construction in Eq. (10) is equivalent to

n? = arg minn∈[N ],tn∈R

‖ξw→1 − ξw→w+tn1n‖w s.t. ∀n /∈ I, tn > 0. (17)

Proposition 2. Suppose π in Eq. (3) satisfies π = πw for a set of weights w ∈ RN≥0. Then if Jπ(w)is the objective function in Eq. (3),

DKL (π||πw) + DKL (πw||π) ≤ Cπ(w) · Jπ(w), (18)

where Cπ(w) := EU∼Unif[0,1][λmax

(G(w)−1/2G((1− U)w + U1)G(w)−1/2

)]. In particular, if

∇2w logZ(w) is constant in w ∈ RN≥0, then Cπ(w) = 1.

5 Experiments

In this section, we compare the quality of coresets constructed via the proposed SparseVI greedycoreset construction method, uniform random subsampling, and Hilbert coreset construction (GIGA[32]). In particular, for GIGA we used a 100-dimensional random projection generated from aGaussian π with two parametrizations: one with mean and covariance set using the moments of theexact posterior (Optimal) which is a benchmark but is not possible to achieve in practice; and one

6

Figure 1: Information-geometric view of greedy coreset construction on the coreset manifoldM.(1a): Hilbert coreset construction, with weighting distribution πw, full posterior π, coreset posteriorπw, and arrows denoting initial geodesic directions from w towards new datapoints. (1b): Riemanniancoreset construction, with the path of posterior approximations πwt

, t = 0, . . . , 3, and arrows denotinginitial geodesic directions towards new datapoints to add within each tangent plane.

(a)(b)

Figure 2: (2a): Synthetic comparison of coreset construction methods. Solid lines show the medianKL divergence over 10 trials, with 25th and 75th percentiles shown by shaded areas. (2b): 2Dprojection of coresets after 0, 1, 5, 20, 50, and 100 iterations via SparseVI. True/coreset posteriorand 2σ-predictive ellipses are shown in black/blue respectively. Coreset points are black with radiusdenoting weight.

with mean and covariance uniformly distributed between the prior and the posterior with 75% relativenoise added (Realistic) to simulate the choice of π without exact posterior information. Experimentswere performed on a machine with an Intel i7 8700K processor and 32GB memory; code is availableat www.github.com/trevorcampbell/bayesian-coresets.

5.1 Synthetic Gaussian posterior inference

We first compared the coreset construction algorithms on a synthetic example involving posteriorinference for the mean of a d-dimensional Gaussian with Gaussian observations,

θ ∼ N (µ0,Σ0) xni.i.d.∼ N (θ,Σ), n = 1, . . . , N. (19)

We selected this example because it decouples the evaluation of the coreset construction methods fromthe concerns of stochastic optimization and approximate posterior inference: the coreset posteriorπw is a Gaussian πw = N (µw,Σw) with closed-form expressions for the parameters as well ascovariance (see Appendix B for the derivation),

Σw =(Σ−10 +

∑Nn=1wnΣ−1

)−1µw = Σw

(Σ−10 µ0 + Σ−1

∑Nn=1wnxn

)(20)

Covw [fn, fm] = 1/2 tr ΨTΨ + νTmΨνn, (21)

7

www.github.com/trevorcampbell/bayesian-coresets

(a)

(b)

Figure 3: (3a): Coreset construction on regression of housing prices using radial basis functions inthe UK Land Registry data. Solid lines show the median KL divergence over 10 trials, with 25th and75th percentiles shown by shaded areas. (3b): Posterior mean contours with coresets of size 0–300via SparseVI compared with the exact posterior. Posterior and final coreset highlighted in the toprow. Coreset points are black with radius denoting weight.

where Σ = QQT , νn := Q−1(xn − µw) and Ψ := Q−1ΣwQ−T . Thus the greedy selection and

weight update can be performed without Monte Carlo estimation. We set Σ0 = Σ = I , µ0 = 0,d = 200, and N = 1, 000. We used a learning rate of γt = t−1, T = 100 weight update optimizationiterations, and M = 200 greedy iterations, although note that this is an upper bound on the size ofthe coreset as the same data point may be selected multiple times. The results in Fig. 2 demonstratethat the use of a fixed weighting function π (and thus, a fixed tangent plane on the coreset manifold)fundamentally limits the quality coresets via past algorithms. In contrast, the proposed greedyalgorithm is “manifold-aware” and is able to continually improve the approximation, resulting inorders-of-magnitude improvements in KL divergence to the true posterior.

5.2 Bayesian radial basis function regression

Next, we compared the coreset construction algorithms on Bayesian basis function regression forN = 10, 000 records of house sale log-price yn ∈ R as a function of latitude / longitude coordinatesxn ∈ R2 in the UK.3 The regression problem involved inference for the coefficients α ∈ RK in alinear combination of radial basis functions bk(x) = exp(−1/2σ2

k(x− µk)2), k = 1, . . . ,K,

yn = bTnα+ εn εni.i.d.∼ N (0, σ2) bn = [ b1(xn) · · · bK(xn) ]

Tα ∼ N (µ0, σ

20I). (22)

We generated 50 basis functions for each of 6 scales σk ∈ {0.2, 0.4, 0.8, 1.2, 1.6, 2.0} by generatingmeans µk uniformly from the data, and added one additional near-constant basis with scale 100and mean corresponding to the mean latitude and longitude of the data. This resulted in K = 301total basis functions and thus a 301-dimensional regression problem. We set the prior and noiseparameters µ0, σ

20 , σ

2 equal to the empirical mean, second moment, and variance of the price paid(yn)Nn=1 across the whole dataset, respectively. As in Section 5.1, the posterior and log-likelihoodcovariances are available in closed form, and all algorithmic steps can be performed without MonteCarlo. In particular, πw = N (µw,Σw), where (see Appendix B for the derivation)

Σw = (Σ−10 + σ−2N∑n=1

wnbnbTn )−1 and µw = Σw(Σ−10 µ0 + σ−2

N∑n=1

wnynbn) (23)

Covw [fn, fm] = σ−4(νnνmβ

Tn βm + 1/2(βTn βm)2

). (24)

where νn := yn − µTwbn, Σw = LLT , and βn := LT bn. We used a learning rate of γt = t−1,T = 100 optimization steps, and M = 300 greedy iterations, although again note that this is an upper

3This dataset was constructed by merging housing prices from the UK land registry data https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads with latitude & longitude co-ordinates from the Geonames postal code data http://download.geonames.org/export/zip/.

8

https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

http://download.geonames.org/export/zip/

(a) (b)

Figure 4: The results of the logistic (4a) and Poisson (4b) regression experiments. Plots show themedian KL divergence (estimated using the Laplace approximation [62] and normalized by the valuefor the prior) across 10 trials, with 25th and 75th percentiles shown by shaded areas. From top tobottom, (4a) shows the results for logistic regression on synthetic, chemical reactivities, and phishingwebsites data, while (4b) shows the results for Poisson regression on synthetic, bike trips, and airportdelays data. See Appendix C for details.

bound on the coreset size. The results in Fig. 3 generally align with those from the previous syntheticexperiment. The proposed sparse variational inference formulation builds coresets of comparablequality to Hilbert coreset construction (when given the exact posterior for π) up to a size of about 150.Beyond this point, past methods become limited by their fixed tangent plane approximation whilethe proposed method continues to improve. This experiment also highlights the sensitivity of pastmethods to the choice of π: uniform subsampling outperforms GIGA with a realistic choice of π.

5.3 Bayesian logistic and Poisson regression

Finally, we compared the methods on logistic and Poisson regression applied to six datasets (detailsmay be found in Appendix C) with N = 500 and dimension ranging from 2-15. We used M = 100greedy iterations, S = 100 samples for Monte Carlo covariance estimation, and T = 500 optimizationiterations with learning rate γt = 0.5t−1. Fig. 4 shows the result of this test, demonstrating thatthe proposed greedy sparse VI method successfully recovers a coreset with divergence from theexact posterior as low or lower than GIGA with without having the benefit of a user-specifiedweighting function. Note that there is a computational price to pay for this level of automation;Fig. 5, Appendix C shows that SparseVI is significantly slower than Hilbert coreset construction viaGIGA [32], primarily due to the expensive gradient descent weight update. However, if we removeGIGA (Optimal) from consideration due to its unrealistic use of π ≈ π1, SparseVI is the onlypractical coreset construction algorithm that reduces the KL divergence to the posterior appreciablyfor reasonable coreset sizes. We leave improvements to computational cost for future work.

6 Conclusion

This paper introduced sparse variational inference for Bayesian coreset construction. By exploitingthe fact that coreset posteriors form an exponential family, a greedy algorithm as well as a unifyingRiemannian information-geometric view of present and past coreset constructions were developed.Future work includes extending sparse VI to improved optimization techniques beyond greedymethods, and reducing computational cost.

Acknowledgments T. Campbell and B. Beronov are supported by National Sciences and Engineer-ing Research Council of Canada (NSERC) Discovery Grants. T. Campbell is additionally supportedby an NSERC Discovery Launch Supplement.

9

References[1] Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.

[2] Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian dataanalysis. CRC Press, 3rd edition, 2013.

[3] Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variationalmethods for graphical models. Machine Learning, 37:183–233, 1999.

[4] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference.Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.

[5] Aılım Günes Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind. Automatic differentiation inmachine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018.

[6] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic differentia-tion variational inference. Journal of Machine Learning Research, 18:1–45, 2017.

[7] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In InternationalConference on Artificial Intelligence and Statistics, 2014.

[8] Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, andXiao-Li Meng, editors, Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 2011.

[9] Matthew Hoffman and Andrew Gelman. The No-U-Turn Sampler: adaptively setting path lengths inHamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351–1381, 2014.

[10] Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.Journal of Machine Learning Research, 18:1–43, 2017.

[11] Steven Scott, Alexander Blocker, Fernando Bonassi, Hugh Chipman, Edward George, and Robert McCul-loch. Bayes and big data: the consensus Monte Carlo algorithm. International Journal of ManagementScience and Engineering Management, 11:78–88, 2016.

[12] Michael Betancourt. The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling.In International Conference on Machine Learning, 2015.

[13] Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approxi-mations. The Annals of Statistics, 2018 (to appear).

[14] Yixin Wang and David Blei. Frequentist consistency of variational Bayes. Journal of the AmericanStatistical Association, 0(0):1–15, 2018.

[15] Yun Yang, Debdeep Pati, and Anirban Bhattacharya. α-variational inference with statistical guarantees.The Annals of Statistics, 2018 (to appear).

[16] Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Consistency of variational Bayes inference forestimation and model selection in mixtures. Electronic Journal of Statistics, 12:2995–3035, 2018.

[17] Kevin Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.

[18] Matthew Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. TheJournal of Machine Learning Research, 14:1303–1347, 2013.

[19] Maxim Rabinovich, Elaine Angelino, and Michael Jordan. Variational consensus Monte Carlo. In Advancesin Neural Information Processing Systems, 2015.

[20] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia Wilson, and Michael Jordan. Streamingvariational Bayes. In Advances in Neural Information Processing Systems, 2013.

[21] Trevor Campbell, Julian Straub, John W. Fisher III, and Jonathan How. Streaming, distributed variationalinference for Bayesian nonparametrics. In Advances in Neural Information Processing Systems, 2015.

[22] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. InInternational Conference on Machine Learning, 2011.

[23] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradientFisher scoring. In International Conference on Machine Learning, 2012.

10

[24] Rémi Bardenet, Arnaud Doucet, and Chris C Holmes. Towards scaling up Markov chain Monte Carlo: anadaptive subsampling approach. In International Conference on Machine Learning, pages 405–413, 2014.

[25] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: cutting the Metropolis-Hastings budget. In International Conference on Machine Learning, 2014.

[26] Dougal Maclaurin and Ryan Adams. Firefly Monte Carlo: exact MCMC with subsets of data. In Conferenceon Uncertainty in Artificial Intelligence, 2014.

[27] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: scalable Bayes via barycentersof subset posteriors. In International Conference on Artificial Intelligence and Statistics, 2015.

[28] Reihaneh Entezari, Radu Craiu, and Jeffrey Rosenthal. Likelihood inflating sampling algorithm.arXiv:1605.02113, 2016.

[29] Elaine Angelino, Matthew Johnson, and Ryan Adams. Patterns of scalable Bayesian inference. Foundationsand Trends in Machine Learning, 9(1–2):1–129, 2016.

[30] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for Bayesian logistic regression. InAdvances in Neural Information Processing Systems, 2016.

[31] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert coresets.Journal of Machine Learning Research, 20(15):1–38, 2019.

[32] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy iterative geodesic ascent.In International Conference on Machine Learning, 2018.

[33] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACMTransactions on Algorithms, 6(4), 2010.

[34] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimizationvariants. In Advances in Neural Information Processing Systems, 2015.

[35] Francesco Locatello, Michael Tschannen, Gunnar Rätsch, and Martin Jaggi. Greedy algorithms for coneconstrained optimization with convergence guarantees. In Advances in Neural Information ProcessingSystems, 2017.

[36] Andrew Barron, Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Approximation and learning bygreedy algorithms. The Annals of Statistics, 36(1):64–94, 2008.

[37] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Uncertainty inArtificial Intelligence, 2010.

[38] Robert Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.

[39] Ferenc Huszar and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Uncertaintyin Artificial Intelligence, 2012.

[40] Yoav Freund and Robert Schapire. A decision-theoretic generalization of on-line learning and an applicationto boosting. Journal of Computer and System Sciences, 55:119–139, 1997.

[41] Emmanuel Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on InformationTheory, 51(12):4203–4215, 2005.

[42] Emmanual Candès and Terence Tao. The Dantzig selector: statistical estimation when p is much largerthan n. The Annals of Statistics, 35(6):2313–2351, 2007.

[43] David Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.

[44] Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vybíral. A survey of compressed sensing. InHolger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vybíral, editors, Compressed Sensing and itsApplications: MATHEON Workshop 2013. Birkhäuser, 2015.

[45] Stéphane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-tions on Signal Processing, 41(12):3397–3415, 1993.

[46] Sheng Chen, Stephen Billings, and Wan Luo. Orthogonal least squares methods and their application tonon-linear system identification. International Journal of Control, 50(5):1873–1896, 1989.

11

[47] Scott Chen, David Donoho, and Michael Saunders. Atomic decomposition by basis pursuit. SIAM Review,43(1):129–159, 1999.

[48] Joel Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on InformationTheory, 50(10):2231–2242, 2004.

[49] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical SocietySeries B, 58(1):267–288, 1996.

[50] Leo Geppert, Katja Ickstadt, Alexander Munteanu, Jesn Quedenfeld, and Christian Sohler. Randomprojections for Bayesian regression. Statistics and Computing, 27:79–101, 2017.

[51] Daniel Ahfock, William Astle, and Sylvia Richardson. Statistical properties of sketching algorithms.arXiv:1706.03665, 2017.

[52] Pankaj Agarwal, Sariel Har-Peled, and Kasturi Varadarajan. Geometric approximation via coresets.Combinatorial and computational geometry, 52:1–30, 2005.

[53] Michael Langberg and Leonard Schulman. Universal ε-approximators for integrals. In Proceedings of the21st Annual ACM–SIAM Symposium on Discrete Algorithms, pages 598–607, 2010.

[54] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. InProceedings of the 43rd Annual ACM Symposium on Theory of Computing, pages 569–578, 2011.

[55] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-sizecoresets for k-means, pca and projective clustering. In Proceedings of the 24th Annual ACM–SIAMSymposium on Discrete Algorithms, pages 1434–1453, 2013.

[56] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning.arXiv:1703.06476, 2017.

[57] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coresetconstructions. arXiv:1612.00889, 2016.

[58] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research LogisticsQuarterly, 3:95–110, 1956.

[59] Jacques Guélat and Patrice Marcotte. Some comments on Wolfe’s ‘away step’. Mathematical Programming,35:110–119, 1986.

[60] Martin Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In InternationalConference on Machine Learning, 2013.

[61] Shun ichi Amari. Information Geometry and its Applications. Springer, 2016.

[62] Luke Tierney and Joseph Kadane. Accurate approximations for posterior moments and marginal densities.Journal of the American Statistical Association, 81(393):82–86, 1986.

12

Sparse Variational Inference: Bayesian Coresets from Scratch · The major challenge posed by this...

Documents

Transcript of Sparse Variational Inference: Bayesian Coresets from Scratch · The major challenge posed by this...