Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic...

86
Probabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New England November 6, 2019 Collaborators: Jackson Gorham, Andrew Duncan, Sebastian Vollmer, Jonathan Huggins, Wilson Chen, Alessandro Barp, Francois-Xavier Briol, Mark Girolami, Chris Oates, Murat Erdogdu, and Ohad Shamir Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 1 / 33

Transcript of Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic...

Page 1: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Probabilistic Inference and Learning with

Stein’s Method

Lester Mackey

Microsoft Research New England

November 6, 2019

Collaborators: Jackson Gorham, Andrew Duncan, Sebastian Vollmer, Jonathan Huggins,Wilson Chen, Alessandro Barp, Francois-Xavier Briol, Mark Girolami, Chris Oates, Murat

Erdogdu, and Ohad Shamir

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 1 / 33

Page 2: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}

, P(Yl = 1 | vl, β) = 11+e−〈β,vl〉

3 Unknown parameter vector: β ∼ N (0, I)

Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 3: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}, P(Yl = 1 | vl, β) = 1

1+e−〈β,vl〉

3 Unknown parameter vector: β ∈ Rd

Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 4: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Bayesian logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}, P(Yl = 1 | vl, β) = 1

1+e−〈β,vl〉

3 Unknown parameter vector: β ∼ N (0, I)

Generative model simple to express

Posterior distribution over unknown parameters is complexNormalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 5: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Bayesian logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}, P(Yl = 1 | vl, β) = 1

1+e−〈β,vl〉

3 Unknown parameter vector: β ∼ N (0, I)

Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 6: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Bayesian logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}, P(Yl = 1 | vl, β) = 1

1+e−〈β,vl〉

3 Unknown parameter vector: β ∼ N (0, I)

Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 7: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Example: Bayesian logistic regression1 Fixed feature vectors: vl ∈ Rd for each datapoint l = 1, . . . , L2 Binary class labels: Yl ∈ {0, 1}, P(Yl = 1 | vl, β) = 1

1+e−〈β,vl〉

3 Unknown parameter vector: β ∼ N (0, I)

Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 2 / 33

Page 8: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Question: How do we scale Markov chain Monte Carlo (MCMC)posterior inference to massive datasets?

MCMC Benefit: Approximates intractable posteriorexpectations EP [h(Z)] =

∫X p(x)h(x)dx with asymptotically

exact sample estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each point xi requires iterating over entire dataset!

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Approximate standard MCMC procedure in a manner that makesuse of only a small subset of datapoints per sampleReduced computational overhead leads to faster sampling andreduced Monte Carlo varianceIntroduces asymptotic bias: target distribution is not stationaryHope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 3 / 33

Page 9: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Question: How do we scale Markov chain Monte Carlo (MCMC)posterior inference to massive datasets?

MCMC Benefit: Approximates intractable posteriorexpectations EP [h(Z)] =

∫X p(x)h(x)dx with asymptotically

exact sample estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each point xi requires iterating over entire dataset!

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Approximate standard MCMC procedure in a manner that makesuse of only a small subset of datapoints per sampleReduced computational overhead leads to faster sampling andreduced Monte Carlo varianceIntroduces asymptotic bias: target distribution is not stationaryHope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 3 / 33

Page 10: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 11: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 12: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 13: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 14: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 15: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 4 / 33

Page 16: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 17: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 18: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 19: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP

in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 20: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 21: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasible

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 22: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasibleMackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 5 / 33

Page 23: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 24: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 25: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!

How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 26: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?

Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 27: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?

How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 28: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 6 / 33

Page 29: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.

T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ (Qn)n≥1 converges to P whenever S(Qn, T ,G)→ 0 (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 7 / 33

Page 30: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ (Qn)n≥1 converges to P whenever S(Qn, T ,G)→ 0 (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 7 / 33

Page 31: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ (Qn)n≥1 converges to P whenever S(Qn, T ,G)→ 0 (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 7 / 33

Page 32: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ (Qn)n≥1 converges to P whenever S(Qn, T ,G)→ 0 (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 7 / 33

Page 33: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ (Qn)n≥1 converges to P whenever S(Qn, T ,G)→ 0 (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 7 / 33

Page 34: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Identifying a Stein Operator TGoal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G

Approach: Generator method of Barbour [1988, 1990], Gotze [1991]

Identify a Markov process (Zt)t≥0 with stationary distribution PUnder mild conditions, its infinitesimal generator

(Au)(x) = limt→0

(E[u(Zt) | Z0 = x]− u(x))/t

satisfies EP [(Au)(Z)] = 0

Overdamped Langevin diffusion: dZt = 12∇ log p(Zt)dt+ dWt

Generator: (APu)(x) = 12〈∇u(x),∇ log p(x)〉+ 1

2〈∇,∇u(x)〉

Stein operator: (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]

Depends on P only through ∇ log p; computable even if pcannot be normalized!EP [(TP g)(Z)] = 0 for all g : X → Rd in classical Stein set

G‖·‖ ={g : supx 6=y max

(‖g(x)‖∗, ‖∇g(x)‖∗, ‖∇g(x)−∇g(y)‖

‖x−y‖

)≤ 1}

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 8 / 33

Page 35: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Identifying a Stein Operator TGoal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ GApproach: Generator method of Barbour [1988, 1990], Gotze [1991]

Identify a Markov process (Zt)t≥0 with stationary distribution PUnder mild conditions, its infinitesimal generator

(Au)(x) = limt→0

(E[u(Zt) | Z0 = x]− u(x))/t

satisfies EP [(Au)(Z)] = 0

Overdamped Langevin diffusion: dZt = 12∇ log p(Zt)dt+ dWt

Generator: (APu)(x) = 12〈∇u(x),∇ log p(x)〉+ 1

2〈∇,∇u(x)〉

Stein operator: (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]

Depends on P only through ∇ log p; computable even if pcannot be normalized!EP [(TP g)(Z)] = 0 for all g : X → Rd in classical Stein set

G‖·‖ ={g : supx 6=y max

(‖g(x)‖∗, ‖∇g(x)‖∗, ‖∇g(x)−∇g(y)‖

‖x−y‖

)≤ 1}

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 8 / 33

Page 36: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Identifying a Stein Operator TGoal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ GApproach: Generator method of Barbour [1988, 1990], Gotze [1991]

Identify a Markov process (Zt)t≥0 with stationary distribution PUnder mild conditions, its infinitesimal generator

(Au)(x) = limt→0

(E[u(Zt) | Z0 = x]− u(x))/t

satisfies EP [(Au)(Z)] = 0

Overdamped Langevin diffusion: dZt = 12∇ log p(Zt)dt+ dWt

Generator: (APu)(x) = 12〈∇u(x),∇ log p(x)〉+ 1

2〈∇,∇u(x)〉

Stein operator: (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]

Depends on P only through ∇ log p; computable even if pcannot be normalized!Multivariate generalization of density method operator(T g)(x) = g(x) ddx log p(x) + g′(x) [Stein, Diaconis, Holmes, and Reinert, 2004]

G‖·‖ ={g : supx 6=y max

(‖g(x)‖∗, ‖∇g(x)‖∗, ‖∇g(x)−∇g(y)‖

‖x−y‖

)≤ 1}

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 8 / 33

Page 37: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Identifying a Stein Operator TGoal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ GApproach: Generator method of Barbour [1988, 1990], Gotze [1991]

Identify a Markov process (Zt)t≥0 with stationary distribution PUnder mild conditions, its infinitesimal generator

(Au)(x) = limt→0

(E[u(Zt) | Z0 = x]− u(x))/t

satisfies EP [(Au)(Z)] = 0

Overdamped Langevin diffusion: dZt = 12∇ log p(Zt)dt+ dWt

Generator: (APu)(x) = 12〈∇u(x),∇ log p(x)〉+ 1

2〈∇,∇u(x)〉

Stein operator: (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]

Depends on P only through ∇ log p; computable even if pcannot be normalized!EP [(TP g)(Z)] = 0 for all g : X → Rd in classical Stein set

G‖·‖ ={g : supx 6=y max

(‖g(x)‖∗, ‖∇g(x)‖∗, ‖∇g(x)−∇g(y)‖

‖x−y‖

)≤ 1}

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 8 / 33

Page 38: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Convergence and Non-convergence

Goal: Show classical Stein discrepancy S(Qn, TP ,G‖·‖)→ 0 ifand only if (Qn)n≥1 converges to P

In the univariate case (d = 1), known that for many targets P ,S(Qn, TP ,G‖·‖)→ 0 only if Wasserstein dW‖·‖(Qn, P )→ 0[Stein, Diaconis, Holmes, and Reinert, 2004, Chatterjee and Shao, 2011, Chen, Goldstein, and Shao, 2011]

Few multivariate targets have been analyzed (see [Reinert and Rollin,

2009, Chatterjee and Meckes, 2008, Meckes, 2009] for multivariate Gaussian)

New contribution [Gorham, Duncan, Vollmer, and Mackey, 2019]

Theorem (Stein Discrepancy-Wasserstein Equivalence)

If the Langevin diffusion couples at an integrable rate and ∇ log p isLipschitz, then S(Qn, TP ,G‖·‖)→ 0 ⇔ dW‖·‖(Qn, P )→ 0.

Examples: strongly log concave P , Bayesian logistic regressionor robust t regression with Gaussian priors, Gaussian mixturesConditions not necessary: template for bounding S(Qn, TP ,G‖·‖)

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 9 / 33

Page 39: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Convergence and Non-convergence

Goal: Show classical Stein discrepancy S(Qn, TP ,G‖·‖)→ 0 ifand only if (Qn)n≥1 converges to P

In the univariate case (d = 1), known that for many targets P ,S(Qn, TP ,G‖·‖)→ 0 only if Wasserstein dW‖·‖(Qn, P )→ 0[Stein, Diaconis, Holmes, and Reinert, 2004, Chatterjee and Shao, 2011, Chen, Goldstein, and Shao, 2011]

Few multivariate targets have been analyzed (see [Reinert and Rollin,

2009, Chatterjee and Meckes, 2008, Meckes, 2009] for multivariate Gaussian)

New contribution [Gorham, Duncan, Vollmer, and Mackey, 2019]

Theorem (Stein Discrepancy-Wasserstein Equivalence)

If the Langevin diffusion couples at an integrable rate and ∇ log p isLipschitz, then S(Qn, TP ,G‖·‖)→ 0 ⇔ dW‖·‖(Qn, P )→ 0.

Examples: strongly log concave P , Bayesian logistic regressionor robust t regression with Gaussian priors, Gaussian mixturesConditions not necessary: template for bounding S(Qn, TP ,G‖·‖)

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 9 / 33

Page 40: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

A Simple Example

●●

●●

●●●●

● ●●

●●●

●●

●●

0.10

0.03

0.01

100 1000 10000Number of sample points, n

Ste

in d

iscr

epan

cy

Gaussian ScaledStudent's t

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−1.0−0.5

0.00.51.0

−1.0−0.5

0.00.51.0

−1.0−0.5

0.00.51.0

n = 300

n = 3000

n = 30000

−6 −3 0 3 6 −6 −3 0 3 6x

g

Gaussian ScaledStudent's t

●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−2−1

012

−2024

−2.50.02.55.0

n = 300

n = 3000

n = 30000

−6 −3 0 3 6 −6 −3 0 3 6x

h=

TP g Sample

● GaussianScaledStudent's t

For target P = N (0, 1), compare i.i.d. N (0, 1) sample sequenceQ1:n to scaled Student’s t sequence Q′1:n with matching variance

Expect S(Q1:n, TP ,G‖·‖,Q,G1)→ 0 & S(Q′1:n, TP ,G‖·‖,Q,G1) 6→ 0

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 10 / 33

Page 41: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

A Simple Example

●●

●●

●●●●

● ●●

●●●

●●

●●

0.10

0.03

0.01

100 1000 10000Number of sample points, n

Ste

in d

iscr

epan

cy

Gaussian ScaledStudent's t

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−1.0−0.5

0.00.51.0

−1.0−0.5

0.00.51.0

−1.0−0.5

0.00.51.0

n = 300

n = 3000

n = 30000

−6 −3 0 3 6 −6 −3 0 3 6x

g

Gaussian ScaledStudent's t

●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−2−1

012

−2024

−2.50.02.55.0

n = 300

n = 3000

n = 30000

−6 −3 0 3 6 −6 −3 0 3 6x

h=

TP g Sample

● GaussianScaledStudent's t

Middle: Recovered optimal functions g

Right: Associated test functions h(x) , (TPg)(x) which bestdiscriminate sample Q from target P

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 11 / 33

Page 42: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Selecting Sampler Hyperparameters

●●

●●

diagnostic = ESS

diagnostic = Spanner Stein

1.0

1.5

2.0

2.5

1.01.52.02.53.0

1e−04 1e−03 1e−02Step size, ε

Log

med

ian

diag

nost

ic

Step size, ε = 5e−05 Step size, ε = 5e−03 Step size, ε = 5e−02

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●●●

●●●●●

● ● ●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●● ●●

●●●●●●

●●●●

●●●●●●●●●● ●●● ●●

●●

●●●●●●●●●

●●

●●●● ●

●●●●●

●●●●

●●●●●●●●●●

●● ● ●●●●

●●●●●●●●●●●●● ●●

●●●●●

●● ●●●● ● ●●●●●●●

●●●● ●●

●●●●●

●●●●

●●●●●

●●●●●●●●●

●●● ●●●●●

●●

●●●●

●●● ●

●●●

●● ●●●

●●●●●●●●●

●●●●●●●●

●●●

● ●

●●●●●●●●●●●●●●

●●●

●●●●●●●●● ●●●●●● ●●

●●●●

●●●

●●●●●●●● ●●● ●●●●

●●●●

●●●

●● ●

●●●●

●●

●●●●

●●●●●●

●● ●●● ●●●●

●●●●●

●●●●●●●●●●●●●● ●●●●

●●●●●● ●

● ●●

●●●

●●●● ●●●●

●●

●●●●●●●

● ●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●● ●●●

●●●●●●

●●●●●●

●● ●● ● ●●

●●●●●●●

●●●●

●●●●●●

●● ●● ●● ● ● ●●●●●●

●●

●●●●

●●●●●●●●● ●●●●● ●●

●●●●●

● ●●

●●●● ●●●●●●

●●●

●●● ●●●●●●●●

●●●●●●

●●●●●●●● ●

● ●●●●●

●●●●●●●●●●●●●●●●●

●●

●●●●● ●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●● ●

●●●●●●●●

●●●●●●●●●● ●●●●●●●●●

●●●●●●●

●●● ●●

●●●●●●●

● ●●●●●

●●●●●●●

●●●●●●●●

● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●● ●●●●●●

●●

●●●

●●●●●● ●

●●●●●●●●●●●●

●●●●●

●●●●●

●●●

●●●

●●●●●●

●●●●●●●

●●●●●●

●●●

●●

● ●●

●●●

●●●●●●●

●●●●

●●●●●● ●

●●●●●●

●● ●●●●

●●●●

●●●●

●●●●●●●●●

●● ●●●●● ●●●●●●●

●●●

●●●●●●

● ●●●●●●●●●●●●● ●●●●●

●●●●●

●●●

●●●●●

●●

●●●

●●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●

●● ● ●●

●●●●

●●●●●

●●● ●●●

● ●●●●

● ●●●●●●●

●●●●●●

●●● ●●● ●●

●●●●

●●●●

●●●●

●●

●●●●●●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●●

● ●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●●●●

●●

●●

●●

●●

●●

●●

−4

−3

−2

−1

0

1

2

3

4

−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3x1

x 2

Target posterior density: p(x) ∝ π(x)∏L

l=1 π(yl | x)Stochastic Gradient Langevin Dynamics [Welling and Teh, 2011]

xk+1 ∼ N (xk + ε2(∇ log π(xk) + L

|Bk|∑

l∈Bk ∇ log π(yl|xk)), εI)

Random batch Bk of datapoints used to draw each sample pointStep size ε too small ⇒ slow mixingStep size ε too large ⇒ sampling from very different distributionStandard MCMC selection criteria like effective sample size(ESS) and asymptotic variance do not account for this bias

ESS maximized at ε = 5× 10−2, Stein minimized at ε = 5× 10−3Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 12 / 33

Page 43: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Alternative Stein Sets GGoal: Identify a more “user-friendly” Stein set G than the classical

Approach: Reproducing kernels k : X × X → R [Oates, Girolami, and

Chopin, 2016, Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016]

A reproducing kernel k is symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ X , ci ∈ R)

Gaussian: k(x, y) = e−12‖x−y‖22 , IMQ: k(x, y) = 1

(1+‖x−y‖22)1/2

Generates a reproducing kernel Hilbert space (RKHS) KkDefine the kernel Stein set [Gorham and Mackey, 2017]

Gk , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}

Yields closed-form kernel Stein discrepancy (KSD)

S(Qn, TP ,Gk) = ‖w‖ for wj ,√∑n

i,i′=1 kj0(xi, xi′).

Reduces to parallelizable pairwise evaluations of Stein kernels

kj0(x, y) ,1

p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y))

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 13 / 33

Page 44: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Alternative Stein Sets GGoal: Identify a more “user-friendly” Stein set G than the classical

Approach: Reproducing kernels k : X × X → R [Oates, Girolami, and

Chopin, 2016, Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016]

A reproducing kernel k is symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ X , ci ∈ R)

Gaussian: k(x, y) = e−12‖x−y‖22 , IMQ: k(x, y) = 1

(1+‖x−y‖22)1/2

Generates a reproducing kernel Hilbert space (RKHS) Kk

Define the kernel Stein set [Gorham and Mackey, 2017]

Gk , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}

Yields closed-form kernel Stein discrepancy (KSD)

S(Qn, TP ,Gk) = ‖w‖ for wj ,√∑n

i,i′=1 kj0(xi, xi′).

Reduces to parallelizable pairwise evaluations of Stein kernels

kj0(x, y) ,1

p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y))

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 13 / 33

Page 45: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Alternative Stein Sets GGoal: Identify a more “user-friendly” Stein set G than the classical

Approach: Reproducing kernels k : X × X → R [Oates, Girolami, and

Chopin, 2016, Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016]

A reproducing kernel k is symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ X , ci ∈ R)

Gaussian: k(x, y) = e−12‖x−y‖22 , IMQ: k(x, y) = 1

(1+‖x−y‖22)1/2

Generates a reproducing kernel Hilbert space (RKHS) KkDefine the kernel Stein set [Gorham and Mackey, 2017]

Gk , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}

Yields closed-form kernel Stein discrepancy (KSD)

S(Qn, TP ,Gk) = ‖w‖ for wj ,√∑n

i,i′=1 kj0(xi, xi′).

Reduces to parallelizable pairwise evaluations of Stein kernels

kj0(x, y) ,1

p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y))

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 13 / 33

Page 46: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Alternative Stein Sets GGoal: Identify a more “user-friendly” Stein set G than the classical

Approach: Reproducing kernels k : X × X → R [Oates, Girolami, and

Chopin, 2016, Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016]

A reproducing kernel k is symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ X , ci ∈ R)

Gaussian: k(x, y) = e−12‖x−y‖22 , IMQ: k(x, y) = 1

(1+‖x−y‖22)1/2

Generates a reproducing kernel Hilbert space (RKHS) KkDefine the kernel Stein set [Gorham and Mackey, 2017]

Gk , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}

Yields closed-form kernel Stein discrepancy (KSD)

S(Qn, TP ,Gk) = ‖w‖ for wj ,√∑n

i,i′=1 kj0(xi, xi′).

Reduces to parallelizable pairwise evaluations of Stein kernels

kj0(x, y) ,1

p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y))

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 13 / 33

Page 47: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

Theorem (Univariate KSD detects non-convergence [Gorham and Mackey, 2017])

Suppose P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If d = 1, then (Qn)n≥1converges weakly to P whenever S(Qn, TP ,Gk)→ 0.

P is the set of targets P with Lipschitz ∇ log p and distantstrong log concavity ( 〈∇ log(p(x)/p(y)),y−x〉

‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Includes Bayesian logistic and Student’s t regression withGaussian priors, Gaussian mixtures with common covariance, ...

Justifies use of KSD with popular Gaussian, Matern, or inversemultiquadric kernels k in the univariate case

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 14 / 33

Page 48: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

Theorem (Univariate KSD detects non-convergence [Gorham and Mackey, 2017])

Suppose P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If d = 1, then (Qn)n≥1converges weakly to P whenever S(Qn, TP ,Gk)→ 0.

P is the set of targets P with Lipschitz ∇ log p and distantstrong log concavity ( 〈∇ log(p(x)/p(y)),y−x〉

‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Includes Bayesian logistic and Student’s t regression withGaussian priors, Gaussian mixtures with common covariance, ...

Justifies use of KSD with popular Gaussian, Matern, or inversemultiquadric kernels k in the univariate case

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 14 / 33

Page 49: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

In higher dimensions, KSDs based on common kernels fail todetect non-convergence, even for Gaussian targets P

Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017])

Suppose d ≥ 3, P = N (0, Id), and α , (12− 1

d)−1. If k(x, y) and its

derivatives decay at a o(‖x− y‖−α2 ) rate as ‖x− y‖2 →∞, thenS(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to P .

Gaussian (k(x, y) = e−12‖x−y‖22) and Matern kernels fail for d ≥ 3

Inverse multiquadric kernels (k(x, y) = (1 + ‖x− y‖22)β) withβ < −1 fail for d > 2β

1+β

The violating sample sequences (Qn)n≥1 are simple to construct

Problem: Kernels with light tails ignore excess mass in the tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 15 / 33

Page 50: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

In higher dimensions, KSDs based on common kernels fail todetect non-convergence, even for Gaussian targets P

Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017])

Suppose d ≥ 3, P = N (0, Id), and α , (12− 1

d)−1. If k(x, y) and its

derivatives decay at a o(‖x− y‖−α2 ) rate as ‖x− y‖2 →∞, thenS(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to P .

Gaussian (k(x, y) = e−12‖x−y‖22) and Matern kernels fail for d ≥ 3

Inverse multiquadric kernels (k(x, y) = (1 + ‖x− y‖22)β) withβ < −1 fail for d > 2β

1+β

The violating sample sequences (Qn)n≥1 are simple to construct

Problem: Kernels with light tails ignore excess mass in the tailsMackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 15 / 33

Page 51: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

Consider the inverse multiquadric (IMQ) kernel

k(x, y) = (c2 + ‖x− y‖22)β for some β < 0, c ∈ R.

IMQ KSD fails to detect non-convergence when β < −1

However, IMQ KSD detects non-convergence when β ∈ (−1, 0)

Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017])

Suppose P ∈ P and k(x, y) = (c2 + ‖x− y‖22)β for β ∈ (−1, 0). IfS(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 16 / 33

Page 52: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Non-convergence

Goal: Show (Qn)n≥1 converges to P whenever S(Qn, TP ,Gk)→ 0

Consider the inverse multiquadric (IMQ) kernel

k(x, y) = (c2 + ‖x− y‖22)β for some β < 0, c ∈ R.

IMQ KSD fails to detect non-convergence when β < −1

However, IMQ KSD detects non-convergence when β ∈ (−1, 0)

Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017])

Suppose P ∈ P and k(x, y) = (c2 + ‖x− y‖22)β for β ∈ (−1, 0). IfS(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 16 / 33

Page 53: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Convergence

Goal: Show S(Qn, TP ,Gk)→ 0 whenever (Qn)n≥1 converges to P

Proposition (KSD detects convergence [Gorham and Mackey, 2017])

If k ∈ C(2,2)b and ∇ log p Lipschitz and square integrable under P ,

then S(Qn, TP ,Gk)→ 0 whenever the Wasserstein distancedW‖·‖2 (Qn, P )→ 0.

Covers Gaussian, Matern, IMQ, and other common boundedkernels k

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 17 / 33

Page 54: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Detecting Convergence

Goal: Show S(Qn, TP ,Gk)→ 0 whenever (Qn)n≥1 converges to P

Proposition (KSD detects convergence [Gorham and Mackey, 2017])

If k ∈ C(2,2)b and ∇ log p Lipschitz and square integrable under P ,

then S(Qn, TP ,Gk)→ 0 whenever the Wasserstein distancedW‖·‖2 (Qn, P )→ 0.

Covers Gaussian, Matern, IMQ, and other common boundedkernels k

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 17 / 33

Page 55: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Selecting Samplers

Stochastic Gradient Fisher Scoring (SGFS)[Ahn, Korattikara, and Welling, 2012]

Approximate MCMC procedure designed for scalability

Approximates Metropolis-adjusted Langevin algorithm but doesnot use Metropolis-Hastings correctionTarget P is not stationary distribution

Goal: Choose between two variants

SGFS-f inverts a d× d matrix for each new sample pointSGFS-d inverts a diagonal matrix to reduce sampling time

MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]

10000 images, 51 features, binary label indicating whetherimage of a 7 or a 9

Bayesian logistic regression posterior P

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 18 / 33

Page 56: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Selecting Samplers●

●● ●●

●●●●

● ● ●●●●●●● ● ● ● ●

17500

18000

18500

19000

102 102.5 103 103.5 104 104.5

Number of sample points, n

IMQ

ker

nel S

tein

dis

crep

ancy

Sampler● SGFS−d

SGFS−f

SGFS−d

●●

−0.4

−0.3

−0.2

BE

ST

−0.3 −0.2 −0.1 0.0x7

x 51

SGFS−d

●●

0.80.91.01.11.2 W

OR

ST

−0.5 −0.4 −0.3 −0.2 −0.1 0.0x8

x 42

SGFS−f

●●

−0.1

0.0

0.1

0.2 BE

ST

−0.1 0.0 0.1 0.2x32

x 34

SGFS−f

●●

−0.6

−0.5

−0.4

−0.3 WO

RS

T

−1.5 −1.4 −1.3 −1.2 −1.1x2

x 25

Left: IMQ KSD quality comparison for SGFS Bayesian logisticregression (no surrogate ground truth used)

Right: SGFS sample points (n = 5× 104) with bivariatemarginal means and 95% confidence ellipses (blue) that alignbest and worst with surrogate ground truth sample (red)

Both suggest small speed-up of SGFS-d (0.0017s per sample vs.0.0019s for SGFS-f) outweighed by loss in inferential accuracy

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 19 / 33

Page 57: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Beyond Sample Quality Comparison

Goodness-of-fit testingChwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP ,Gk)to test whether a sample was drawn from a target distribution P(see also Liu, Lee, and Jordan [2016])Test with default Gaussian kernel k experienced considerable lossof power as the dimension d increased

We recreate their experiment with IMQ kernel (β = −12, c = 1)

For n = 500, generate sample (xi)ni=1 with xi = zi + ui e1

ziiid∼ N (0, Id) and ui

iid∼ Unif[0, 1]. Target P = N (0, Id).Compare with standard normality test of Baringhaus and Henze [1988]

Table: Mean power of multivariate normality tests across 400 simulations

d=2 d=5 d=10 d=15 d=20 d=25B&H 1.0 1.0 1.0 0.91 0.57 0.26

Gaussian 1.0 1.0 0.88 0.29 0.12 0.02IMQ 1.0 1.0 1.0 1.0 1.0 1.0

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 20 / 33

Page 58: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Beyond Sample Quality Comparison

Goodness-of-fit testingChwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP ,Gk)to test whether a sample was drawn from a target distribution P(see also Liu, Lee, and Jordan [2016])Test with default Gaussian kernel k experienced considerable lossof power as the dimension d increasedWe recreate their experiment with IMQ kernel (β = −1

2, c = 1)

For n = 500, generate sample (xi)ni=1 with xi = zi + ui e1

ziiid∼ N (0, Id) and ui

iid∼ Unif[0, 1]. Target P = N (0, Id).Compare with standard normality test of Baringhaus and Henze [1988]

Table: Mean power of multivariate normality tests across 400 simulations

d=2 d=5 d=10 d=15 d=20 d=25B&H 1.0 1.0 1.0 0.91 0.57 0.26

Gaussian 1.0 1.0 0.88 0.29 0.12 0.02IMQ 1.0 1.0 1.0 1.0 1.0 1.0

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 20 / 33

Page 59: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Beyond Sample Quality Comparison

Improving sample quality

Given sample points (xi)ni=1, can minimize KSD S(Qn, TP ,Gk)

over all weighted samples Qn =∑n

i=1 qn(xi)δxi for qn aprobability mass function

Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e−1h‖x−y‖22

Bandwidth h set to median of the squared Euclidean distancebetween pairs of sample points

We recreate their experiment with the IMQ kernelk(x, y) = (1 + 1

h‖x− y‖22)−1/2

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 21 / 33

Page 60: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Improving Sample Quality

● ● ● ● ●

10−4.5

10−4

10−3.5

10−3

10−2.5

10−2

2 10 50 75 100Dimension, d

Ave

rage

MS

E, |

|EP Z

−E

Qn~ X

|| 22d

Sample● Initial Qn

Gaussian KSD

IMQ KSD

MSE averaged over 500 simulations (±2 standard errors)

Target P = N (0, Id)

Starting sample Qn = 1n

∑ni=1 δxi for xi

iid∼ P , n = 100.

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 22 / 33

Page 61: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality Samples

Stein Variational Gradient Descent (SVGD) [Liu and Wang, 2016]

Uses KSD to repeatedly update locations of n sample points:xi ← xi + ε

n

∑nl=1(k(xl, xi)∇ log p(xl) +∇xlk(xl, xi))

Approximates gradient step in KL divergence (but convergenceis unclear)Simple to implement (but each update costs n2 time)

Stein Points [Chen, Mackey, Gorham, Briol, and Oates, 2018]

Greedily minimizes KSD by constructing Qn = 1n

∑ni=1 δxi with

xn ∈ argminx S(n−1nQn−1 + 1

nδx, TP ,Gk)

= argminx∑d

j=1kj0(x,x)

2+∑n−1

i=1 kj0(xi, x)

Can generate sample sequence from scratch or, under budgetconstraints, optimize existing locations by coordinate descentSends KSD to zero at O(

√log(n)/n) rate

Stein Point MCMC [Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019]

Suffices to optimize over iterates of a Markov chain

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 23 / 33

Page 62: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality Samples

Stein Variational Gradient Descent (SVGD) [Liu and Wang, 2016]

Uses KSD to repeatedly update locations of n sample points:xi ← xi + ε

n

∑nl=1(k(xl, xi)∇ log p(xl) +∇xlk(xl, xi))

Approximates gradient step in KL divergence (but convergenceis unclear)Simple to implement (but each update costs n2 time)

Stein Points [Chen, Mackey, Gorham, Briol, and Oates, 2018]

Greedily minimizes KSD by constructing Qn = 1n

∑ni=1 δxi with

xn ∈ argminx S(n−1nQn−1 + 1

nδx, TP ,Gk)

= argminx∑d

j=1kj0(x,x)

2+∑n−1

i=1 kj0(xi, x)

Can generate sample sequence from scratch or, under budgetconstraints, optimize existing locations by coordinate descentSends KSD to zero at O(

√log(n)/n) rate

Stein Point MCMC [Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019]

Suffices to optimize over iterates of a Markov chain

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 23 / 33

Page 63: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality Samples

Stein Variational Gradient Descent (SVGD) [Liu and Wang, 2016]

Uses KSD to repeatedly update locations of n sample points:xi ← xi + ε

n

∑nl=1(k(xl, xi)∇ log p(xl) +∇xlk(xl, xi))

Approximates gradient step in KL divergence (but convergenceis unclear)Simple to implement (but each update costs n2 time)

Stein Points [Chen, Mackey, Gorham, Briol, and Oates, 2018]

Greedily minimizes KSD by constructing Qn = 1n

∑ni=1 δxi with

xn ∈ argminx S(n−1nQn−1 + 1

nδx, TP ,Gk)

= argminx∑d

j=1kj0(x,x)

2+∑n−1

i=1 kj0(xi, x)

Can generate sample sequence from scratch or, under budgetconstraints, optimize existing locations by coordinate descentSends KSD to zero at O(

√log(n)/n) rate

Stein Point MCMC [Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019]

Suffices to optimize over iterates of a Markov chainMackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 23 / 33

Page 64: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality SamplesSP-MCMCMCMC

Stein Point MCMC [Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019]

Greedily minimizes KSD by constructing Qn = 1n

∑ni=1 δxi with

xn ∈ argminx S(n−1nQn−1 + 1

nδx, TP ,Gk)

= argminx∑d

j=1kj0(x,x)

2+∑n−1

i=1 kj0(xi, x)

Suffices to optimize over iterates of a Markov chainMackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 24 / 33

Page 65: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality SamplesGoodwin oscillator: kinetic model of oscillatory enzymatic control

Measure mRNA and protein product (yi)40i=1 at times (ti)

40i=1

yi = g(u(ti)) + εi, εii.i.d.∼ N (0, σ2I)

u(t) = fθ(t, u), u(0) = u0 ∈ R8

P is posterior of log(θ) ∈ R10 given y, t and log Γ(2, 1) priorsEvaluating likelihood requires solving the ODE numerically

4 6 8 10 12log neval

1

2

3

4

5

6

7

8

log

KSD

MALARWMSVGDMEDSPSP-MALA LASTSP-MALA INFLSP-RWM LASTSP-RWM INFL

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 25 / 33

Page 66: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Generating High-quality Samples

2 4 6 8 10 12log n

eval

-11

-10

-9

-8

-7

-6

-5

-4

log

EP

RWMSVGDSP-RWM INFL

IGARCH model of financial time series with time-varying volatilityDaily percentage returns y = (yt)

2000t=1 from S&P 500 modeled as

yt = σtεt, εti.i.d.∼ N (0, 1)

σ2t = θ1 + θ2y

2t−1 + (1− θ2)σ2

t−1P is posterior of θ1 > 0, θ2 ∈ (0, 1) given y and uniform priors

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 26 / 33

Page 67: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log p

Inexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]:low-rank kernel, linear runtime (but convergence control unclear)Random feature Stein discrepancies [Huggins and Mackey, 2018]:stochastic low-rank kernel, near-linear runtime + high probabilityconvergence control when (Qn)n≥1 moments uniformly bounded

2 Exploring the impact of Stein operator choice

An infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)a(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 27 / 33

Page 68: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log pInexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]:low-rank kernel, linear runtime (but convergence control unclear)Random feature Stein discrepancies [Huggins and Mackey, 2018]:stochastic low-rank kernel, near-linear runtime + high probabilityconvergence control when (Qn)n≥1 moments uniformly bounded

2 Exploring the impact of Stein operator choice

An infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)a(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 27 / 33

Page 69: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log pInexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]:low-rank kernel, linear runtime (but convergence control unclear)Random feature Stein discrepancies [Huggins and Mackey, 2018]:stochastic low-rank kernel, near-linear runtime + high probabilityconvergence control when (Qn)n≥1 moments uniformly bounded

2 Exploring the impact of Stein operator choice

An infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)a(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 27 / 33

Page 70: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log pInexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]:low-rank kernel, linear runtime (but convergence control unclear)Random feature Stein discrepancies [Huggins and Mackey, 2018]:stochastic low-rank kernel, near-linear runtime + high probabilityconvergence control when (Qn)n≥1 moments uniformly bounded

2 Exploring the impact of Stein operator choice

An infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to P

Diffusion Stein operators (T g)(x) = 1p(x)〈∇, p(x)a(x)g(x)〉 of

Gorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 27 / 33

Page 71: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log pInexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]:low-rank kernel, linear runtime (but convergence control unclear)Random feature Stein discrepancies [Huggins and Mackey, 2018]:stochastic low-rank kernel, near-linear runtime + high probabilityconvergence control when (Qn)n≥1 moments uniformly bounded

2 Exploring the impact of Stein operator choice

An infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)a(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 27 / 33

Page 72: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development3 Addressing other inferential tasks

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization

Example (Optimization with Discretized Diffusions [Erdogdu, Mackey, and Shamir, 2018])

To minimize f(x), choose a(x) < cI with a(x)∇f(x) Lipschitz and

distantly dissipative ( 〈a(x)∇f(x)−a(y)∇f(y),x−y〉‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Approximate target sequence pn(x) ∝ e−γnf(x) using Markov chainxn+1 ∼ N (xn − εn

2a(xn)∇f(xn) +

εn2γn〈∇, a(xn)〉, εnγna(xn))

Thm: min1≤i≤n Ef(xi)→ minx f(x) (with explicit error bounds)for appropriate εn and γn when ∇f,∇a, and a1/2 are Lipschitz

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 28 / 33

Page 73: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development3 Addressing other inferential tasks

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization

Example (Optimization with Discretized Diffusions [Erdogdu, Mackey, and Shamir, 2018])

To minimize f(x), choose a(x) < cI with a(x)∇f(x) Lipschitz and

distantly dissipative ( 〈a(x)∇f(x)−a(y)∇f(y),x−y〉‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Approximate target sequence pn(x) ∝ e−γnf(x) using Markov chainxn+1 ∼ N (xn − εn

2a(xn)∇f(xn) +

εn2γn〈∇, a(xn)〉, εnγna(xn))

Thm: min1≤i≤n Ef(xi)→ minx f(x) (with explicit error bounds)for appropriate εn and γn when ∇f,∇a, and a1/2 are Lipschitz

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 28 / 33

Page 74: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development3 Addressing other inferential tasks

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization

Example (Optimization with Discretized Diffusions [Erdogdu, Mackey, and Shamir, 2018])

To minimize f(x), choose a(x) < cI with a(x)∇f(x) Lipschitz and

distantly dissipative ( 〈a(x)∇f(x)−a(y)∇f(y),x−y〉‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Approximate target sequence pn(x) ∝ e−γnf(x) using Markov chainxn+1 ∼ N (xn − εn

2a(xn)∇f(xn) +

εn2γn〈∇, a(xn)〉, εnγna(xn))

Thm: min1≤i≤n Ef(xi)→ minx f(x) (with explicit error bounds)for appropriate εn and γn when ∇f,∇a, and a1/2 are Lipschitz

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 28 / 33

Page 75: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development3 Addressing other inferential tasks

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization

Example (Optimization with Discretized Diffusions [Erdogdu, Mackey, and Shamir, 2018])

To minimize f(x), choose a(x) < cI with a(x)∇f(x) Lipschitz and

distantly dissipative ( 〈a(x)∇f(x)−a(y)∇f(y),x−y〉‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Approximate target sequence pn(x) ∝ e−γnf(x) using Markov chainxn+1 ∼ N (xn − εn

2a(xn)∇f(xn) +

εn2γn〈∇, a(xn)〉, εnγna(xn))

Thm: min1≤i≤n Ef(xi)→ minx f(x) (with explicit error bounds)for appropriate εn and γn when ∇f,∇a, and a1/2 are Lipschitz

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 28 / 33

Page 76: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development3 Addressing other inferential tasks

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization [Erdogdu, Mackey, and Shamir, 2018]

minx f(x) = 5 log(1+ 12‖x‖22), a(x) = (1+ 1

2‖x‖22)I, a(x)∇f(x) = 5x

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 29 / 33

Page 77: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Future Directions

Many opportunities for future development1 Improving scalability while maintaining convergence control

Subsampling of likelihood terms in ∇ log pInexpensive approximations of kernel matrix

Finite set Stein discrepancies [Jitkrittum, Xu, Szabo, Fukumizu, and Gretton, 2017]

Random feature Stein discrepancies [Huggins and Mackey, 2018]

2 Exploring the impact of Stein operator choiceAn infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)m(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2019] may be appropriate for heavy tails

3 Addressing other inferential tasksTraining generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

Non-convex optimization [Erdogdu, Mackey, and Shamir, 2018]

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 30 / 33

Page 78: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

References IS. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML,

ICML’12, 2012.

A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN0021-9002. A celebration of applied probability.

A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN0178-8051. doi: 10.1007/BF01197887.

L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function.Metrika, 35(1):339–348, 1988.

S. Chatterjee and E. Meckes. Multivariate normal approximation using exchangeable pairs. ALEA Lat. Am. J. Probab. Math.Stat., 4:257–283, 2008. ISSN 1980-0436.

S. Chatterjee and Q. Shao. Nonnormal approximation by Stein’s method of exchangeable pairs with application to theCurie-Weiss model. Ann. Appl. Probab., 21(2):464–483, 2011. ISSN 1050-5164. doi: 10.1214/10-AAP712.

L. Chen, L. Goldstein, and Q. Shao. Normal approximation by Stein’s method. Probability and its Applications. Springer,Heidelberg, 2011. ISBN 978-3-642-15006-7. doi: 10.1007/978-3-642-15007-4.

W. Y. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. Oates. Stein points. In J. Dy and A. Krause, editors, Proceedings of the35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages844–853, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. Oates. Stein point Markov chain Monte Carlo. InK. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning,volume 97 of Proceedings of Machine Learning Research, pages 1011–1021, Long Beach, California, USA, 09–15 Jun 2019.PMLR. URL http://proceedings.mlr.press/v97/chen19b.html.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016.

M. A. Erdogdu, L. Mackey, and O. Shamir. Global non-convex optimization with discretized diffusions. In Advances in NeuralInformation Processing Systems, pages 9694–9703, 2018.

J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015.

J. Gorham and L. Mackey. Measuring sample quality with kernels. In ICML, volume 70 of Proceedings of Machine LearningResearch, pages 1292–1301. PMLR, 2017.

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 31 / 33

Page 79: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

References II

J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. Ann. Appl. Probab., 29(5):2884–2928, 10 2019. doi: 10.1214/19-AAP1467. URL https://doi.org/10.1214/19-AAP1467.

F. Gotze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991.

J. Huggins and L. Mackey. Random feature stein discrepancies. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1903–1913. CurranAssociates, Inc., 2018.

W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-of-Fit Test. In Advances inNeural Information Processing Systems, 2017.

A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31stICML, ICML’14, 2014.

Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017.

Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471,Aug. 2016.

Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 ofICML, pages 276–284, 2016.

L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. Electron. Commun.Probab., 21:14 pp., 2016. doi: 10.1214/16-ECP15.

E. Meckes. On Stein’s method for multivariate normal approximation. In High dimensional probability V: the Luminy volume,volume 5 of Inst. Math. Stat. Collect., pages 153–178. Inst. Math. Statist., Beachwood, OH, 2009. doi:10.1214/09-IMSCOLL511.

A. Muller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997.

C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185.

Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via stein variational gradient descent. In Advances in NeuralInformation Processing Systems, pages 4237–4246, 2017.

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 32 / 33

Page 80: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

References III

G. Reinert and A. Rollin. Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearitycondition. Ann. Probab., 37(6):2150–2173, 2009. ISSN 0091-1798. doi: 10.1214/09-AOP467.

C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. InProc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971),Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972.

C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method:expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist.,Beachwood, OH, 2004.

D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning.arXiv:1611.01722, Nov. 2016.

M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 33 / 33

Page 81: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Comparing Discrepancies

●●

●●

●●●●

●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●● ●●

●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

101 102 103 104 101 102 103 10410−2.5

10−2

10−1.5

10−1

10−0.5

100

Number of sample points, n

Dis

crep

ancy

val

ueDiscrepancy

IMQ KSD

Graph Steindiscrepancy

Wasserstein●

●●●●●●●●●

●●●●●

●●●●

●●●●●

●●●●

●●●●●●

●●●●

●●●●●

d = 1 d = 4

101 102 103 104 101 102 103 104

10−3

10−2

10−1

100

101

102

103

Number of sample points, n

Com

puta

tion

time

(sec

)

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

gh

=T

P g

−3 0 3 −3 0 3

0.2

0.4

0.6

0.8

1.0

−1.0

−0.5

0.0

0.5

x

Left: Samples drawn i.i.d. from either the bimodal Gaussianmixture target p(x) ∝ e−

12(x+1.5)2 + e−

12(x−1.5)2 or a single

mixture component.

Right: Discrepancy computation time using d cores in ddimensions.

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 34 / 33

Page 82: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

The Importance of Kernel Choice

i.i.d. from target P Off−target sample

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●● ● ● ●●●●●●● ● ● ●●●●●●●

10−2

10−1

100

10−1

100

10−2

10−1

100

Gaussian

Matérn

Inverse Multiquadric

102 103 104 105102 103 104 105

Number of sample points, n

Ker

nel S

tein

dis

crep

ancy

Dimension● d = 5

d = 8

d = 20

Target P = N (0, Id)

Off-target Qn has all‖xi‖2 ≤ 2n1/d log n,‖xi − xj‖2 ≥ 2 log n

Gaussian and MaternKSDs driven to 0 byan off-target sequencethat does not convergeto P

IMQ KSD(β = −1

2, c = 1) does

not have thisdeficiency

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 35 / 33

Page 83: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Selecting Sampler Hyperparameters

Setup [Welling and Teh, 2011]

Consider the posterior distribution P induced by L datapoints yldrawn i.i.d. from a Gaussian mixture likelihood

Yl|Xiid∼ 1

2N (X1, 2) + 1

2N (X1 +X2, 2)

under Gaussian priors on the parameters X ∈ R2

X1 ∼ N (0, 10) ⊥⊥ X2 ∼ N (0, 1)

Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1)Induces posterior with second mode at (x1, x2) = (1,−1)

For range of parameters ε, run approximate slice sampling for148000 datapoint likelihood evaluations and store resultingposterior sample Qn

Use minimum IMQ KSD (β = −12, c = 1) to select appropriate ε

Compare with standard MCMC parameter selection criterion,effective sample size (ESS), a measure of Markov chainautocorrelationCompute median of diagnostic over 50 random sequences

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 36 / 33

Page 84: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

Selecting Samplers

Setup

MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]

10000 images, 51 features, binary label indicating whetherimage of a 7 or a 9

Bayesian logistic regression posterior P

L independent observations (yl, vl) ∈ {1,−1} × Rd with

P(Yl = 1|vl, X) = 1/(1 + exp(−〈vl, X〉))

Flat improper prior on the parameters X ∈ Rd

Use IMQ KSD (β = −12, c = 1) to compare SGFS-f to SGFS-d

drawing 105 sample points and discarding first half as burn-in

For external support, compare bivariate marginal means and95% confidence ellipses with surrogate ground truth HamiltonianMonte chain with 105 sample points [Ahn, Korattikara, and Welling, 2012]

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 37 / 33

Page 85: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

The Importance of Tightness

Goal: Show S(Qn, TP ,Gk)→ 0 only if Qn converges to P

A sequence (Qn)n≥1 is uniformly tight if for every ε > 0, thereis a finite number R(ε) such that supnQn(‖X‖2 > R(ε)) ≤ ε

Intuitively, no mass in the sequence escapes to infinity

Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017])

Suppose that P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformlytight and S(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

Good news, but, ideally, KSD would detect non-tight sequencesautomatically...

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 38 / 33

Page 86: Probabilistic Inference and Learning with Stein's …lmackey/papers/gsd_ksd-slides.pdfProbabilistic Inference and Learning with Stein’s Method Lester Mackey Microsoft Research New

The Importance of Tightness

Goal: Show S(Qn, TP ,Gk)→ 0 only if Qn converges to P

A sequence (Qn)n≥1 is uniformly tight if for every ε > 0, thereis a finite number R(ε) such that supnQn(‖X‖2 > R(ε)) ≤ ε

Intuitively, no mass in the sequence escapes to infinity

Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017])

Suppose that P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformlytight and S(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

Good news, but, ideally, KSD would detect non-tight sequencesautomatically...

Mackey (MSR) Inference and Learning with Stein’s Method November 6, 2019 38 / 33