arXiv:1906.05323v2 [cs.NE] 25 Oct 2019 · 2011; Blundell et al. 2015; Paisley, Blei, and Jordan...

Specifying Weight Priors in Bayesian Deep Neural Networks with Empirical Bayes

Ranganath Krishnan∗

[email protected] Labs

Mahesh Subedar∗[email protected]

Intel Labs

Omesh [email protected]

Intel Labs

Abstract

Stochastic variational inference for Bayesian deep neural net-work (DNN) requires specifying priors and approximate pos-terior distributions over neural network weights. Specifyingmeaningful weight priors is a challenging problem, particu-larly for scaling variational inference to deeper architecturesinvolving high dimensional weight space. We propose MOdelPriors from Empirical Bayes using DNN (MOPED) methodto choose informed prior distributions for Bayesian neural net-work weights. We formulate a two-stage hierarchical modeling,first find the maximum likelihood estimates of weights withDNN, and then set the weight priors using empirical Bayes ap-proach to infer the posterior with variational inference. We em-pirically evaluate the proposed approach on real-world tasksincluding image classification, video activity recognition andaudio classification with varying complex neural network ar-chitectures. We also evaluate our proposed approach on di-abetic retinopathy diagnosis task and benchmark with thestate-of-the-art Bayesian deep learning techniques. We demon-strate MOPED method enables scalable variational inferenceand provides reliable uncertainty quantification.

1 IntroductionUncertainty estimation in deep neural network (DNN) predic-tions is essential for designing reliable and robust AI systems.Bayesian deep neural networks (Neal 1995; Gal 2016) hasallowed bridging deep learning and probabilistic Bayesiantheory to quantify uncertainty by borrowing the strengthsof both methodologies. Variational inference (VI) (Blei, Ku-cukelbir, and McAuliffe 2017) is an analytical approximationtechnique to infer the posterior distribution of model parame-ters. VI methods formulate the Bayesian inference problemas an optimization-based approach which lends itself to thestochastic gradient descent based optimization used in train-ing DNN models. VI with generalizable formulations (Graves2011; Blundell et al. 2015; Paisley, Blei, and Jordan 2012;Ranganath, Gerrish, and Blei 2013) has renewed interest inBayesian neural networks.

The recent research in Bayesian Deep Learning (BDL) isfocused on scaling the VI to more complex models to improvethe model performance while managing complexity of the

∗Equal Contribution

solution. The scalability of VI in Bayesian DNNs to practicalapplications involving deep models and large-scale datasets isan open problem. On the contrary, DNNs are shown to havestructural benefits (Bengio, Courville, and Vincent 2013)which helps them in learning complex models on largerdatasets. The convergence speed and performance (Good-fellow, Bengio, and Courville 2016) of DNN models heavilydepend on the initialization of model weights and other hy-per parameters. The transfer learning approaches (Shin et al.2016) demonstrate the benefit of fine tuning the pretrainedDNN models from adjacent domains in order to achieve fasterconvergence and better accuracies.

Variational inference for Bayesian DNN involves choosingprior distributions and approximate posterior distributionsover neural network weights. In a pure Bayesian approach,prior distribution is specified before any data is observed.But specifying meaningful priors in large Bayesian DNNmodels with high dimensional weight space is a challeng-ing problem, as it is practically difficult to have prior be-lief on millions of parameters that we intend to estimatethrough Bayesian inference. It is an active area of researchin Bayesian Deep Learning community (Wu et al. 2018;Nalisnick, Hernández-Lobato, and Smyth 2019; Sun et al.2019; Atanov et al. 2018). Empirical Bayes (Robbins 1956;Casella 1992) methods estimates prior distribution from thedata, which is in contrast to typical Bayesian approach. Basedon Empirical Bayes and transfer learning approaches, wepropose MOdel Priors from Empirical Bayes using DNN(MOPED) method to initialize the weight priors in BayesianDNNs, which in our experiments have shown to providegood prior knowledge for the weights in large models. Ourempirical results indicate this approach guarantees conver-gence of the model along with better uncertainty estimateswithout sacrificing the accuracies provided by state-of-the-artdeterministic DNNs.

Our main contributions include:

• We propose MOPED method to choose informed priordistributions for Bayesian neural network weights usingEmpirical Bayes framework. MOPED advances the currentstate-of-the-art by enabling scalable variational inferencefor large models applied to real-world tasks.

• We demonstrate with thorough empirical experiments on

arX

iv:1

906.

0532

3v2

[cs

.NE

] 2

5 O

ct 2

019

real-world tasks that the MOPED method provides bet-ter model performance, reliable uncertainty quantificationand faster training convergence. We benchmark MOPEDagainst state-of-the-art Bayesian deep learning baselinesprovided in (Filos et al. 2019) and demonstrate better ac-curacy and predictive uncertainty estimates.

The rest of the document is organized as below. We pro-vide background material for the proposed approach includ-ing Bayesian neural networks, variational inference, Empir-ical Bayes and different types of uncertainties. Then, thedetails of proposed method for initializing the weight priorsin Bayesian DNN models is presented. Followed by empiricalexperiments and results supporting the claims of proposedMOPED method with various real-world tasks and bench-mark with state-of-the-art BDL techniques (baselines).

2 Background2.1 Bayesian neural networksBayesian neural networks provide a probabilistic interpreta-tion of deep learning models by placing distributions over theneural network weights (Neal 1995). Given training datasetD = {x, y} with inputs x = {x1, ..., xN} and their corre-sponding outputs y = {y1, ..., yN}, in parametric Bayesiansetting we would like to infer a distribution over parametersw as a function y = fw(x) that represents the DNN model.A prior distribution is assigned over the weights p(w) thatcaptures our prior belief as to which parameters would havelikely generated the outputs before observing any data. Giventhe evidence data p(y|x), prior distribution and model likeli-hood p(y |x,w), the goal is to infer the posterior distributionover the weights p(w|D):

p(w|D) =p(y |x,w) p(w)∫p(y |x,w) p(w) dw

(1)

Computing the posterior distribution p(w|D) is often in-tractable, some of the previously proposed techniques toachieve an analytically tractable inference include MarkovChain Monte Carlo (MCMC) sampling based probabilisticinference (Neal 2012; Welling and Teh 2011), variationalinference (Graves 2011; Ranganath, Gerrish, and Blei 2013;Blundell et al. 2015), expectation propagation (EP) (Minka2001) and Monte Carlo dropout approximate inference(Galand Ghahramani 2016) .

Predictive distribution is obtained through multiplestochastic forward passes through the network during theprediction phase while sampling from the posterior distribu-tion of network parameters through Monte Carlo estimators.Equation 2 shows the predictive distribution of the output y∗given new input x∗:

p(y∗|x∗, D) =

∫p(y∗|x∗, w) p(w |D)dw

p(y∗|x∗, D) ≈ 1

T

T∑i=1

p(y∗|x∗, wi) , wi ∼ p(w |D)

(2)

where, T is number of Monte Carlo samples.

2.2 Variational inferenceVariational inference approximates a complex probabilitydistribution p(w|D) with a simpler distribution qθ(w), pa-rameterized by variational parameters θ while minimiz-ing the Kullback-Leibler (KL) divergence (Bishop 2006).Minimizing the KL divergence is equivalent to maximiz-ing the log evidence lower bound (ELBO) (Bishop 2006;Gal and Ghahramani 2016), as shown in Equation 3.

L :=

∫qθ(w) log p(y|x,w) dw −KL[qθ(w)||p(w)] (3)

In mean field variation inference, weights are modeledwith fully factorized Gaussian distribution parameterized byvariational parameters µ and σ.

qθ(w) := N (w |µ, σ) (4)The variational distribution qθ(w) and its parameters µ andσ are learnt while optimizing the cost function ELBO withthe stochastic gradient steps.

A wide variety of variational techniques (Peterson andHartman 1989; Neal and Hinton 1998; Ghahramani and Beal2001) are proposed for different class of model specific for-mulations. (Paisley, Blei, and Jordan 2012; Ranganath, Ger-rish, and Blei 2013) presented a more generic formulation ofvariational inference with iterative updates using stochasticoptimization. (Graves 2011) proposed fully factorized Gaus-sian posteriors and a differentiable loss function. (Blundellet al. 2015) proposed a Bayes by Backprop method whichlearns probability distribution on the weights of the neuralnetwork by minimizing loss function. (Wen et al. 2018) pro-posed a Flipout method to apply pseudo-independent weightperturbations to decorrelate the gradients within mini-batches.Flipout method is effective in regularization over many differ-ent neural network architectures and achieves linear variancereduction.

2.3 Empirical BayesEmpirical Bayes (EB) (Casella 1992) methods lie in betweenfrequestist and Bayesian statistical approaches as it attemptsto leverage strengths from both methodologies. EB methodsare considered as approximation to a fully Bayesian treat-ment of a hierarchical Bayes model. EB methods estimatesprior distribution from the data, which is in contrast to typi-cal Bayesian approach. The idea of Empirical Bayes is notnew and the original formulation of Empirical Bayes datesback to 1950s (Robbins 1956), which is non-parametric EB.Since then, many parametric formulations has been proposedand used in wide variety of applications. However, we arenot aware of any prior art using EB in variational inferencefor Bayesian deep neural networks. We use parametric Em-pirical Bayes approach in our proposed method for meanfield variational inference in Bayesian deep neural network,where weights are modeled with fully factorized Gaussiandistribution.

Parametric EB specifies a family of prior distributionsp(w|λ) where λ is a hyper-parameter. Analogous to Equa-tion 1, posterior distribution can be obtained with EB as givenby Equation 5.

p(w|D,λ) =p(y |x,w) p(w |λ)∫p(y |x,w) p(w |λ) dw

(5)

Bayesian DNN Validation Accuracy

Complexity Bayesian DNN

Dataset Modality Architecture (# parameters) DNN MFVI MOPED_MFVI

UCF-101 Video ResNet-101 C3D 170,838,181 0.851 0.029 0.867UrbanSound8K Audio VGGish 144,274,890 0.817 0.143 0.819

Diabetic Retinopathy Images VGG 21,242,689 0.842 0.843 0.857

CIFAR-10 ImagesResnet-56 1,714,250 0.926 0.896 0.927Resnet-20 546,314 0.911 0.878 0.916

MNIST Images LeNeT 1,090,856 0.994 0.993 0.995Fashion-MNIST Images SCNN 442,218 0.921 0.906 0.923

Table 1: Accuracies for architectures with different complexities and input modalities. Mean field variational inference with MOPEDinitialization (MOPED_MFVI) obtains reliable uncertainty estimates from Bayesian DNNs while achieving similar or better accuracy asthe deterministic DNNs. Mean field variational inference with random priors (MFVI) has convergence issues (shown in red) for complexarchitectures, while the proposed method shows consistent results. DNN and MFVI accuracy numbers for diabetic retinopathy dataset areobtained from BDL benchmarks.

2.4 Uncertainty QuantificationUncertainty estimation is essential to build reliable and ro-bust AI systems, which is pivotal to understand system’sconfidence in predictions and decision-making. BayesianDNNs allows us to capture uncertainty in deep learning mod-els from Bayesian perspective. Bayesian modelling allowsto capture different types of uncertainties: “Aleatoric” un-certainty and “Epistemic”uncertainty (Gal 2016). Aleatoricuncertainty, also known as input uncertainty, captures noiseinherent with observation. Epistemic uncertainty, also knownas model uncertainty captures lack of knowledge in represent-ing model parameters, specifically in the scenario of limiteddata.

We evaluate the model uncertainty using Bayesian activelearning by disagreement (BALD) (Houlsby et al. 2011),which quantifies mutual information between parameter pos-terior distribution and predictive distribution.

BALD := H(y∗|x∗, D)− Ep(w|D)[H(y∗|x∗, w)] (6)

where, H(y∗|x∗, D) is the predictive entropy as shown inEquation 7. Predictive entropy captures a combination ofinput uncertainty and model uncertainty.

H(y∗|x∗, D) := −K−1∑i=0

piµ ∗ log piµ (7)

and piµ is predictive mean probability of ith class fromT Monte Carlo samples, and K is total number of outputclasses.

3 MOPED: informed weight priorsMOPED advances the current state-of-the-art in variationalinference for Bayesian DNNs by providing a way for specify-ing meaningful prior distributions and approximate posteriordistributions over weights using Empirical Bayes framework.Empirical Bayes framework borrows strengths from bothclassical (frequentist) and Bayesian statistical methodolo-gies.

We formulate a two-stage hierarchical modeling approach,first find the maximum likelihood estimates of weights withDNN, and then set the weight priors using empirical Bayesapproach to infer the posterior with variational inference.

We illustrate our proposed approach on mean-field vari-ational inference (MFVI). For MFVI in Bayesian DNNs,weights are modeled with fully factorized Gaussian distri-bution parameterized by variational parameters, i.e. eachweight is independently sampled from the Gaussian dis-tribution w = N (w, σ), where w is mean and varianceσ = log(1 + exp(ρ)). In order to ensure non-negative vari-ance, σ is expressed in terms of softplus function with uncon-strained parameter ρ. Here we propose to set the w mean ofweight priors from the weights obtained through maximumlikelihood estimation using DNN of equivalent architecture.

w = wd; ρ ∼ N (ρ,∆ρ)

w ∼ N (wd, log(1 + eρ))(8)

where, wd represents weights obtained from deterministicDNN model, and (ρ, ∆ρ) are hyper parameters (mean andvariance of Gaussian perturbation for ρ).

For Bayesian DNNs of complex architectures involvingvery high dimensional weight space (hundreds of millionsof parameters), choice of ρ can be sensitive as values of theweights can vary by large margin with each other.

So, we propose to initialize the weight priors with wd andρ given in Equation 9:

w = wd ; ρ = log(eδ|wd| − 1)

w ∼ N (wd, δ | wd |))(9)

where, δ is initial perturbation factor for the weight in termsof percentage of the pretrained deterministic weight values.

In the next section, we demonstrate the benefits of MOPEDmethod for variational inference with extensive empirical ex-periments. We showcase the proposed MOPED method helpsBayesian DNN architectures to achieve better model perfor-mance, faster training convergence and reliable uncertaintyestimates.

Bayesian DNN AuPR AuROC

Dataset Archiectures MFVI MOPED_MFVI MFVI MOPED_MFVI

UCF-101 ResNet-101 C3D 0.0174 0.9186 0.6217 0.9967Urban Sound 8K VGGish 0.1166 0.8972 0.551 0.9811

CIFAR-10ResNet-20 0.9265 0.9622 0.9877 0.9941ResNet-56 0.9225 0.9799 0.987 0.9970

MNIST LeNet 0.9996 0.9997 0.9999 0.9999Fashion-MNIST SCNN 0.9722 0.9784 0.9962 0.9969

Table 2: Comparison of AUC of precision-recall (AUPR) and ROC (auROC) for models with varying complexities. MOPED methodoutperforms training with random initialization of weight priors.

(a) Training convergence curves (b) AuPR curves (c) Precision-recall

Figure 1: Comparison of MOPED and random initialization of priors for Bayesian ResNet-20 and ResNet-56 architectures. (a) trainingconvergence, (b) AuPR for different percentage of retained data based on model uncertainty and (c) precision-recall plots. Plot (a) demonstratesMOPED method achieves faster convergence speed and attains higher accuracy values than the random initialization of priors. Plot (b) showsthe precision-recall AUC values for different percentage of retained data based on model uncertainty. The MOPED method provides betterAUPR and reliable uncertainty estimates than the random initialization of priors. Plot (c) showcases the precision-recall curves. (Plots are bestviewed in color.)

4 Related WorkDeterministic pretraining (Molchanov, Ashukha, and Vetrov2017; Sønderby et al. 2016) has been used to improve modeltraining for variational probabilistic models. Molchanov,Ashukha, and Vetrov use a pretrained network for Sparse Vari-ational Dropout method by training the original architecturewithout Sparse Variational Dropout until full convergence.Sønderby et al. use a warm-up method for variational-autoencoder by rescaling the KL-divergence term with a scalarterm β, which is increased linearly from 0 to 1 during thefirst N epochs of training. Whereas, in our method we use thepoint-estimates from pretrained standard DNN of the samearchitecture to set the informed priors and the model is op-timized with MFVI using full-scale KL-divergence term inELBO.

Choosing weight priors in Bayesian neural networks (Wuet al. 2018; Atanov et al. 2018; Sun et al. 2019) is an activearea of research. Atanov et al. propose implicit priors forvariational inference in convolutional neural networks thatexploit generative models. Nguyen et al. use prior with zeromean and unit variance, and initialize the optimizer at the

mean of the MLE model and a very small initial variance forsmall-scale MNIST experiments. Wu et al. modify ELBOwith a deterministic approximation of reconstruction termand use Empirical Bayes procedure for selecting variance ofprior in KL term (with zero prior mean). The authors also cau-tion about inherent scaling of their method could potentiallylimit its practical use for networks with large hidden size. Allof these works have been demonstrated only on small-scalemodels and simple datasets like MNIST. In our method, weretain the stochastic property of the expected log-likelihoodterm with Monte Carlo sampling, and use Empirical Bayesapproach to specify both mean and variance of weight pri-ors and initialize the optimizer based on pretrained DNN.We demonstrate our method on large-scale Bayesian DNNmodels with complex datasets on real-world tasks.

5 ExperimentsWe evaluate proposed method on real-world applications in-cluding image and audio classification, and video activityrecognition. We consider multiple architectures with vary-ing complexity to show the scalability of method in training

(a) Bayesian ResNet-20 (CIFAR-10)

(b) Bayesian ResNet-101 C3D (UCF-101)

Figure 2: Precision-recall AUC (AUPR) plots with different δ scalefactors for initializing variance values in MOPED method.

deep Bayesian models. Our experiments include: a) ResNet-101 C3D (Hara, Kataoka, and Satoh 2018) for video ac-tivity classification on UCF-101(Soomro, Zamir, and Shah2012) dataset b) VGGish(Hershey et al. 2017) for audio clas-sification on UrbanSound8K (Salamon, Jacoby, and Bello2014) dataset, c) Modified version of VGG(Filos et al. 2019)for diabetic retinopathy detection, d) LeNet architecture forMNIST (LeCun et al. 1998) digit classification, and e) Sim-ple convolutional neural network (SCNN) consisting of twoconvolutional layers followed by two dense layers for imageclassification on Fashion-MNIST (Xiao, Rasul, and Vollgraf2017) datasets.

We implemented above Bayesian DNN models and trainedthem using Tensorflow and Tensorflow-Probability (Dillonet al. 2017) frameworks. The variational layers are modeledusing Flipout (Wen et al. 2018), an efficient method thatdecorrelates the gradients within a mini-batch by implicitlysampling pseudo-independent weight perturbations for eachinput. The model weights obtained from the trained DNNmodels are used in MOPED method to initialize Gaussianpriors over weights (Equation 8 and 9 ), as described inMOPED section.

During inference phase, predictive distributions are ob-tained by performing multiple stochastic forward passes overthe network while sampling from posterior distribution of theweights (40 Monte Carlo samples in our experiments). Weevaluate the model uncertainty and predictive uncertaintyusing Bayesian active learning by disagreement (BALD)(Houlsby et al. 2011) (Equation 6) and predictive entropy(Equation 7), respectively. Following (Filos et al. 2019), quan-titative comparison of uncertainty estimates are made bycalculating area under the curve (AUC) of precision-recall(AUPR) values by retaining different percentages (0.5 to 1.0)of most certain test samples (i.e. ignoring most uncertainpredictions based on model/predictive uncertainty estimates).

In Table 1, classification accuracies for architectures withvarious model complexity are presented. Bayesian DNNswith priors initialized with MOPED method achieves simi-lar or better predictive accuracies as compared to equivalentDNN models. Bayesian DNNs with random initializationof Gaussian priors has difficulty in converging to optimalsolution for larger models (ResNet-101 C3D and VGGish)with hundreds of millions of trainable parameters. It is evi-dent from these results that MOPED method guarantees thetraining convergence even for the complex models.

In Figure 1, comparison of mean field variational inferencewith MOPED method (MOPED_MFVI) and mean field vari-ational inference with random initialization of priors (MFVI)is shown for Bayesian ResNet-20 and ResNet-56 architec-tures trained on CIFAR-10 dataset. The AUPR plots cap-ture the precision-recall AUC values for different percent-age of most certain predictions based on the model uncer-tainty estimates. Figure 1 (a) shows faster convergence ofMOPED_MFVI method, while achieving better accuracyvalues. Figure 1 (b) shows that MOPED_MFVI provideshigher AUPR values than MFVI and also that AUPR in-creases as most uncertain predictions are ignored based onthe model uncertainty, indicating the method enables supe-rior performance and provides reliable uncertainty estimates.Figure 1 (c) shows MOPED_MFVI method provides bet-ter precision-recall values compared to MFVI. We show theresults for different selection of ρ values (Equation 8). InFigure 2, we show AUPR plots with different δ values asmentioned in Equation 9.

5.1 Benchmarking uncertainty estimatesBayesian Deep Learning (BDL) benchmarks (Filos et al.2019) is an open-source framework for evaluating deep prob-abilistic machine learning models and their application toreal-world problems. BDL-benchmarks assess both the scal-ability and effectiveness of different techniques for uncer-tainty estimation. The proposed MOPED_MFVI method iscompared with state-of-the-art baseline methods availablein BDL-benchmarks suite on diabetic retinopathy detectiontask (Kaggle ). The evaluation methodology assesses the tech-niques by their diagnostic accuracy and area under receiver-operating-characteristic (AUC-ROC) curve, as a function ofpercentage of retained data based on predictive uncertaintyestimates. It is expected that the models with well-calibrateduncertainty improve their performance (detection accuracyand AUC-ROC) as most certain data is retrained.

50% data retrained 75% data retrained 100% data retrainedMethod AUC Accuracy AUC Accuracy AUC Accuracy

MC Dropout 0.878 0.913 0.852 0.871 0.821 0.845Mean-field VI 0.866 0.881 0.84 0.850 0.821 0.843

Deep Ensembles 0.872 0.899 0.849 0.861 0.818 0.846Deterministic 0.849 0.861 0.823 0.849 0.82 0.842

Ensemble MC Dropout 0.881 0.924 0.854 0.881 0.825 0.853MOPED Mean-field VI 0.948 0.937 0.935 0.914 0.915 0.857

Random 0. 818 0.848 0.820 0.843 0.820 0.842

Table 3: Comparison of Area under the receiver-operating characteristic curve (AUC) and classification accuracy as a function of retained datafrom the BDL benchmark suite. The proposed method demonstrates an improvement of superior performance compared to all the baselinemodels.

(a) Binary Accuracy (b) AUC-ROC

Figure 3: Benchmarking MOPED_MFVI with state-of-art Bayesian deep learning techniques on diabetic retinopathy diagnosis task usingBDL-benchmarks. Accuracy and area under the receiver-operating characteristic curve (AUC-ROC) plots for varied percentage of retained databased on predictive uncertainty. MOPED_MFVI performs better than the other baselines from BDL-benchmarks suite. Shading shows thestandard error.

We have followed the evaluation methodology presented inthe BDL-benchmarks to compare the accuracy and uncertainestimates obtained from our method. We used the hyper-parameters and model architectures as mentioned in BDL-benchmarks. The results for the BDL baseline methods arepresented from (Filos et al. 2019). In Table 3 and Figure 3,quantitative evaluation of AUC and accuracy values for BDLbaseline methods and MOPED_MFVI are presented. Theproposed MOPED_MFVI method demonstrates significantlysuperior performance compared to other state-of-the-art BDLtechniques.

5.2 Robustness to out-of-distribution dataWe evaluate the uncertainty estimates obtained fromMOPED_MFVI to detect out-of-distribution data. Out-of-distribution samples are data points which fall far off fromthe training data distribution. We evaluate two sets of out-of-distribution detection experiments. In the first set, we

Figure 4: Accuracy vs Confidence curves: Networks trained onMNIST and tested on both MNIST and the NotMNIST (out-of-distribution) test sets.

(a) Model uncertainty(Bayesian ResNet-56)

(b) Predictive uncertainty(Bayesian ResNet-56)

(c) Model uncertainty(Bayesian ResNet-101 C3D)

(d) Predictive uncertainty(Bayesian ResNet-101 C3D)

Figure 5: Density histograms obtained from in- and out-of-distribution samples. Bayesian DNN model uncertainty estimates indicate higheruncertainty for out-of-distribution samples as compared to the in-distribution samples.

use CIFAR-10 as the in-distribution samples trained us-ing ResNet-56 Bayesian DNN model. TinyImageNet (Rus-sakovsky et al. 2015) and SVHN (Goodfellow et al. 2013)datasets are used as out-of-distribution samples which werenot seen during the training phase. The density histograms(area under the histogram is normalized to one) for uncer-tainty estimates obtained from the Bayesian DNN models areplotted in Figure 5. The density histograms in Figure 5 (a)& (b) indicate higher uncertainty estimates for the out-of-distribution samples and lower uncertainty values for thein-distribution samples. A similar trend is observed in thesecond set using UCF-101 (Soomro, Zamir, and Shah 2012)and Moments-in-Time (MiT) (Monfort et al. 2018) videoactivity recognition datasets as the in- and out-of-distributiondata, respectively. These results confirm the uncertainty esti-mates obtained from proposed method are reliable and canidentify out-of-distribution data.

In order to evaluate robustness of our method(MOPED_MFVI), we compare state-of-the-art probabilisticdeep learning methods for prediction accuracy as a functionof model confidence. Following the experiments in (Laksh-minarayanan, Pritzel, and Blundell 2017), we trained ourmodel on MNIST training set and tested it on a mix of exam-ples from MNIST and NotMNIST1 (out-of-distribution) testset. The accuracy as a function of confidence plots shouldincrease monotonically, as higher accuracy is expected formore confident results. A robust model should provide lowconfidence for out-of-distribution samples while providinghigh confidence for correct prediction from in-distributionsamples. The proposed variational inference method withMOPED priors provides more robust results as compared tothe MC Dropout (Gal and Ghahramani 2016) and deep modelensembles (Lakshminarayanan, Pritzel, and Blundell 2017)approaches (shown in Figure 4).

6 ConclusionsWe proposed MOPED method that specifies the weight pri-ors in Bayesian deep neural networks using empirical Bayes

1http://yaroslavvb.blogspot.co.uk/2011/09/notmnist-dataset.html

framework. We demonstrated with thorough empirical experi-ments that MOPED enables scalable variational inference forBayesian DNNs.We demonstrated the proposed method out-performs state-of-the-art Bayesian deep learning techniquesusing BDL-benchmarks suite. We also showed the uncer-tainty estimates obtained from the proposed method are reli-able to identify out-of-distribution data. The results supportproposed approach provides better model performance andreliable uncertainty estimates on real-world tasks with largescale complex models.

References[Atanov et al. 2018] Atanov, A.; Ashukha, A.; Struminsky,

K.; Vetrov, D.; and Welling, M. 2018. The deep weight prior.[Bengio, Courville, and Vincent 2013] Bengio, Y.; Courville,A.; and Vincent, P. 2013. Representation learning: A reviewand new perspectives. IEEE transactions on pattern analysisand machine intelligence 35(8):1798–1828.

[Bishop 2006] Bishop, C. M. 2006. Pattern recognitionand machine learning (information science and statistics)springer-verlag new york. Inc. Secaucus, NJ, USA.

[Blei, Kucukelbir, and McAuliffe 2017] Blei, D. M.; Ku-cukelbir, A.; and McAuliffe, J. D. 2017. Variational inference:A review for statisticians. Journal of the American StatisticalAssociation 112(518):859–877.

[Blundell et al. 2015] Blundell, C.; Cornebise, J.;Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertaintyin neural networks. arXiv preprint arXiv:1505.05424.

[Casella 1992] Casella, G. 1992. Illustrating empirical bayesmethods. Chemometrics and intelligent laboratory systems16(2):107–125.

[Dillon et al. 2017] Dillon, J. V.; Langmore, I.; Tran, D.;Brevdo, E.; Vasudevan, S.; Moore, D.; Patton, B.; Alemi,A.; Hoffman, M.; and Saurous, R. A. 2017. Tensorflowdistributions. arXiv preprint arXiv:1711.10604.

[Filos et al. 2019] Filos, A.; Farquhar, S.; Gomez, A. N.; Rud-ner, T. G. J.; Kenton, Z.; Smith, L.; Alizadeh, M.; de Kroon,A.; and Gal, Y. 2019. Benchmarking bayesian deep learningwith diabetic retinopathy diagnosis.

[Gal and Ghahramani 2016] Gal, Y., and Ghahramani, Z.2016. Dropout as a bayesian approximation: Representingmodel uncertainty in deep learning. In international confer-ence on machine learning, 1050–1059.

[Gal 2016] Gal, Y. 2016. Uncertainty in deep learning. Uni-versity of Cambridge.

[Ghahramani and Beal 2001] Ghahramani, Z., and Beal, M. J.2001. Propagation algorithms for variational bayesian learn-ing. In Advances in neural information processing systems,507–513.

[Goodfellow, Bengio, and Courville 2016] Goodfellow, I.;Bengio, Y.; and Courville, A. 2016. Deep learning. MITpress.

[Goodfellow et al. 2013] Goodfellow, I. J.; Bulatov, Y.; Ibarz,J.; Arnoud, S.; and Shet, V. 2013. Multi-digit number recog-nition from street view imagery using deep convolutionalneural networks. arXiv preprint arXiv:1312.6082.

[Graves 2011] Graves, A. 2011. Practical variational infer-ence for neural networks. In Advances in neural informationprocessing systems, 2348–2356.

[Hara, Kataoka, and Satoh 2018] Hara, K.; Kataoka, H.; andSatoh, Y. 2018. Can spatiotemporal 3d cnns retrace thehistory of 2d cnns and imagenet? In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR), 6546–6555.

[Hershey et al. 2017] Hershey, S.; Chaudhuri, S.; Ellis, D. P.;Gemmeke, J. F.; Jansen, A.; Moore, R. C.; Plakal, M.; Platt,D.; Saurous, R. A.; Seybold, B.; et al. 2017. Cnn architecturesfor large-scale audio classification. In Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE InternationalConference on, 131–135. IEEE.

[Houlsby et al. 2011] Houlsby, N.; Huszár, F.; Ghahramani,Z.; and Lengyel, M. 2011. Bayesian active learningfor classification and preference learning. arXiv preprintarXiv:1112.5745.

[Kaggle ] Kaggle. Diabetic retinopathy detection chal-lenge. https://www.kaggle.com/c/diabetic-retinopathy-detection/overview/description.

[Lakshminarayanan, Pritzel, and Blundell 2017]Lakshminarayanan, B.; Pritzel, A.; and Blundell, C.2017. Simple and scalable predictive uncertainty estimationusing deep ensembles. In Advances in Neural InformationProcessing Systems, 6402–6413.

[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.;Haffner, P.; et al. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE 86(11):2278–2324.

[Minka 2001] Minka, T. P. 2001. Expectation propagation forapproximate bayesian inference. In Proceedings of the Sev-enteenth conference on Uncertainty in artificial intelligence,362–369. Morgan Kaufmann Publishers Inc.

[Molchanov, Ashukha, and Vetrov 2017] Molchanov, D.;Ashukha, A.; and Vetrov, D. 2017. Variational dropoutsparsifies deep neural networks. In Proceedings of the 34thInternational Conference on Machine Learning-Volume 70,2498–2507. JMLR. org.

[Monfort et al. 2018] Monfort, M.; Zhou, B.; Bargal, S. A.;Andonian, A.; Yan, T.; Ramakrishnan, K.; Brown, L.; Fan, Q.;Gutfruend, D.; Vondrick, C.; et al. 2018. Moments in timedataset: one million videos for event understanding. arXivpreprint arXiv:1801.03150.

[Nalisnick, Hernández-Lobato, and Smyth 2019] Nalisnick,E.; Hernández-Lobato, J. M.; and Smyth, P. 2019. Dropoutas a structured shrinkage prior. In International Conferenceon Machine Learning, 4712–4722.

[Neal and Hinton 1998] Neal, R. M., and Hinton, G. E. 1998.A view of the em algorithm that justifies incremental, sparse,and other variants. In Learning in graphical models. Springer.355–368.

[Neal 1995] Neal, R. M. 1995. BAYESIAN LEARNING FORNEURAL NETWORKS. Ph.D. Dissertation, Citeseer.

[Neal 2012] Neal, R. M. 2012. Bayesian learning for neuralnetworks, volume 118. Springer Science & Business Media.

[Nguyen et al. 2017] Nguyen, C. V.; Li, Y.; Bui, T. D.; andTurner, R. E. 2017. Variational continual learning. arXivpreprint arXiv:1710.10628.

[Paisley, Blei, and Jordan 2012] Paisley, J.; Blei, D.; and Jor-dan, M. 2012. Variational bayesian inference with stochasticsearch. arXiv preprint arXiv:1206.6430.

[Peterson and Hartman 1989] Peterson, C., and Hartman, E.1989. Explorations of the mean field theory learning algo-rithm. Neural Networks 2(6):475–494.

[Ranganath, Gerrish, and Blei 2013] Ranganath, R.; Gerrish,S.; and Blei, D. M. 2013. Black box variational inference.arXiv preprint arXiv:1401.0118.

[Robbins 1956] Robbins, H. 1956. An empirical bayes ap-proach to statistics. Herbert Robbins Selected Papers 41–47.

[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.;Khosla, A.; Bernstein, M.; et al. 2015. Imagenet largescale visual recognition challenge. International journal ofcomputer vision 115(3):211–252.

[Salamon, Jacoby, and Bello 2014] Salamon, J.; Jacoby, C.;and Bello, J. P. 2014. A dataset and taxonomy for urbansound research. In Proceedings of the 22nd ACM interna-tional conference on Multimedia, 1041–1044. ACM.

[Shin et al. 2016] Shin, H.-C.; Roth, H. R.; Gao, M.; Lu, L.;Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; and Summers, R. M.2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristicsand transfer learning. IEEE transactions on medical imaging35(5):1285–1298.

[Sønderby et al. 2016] Sønderby, C. K.; Raiko, T.; Maaløe,L.; Sønderby, S. K.; and Winther, O. 2016. Ladder variationalautoencoders. In Advances in neural information processingsystems, 3738–3746.

[Soomro, Zamir, and Shah 2012] Soomro, K.; Zamir, A. R.;and Shah, M. 2012. Ucf101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprintarXiv:1212.0402.

[Sun et al. 2019] Sun, S.; Zhang, G.; Shi, J.; and Grosse, R.

https://www.kaggle.com/c/diabetic-retinopathy-detection/overview/description

https://www.kaggle.com/c/diabetic-retinopathy-detection/overview/description

2019. Functional variational bayesian neural networks. arXivpreprint arXiv:1903.05779.

[Welling and Teh 2011] Welling, M., and Teh, Y. W. 2011.Bayesian learning via stochastic gradient langevin dynam-ics. In Proceedings of the 28th International Conference onMachine Learning (ICML-11), 681–688.

[Wen et al. 2018] Wen, Y.; Vicol, P.; Ba, J.; Tran, D.; andGrosse, R. 2018. Flipout: Efficient pseudo-independentweight perturbations on mini-batches. arXiv preprintarXiv:1803.04386.

[Wu et al. 2018] Wu, A.; Nowozin, S.; Meeds, E.; Turner,R. E.; Hernández-Lobato, J. M.; and Gaunt, A. L. 2018.Deterministic variational inference for robust bayesian neuralnetworks.

[Xiao, Rasul, and Vollgraf 2017] Xiao, H.; Rasul, K.; andVollgraf, R. 2017. Fashion-mnist: a novel image dataset forbenchmarking machine learning algorithms. arXiv preprintarXiv:1708.07747.

arXiv:1906.05323v2 [cs.NE] 25 Oct 2019 · 2011; Blundell et al. 2015; Paisley, Blei, and Jordan...

Documents

Transcript of arXiv:1906.05323v2 [cs.NE] 25 Oct 2019 · 2011; Blundell et al. 2015; Paisley, Blei, and Jordan...