Uncertainty Quantified Matrix Completion Using Bayesian ... · Uncertainty Quantiﬁed Matrix...

Uncertainty Quantified Matrix Completion usingBayesian Hierarchical Matrix Factorization

Farideh Fazayeli, Arindam BanerjeeDept. of Computer Science & Engg

Univ of Minnesota, Twin CitiesEmail: {farideh, banerjee}@cs.umn.edu

Jens Kattge, Franziska SchrodtBiogeochemistry

Max Planck InstituteEmail: {jkattge, fschrodt}@bgc-jena.mpg.de

Peter B. ReichDept. of Forest Resources

Univ of Minnesota, Twin CitiesEmail: [email protected]

Abstract—Low-rank matrix completion methods have beensuccessful in a variety of settings such as recommendationsystems. However, most of the existing matrix completion methodsonly provide a point estimate of missing entries, and do notcharacterize uncertainties of the predictions. In this paper, wepropose a Bayesian hierarchical probabilistic matrix factorization(BHPMF) model to 1) incorporate hierarchical side information,and 2) provide uncertainty quantified predictions. The formeryields significant performance improvements in the problem ofplant trait prediction, a key problem in ecology, by leveragingthe taxonomic hierarchy in the plant kingdom. The latter ishelpful in identifying predictions of low confidence which canin turn be used to guide field work for data collection efforts. AGibbs sampler is designed for inference in the model. Further,we propose a multiple inheritance BHPMF (MI-BHPMF) whichcan work with a general directed acyclic graph (DAG) struc-tured hierarchy, rather than a tree. We present comprehensiveexperimental results on the problem of plant trait predictionusing the largest database of plant traits, where BHPMF showsstrong empirical performance in uncertainty quantified traitprediction, outperforming the state-of-the-art based on pointestimates. Further, we show that BHPMF is more accurate whenit is confident, whereas the error is high when the uncertainty ishigh.

I. INTRODUCTION

Plant traits are morphological, anatomical, biochemical,physiological or phenological features of individuals or theircomponent organs or tissues [1]. Plant traits are key to under-stand and ameliorate the response of biodiversity and terrestrialecosystems to human disturbance and expected environmentalchanges. Meanwhile, the plant kingdom is characterized by ataxonomic hierarchy. In the taxonomy, individual plants arecategorized to species, species to genera, genera to families,and families to phylogenetic groups. In 2007, the TRY project(www.try-db.org) was established to combine different traitdatabases worldwide. TRY has thus become the world’s largesttrait database. TRY provides a unique opportunity to system-atically analysis plant traits and to improve research in globalchange science [1].

One of the main challenges in trait analysis with TRYis data sparsity (containining 4.2M trait measurements on1.7M plants for 1000 traits) [1]. In ecology, a common wayto estimate missing values is to consider the species mean(MEAN). MEAN is relatively accurate due to the low variationof trait values within species [1]. From a matrix completionperspective, MEAN provides an unusually strong and robustbaseline. However, MEAN has been shown to introduce large

errors and biases in some cases [2]. Understanding intra-specific variability of traits to gain better insights into howspecies adapt to changing environmental resources and climaticconditions, is a key goal in modern ecology, and MEAN isineffective for the purpose.

In the recent literature, low-rank matrix completion meth-ods have been successful in a variety of settings [3]–[9]. Suchmethods broadly come in two flavors—from an optimizationperspective usually based on a rank/trace-norm constraints[10], [11], or probabilistic models based on latent factors [7],[8]. In most settings, the given sparse matrix 𝑋 ∈ ℝ

𝑁×𝑀

is approximated by a low-rank matrix �̂� = 𝑈𝑉 𝑇 where𝑈 ∈ ℝ

𝑁×𝐷 and 𝑉 ∈ ℝ𝑀×𝐷. The latent factors 𝑢𝑛 ∈ ℝ

𝐷, foreach row 𝑛, and the latent factors 𝑣𝑚 ∈ ℝ

𝐷, for each column𝑚 of matrix 𝑋 are estimated, usually based on alternatingoptimization [4], [12]. Once the latent factors have beenestimated, the inner product of 𝑢𝑛 and 𝑣𝑚 gives the predictionfor the missing entry 𝑥𝑛𝑚.

A key limitation of most matrix factorization (MF) modelsis the inability to use the domain knowledge such as hierarchi-cal side information. In fact, as we illustrate in Section IV-D,applying PMF (Probabilistic Matrix Factorization) model [7]which does not incorporate the plant taxonomic hierarchyleads to a performance worse than the simple algorithmMEAN which uses the domain knowledge [13]. In a recentwork, the hierarchical information is incorporated into MF inthree different ways – hierarchical regularization, agglomeratefitting, and residual fitting [14]. In another work, hierarchicalPMF (HPMF) was proposed for predicting missing values[13]. Latent factors corresponding to the higher levels of thehierarchy are used as priors over the latent factors of thelower levels. Inference in the model was done using alternatingoptimization leading to a point estimate of the missing values.

One of the main drawbacks of point estimates in thecontext of matrix completion is that they provide no un-certainty quantification. In other words, there is no way oftelling when a prediction is confident, and when the model isunsure about the prediction. In most scientific disciplines, anuncertainty quantified prediction is essential for understand-ing the predictions, and planning subsequent steps, includingadditional data collection efforts to reduce uncertainties. Insome applications like plant trait prediction, we are thereforeinterested in inferring a distribution for each prediction, whichmotivates the use of a Bayesian approach to the problem.BPMF is proposed in [8] as a Bayesian generalization ofPMF by maintaining a distribution over all possible covariance

2014 13th International Conference on Machine Learning and Applications

978-1-4799-7415-3/14 $31.00 © 2014 IEEE

DOI 10.1109/ICMLA.2014.56

311

2014 13th International Conference on Machine Learning and Applications

978-1-4799-7415-3/14 $31.00 © 2014 IEEE

DOI 10.1109/ICMLA.2014.56

312

matrices. Gibbs sampler is applied for inference in BPMFyielding to a distribution for each prediction.

In this paper, we present a model and inference algorithmsfor uncertainty quantified matrix completion while incorporat-ing a given hierarchy as side information. The developmentis more general than that in [13], [14] since we provideuncertainty quantified predictions, i.e., both mean and standarddeviation of a missing entry, where low standard deviationsimply high confidence predictions. The model is also moregeneral than that in [8] since we incorporate the given hi-erarchical side information as part of the model. The keycontributions of the paper are as follows:

(i) We propose Bayesian HPMF (BHPMF), which is a fullBayesian model that incorporates the hierarchical side infor-mation and provides uncertainty quantified estimates of themissing values. Unlike several existing hierarchical Bayesianmodels [6], [9], [15], in BHPMF the structure of the hierarchyis determined by side information as opposed to being derivedfrom the data. In particular, both sets of latent factors, overrows and columns at each level of the hierarchy, serve aspriors for the next level in the hierarchy. We propose a Gibbssampler for inference in BHPMF. While there are several latentfactors at multiple levels, the Markov blanket of each factoris small, and independent of the number of levels yieldingan efficient Gibbs sampler. We explore variants of BHPMF,which consider different implementations of the Gibbs samplerincluding element-wise and block-wise sampling as well assamplers for y correlated HPMF models.

(ii) We introduce a Bayesian hierarchical multiple inher-itance PMF (MI-BHPMF) model which assumes a directedacyclic graph (DAG) structured hierarchical prior, rather thantree structured as in HPMF and BHPMF. We show that theGibbs sampler can be easily generalized to this setting.

(iii) We present extensive results on plant trait predictionproblem in Section IV. We illustrate higher accuracy for traitprediction in BHPMF as compared to HPMF, MEAN, andPMF. Further, we show that BHPMF is more accurate whenit is confident, whereas the error is high when the uncertaintyis high. We identify spatial coverage of the lowest confidentregions. Finally, we illustrate the performance of MI-BHPMFon movie recommendation using genre information.

The rest of paper is organized as follows. In Section II, wegive a brief overview of the related work. Section III presentsthe BHPMF model. We present the experimental results forplant trait prediction in Section IV. In Section V, we introducea hierarchical multiple inheritance model with the preliminaryresults on movie recommendation. We conclude in Section VI.

II. RELATED WORK

Low-rank MF algorithms provide powerful techniques formatrix completion [3]–[9]. It has been shown that rank con-straint minimization problems can be formulated as trace normconstraints which are convex and can be written as semi-definite constraints [16]. Moreover, Srebro et al. proposedmaximum margin matrix factorization as a convex, infinitedimensional alternative to low-rank matrix factorization [17].Several important variants of low-rank matrix factorizationhave been investigated, including PMF [7] and its Bayesian

v(ℓ)𝑚

𝐾𝑣

v(ℓ−1)𝑚

v(ℓ+1)𝑚 𝑋

(ℓ)𝑛𝑚

𝜎

u(ℓ)𝑛

𝜎𝑢

u(ℓ−1)𝑝(𝑛)

u(ℓ+1)𝑐(𝑛)

∣𝐶(𝑛)∣

𝑛 = 1...𝑁(ℓ)

𝑚 = 1...𝑀

(a) BHPMF Schematic

v(ℓ)𝑚

𝐾𝑣

v(ℓ−1)𝑚

v(ℓ+1)𝑚 𝑋

(ℓ)𝑛𝑚

𝜎

u(ℓ)𝑛

𝜎𝑢

u(ℓ−1)𝑝(𝑛)

∣𝑃 (𝑛)∣

u(ℓ+1)𝑐(𝑛)

∣𝐶(𝑛)∣

𝑛 = 1...𝑁(ℓ)

𝑚 = 1...𝑀

(b) MI-BHPMF Schematic

Fig. 1. (a) BHPMF and (b) MI-BHPMF schematic at level (ℓ). In spite of thesize of the model, the Gibbs sampler is efficient since the Markov blanket issmall and independent of the number of levels. MI-BHPMF supports multipleinheritance.

generalization [8], [9] as well as generalizations to probabilis-tic tensor factorization [3], [18], [19]. A non-linear MF usingGaussian process latent variable models is proposed in [5].However, one major drawback of the above methods is theinability to incorporate side information.

In order to consider side information, several approacheshave been proposed to combine MF with topic modeling[6], [20], [21]. Kernelized PMF was developed to incorporatecovariance functions based on kernels over rows and columnsin the context of latent factor models for matrix comple-tion [22]. Moreover, probabilistic matrix addition is proposedin [23] to capture covariance structure among rows and amongcolumns at the same time by adding the latent matrices. Ina recent work in online advertising [14], hierarchical sideinformation is incorporated into MF in three different ways– hierarchical regularization, agglomerate fitting, and residualfitting. Hierarchical PMF was proposed to incorporate thetaxonomic hierarchy into PMF which is the state-of-the-art forplant trait prediction [13].

III. BHPMFIn this Section, we propose a full Bayesian model (BH-

PMF) and an inference procedure that incorporates the hier-archical side information and provides uncertainty quantifiedestimates of the missing trait values.

A. Model specification

We illustrate the BHPMF model in the context of plant traitprediction. However, it can be applied to any problem with thehierarchical side information.

Denote the data matrix at each level ℓ with 𝑋(ℓ) ∈ℝ

𝑁(ℓ)×𝑀 for ℓ running from the top level 1 (e.g. phylogeneticgroups) to the bottom level 𝐿 (e.g. individual plants). Each row𝑛(ℓ) and column 𝑚(ℓ) of 𝑋(ℓ) has a latent factor u

(ℓ)𝑛 ∈ ℝ

𝐷

and v(ℓ)𝑚 ∈ ℝ

𝐷, respectively. Denoting latent factor matricesat level ℓ with 𝑈 (ℓ) ∈ ℝ

𝑁(ℓ)×𝐷 and 𝑉 (ℓ) ∈ ℝ𝑀×𝐷 (Figure

1(a)).

The generative process of BHPMF at level ℓ is given asfollows:

1) Generate u(ℓ)𝑛 ∼ 𝑁(u

(ℓ−1)𝑝(𝑛) , 𝜎

2𝑢𝐼), [𝑛]

𝑁(ℓ)

1 , where 𝑝(𝑛)is the parent node of 𝑛 in the upper level.

312313

2) Generate v(ℓ):𝑑 ∼ 𝑁(v

(ℓ−1):𝑑 , [𝐾

(ℓ)𝑣 ]−1), [𝑑]𝐷1 .

3) Generate 𝑥(ℓ)𝑛𝑚 ∼ 𝑁(⟨u(ℓ)𝑛 ,v

(ℓ)𝑚 ⟩, 𝜎2) for each non-

missing entry, where ⟨., .⟩ is the inner product.

where v(ℓ):𝑑 ∈ ℝ

𝑀 is column 𝑑 of 𝑉 (ℓ) and 𝐾(ℓ)𝑣 is the trait

precision matrix (inverse of covariance matrix) at level ℓ.

We use a Gibbs sampling procedure [24] to draw samplesof latent matrices from the joint posterior. In spite of thesize of the model, the Gibbs sampler is efficient since theMarkov blanket is small and independent of the number oflevels (Figure 1(a)). We consider two different types of traitcovariance structure prior for trait factors: diagonal covarianceand full covariance matrix.

B. Sampling 𝑈

Let 𝐶(𝑛) = {𝑐𝑖(𝑛)} be a set of child nodes of 𝑛 with𝑐𝑖(𝑛) be the 𝑖𝑡ℎ child node. Consider 𝑈−𝑛 a matrix obtainedfrom 𝑈 by discarding the 𝑛𝑡ℎ row. Let 𝛿(ℓ)𝑛𝑚 = 1 if 𝑥(ℓ)𝑛𝑚 isnon-missing and 0 otherwise.

Given the Markov blanket of u(ℓ)𝑛 i.e., {x(ℓ)

𝑛 , 𝑉 (ℓ), u(ℓ−1)𝑝(𝑛) ,

u(ℓ+1)𝐶(𝑛) }, u

(ℓ)𝑛 is independent of the other variables (Figure

1(a)). Therefore, the conditional probability of 𝑈 (ℓ) can befactorized into the product of conditional probability of itsrows {u(ℓ)

𝑛 }𝑁(ℓ)

𝑛=1 . By applying Bayes rule and given that theproduct of multiple Gaussian distributions is another Gaussiandistribution, it can be shown that the conditional probabilityof u(ℓ)

𝑛 is a Gaussian distribution

𝑝(u(ℓ)𝑛

∣∣x(ℓ)𝑛 , 𝑉

(ℓ),u(ℓ−1)𝑝(𝑛) ,u

(ℓ+1)𝐶(𝑛)

)∼ 𝒩

(u(ℓ)𝑛

∣∣𝜇∗(ℓ)𝑛 ,Σ∗(ℓ)𝑛

)∼

∏𝑚

[𝒩

(𝑥(ℓ)𝑛𝑚

∣∣⟨u(ℓ)𝑛 ,v

(ℓ)𝑚 ⟩, 𝜎2

)]𝛿(ℓ)𝑛𝑚

(1)

∏𝑖

[𝒩

(u(ℓ+1)𝑐𝑖(𝑛)

∣∣u(ℓ)𝑛 , 𝜎

2𝑢𝐼

)]𝒩

(u(ℓ)𝑛

∣∣u(ℓ−1)𝑝(𝑛) , 𝜎

2𝑢𝐼

)where ∣.∣ denotes the set cardinality,

Σ∗(ℓ)𝑛 =

[∣𝐶(𝑛)∣+ 1

𝜎2𝑢𝐼 +

1

𝜎2

∑𝑚

𝛿(ℓ)𝑛𝑚v(ℓ)𝑚 v(ℓ)𝑇

𝑚

]−1

(2)

𝜇∗(ℓ)𝑛 = Σ∗(ℓ)𝑛

⎡⎢⎣u

(ℓ−1)𝑝(𝑛) +

∑𝑖

u(ℓ+1)𝑐𝑖(𝑛)

𝜎2𝑢+

∑𝑚𝛿(ℓ)𝑛𝑚𝑥

(ℓ)𝑛𝑚v

(ℓ)𝑚

𝜎2

⎤⎥⎦ .

C. Sampling 𝑉

a) Block-wise Sampling: When 𝐾(ℓ)𝑣 = 1

𝜎𝑣𝐼 , each row

of latent matrix 𝑉 (ℓ) can be sampled in parallel similar tosampling matrix 𝑈 (ℓ) (Section III-B).

b) Element-wise Sampling: When 𝐾(ℓ)𝑣 is a full matrix, each

column 𝑑 of 𝑉 (ℓ) is drawn from 𝒩 (𝑉(ℓ):𝑑 ∣𝑉 (ℓ−1)

:𝑑 , [𝐾(ℓ)𝑣 ]−1).

Because of conditional dependencies, unlike sampling 𝑈 (ℓ) inSection III-B, matrix 𝑉 (ℓ) is sampled element-wise.

By applying Bayes rule, the conditional probability of 𝑣(ℓ)𝑚𝑑can be written as

𝑝(𝑣(ℓ)𝑚𝑑

∣∣𝑉 (ℓ)−𝑚,−𝑑, 𝑋

(ℓ), 𝑈 (ℓ),v(ℓ−1):𝑑 ,v

(ℓ+1):𝑑

)∼ (3)

𝑝(x(ℓ):𝑚∣v(ℓ)

𝑚 , 𝑈(ℓ)

)𝑝(𝑣(ℓ)𝑚𝑑∣v(ℓ)

−𝑚,𝑑,v(ℓ−1):𝑑

)𝑝(𝑣(ℓ)𝑚𝑑∣v(ℓ)

−𝑚,𝑑,v(ℓ+1):𝑑

).

It can be shown that the individual distributions are univariateGaussians as follows. Consider

𝑝(x(ℓ):𝑚

∣∣v(ℓ)𝑚 , 𝑈

(ℓ)) =∏𝑛

𝒩 (⟨u𝑛,v𝑚⟩, 𝜎2). (4)

Given (4), the conditional probability of 𝑣(ℓ)𝑚𝑑 is obtained as


∣∣v(ℓ)𝑚,−𝑑,x

(ℓ):𝑚, 𝑈

(ℓ))∼ 𝒩

(𝜇(ℓ)𝑥 ,

1

𝜎(ℓ)𝑥

)(5)

where 𝜎(ℓ)𝑥 =∑

𝑛[𝑢(ℓ)𝑛𝑑 ]

2

𝜎2 , 𝜇(ℓ)𝑥 =∑

𝑛 𝑢(ℓ)𝑛𝑑𝛽

(ℓ)𝑛

∑𝑛[𝑢

(ℓ)𝑛𝑑 ]

2, and 𝛽(ℓ)𝑛 = 𝑥

(ℓ)𝑛𝑚−∑𝐾

ℎ=1ℎ ∕=𝑑

𝑢(ℓ)𝑛ℎ𝑣

(ℓ)𝑚ℎ.

Since the prior of each column 𝑉 (ℓ) is𝒩 (v

(ℓ):𝑑 ∣v(ℓ−1)

:𝑑 , [𝐾(ℓ)𝑣 ]−1), the conditional probabilities

of 𝑣(ℓ)𝑚𝑑 can be obtained as follows,


∣∣v(ℓ)−𝑚,𝑑,v

(ℓ−1):𝑑

)∼ 𝒩

(𝜇(ℓ)𝑚1,

1

𝜎𝑚

)


∣∣v(ℓ)−𝑚,𝑑,v

(ℓ+1):𝑑

)∼ 𝒩

(𝜇(ℓ)𝑚2,

1

𝜎𝑚

) (6)

where 𝜎𝑚 = 𝐾(ℓ)𝑣 (𝑚,𝑚),

𝜇(ℓ)𝑚1 = 𝑣

(ℓ−1)𝑚𝑑 − 𝐾

(ℓ)𝑣 (𝑚,−𝑚)

𝐾(ℓ)𝑣 (𝑚,𝑚)

[v(ℓ)−𝑚,𝑑 − v

(ℓ−1)−𝑚,𝑑

],

𝜇(ℓ)𝑚2 = 𝑣

(ℓ+1)𝑚𝑑 − 𝐾

(ℓ)𝑣 (𝑚,−𝑚)

𝐾(ℓ)𝑣 (𝑚,𝑚)

[v(ℓ)−𝑚,𝑑 − v

(ℓ+1)−𝑚,𝑑

].

From (5), (6), we can write (3) as


∣∣𝑉 (ℓ)−𝑚,−𝑑, 𝑋

(ℓ), 𝑈 (ℓ)v(ℓ−1):𝑑 ,v

(ℓ+1):𝑑

)∼ 𝒩

(𝜇∗(ℓ)𝑚𝑑 , 𝜎

∗(ℓ)𝑚𝑑

)(7)

where 𝜎∗(ℓ)𝑚𝑑 = (2𝜎𝑚 + 𝜎(ℓ)𝑥 )−1,

𝜇∗(ℓ)𝑚𝑑 = 𝜎

∗(ℓ)𝑚𝑑

[𝜎(ℓ)𝑥 𝜇

(ℓ)𝑥 + 𝜎𝑚(𝜇

(ℓ)𝑚1 + 𝜇

(ℓ)𝑚2)

].

D. BHPMF Inference

We consider three different sampling procedures based onselection of 𝐾(ℓ)

𝑣 at each level as follows.a) Block-wise Sampler: For a given sparse matrix 𝑋 , thesampler updates the latent factor matrices (𝑈 (ℓ), 𝑉 (ℓ)) at everylevel ℓ. At each level ℓ, 𝑈 (ℓ) is sampled block-wise using (1).Using 𝐾(ℓ)

𝑣 = 1𝜎𝑣𝐼 for all level ℓ = 1 ⋅ ⋅ ⋅𝐿, 𝑉 (ℓ) is sampled

block-wise. To incorporate the taxonomic information we usethe following procedure. Each sample at the lowest levelis obtained by sampling the upper level matrices iteratively.At each iteration, we first do a bottom-up pass to sample(𝑈 (𝐿), 𝑉 (𝐿)) to (𝑈 (1), 𝑉 (1)), followed by a top-down pass tosample (𝑈 (1), 𝑉 (1)) to (𝑈 (𝐿), 𝑉 (𝐿)), and repeat the procedureto generate enough samples (Algorithm 1).b) Element-wise Sampler: At each level ℓ, similar to theblock-wise sampler, 𝑈 (ℓ) is sampled using (1). To incorporatetrait correlations into the sampler, a full covariance matrix𝐾(ℓ)

𝑣

is used for ℓ = 1 ⋅ ⋅ ⋅𝐿. Therefore, the matrix 𝑉 (ℓ) is sampledelement-wise. The sampling procedure is mostly similar tothe block-wise sampler (Algorithm 1) except that line 6 inAlgorithm 1 is replaced with the following lines

313314

6a: for 𝑖𝑡𝑒𝑟 = 1 ⋅ ⋅ ⋅𝑀𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 do6b: for 𝑚 = 1 ⋅ ⋅ ⋅𝑀 do6c: for 𝑑 = 1 ⋅ ⋅ ⋅𝐷 do

Sample 𝑣(ℓ)𝑚𝑑 using (7):

𝑝(𝑣𝑡+1(ℓ)𝑚𝑑

∣∣𝑉 𝑡+1(ℓ)−𝑚,−𝑑, 𝑋

(ℓ), 𝑈 𝑡+1(ℓ)v𝑡(ℓ−1):𝑑 ,v

𝑡(ℓ+1):𝑑

)

where 𝑀𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 is chosen empirically. Updating 𝑉 (ℓ)

more than once at each iteration obtains a stable matrix beforeupdating upper level matrices. Similar changes are applied toline 9 in Algorithm 1.c) Mixture Sampler: At each level ℓ, similar to the block-wise sampler, 𝑈 (ℓ) is sampled using (1). For ℓ = 1 ⋅ ⋅ ⋅ (𝐿−1),𝐾

(ℓ)𝑣 = 1

𝜎𝑣𝐼 is used and 𝑉 (ℓ) is sampled block-wised. At the

lowest level 𝐿 (plant level), a full covariance matrix 𝐾(ℓ)𝑣 is

used and 𝑉 (ℓ) is sampled element-wise from (7).

Algorithm 1 BHPMF - Block-wise Sampler1: for ℓ = 1, ⋅ ⋅ ⋅ , 𝐿 do2: Initialize model parameters {𝑈1(ℓ), 𝑉 1(ℓ)}3: for 𝑡 = 1, ⋅ ⋅ ⋅ , 𝑇 do4: for ℓ = 𝐿, ⋅ ⋅ ⋅ , 1 do ⊳ bottom-up5: for 𝑛 = 1 ⋅ ⋅ ⋅𝑁 sample u

(ℓ)𝑛 in parallel using (1):

u𝑡+1(ℓ)𝑛 ∼ 𝑝

(u𝑡(ℓ)𝑛

∣∣x(ℓ)𝑛 , 𝑉 𝑡(ℓ),u

𝑡(ℓ−1)𝑝(𝑛) ,u

𝑡(ℓ+1)𝐶(𝑛)

)6: for 𝑚 = 1 ⋅ ⋅ ⋅𝑀 sample v

(ℓ)𝑚 in parallel:

v𝑡+1(ℓ)𝑚 ∼ 𝑝(v𝑡(ℓ)

𝑚 ∣x(ℓ)𝑚 , 𝑈 𝑡+1(ℓ),v

𝑡(ℓ−1)𝑚 ,v

𝑡(ℓ+1)𝑚 )

7: for ℓ = 1, ⋅ ⋅ ⋅ , 𝐿 do ⊳ top-down8: for 𝑛 = 1 ⋅ ⋅ ⋅𝑁 sample u

(ℓ)𝑛 in parallel using (1):

u𝑡+2(ℓ)𝑛 ∼ 𝑝(u𝑡+1(ℓ)

𝑛 ∣x(ℓ)𝑛 , 𝑉 𝑡+1(ℓ),u

𝑡+1(ℓ−1)𝑝(𝑛) ,u

𝑡+1(ℓ+1)𝐶(𝑛) )

9: for 𝑚 = 1 ⋅ ⋅ ⋅𝑀 sample v(ℓ)𝑚 in parallel:

v𝑡+2(ℓ)𝑚 ∼ 𝑝(v𝑡+1(ℓ)

𝑚 ∣x(ℓ)𝑚 , 𝑈 𝑡+2(ℓ),v

𝑡+1(ℓ−1)𝑚 ,v

𝑡+1(ℓ+1)𝑚 )

IV. BHPMF EXPERIMENTAL RESULTS

Here, we present the results for trait prediction.

A. Dataset

In our experiment, we use a cleaned subset of the TRYdatabase – the world’s largest database of plant trait [1]–where taxonomic hierarchy information is available for allentries. This subset is a matrix containing 78,300 plants and 13traits. The percentage of missing entries varies from 49.63% to92.33% for each trait. In total, 79.9% of entries are missing.Starting from the top of the taxonomic hierarchy, there are6 phylogenetic groups, 358 families, 3793 genera, 14,320species, and 78,300 plants.

Given the 𝑝𝑙𝑎𝑛𝑡×𝑡𝑟𝑎𝑖𝑡matrix and the taxonomic hierarchy,trait data matrices at upper levels, such as 𝑠𝑝𝑒𝑐𝑖𝑒𝑠 × 𝑡𝑟𝑎𝑖𝑡matrix, 𝑔𝑒𝑛𝑢𝑠×𝑡𝑟𝑎𝑖𝑡matrix, etc. are constructed. For example,a 𝑠𝑝𝑒𝑐𝑖𝑒𝑠 × 𝑡𝑟𝑎𝑖𝑡𝑠 matrix can be constructed by taking theaverage of the plants in the same species.

B. Baselines

Mean: Given the 𝑝𝑙𝑎𝑛𝑡×𝑡𝑟𝑎𝑖𝑡 training matrix, upper levelmatrices are constructed to provide species mean, genus mean,etc. using taxonomic information. For example, species meanof trait 𝑚 is the average of trait 𝑚 among plants in the same

TABLE I. RMSE OF SPECIES MEAN, PMF, HPMF AND BHPMF.LATENT DIMENSION 𝑘=15 FOR MATRIX FACTORIZATION METHODS.

Method RMSEPMF 0.8993 ± 0.0210MEAN 0.5753 ± 0.0024HPMF 0.5009 ± 0.0034BHPMF - Block-wise Sampler 0.4567± 0.0021

species with available trait 𝑚. To predict missing trait 𝑚 ofplant 𝑛, among species mean, genus mean, etc. we use the firstavailable one at the lowest level.PMF: We run PMF [7] on 𝑝𝑙𝑎𝑛𝑡×𝑡𝑟𝑎𝑖𝑡 matrix directly. Notethat PMF is unable to consider the taxonomic information.HPMF: The results of HPMF are obtained from 5 top-downand bottom-up passes in total same as [13].

C. MethodologyThe most common evaluation measure for prediction ac-

curacy is the root of the mean square error (RMSE), given as

𝑅𝑀𝑆𝐸 = 1𝑇

√∑𝑁𝑖=1

∑𝑀𝑗=1 𝛿𝑖𝑗(𝑥𝑖𝑗 − �̂�𝑖𝑗)2 where 𝑥𝑖𝑗 is the

actual trait value, �̂�𝑖𝑗 is the predicted value for plant 𝑖 and trait𝑗, and 𝑇 is the total number of non-zero entries .

For uncertainty evaluation, we report our results basedon a model’s confidence vs. accuracy curve. We use thestandard deviation to measure the degree of confidence in traitprediction, and RMSE to measure the model’s accuracy. Thehypothesis is that when we are confident in the predictions onthe test set, the achieved accuracy is high i.e., the standarddeviation should decrease with decreasing RMSE.

In order to run BHPMF, the latent matrices 𝑈 (ℓ) and 𝑉 (ℓ)

for ℓ = 1 ⋅ ⋅ ⋅𝐿 are initialized randomly. The parameters fordifferent BHPMF samplers are as follows.Block-wise sampler: The burn-in period was set to 200 witha lag of 2 and final number of samples 400.Element-wise sampler and Mixture sampler: The burn-inperiod was set to 700 with a lag of 2 and final number ofsamples 400. Element-wise sampler has been tested with𝐾

(ℓ)𝑣 = 1

𝜎𝑣𝐼 and 𝐾(ℓ)

𝑣 = 𝐾∗ where 𝐾∗ is the estimatedprecision matrix by mGLasso [25].

D. Results

In this Section, we evaluate BHPMF in different aspectslike comparison between different samplers , uncertainty eval-uation analysis, and prediction accuracy.

1) Different Type of Samplers: Figure 2(a) illustrates acomparison between different BHPMF samplers with respectto RMSE per iteration. All of the samplers reach to a stationarystate with increasing number of iterations. Interestingly, theblock-wise sampler with {𝐾(ℓ)

𝑣 = 1𝜎𝑣𝐼}𝐿ℓ=1, outperforms all

other samplers. While mixture sampling with both type ofcovariance matrix behave almost similarly, the element-wisesampler with full covariance {𝐾(ℓ)

𝑣 }𝐿ℓ=1 improves the element-wise sampler with a diagonal covairance matrix.

2) Uncertainty Evaluation: The experiment runs as fol-lows: we sort all the data points in the test sets in ascendingorder of their standard deviation (Std), and divide the testsets evenly into 10 parts according to ascending Std, i.e.,the first part (Batch 1) contains the first 10% data pointswith the lowest Std, the second part (Batch 2) contains thesecond 10% data points with the second lowest Std, and soon. We calculate the RMSE on these 10 parts separately anddraw a curve. Figure 2(b) illustrates the curve on the 13-trait TRY data set. It is observed that the RMSE increases

314315

0 500 1000 15000.6

0.65

0.7

0.75

0.8

0.85

RM

SE

per

Iter

atio

n

Iteration

Block−wiseElement−wise: K

v = σ

vI

Element−wise: Kv = K*

Mixture: Kv = σ

vI

Mixture: Kv = K*

(a)

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Ave

rag

e o

f R

MS

E

Percentage of Data with Ascending Std

Block−wiseElement−wise: K

v = σ

vI

Element−wise: Kv = K*

Mixture: Kv = σ

vI

Mixture: Kv = K*

(b)

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

Ave

rag

e o

f R

MS

E

Percentage of Data with Ascending Std

(c)

Fig. 2. a) RMSE of different BHPMF samplers with increasing number of iterations. Block-wise sampler outperforms others. b) BHPMF for all traits and c)MI-BHPMF for all movies with the inverse of prediction confidence (Std) on the x-axis and the prediction error (RMSE) on the y-axis. The errors are small(more accurate) when the Std is small (more confident).

all Trait Measurements

180° W 135° W 90° W 45° W 0° 45° E 90° E 135° E 180° E

90° S

45° S

0°

45° N

90° N

(a) All Observations

all Traits −− batch 1: 10%

180° W 135° W 90° W 45° W 0° 45° E 90° E 135° E 180° E

90° S

45° S

0°

45° N

90° N

(b) The highest confident group

all Traits −− batch 10: 100%

180° W 135° W 90° W 45° W 0° 45° E 90° E 135° E 180° E

90° S

45° S

0°

45° N

90° N

(c) The lowest confident group

Fig. 3. a,b,c) Spatial coverage of all observation, the highest and lowest confident group. Trait measurements in China or south Africa are more frequent inthe uncertain groups (c). Additional measurements in the densely covered regions like China may improve the accuracy.

monotonically with increasing Std. By looking at the Std vsRMSE curve, we conclude that when we are confident aboutour predictions, the predictions are accurate. In other words,the model’s accuracy decreases monotonically with decreasingthe model’s confidence, i.e., the less confidence the modelhas, the worse performance it gets. Uncertainty quantificationnot only is a tool to measure how accurate the predictedtrait values are, but also provides the areas of less confidentpredictions which can in turn be used to guide field work fordata collection efforts. Similar results have been observed withconsidering each trait separately.

In order to identify areas of limited confidence, we exploredthe spatial coverage of different batches with different uncer-tainties (Figure 3). It can be observed that trait measurementsare scattered in a wider range of the world by going tomore uncertain batches i.e., going from batch 1 to batch10. Particularly, trait measurements in China or south Africahave been appeared more in the uncertain batches. Additionalmeasurements even in the densely covered regions like Chinaor South Africa may improve the accuracy.

3) Prediction Accuracy: We also compared the point esti-mation derived from BHPMF with MEAN, PMF, and HPMF.As shown in Section IV-D2, the block-wise sampler outper-forms the other sampling types. Therefore, we only providepoint estimation results of the block-wise sampler. BHMPFprovides the point estimation of each missing trait valueby taking the average of all generated samples. The RMSEresults are shown in table I. BHPMF outperforms all the othermodels, which means BHPMF not only provides uncertaintyquantification but also improves the point estimation of currenttrait value predictions.

V. MULTIPLE INHERITANCE BHPMF

The Multiple Inheritance model we present here can beaddressed as a generalization of the BHPMF model whichassumes a directed acyclic graph (DAG) structured hierarchicalprior, rather than tree structured as in BHPMF. In the BHPMFmodel, 𝑢(ℓ)𝑛 and 𝑣(ℓ)𝑚 are generated from a single Gaussiandistribution. In the case of multiple inheritance, the BHPMFmodel could be generalized to generate each latent factorfrom product of Gaussian distributions, involving a subset ofparents. Markov blanket of Multiple Inheritance BHPMF (MI-BHPMF) is illustrated in Figure 1(b). The construction hasparallels with the product of expert models [26], and multi-plicative mixture models (MMM) [27], [28]. A key differencehere is that the DAG structure is assumed to be known, andhere a combination of inference in MMM [27], [28] is avoided.

The generalized model of MI-BHPMF at level ℓ is:

1) Generate u(ℓ)𝑛 ∼∏

𝑖𝑁(u(ℓ−1)𝑝𝑖(𝑛)

, 𝜎2𝑢𝐼), [𝑛]𝑁(ℓ)

1 .

2) Generate v(ℓ):𝑑 ∼ 𝑁(v

(ℓ−1):𝑑 , [𝐾

(ℓ)𝑣 ]−1), [𝑑]𝐷1 .

3) Generate 𝑥(ℓ)𝑛𝑚 ∼ 𝑁(⟨u(ℓ)𝑛 ,v

(ℓ)𝑚 ⟩, 𝜎2) for each non-

missing entry.

where 𝑝𝑖(𝑛) is the 𝑖𝑡ℎ parent of 𝑛 in the upper level. Inprinciple, 𝑉 can also have multiple inheritance. Here, wediscuss the case where only 𝑈 has multiple inheritance. Theconditional probability of u(ℓ)

𝑛 is

𝑝(u(ℓ)𝑛

∣∣x(ℓ)𝑛 , 𝑉

(ℓ),u(ℓ−1)𝑃 (𝑛) ,u

(ℓ+1)𝐶(𝑛)

)∼ 𝒩

(u(ℓ)𝑛

∣∣𝜇∗(ℓ)𝑛 ,Σ∗(ℓ)𝑛

)where 𝑃 (𝑛) = {𝑝𝑖(𝑛)} is the set of parent nodes of 𝑛,

315316

Σ∗(ℓ)𝑛 =

[∣𝐶(𝑛)∣+ ∣𝑃 (𝑛)∣

𝜎2𝑢𝐼 +

∑𝑚 𝛿

(ℓ)𝑛𝑚v

(ℓ)𝑚 v

(ℓ)𝑇𝑚

𝜎2

]−1

(8)

𝜇∗(ℓ)𝑛 = Σ∗(ℓ)𝑛

⎡⎢⎣∑𝑖

u(ℓ−1)𝑝𝑖(𝑛)

+∑𝑗

u(ℓ+1)𝑐𝑗(𝑛)

𝜎2𝑢+

∑𝑚𝛿(ℓ)𝑛𝑚𝑥

(ℓ)𝑛𝑚v

(ℓ)𝑚

𝜎2

⎤⎥⎦ .

The sampling procedure is mostly similar to Algorithm 1except that line 5 is replaced with the following line

5: for 𝑛 = 1 ⋅ ⋅ ⋅𝑁 sample u(ℓ)𝑛 in parallel using (8):

u𝑡+1(ℓ)𝑛 ∼ 𝑝

(u𝑡(ℓ)𝑛

∣∣x(ℓ)𝑛 , 𝑉 𝑡(ℓ),u

𝑡(ℓ−1)𝑃 (𝑛) ,u

𝑡(ℓ+1)𝐶(𝑛)

)We present some preliminary results of evaluating the

multiple inheritance model on the MovieLens Data set. Thedata set contains 1M ratings for 3900 movies by 6040 users.The genre of each movie has been extracted from IMDB [20].There are 25 movie types (Genre). A hierarchy over moviescan be built by grouping movies based on genre where eachmovie may belong to more than one genre. Figure 2(c) showsRMSE-Std curve on the MovieLens data set. Similar to BH-PMF, RMSE increases monotonically with increasing standarddeviation. Meaning that MI-BHPMF is accurate (small RMSE)when it is confident (small standard deviation).

VI. CONCLUSIONS

While the uncertainty quantification of a prediction isessential to understanding the prediction itself, most of thematrix completion methods give only a point estimate of miss-ing entries without any uncertainty quantification. This papershows how we can derive uncertainty quantified estimatesof missing values in sparse matrices. We propose BHPMFto incorporate the hierarchical side information and provideuncertainty quantified estimates of the missing values. Wedeveloped a Gibbs sampling procedure for inference in themodel. We observe that block-wise sampling with diagonalcovariance as traits’ prior outperforms point-wise sampling,which uses a full covariance trait structure as prior overtraits. BHPMF with block-wise sampling provides higher pointestimation accuracy than PMF, HPMF, which is the state-of-the-art for trait prediction, and MEAN, which is frequentlyused as a baseline in the ecology community.

We then generalized BHPMF to consider hierarchical mul-tiple inheritance side information (MI-BHPMF). We showthat the Gibbs sampler readily generalizes to this setting. Wehypothesize that BHPMF and MI-BHPMF are accurate (smallRMSE) when they are confident (small standard deviation),whereas the error is high when the uncertainty is high. On theexample of 13 plant traits from the world’s largest plant traitdatabase (TRY) and the MovieLens data set we demonstratethat this hypothesis is fulfilled in all cases. Quantified uncer-tainty estimates based on BHPMF and MI-BHPMF thus helpto identify areas of limited confidence, which can be used toinform future trait data collection efforts.

ACKNOWLEDGMENT

Authors acknowledge the support of NSF via IIS-0953274,IIS-1029711, IIS- 0916750, IIS-0812183.

REFERENCES

[1] J. Kattge, S. Diaz, S. Lavorel et al., “Try–a global database of planttraits,” Global Change Biology, vol. 17, no. 9, pp. 2905–2935, 2011.

[2] V. Cordlandwehr, R. L. Meredith et al., “Do plant traits retrieved from adatabase accurately predict on-site measurements?” Journal of Ecology,vol. 101, no. 3, pp. 662–670, 2013.

[3] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup, “Scalable tensorfactorizations with missing data,” in SDM, 2010.

[4] Y. Koren, R. Bell, and C. Volinsky, “Matrix Factorization Techniquesfor Recommender Systems,” IEEE Computer, 2009.

[5] N. Lawrence and R. Urtasun, “Non-linear Matrix Factorization withGaussian Processes,” in ICML, 2009.

[6] I. Porteous, A. Asuncion, and M. Welling, “Bayesian matrix factoriza-tion with side information and dirichlet process mixtures,” in AAAI,2010.

[7] R. Salakhutdinov and A. Mnih, “Probabilistic Matrix Factorization,” inNIPS, 2007.

[8] ——, “Bayesian Probabilistic Matrix Factorization using Markov ChainMonte Carlo,” in ICML, 2008.

[9] A. Singh and G. Gordon, “A Bayesian Matrix Factorization Model forRelational Data,” in UAI, 2010.

[10] R. Salakhutdinov and N. Srebro, “Collaborative filtering in a non-uniform world: Learning with the weighted trace norm,” in NIPS, 2010.

[11] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principalcomponent analysis: Exact recovery of corrupted low-rank matrices viaconvex optimization,” Journal of the ACM, 2009.

[12] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completionusing alternating minimization,” Symposium on Theory of Computing,2013.

[13] H. Shan, J. Kattge, P. B. Reich, A. Banerjee, F. Schrodt, and M. Re-ichstein, “Gap filling in the plant kingdom—trait prediction usinghierarchical probabilistic matrix factorization,” ICML, 2012.

[14] A. K. Menon, K. Chitrapura, S. Garg, D. Agarwal, and N. Kota,“Response prediction using collaborative filtering with hierarchies andside-information,” in KDD. ACM, 2011, pp. 141–149.

[15] C. Wang, J. Paisley, and D. Blei, “Online variational inference for thehierarchical dirichlet process,” in AISTATS, 2011.

[16] M. Fazel, H. Hindi, and S. Boyd, “A rank minimization heuristic withapplication to minimum order system approximation,” in ACC, vol. 6.IEEE, 2001, pp. 4734–4739.

[17] N. Srebro, J. Rennie, and T. Jaakkola, “Maximum-Margin MatrixFactorization,” in NIPS, 2005.

[18] I. Sutskever, R. Salakhutdinov, and J. Tenenbaum, “Modelling Rela-tional Data using Bayesian Clustered Tensor Facotrization,” in NIPS,2009.

[19] L. Xiong, X. Chen, T. Huang, J. G. Schneider, and J. G. Carbonell,“Temporal collaborative filtering with bayesian probabilistic tensorfactorization,” in SDM, 2010.

[20] H. Shan and A. Banerjee, “Generalized Probabilistic Matrix Factoriza-tions for Collaborative Filtering,” in ICDM, 2010.

[21] C. Wang and D. M. Blei, “Collaborative topic modeling for recom-mending scientific articles,” in KDD. ACM, 2011, pp. 448–456.

[22] T. Zhou, H. Shan, A. Banerjee et al., “Kernelized probabilistic matrixfactorization: Exploiting graphs and side information,” in SDM, 2012.

[23] A. Agovic, A. Banerjee, and S. Chatterjee, “Probabilistic Matrix Addi-tion,” in ICML, 2011.

[24] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions,and the bayesian restoration of images,” IEEE Transactions on PatternAnalysis and Machine Intelligence, no. 6, pp. 721–741, 1984.

[25] M. Kolar and E. P. Xing, “Estimating sparse precision matrices fromdata with missing values,” ICML, 2012.

[26] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[27] Q. Fu and A. Banerjee, “Multiplicative mixture models for overlappingclustering,” in ICDM, 2008, pp. 791–796.

[28] K. Heller and Z. Ghahramani, “A nonparametric bayesian approach tomodeling overlapping clusters,” in AISTAT, 2007.

316317

Uncertainty Quantified Matrix Completion Using Bayesian ... · Uncertainty Quantiﬁed Matrix...

Documents

Transcript of Uncertainty Quantified Matrix Completion Using Bayesian ... · Uncertainty Quantiﬁed Matrix...