Maximum Likelihood Matrix Completion Under Sparse Factor ...€¦ · Background and Motivation...

Maximum Likelihood Matrix Completion UnderSparse Factor Models:

Error Guarantees and Efficient Algorithms

Jarvis Haupt

Department of Electrical and Computer EngineeringUniversity of Minnesota

Institute for Computational and Experimental Research in Mathematics (ICERM)Workshop on Approximation, Integration, and Optimization

October 1, 2014

Background and Motivation Problem Statement Error Bounds Algorithmic Approach Experimental Results Acknowledgments

Section 1

Background and Motivation


A Classical Example

Sampling Theorem:(Whittaker/Kotelnikov/Nyquist/Shannon, 1930’s-1950’s)

Original Signal (Red)Samples (Black)

Accurate Recovery (and Imputation)via Ideal Low-Pass Filtering

when Original Signal is Bandlimited

Basic “Formula” for Inference:To draw inferences from limiteddata (or here, to impute missing

elements), need to leverageunderlying structure in the signal

being inferred.


A Contemporary Example

Matrix Completion:(Candes & Recht; Keshavan, et al.; Candes & Tao;

Candes & Plan; Negahban & Wainwright;

Koltchinskii et al.; Davenport et al.;... 2009- )

Samples

Accurate Recovery (and Imputation)via Convex Optimization

when Original Matrix is Low-Rank

Low-rank modeling assumptioncommonly utilized in

collaborative filtering applications(e.g. the Netflix prize),

to describe settings where eachobserved value depends on only a

few latent factors or features.


Beyond Low Rank Models?

Low-Rank Models: All columns of the ma-trix are well-approximated as vectors incommon linear subspace.

Union of Subspaces Model: All columns ofthe matrix are well-approximated as vectorsin a union of linear subspaces.

Union of subspaces models are at the essence of sparse subspace clustering (Elhamifar & Vidal;

Soltanolkotabi et al.; Erikkson et al; Balzano et al) and dictionary learning (Olshausen & Field; Aharon et

al; Mairal et al.;...).

Here, we examine the efficacy of such models in matrix completion tasks.


Section 2

Problem Statement


“Sparse Factor” Data Models

We assume the unknown X∗ ∈ Rn1×n2 we seek to estimate admits a factorization of the form

X∗ = D∗A∗, D∗ ∈ Rn1×r ,A∗ ∈ Rr×n2

where

• ‖D∗‖max , maxi,j |Di,j | ≤ 1 (essentially to fix scaling ambiguities)

• ‖A∗‖max ≤ Amax for a constant 0 < Amax ≤ (n1 ∨ n2)

• ‖X∗‖max ≤ Xmax/2 for a constant Xmax ≥ 1

Our Focus: Sparse factor models, characterized by (approximately or exactly) sparse A∗.


Observation Model

We observe X∗ only at a subset S ∈ {1, 2, . . . , n1} × {1, 2, . . . , n2} of its locations. For someγ ∈ (0, 1] each (i , j) is in S independently with probability γ, and interpret γ = m(n1n2)−1, sothat m = is the nominal number of observations.

Observations {Yi,j}(i,j)∈S , YS conditionally independent given S, modeled via joint density

pX∗S(YS) =

∏(i,j)∈S

pX∗i,j(Yi,j )︸︷︷︸

scalar densities


Estimation Approach

We estimate X∗ via a sparsity-penalized maximum likelihood approach: for λ > 0, we take

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

}.

The set X of candidate reconstructions is any subset of X ′, where

X ′ , {X = DA : D ∈ D, A ∈ A, ‖X‖max ≤ Xmax} ,where

• D: the set of all matrices D ∈ Rn1×r whose elements are discretized to one of Luniformly-spaced values in the range [−1, 1]

• A: the set of all matrices A ∈ Rr×n2 whose elements either take the value zero, or arediscretized to one of L uniformly-spaced values in the range [−Amax,Amax]


Section 3

Error Bounds


A General “Sparse Factor” Matrix Completion Error Guarantee

Theorem (A. Soni, S. Jain, J.H., and S. Gonella, 2014)

Let β > 0 and set L = (n1 ∨ n2)β . If CD satisfies CD ≥ maxX∈X maxi,j D(pX∗i,j‖pXi,j

), then for

any λ ≥ 2 · (β + 2) ·(

1 + 2CD3

)· log(n1 ∨ n2), the sparsity penalized ML estimate

X = arg minX=DA∈X

{− log pXS (YS) + λ · ‖A‖0

}satisfies the (normalized, per-element) error bound

ES,YS[−2 log A(p

X, pX∗ )

]n1n2

≤8CD log m

m

+3 minX=DA∈X

{D(pX∗‖pX)

n1n2+

(λ+

4CD(β + 2) log(n1 ∨ n2)

3

)(n1p + ‖A‖0

m

)}.

Here:

A(pX, pX∗ ) ,∏

i,j A(pXi,j, pX∗

i,j) where A(pXi,j

, pX∗i,j

) , EpX∗i,j

[√pXi,j

/pX∗i,j

]is the Hellinger Affinity

D(pX∗‖pX) ,∑

i,j D(pX∗i,j‖pXi,j

) where D(pX∗i,j‖pXi,j

) , EpX∗i,j

[log(pX∗

i,j/pXi,j

)

]is KL Divergence

Next, we instantiate this result for some specific cases (using a specific choice of β, λ).


Additive White Gaussian Noise Model

Suppose each observation is corrupted by zero-mean AWGN with known variance σ2, so that

pX∗S(YS) =

1

(2πσ2)|S|/2exp

− 1

2σ2

∑(i,j)∈S

(Yi,j − X∗i,j )2

.

Let X = X ′, essentially (a discretization of) a set of rank and max-norm constrained matrices.

Gaussian Noise (Exact Sparse Factor Model)

If A∗ is exactly sparse with ‖A∗‖0 nonzero elements, the sparsity penalized ML estimate satisfies

ES,YS[‖X∗ − X‖2

F

]n1n2

= O(

(σ2 + X2max)

(n1r + ‖A∗‖0

m

)log(n1 ∨ n2)

).


AWGN – Our Result in Context

Gaussian Noise (Exact Sparse Factor Model)



F

]n1n2

= O(

(σ2 + X2max)

(n1r + ‖A∗‖0

m

)log(n1 ∨ n2)

).

Compare with result of (Koltchinskii et al, 2011); when X∗ is max-norm and rank-constrained,nuclear-norm penalized optimization yields estimate satisfying

‖X∗ − X‖2F

n1n2= O

((σ2 + X2

max)

((n1 + n2)r

m

)log(n1 ∨ n2)

)with high probability.

Note: Our guarantees can have improved error performance in the case where ‖A∗‖0 � n2r .The two bounds coincide when A∗ is not sparse (take ‖A∗‖0 = n2r in our error bounds).


AWGN Model (Extension to Approximately Sparse Factor Model)

Recall: For p ≤ 1, a vector x ∈ Rn is said to belong to a weak-`p ball of radius R > 0, denotedx ∈ w`p(R), if its ordered elements |x(1)| ≥ |x(2)| ≥ · · · ≥ |x(n)| satisfy

|x(i)| ≤ Ri−1/p for all i ∈ {1, 2, . . . , n}.

With this, we can state a variant of the above for when columns of A∗ are approximately sparse.

Gaussian Noise (Approximately Sparse Factor Model)

Consider the same Gaussian noise model described above. If for some p ≤ 1 all columns of A∗

belong to a weak-`p ball of radius Amax, then for α = 1/p − 1/2,


F

]n1n2

= O(A2

max

(n2

m

) 2α2α+1

+ (σ2 + X2max)

(n1r

m+(n2

m

) 2α2α+1

)log(n1 ∨ n2)

)

Note:( n2

m

) 2α2α+1 ≤ n2m−

2α2α+1 ⇐ aggregate error of estimating n2 vectors in w`p from noisy obs.


Additive Laplace Noise Model

Suppose each observation is corrupted by additive Laplace noise with known parameter τ > 0, so

pX∗S(YS) =

( τ2

)|S|exp

−τ ∑(i,j)∈S

|Yi,j − X∗i,j |

.

Let X = X ′, essentially (a discretization of) a set of rank and max-norm constrained matrices.

Laplace Noise (Exact Sparse Factor Model)



F

]n1n2

= O( (

1

τ2+ X2

max

)︸︷︷︸

O(variance + X2max)

τXmax

(n1r + ‖A∗‖0

m

)︸︷︷︸

“parametric-like” formsimilar to sparse model

Gaussian-noise case

log(n1∨n2)

).

Can also obtain results for the approximately sparse case here, analogously to above...


Poisson-distributed Observations

Suppose that each element of X∗ satisfies X∗i,j ≥ Xmin for some Xmin > 0, and that each

observation is Poisson-distributed, so that YS ∈ N|S| and

pX∗S(YS) =

∏(i,j)∈S

(X∗i,j )Yi,j e−X∗i,j

(Yi,j )!,

Let X = {X ∈ X ′ : Xi,j ≥ 0 for all (i , j) ∈ {1, 2, . . . , n1} × {1, 2, . . . , n2}}.(To allow only non-negative rate estimates)

Poisson-distributed Observations (Exact Sparse Factor Model)



F

]n1n2

= O( (

Xmax + X2max

Xmax

Xmin

)︸︷︷︸

O(worst-case variance + X2max)

when Xmax/Xmin = O(1)

(n1r + ‖A∗‖0

m

)log(n1∨n2)

).



One-bit Observations

Let link function F : R→ [0, 1] be a differentiable link function with f (t) = ddt

F (t). Supposeeach observation Yi,j for (i , j) ∈ S is Bernoulli(F (X∗i,j ))-distributed, so that

pX∗S(YS) =

∏(i,j)∈S

[F (X∗i,j )

]Yi,j[1− F (X∗i,j )

]1−Yi,j

Assume F (Xmax) < 1, F (−Xmax) > 0, and inf|t|≤Xmaxf (t) > 0.

One-bit Observations (Exact Sparse Factor Model)



F

]n1n2

= O((

cF ,Xmax

c ′F ,Xmax

)(1

cF ,Xmax

+ X2max

) (n1r + ‖A∗‖0

m

)log(n1 ∨ n2)

),

where

cF ,Xmax ,

(sup

|t|≤Xmax

1

F (t)(1− F (t))

)·(

sup|t|≤Xmax

f 2(t)

)

c ′F ,Xmax, inf

|t|≤Xmax

f 2(t)

F (t)(1− F (t)).



Comparisons to “One bit Matrix Completion”

One-bit Observations (Exact Sparse Factor Model)



F

]n1n2

= O((

cF ,Xmax

c ′F ,Xmax

)(1

cF ,Xmax

+ X2max

) (n1r + ‖A∗‖0

m

)log(n1 ∨ n2)

),

Compare with low-rank recovery result of (Davenport et al., 2012); maximum likelihoodoptimization over a set of max-norm and nuclear-norm constrained candidates yields estimatesatisfying

‖X∗ − X‖2F

n1n2= O

(CF ,XmaxXmax

√(n1 + n2)r

m

)with high probability, where CF ,Xmax analogous to (cF ,Xmax/c ′F ,Xmax

) factor in our bounds.

Extra loss of Xmax log(n1 ∨ n2) in our bound, but faster “parametric-like” dependence on m (inaddition to the “sparse factor” improvement). Lower bounds for “sparse factor” model stillopen (we think!).


Section 4

Algorithmic Approach


A Non-Convex Problem...

Our optimizations take the general form

minD∈Rn1×r ,A∈Rr×n2 ,X∈Rn1×n2

∑i,j

−si,j logpXi,j(Yi,j ) + IX (X) + ID(D) + IA(A) + λ‖A‖0

s.t. X = DA.

where si,j = 1 if (i , j) ∈ S (and 0 otherwise) and IX (.), ID(.), IA(.) are indicator functions.

Multiple sources of non-convexity:

• `0 regularizer

• discretized sets D and A• inherent bilinearity of the model!

We propose an approach based on the Alternating Direction Method of Multipliers (ADMM).


A General-Purpose ADMM-based Approach

We form the augmented Lagrangian

L(D,A,X,Λ) = −∑i,j

si,j logpXi,j(Yi,j ) + IX (X) + ID(D) + IA(A) + λ‖A‖0

+tr (Λ(X−DA)) +ρ

2‖X−DA‖2F,

where Λ is Lagrange multiplier for the equality constraint and ρ > 0 is a parameter, and solve:

(S1 :) Xk+1 := minX∈Rn1×n2

L(Dk ,Ak ,X,Λk )

(S2 :) Ak+1 := minA∈Rr×n2

L(Dk ,A,Xk+1,Λk )

(S3 :) Dk+1 := minD∈Rn1×r

L(D,Ak+1,Xk+1,Λk )

(S4 :) Λk+1 = Λk + ρ(Xk+1 −Dk+1Ak+1).


Efficiently Solvable Subproblems

We relax D,A,X to closed convex sets, and solve S1-S4 iteratively, as follows...

Step S1: After simplification, the solution can be written in terms of scalar prox functions:

Xk+1i,j = arg min

Xi,j∈R−si,j logpXi,j

(Yi,j ) + IX (Xi,j ) +ρ

2

(Xi,j − (Dk Ak )i,j + (Λk )i,j/ρ

)2

, prox−si,j logp· (Yi,j )+IX (·)

((Dk Ak )i,j − (Λk )i,j/ρ

).

(Closed-form for three of our examples; use Newton’s Method for the one-bit model w/probit or logit link.)

Step S2: The subproblem takes the form

minA∈Rn1×r

IA(A) + λ‖A‖0 +ρ

2‖Xk+1 −Dk A + Λk/ρ‖2

F .

(Solved via “majorization-minimization;” Iterative Hard Thresholding (Blumensath & Davies 2008).)

Step S3: The subproblem takes the form

minD∈Rr×n2

ID(D) +ρ

2‖Xk+1 −DAk+1 + Λkρ‖2

F .

(Efficiently solved via Newton’s Method or closed-form.)


Section 5

Experimental Results


A Comparison with Synthetic Data

Preliminary Experimental Results: We evaluated each of these methods on matrices of size100× 1000 with r = 20 and 4 nonzero elements per column of A∗, for varying sampling rates(and different likelihood models). For each, we evaluated the average (over 5 trials) normalizedreconstruction error as a function of the sampling rate.

Gaussian and Laplace Noises havesame variances.

For sampling rates > 10−4 ≈ 40%,the error exhibits predicted decay

(slope of ≈-1 on the log-log scale).

−1 −0.8 −0.6 −0.4 −0.2 0−2

−1

0

1

2

3

log10(γ )

log 1

0

(

E‖X−X

∗‖2 F

n1n2

)

GaussianLaplacePoisson


Imaging Example – Gaussian Noise

Original 512× 512 image reshaped into 256× 1024 matrix (0.005 ≤ X∗i,j ≤ 1.05 for all i , j)

Inner dimension r = 25, noise standard deviation: σ = 0.01, sampling rate = 50%

Original Image Samples

Estimated Image Estimated A


Imaging Example – Laplace Noise


Inner dimension r = 25, noise standard deviation:√

2/τ = 0.01, sampling rate = 50%




Imaging Example – Poisson-distributed Observations


Inner dimension r = 25, sampling rate = 50%




Imaging Example – One-bit Observations


Inner dimension r = 25, sampling rate = 50%




Section 6

Acknowledgments


Acknowledgments

Collaborators/Co-authors:

Akshay Soni Swayambhoo Jain Prof. Stefano Gonella(UMN ECE PhD Student) (UMN ECE PhD Student) (UMN Civil Engr.)

Research Support:NSF EARS (Enhancing Access to the Radio Spectrum) ProgramDARPA Young Faculty Award

[email protected]

www.ece.umn.edu/~jdhaupt

(Special thanks to Prof. Julian Wolfson, UMN Dept. of Biostatistics, for the Beamer Template!)

www.ece.umn.edu/~jdhaupt

Maximum Likelihood Matrix Completion Under Sparse Factor ...€¦ · Background and Motivation...

Documents

Transcript of Maximum Likelihood Matrix Completion Under Sparse Factor ...€¦ · Background and Motivation...