Latent Feature Lasso - ccs.neu.edu · I Latent Feature Model (LFM)is a generalization ofMixture...

1
Latent Feature Lasso Ian E.H. Yen 1 , Wei-Cheng Lee 2 , Sung-En Chang 2 , Arun S. Suggala 1 , Shou-De Lin 2 and Pradeep Ravikumar 1 1 Carnegie Mellon University. 2 National Taiwan University Abstract I In this work, we propose a novel convex estimator (Latent Feature Lasso) for Latent Feature Model. I To best of our knowledge, this is the first method with low-order polynomial runtime and sample complexity without restrictive assumptions on the data distribution for LFM. I In experiments, the Latent Feature Lasso significantly outperforms other methods when there is a larger number of latent features. I The method enjoys a runtime of O (ND + DK 2 ) runtime per iter, more scalable than a typical O (NDK 2 ) of existing approaches. Latent Feature Models I Latent Feature Model (LFM) is a generalization of Mixture Model, where each observation is an additive combination of latent features. Discriminative Multiclass Classification Multilabel Classification Generative Mixture Model Latent Feature Model I In Latent Feature Model, each observation x n = W T z n + n where x n R D : observation, W R K ×D : feature dictionary, z n ∈{0, 1} K : binary latent indicators, and n R D : noise. I Mixture Model is a special case with kz n k 0 = 1. Related Works & Results I Goal: Find dictionary W K ×D and latent indicators Z : N × K that best approximates observation X : N × D . I Existing Approaches: I MCMC, Variational (Indian Buffet Process): No finite-time guarantee. I Spectral Method (Tung 2014): O (DK 6 ) sample complexity. (z Ber(π ), x N (W T z )). I Matrix Factorization (Slawski et al., 2013): O (NK 2 K ) runtime complexity for exact recovery (noiseless). I This Paper: I A convex estimator — Latent Feature Lasso. I Low-order polynomial runtime and sample complexity. I No restrictive assumption on p (X ), even allows model mis-specification. Convex Formulation via Atomic Norm I Empirical Risk Minimization: min Z ∈{0,1} N ×K min W R K ×D 1 2N kX - ZW k 2 F + τ 2 kW k 2 F , I Given Z , the dual problem w.r.t. W is: min M =ZZ T ∈{0,1} N ×N max AR N ×D -1 2N 2 τ tr (AA T M ) - 1 N N X i =1 L * (x i , -A i ,: ) | {z } g (M ) . I Key insight: the function is convex w.r.t. M . I Enforce structure M = ZZ T via an atomic norm. I Let S := {k | z k ∈{0, 1} N }. We define Atomic Norm: kM k S := min c 0 X k ∈S c k s .t . M = X k ∈S c k z k z T k . I The Latent Feature Lasso estimator: min M g (M )+ λkM k S . I Equivalently, one can solve the estimator by min c R |S| + g ( X k ∈S c k z k z T k )+ λkc k 1 Question: How to optimize with |S| = 2 N variables? Greedy Coordinate Descent via MAX-CUT I At each iteration, we find the coordinate of steepest descent: j * = argmax j -∇ j f (c )= argmax z ∈{0,1} N h-∇g (M ), zz T i (1) which is a Boolean Quadratic problem similar to MAX-CUT: max z ∈{0,1} N z T C z I Can be solved to a 3/5-approximation by roudning from a special type of SDP with O (ND ) iterative solver. Active-Set Algorithm 0. A = , c = 0. for t = 1...T do 1. Find an approximate greedy atom zz T by MAX-CUT-like problem: max z ∈{0,1} N h-∇g (M ), zz T i. . 2. Add zz T to an active set A. 3. Refine c A via Proximal Gradient Method on: min c 0 g ( X k ∈A c k z k z T k )+ λkc k 1 4. Eliminate {z k z T k |c k = 0} from A. end for. I Finding approximate greedy coordinate costs O (ND ) (via SDP). I Evaluating g (M ):a least-square problem of cost O (DK 2 ). I Each iteration costs O (ND ) | {z } MAX-CUT + O (DK 2 ) | {z } Least-Square Runtime Complexity MCMC Variational MF-Binary BP-Means Spectral LatentLasso (NDK 2 )T (NDK 2 )T (NK )2 K (NDK 3 )T ND + K 5 log (K ) (ND + K 2 D )T Theoretical Results: Risk Bound Let the population risk of a dictionary W be r (W ) := E [ min z ∈{0,1} K 1 2 kx - W T z k 2 ]. Let W * be an optimal dictionary of size K , the algorithm outputs ˆ W with r ( ˆ W ) r (W * )+ as long as t = Ω( K ) and N = Ω( DK 3 log( RK ρ )). I The result trades between risk and sparsity. I No assumption on x except that of boundedness. I The sample complexity is (quasi) linear to D and K . Identifiability Let rank * )= K . The decomposition ZW * is unique if 1. Z * :N × K and W * :K × D are both of rank K . 2. span (Z * ) ∩{0, 1} N \{0} = {Z * :,j } K j =1 . Theoretical Results: Exact Recovery (noiseless) Let X = Z * W * , and (Z A , W A ) be a solution of Latent Feature Lasso. If the identifiability holds and W A has full row-rank: {Z :,j } j ∈A = {Z * :,j } K j =1 , {W j ,: } j ∈A = {W * j ,: } K j =1 . Experiments on Synthetic Data Experiments on Real Data Mail: {ianyen,pradeepr,inderjit}@cs.utexas.edu, [email protected],[email protected]

Transcript of Latent Feature Lasso - ccs.neu.edu · I Latent Feature Model (LFM)is a generalization ofMixture...

Page 1: Latent Feature Lasso - ccs.neu.edu · I Latent Feature Model (LFM)is a generalization ofMixture Model, where each observation is anadditive combination of latent features. Discriminative

Latent Feature LassoIan E.H. Yen1, Wei-Cheng Lee2, Sung-En Chang2, Arun S. Suggala1, Shou-De Lin2 and Pradeep Ravikumar1

1Carnegie Mellon University. 2National Taiwan University

Abstract

I In this work, we propose a novel convex estimator (Latent FeatureLasso) for Latent Feature Model.

I To best of our knowledge, this is the first method with low-orderpolynomial runtime and sample complexity without restrictiveassumptions on the data distribution for LFM.

I In experiments, the Latent Feature Lasso significantly outperformsother methods when there is a larger number of latent features.

I The method enjoys a runtime of O(ND + DK 2) runtime per iter, morescalable than a typical O(NDK 2) of existing approaches.

Latent Feature Models

I Latent Feature Model (LFM) is a generalization of Mixture Model,where each observation is an additive combination of latent features.

Discriminative Multiclass Classification Multilabel ClassificationGenerative Mixture Model Latent Feature Model

I In Latent Feature Model, each observationxn = W Tzn + εn

where xn ∈ RD: observation, W ∈ RK×D: feature dictionary,zn ∈ 0,1K : binary latent indicators, and εn ∈ RD: noise.

I Mixture Model is a special case with ‖zn‖0 = 1.

Related Works & Results

I Goal: Find dictionary WK×D and latent indicators Z : N × K that bestapproximates observation X : N × D.

I Existing Approaches:I MCMC, Variational (Indian Buffet Process):

No finite-time guarantee.I Spectral Method (Tung 2014):

O(DK 6) sample complexity. (z ∼Ber(π), x ∼ N(W Tz, σ)).I Matrix Factorization (Slawski et al., 2013):

O(NK 2K ) runtime complexity for exact recovery (noiseless).I This Paper:

I A convex estimator — Latent Feature Lasso.I Low-order polynomial runtime and sample complexity.I No restrictive assumption on p(X ), even allows model

mis-specification.

Convex Formulation via Atomic Norm

I Empirical Risk Minimization:

minZ∈0,1N×K

min

W∈RK×D

12N‖X − ZW‖2

F +τ

2‖W‖2

F

,

I Given Z , the dual problem w.r.t. W is:

minM=ZZ T∈0,1N×N

maxA∈RN×D

−12N2τ

tr (AATM)− 1N

N∑i=1

L∗(xi,−Ai ,:)

︸ ︷︷ ︸g(M)

.

I Key insight: the function is convex w.r.t. M.I Enforce structure M = ZZ T via an atomic norm.

I Let S := k | zk ∈ 0,1N. We define Atomic Norm:

‖M‖S := minc≥0

∑k∈S

ck s.t . M =∑k∈S

ckzkzTk .

I The Latent Feature Lasso estimator:min

Mg(M) + λ‖M‖S.

I Equivalently, one can solve the estimator by

minc∈R|S|+

g (∑k∈S

ckzkzTk ) + λ‖c‖1

Question: How to optimize with |S| = 2N variables?

Greedy Coordinate Descent via MAX-CUT

I At each iteration, we find the coordinate of steepest descent:j∗ = argmax

j−∇jf (c) = argmax

z∈0,1N〈−∇g(M), zzT〉 (1)

which is a Boolean Quadratic problem similar to MAX-CUT:max

z∈0,1NzTCz

I Can be solved to a 3/5-approximation by roudning from a special typeof SDP with O(ND) iterative solver.

Active-Set Algorithm

0. A = ∅, c = 0.for t = 1...T do1. Find an approximate greedy atom zzT by MAX-CUT-like problem:

maxz∈0,1N

〈−∇g(M), zzT〉..2. Add zzT to an active set A.3. Refine cA via Proximal Gradient Method on:

minc≥0

g(∑k∈A

ckzkzTk ) + λ‖c‖1

4. Eliminate zkzTk |ck = 0 from A.

end for.

I Finding approximate greedy coordinate costs O(ND) (via SDP).I Evaluating ∇g(M): a least-square problem of cost O(DK 2).I Each iteration costs O(ND)︸ ︷︷ ︸

MAX-CUT

+ O(DK 2)︸ ︷︷ ︸Least-Square

Runtime Complexity

MCMC Variational MF-Binary BP-Means Spectral LatentLasso(NDK 2)T (NDK 2)T (NK )2K (NDK 3)T ND + K 5log(K ) (ND + K 2D)T

Theoretical Results: Risk Bound

Let the population risk of a dictionary W be

r (W ) := E [ minz∈0,1K

12‖x −W Tz‖2].

Let W ∗ be an optimal dictionary of size K , the algorithm outputs W withr (W ) ≤ r (W ∗) + ε

as long as

t = Ω(Kε

) and N = Ω(DKε3

log(RKερ

)).

I The result trades between risk and sparsity.I No assumption on x except that of boundedness.I The sample complexity is (quasi) linear to D and K .

Identifiability

Let rank(Θ∗) = K . The decomposition ZW = Θ∗ is unique if1. Z ∗:N × K and W ∗:K × D are both of rank K .2. span(Z ∗) ∩ 0,1N \ 0 = Z ∗:,jK

j=1.

Theoretical Results: Exact Recovery (noiseless)

Let X = Z ∗W ∗, and (ZA,WA) be a solution of Latent Feature Lasso. Ifthe identifiability holds and WA has full row-rank:

Z:,jj∈A = Z ∗:,jKj=1 , Wj ,:j∈A = W ∗

j ,:Kj=1.

Experiments on Synthetic Data

Experiments on Real Data

Mail: ianyen,pradeepr,[email protected], [email protected],[email protected]