Sparse Coding and Automatic Relevance Determination for Multi-way models
description
Transcript of Sparse Coding and Automatic Relevance Determination for Multi-way models
Informatics and Mathematical Modelling / Intelligent Signal Processing
1Sparse’09 8 April 2009
Sparse Coding and Automatic Relevance Determination for
Multi-way models
Morten Mørup DTU Informatics
Intelligent Signal ProcessingTechnical University of Denmark
Joint work with Lars Kai HansenDTU Informatics
Intelligent Signal ProcessingTechnical University of Denmark
Informatics and Mathematical Modelling / Intelligent Signal Processing
2Sparse’09 8 April 2009
Multi-way modeling has become an important tool for Unsupervised Learning and are in frequent use today in a variety of fields including Psychometrics (Subject x Task x Time) Chemometrics (Sample x Emission x Absorption) NeuroImaging(Channel x Time x Trial) Textmining (Term x Document x Time/User) Signal Processing (ICA, i.e. diagonalization of Cummulants)
Vector: 1-way array/ 1st order tensor,
Matrix: 2-way array/ 2nd order tensor,
3-way array/3rd order tensor
Informatics and Mathematical Modelling / Intelligent Signal Processing
3Sparse’09 8 April 2009
Some Tensor algebra
Matricizing X X(n)
N-mode multiplication
i.e. for matrix multiplication CA=A x1C , ABT=A x2B
I x JK Matrix
K x IJ Matrix
J x KI Matrix
XI x J x K
X(1)
X(2)
X(3)
Informatics and Mathematical Modelling / Intelligent Signal Processing
4Sparse’09 8 April 2009
Tensor DecompositionThe two most commonly used tensor decomposition models are the CandeComp/PARAFAC (CP) model and the Tucker model
Informatics and Mathematical Modelling / Intelligent Signal Processing
5Sparse’09 8 April 2009
CP-model unique Tucker model not unique What constitutes an adequate number of
components, i.e. determining J for CP and J1,J2,…,JN for Tucker
is an open problem, particularly difficult for the Tucker model as the number of components are specified for each modality separately
Informatics and Mathematical Modelling / Intelligent Signal Processing
6Sparse’09 8 April 2009
Agenda To use sparse coding to Simplify the Tucker core forming
a unique representation as well as enable interpolation between the Tucker (full core) and CP (diagonal core) model
To use sparse coding to turn off excess components in the CP and Tucker model and thereby select the model order.
To tune the pruning strength from data by Automatic Relevance Determination (ARD).
Informatics and Mathematical Modelling / Intelligent Signal Processing
7Sparse’09 8 April 2009
Sparse Coding
Bruno A. Olshausen David J. Field
Nature, 1996
Preserve Information Preserve Sparsity (Simplicity)
Tradeoff parameter
Informatics and Mathematical Modelling / Intelligent Signal Processing
8Sparse’09 8 April 2009
Sparse Coding (cont.)
Patch image Vectorize patches
…
Image patch 1 to N
A cor
resp
onds t
o Gab
or
like
feat
ures r
esem
bling
simple
cell b
ehav
ior i
n
V1 of t
he bra
in!
Sparse coding attempts to both learn dictionary and encoding at the same time(This problem is not convex!)
Informatics and Mathematical Modelling / Intelligent Signal Processing
9Sparse’09 8 April 2009
Automatic Relevance Determination (ARD)
Automatic Relevance Determination (ARD) is a hierarchical Bayesian approach widely used for model selection
In ARD hyper-parameters explicitly represents the relevance of different features by defining the range of variation. (i.e., Range of variation0 Feature removed)
Informatics and Mathematical Modelling / Intelligent Signal Processing
10Sparse’09 8 April 2009
A Bayesian formulation of the Lasso / BPD problem
Informatics and Mathematical Modelling / Intelligent Signal Processing
11Sparse’09 8 April 2009
ARD in reality a L0-norm optimization scheme. As such ARD based on Laplace prior corresponds to L0-norm optimization by re-weighted L1-norm
L0 norm by re-weighted L2 follows by imposing Gaussian prior instead of Laplace
Informatics and Mathematical Modelling / Intelligent Signal Processing
12Sparse’09 8 April 2009
Sparse Tucker decomposition by ARD
Brakes into standard Lasso/BPD sub-problems of the form
Informatics and Mathematical Modelling / Intelligent Signal Processing
13Sparse’09 8 April 2009
Solving efficiently the Lasso/BPD sub-problems
Notice, in general the alternating sub-problems for Tucker estimation have J<I!
Informatics and Mathematical Modelling / Intelligent Signal Processing
14Sparse’09 8 April 2009
The Tucker ARD Algorithm
Informatics and Mathematical Modelling / Intelligent Signal Processing
15Sparse’09 8 April 2009
Results
Tucker(10,10,10) models were fitted to the data, given are below the extracted cores
Fluorescene data: emission x excitation x samples
Synthetic Tucker(3,4,5) Tucker(3,6,4) Tucker(3,3,3) Tucker(?,4,4) Tucker(4,4,4)
Informatics and Mathematical Modelling / Intelligent Signal Processing
16Sparse’09 8 April 2009
Sparse Coding in fact 2-way case of Tucker model The presented ARD framework extracts a 20
component model when trained on the natural image data of (Olshausen and Field, Nature 1996)
Patch image Vectorize patches
…
Image patch 1 to N
Informatics and Mathematical Modelling / Intelligent Signal Processing
17Sparse’09 8 April 2009
Conclusion Imposing sparseness on the core
enable to interpolate between Tucker and CP model
ARD based on MAP estimation form a simple framework to tune the pruning in sparse coding models giving the model order.
ARD framework especially useful for the Tucker model where the order is specified for each mode separately which makes exhaustive evaluation of all potential models expensive.
ARD-estimation based on MAP is closely related to L0 norm estimation based on reweighted L1 and L2 norm. Thus, ARD form a principled framework for learning the sparsity pattern in CS.
Informatics and Mathematical Modelling / Intelligent Signal Processing
18Sparse’09 8 April 2009
http://mmds.imm.dtu.dk