Generalized Multilinear Model for Dimensionality Reduction...

12
Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors Jakub Mažgut * Institute of Informatics and Software Engineering Faculty of Informatics and Information Technologies Slovak University of Technology in Bratislava Ilkoviˇ cova 3, 842 16 Bratislava, Slovakia [email protected] Abstract Generalized multilinear model for dimensionality reduc- tion of binary tensors (GMM-DR-BT) is a technique for computing low-rank approximations of multi-dimensional data objects, tensors. The model exposes a latent struc- ture that represents dominant trends in the binary tenso- rial data while retaining as much information as possible. Recently, there exist several models for computing the low-rank approximations of tensors but to the best of our knowledge at present there is no principled framework for binary tensors. Although the binary tensors occur in many real world applications such as gait recognition, document analysis or graph mining. In the GMM-DR-BT model formulation, to account for binary nature of the data, each tensor element is modeled by a Bernoulli noise distribution. To extract the dom- inant trends in the data, the natural parameters of the Bernoulli distributions are constrained by employing the Tucker model to lie in a sub-space spanned by a reduced set of basis (principal) tensors. Bernoulli distribution is a member of exponential family with helpful analytical properties that allow us to derive an iterative scheme for estimation of the basis tensors and other model parame- ters via maximum likelihood. Furthermore, we extended the fully unsupervised GMM- DR-BT model to the semi-supervised setting by forcing the model to search for a natural parameter subspace that represents a user specified compromise between modelling the quality and the degree of class separation. * Recommended by thesis supervisor: Dr. Peter Tiˇ no. Defended at Faculty of Informatics and Information Tech- nologies, Slovak University of Technology in Bratislava on August 30, 2012. c Copyright 2012. All rights reserved. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy other- wise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific per- mission and/or a fee. Permissions may be requested from STU Press, Vazovova 5, 811 07 Bratislava, Slovakia. Mažgut, J. Generalized Multilinear Model for Dimensionality Reduc- tion of Binary Tensors. Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 We evaluated and compared the proposed GMM-DR-BT technique with existing real-valued and nonnegative ten- sor decomposition methods in several experiments involv- ing variety of high-dimensional datasets. The results sug- gest that the GMM-DR-BT model is better suited for modeling binary tensors than its real-valued and nonneg- ative counterparts. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications Data mining ; I.5.1 [Pattern Recognition]: Models; I.5.2 [Pattern Recognition]: Design Methodology Pattern analysis Keywords unsupervised tensor decomposition, Tucker model, binary data, semi-supervised decomposition 1. Introduction At present an increasing number of data processing tasks involve manipulation of multi-dimensional objects, known also as tensors, where the elements are to be addressed by more than two indices. In other words, the tensors can intuitively be imagined as multidimensional arrays of numbers in programming languages. In many practi- cal problems such as gait [16], hand postures [18] or face recognition [12], hyperspectral image processing [21] and text documents analysis [4], the data tensors are spec- ified in a high-dimensional space where the number of dimensions greatly exceeds the number of samples. Fur- thermore, these data have usually a large number of re- dundancy and lie in a subspace of the input space [10]. Applying pattern recognition or machine learning meth- ods directly to such data spaces can result in high compu- tational and memory requirements, as well as poor gener- alization. To address this “curse of dimensionality” a wide range of decomposition methods have been introduced to compress the data while capturing the ‘dominant’ trends. Making the learning machines operate on this compressed data space may not only boost their generalization per- formance but can also increase their interpretability cru- cially. Traditional dimensionality reduction and feature extrac- tion methods such as principal component analysis (PCA) [19] were designed to handle data objects in the form of vectors. For a tensorial data decomposition, the data items need to be first vectorized before these methods

Transcript of Generalized Multilinear Model for Dimensionality Reduction...

Page 1: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Generalized Multilinear Model for DimensionalityReduction of Binary Tensors

Jakub Mažgut∗

Institute of Informatics and Software EngineeringFaculty of Informatics and Information Technologies

Slovak University of Technology in BratislavaIlkovicova 3, 842 16 Bratislava, Slovakia

[email protected]

AbstractGeneralized multilinear model for dimensionality reduc-tion of binary tensors (GMM-DR-BT) is a technique forcomputing low-rank approximations of multi-dimensionaldata objects, tensors. The model exposes a latent struc-ture that represents dominant trends in the binary tenso-rial data while retaining as much information as possible.Recently, there exist several models for computing thelow-rank approximations of tensors but to the best of ourknowledge at present there is no principled frameworkfor binary tensors. Although the binary tensors occurin many real world applications such as gait recognition,document analysis or graph mining.

In the GMM-DR-BT model formulation, to account forbinary nature of the data, each tensor element is modeledby a Bernoulli noise distribution. To extract the dom-inant trends in the data, the natural parameters of theBernoulli distributions are constrained by employing theTucker model to lie in a sub-space spanned by a reducedset of basis (principal) tensors. Bernoulli distribution isa member of exponential family with helpful analyticalproperties that allow us to derive an iterative scheme forestimation of the basis tensors and other model parame-ters via maximum likelihood.

Furthermore, we extended the fully unsupervised GMM-DR-BT model to the semi-supervised setting by forcingthe model to search for a natural parameter subspace thatrepresents a user specified compromise between modellingthe quality and the degree of class separation.

∗Recommended by thesis supervisor: Dr. Peter Tino.Defended at Faculty of Informatics and Information Tech-nologies, Slovak University of Technology in Bratislava onAugust 30, 2012.

c© Copyright 2012. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Mažgut, J. Generalized Multilinear Model for Dimensionality Reduc-tion of Binary Tensors. Information Sciences and Technologies Bulletinof the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46

We evaluated and compared the proposed GMM-DR-BTtechnique with existing real-valued and nonnegative ten-sor decomposition methods in several experiments involv-ing variety of high-dimensional datasets. The results sug-gest that the GMM-DR-BT model is better suited formodeling binary tensors than its real-valued and nonneg-ative counterparts.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—Data mining ; I.5.1 [Pattern Recognition]: Models;I.5.2 [Pattern Recognition]: Design Methodology—Pattern analysis

Keywordsunsupervised tensor decomposition, Tucker model, binarydata, semi-supervised decomposition

1. IntroductionAt present an increasing number of data processing tasksinvolve manipulation of multi-dimensional objects, knownalso as tensors, where the elements are to be addressedby more than two indices. In other words, the tensorscan intuitively be imagined as multidimensional arraysof numbers in programming languages. In many practi-cal problems such as gait [16], hand postures [18] or facerecognition [12], hyperspectral image processing [21] andtext documents analysis [4], the data tensors are spec-ified in a high-dimensional space where the number ofdimensions greatly exceeds the number of samples. Fur-thermore, these data have usually a large number of re-dundancy and lie in a subspace of the input space [10].Applying pattern recognition or machine learning meth-ods directly to such data spaces can result in high compu-tational and memory requirements, as well as poor gener-alization. To address this “curse of dimensionality” a widerange of decomposition methods have been introduced tocompress the data while capturing the ‘dominant’ trends.Making the learning machines operate on this compresseddata space may not only boost their generalization per-formance but can also increase their interpretability cru-cially.

Traditional dimensionality reduction and feature extrac-tion methods such as principal component analysis (PCA)[19] were designed to handle data objects in the form ofvectors. For a tensorial data decomposition, the dataitems need to be first vectorized before these methods

Page 2: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

36 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

can be applied. Besides the higher computational andmemory requirements, the vectorization of data tensorsbreaks the higher order dependencies presented in thenatural data structure that can potentially lead to morecompact and useful representations [16]. New methodscapable of processing multi-dimensional tensors in theirnatural structure have been introduced in [4, 16, 17] (forreal-valued tensors), [29] (for nonnegative tensors) and [2](for symmetric tensors). Such techniques, however, arenot suitable for processing binary tensors. To the best ofour knowledge at present there is no principled system-atic framework for decomposition of binary tensors. Yet,binary tensors arise in many real world applications suchas gait recognition [16], document analysis [4] or graphobjects represented by adjacency tensors. Thus, in thetensor decomposition domain there is still a need for amethod that explicitly takes into account the binary na-ture of such tensorial data.

In this work, we present a novel decomposition method,generalized multilinear model for dimensionality reduc-tion of binary tensors (GMM-DR-BT), which is able tohandle binary tensorial data in their natural multidimen-sional form and which extracts hidden underlying struc-ture of analyzed data in an unsupervised manner. Fur-thermore, we extended the model to semi-supervised set-ting by forcing the model to search for a natural parame-ter subspace which represents a user specified compromisebetween the modelling quality and the degree of class sep-aration.

2. Notation and Basic Tensor AlgebraBefore we present the basic concepts of the domain, let usintroduce our notation and necessary basic tensor algebra.The vectors are denoted by lowercase boldface letters (e.g.u), matrices by italic uppercase (e.g. U), and tensors bycalligraphic letters (e.g. A). Elements of an N -th ordertensor A ∈ RI1×I2×...×IN are addressed by N indices inranging from 1 to In, n = 1, 2, ..., N . For convenience, wewill often write a particular index setting (i1, i2, ..., iN ) ∈Υ = {1, 2, ..., I1} × {1, 2, ..., I2} × ... × {1, 2, ..., IN} for atensor element using vector notation i = (i1, i2, ..., iN ), sothat instead of writing Ai1,i2,...,iN we write Ai.

A scalar product of two tensors A,B ∈ RI1×I2×...×IN isdefined as 〈A,B〉 =

∑i∈ΥAi · Bi and a Frobenius norm

of tensor A is defined as

||A||F = 〈A,A〉 =

√∑i∈Υ

Ai · Ai. (1)

An arbitrary tensor can be multiplied by a matrix (2nd-order tensor) using n-mode products. The n-mode prod-uct of a tensor A ∈ RI1×I2×...×IN by a matrix U ∈ RJ×Inis a tensor (A×n U) given by

(A×n U)i1,...,in−1,j,in+1,...,iN =

In∑in=1

Ai1,...,in−1,in,in+1,...,iN · Uj,in , (2)

for some j ∈ {1, 2, ..., J}.

Another important and widely used procedure is an un-folding of the tensor. Unfolding, also known as the ma-tricization, is a process of reordering the elements of anNth-order tensor into a matrix along one of the tensor

modes. Unfolding of tensor A along the n-mode is de-noted by A(n). For detailed information with illustra-tions of various third-order tensor unfolding can be foundin [28].

3. Tensor Decomposition MethodsAs we mentioned before, the number of data processingtasks that involve manipulation of massive tensorial datahas increased enormously over the last few years. Thisleads to a strong demand for dimensionality reduction orfeature extraction techniques that are able to process thetensorial data in their natural multi-dimensional struc-ture. In this work we will focus on tensor decompositionmethods based on a reduced-rank representation. Suchmethods extract the dominant information from the large-scale data by finding a lower dimensional (more compact)representation of the original data with no or little lossof information. Such techniques and representations areused for several purposes from noise removal, data com-pression, model reduction, and blind source separation tovisualization.

The main idea behind the tensor decomposition methodsis analogous to the matrix decomposition methods wherethe given matrix is decomposed as a liner combination ofrank-1 matrices. In the tensor domain, the given tensor isdecomposed (factorized) as a linear combination of rank-1tensors. An N -th rank-1 tensor A ∈ RI1×I2×...×IN canbe obtained as an outer product of N non-zero vectorsu(n) ∈ RIn , n = 1, 2, ..., N :

A = u(1) ◦ u(2) ◦ ... ◦ u(N).

In the tensor decomposition literature, there are two mainconcepts how to decompose the given tensor. The con-cepts are represented by two different models, namelyTucker [26] and PARAFAC [11]. Both models have theirorigins and have been firstly investigated in the areasof psychometics (e.g. [26]) and chemometrics (e.g. [3]).Nowadays, the models are employed and extended for nu-merous disciplines from neurosience, social network anal-ysis and web-mining, computer vision and etc.. More de-tails about the actual survey of state-of-the-art applica-tions and extensions can be found in [1].

3.1 PARAFAC ModelThe parallel factor analysis model or simply the PARA-FAC model (e.g. [1]), was proposed by Harshman in theseventies [11]. At the same time, Carroll and Chang pro-posed in [5] independently from Harshman a canonical de-composition (CANDECOMP) model that was consideredequivalent to the PARAFAC model. The basic principleof the model is to decompose any given tensor as a sum ofa finite number of rank-1 tensors. The rank-1 tensors arecalled components in the PARAFAC decomposition. Inthe context of different rank definitions [14], the PARA-FAC model tries to find the best rank-R representation.

If the number of components R is equal to the true rankof the analyzed Nth-order tensor A, the decomposition iscalled rank decomposition and is defined as

A =

R∑r=1

u(1)r ◦ u(2)

r ◦ · · · ◦ u(N)r , (3)

where R = rank(A) and the r-th rank-1 tensor is equal

to an outer product of N vectors (u(1)r ◦ u

(2)r ◦ ... ◦ u

(N)r ).

Besides the exact rank decomposition, PARAFAC model

Page 3: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 37

can also be used to find a lower rank approximation ofthe given tensor A. If the R < rank(A) then we get anapproximation

A =

R∑r=1

u(1)r ◦ u(2)

r ◦ · · · ◦ u(N)r + E , (4)

where E denotes a residue between the real tensor and itsapproximation.

3.2 Tucker ModelThe Tucker decomposition model, also called the three-mode factor analysis, was first introduced by Tucker in1964 [26]. Tucker proposed only a model for third-ordertensors and later in 2000 a group around De Lathauwer[7] proposed a generalization of the model to arbitraryNth-order tensors. Besides that, the authors showed thatthe Tucker decomposition is a convincing generalizationof the matrix singular value decomposition.

Consider now an orthonormal basis {u(n)1 ,u

(n)2 , ...,u

(n)In}

for the n-mode space RIn . The (column) vectors u(n)k

can be stored as columns of the basis matrix U (n) =(u

(n)1 ,u

(n)2 , ...,u

(n)In

). Any tensor A can be decomposedinto the product

A = Q ×1 U (1) ×2 U (2) ×3 ...×N U (N),

with expansion coefficients stored in the Nth-order ten-sor Q ∈ RI1×I2×...×IN . The expansion of A can also bewritten as

A =∑i∈Υ

Qi · (u(1)i1◦ u

(2)i2◦ ... ◦ u

(N)iN

). (5)

In other words, tensor A is expressed as a linear combina-

tion of∏Nn=1 In rank-1 basis tensors (u

(1)i1◦u(2)

i2◦...◦u(N)

iN).

In addition, from orthonormality of the basis sets, the ten-sor Q of expansion coefficients can be obtained as

Q = A×1 (U (1))T ×2 (U (2))T ×3 ...×N (U (N))T .

Note that the orthonormality of basis {u(n)1 ,u

(n)2 , ...,u

(n)In}

for the n-mode space RIn can be relaxed. It can be easilyshown that as long as for each mode n = 1, 2, ..., N , the

vectors u(n)1 ,u

(n)2 , ...,u

(n)In

are linearly independent, the

basis tensors (u(1)i1◦ u

(2)i2◦ ... ◦ u

(N)iN

), i ∈ Υ will be lin-early independent as well.

In many practical problems, one can assume that only areduced set of basis tensors in the expansion (5) is enoughto sufficiently approximate the original tensor:

A =∑i∈K

Qi · (u(1)i1◦ u

(2)i2◦ ... ◦ u

(N)iN

) + E (6)

= Q×1 U(1) ×2 U

(2) ×3 ...×N U (N) + E , (7)

where K ⊂ Υ, and E denotes a residue between the realtensor and its model approximation. Basis matrices U (n)

are called in context of the Tucker decomposition litera-ture also as factors or loadings [20] and the tensor withexpansion coefficients as a core tensor of the reduced di-mension.

3.3 PARAFAC versus Tucker ModelThe main difference between the Tucker and the PARA-FAC model is that the Tucker model allows interactions

between different components while the PARAFAC doesnot. The levels of interactions are defined by values ofthe expansion tensor Q. In fact, the PARAFAC modelcan be considered a special case of the Tucker model withnonzero elements only on a super-diagonal of the expan-sion tensor.

Another difference between the models is the uniquenessof decomposition. The PARAFAC model has a uniquesolution up to permutation and scale. On the other side,the Tucker based decompositions are not unique [14]. Wecan rotate the expansion coefficient tensor Q without af-fecting the fit so long as we apply the inverse rotation tothe basis matrices. In order to achieve uniqueness, severalconstrains such as orthogonality, sparsity, and nonnega-tivity are imposed on the factor matrices and the expan-sion coefficient tensor. Detailed information and a briefsurvey can be found in [14].

There exist many extensions and applications of both themodels in various domains. So it is not an easy matter tochoose the right one for the desired analysis. In our work,we focus on compact representations (decompositions) ofthe original data that preserve as much information aspossible. According to the results of experiments pub-lished in [27], the Tucker decompositions clearly preservemore information than the PARAFAC decompositions.The authors used both the models to decompose, com-press and consequently reconstruct data tensors. Then,the reconstructed tensors were compared with the origi-nals using mean square error criterion. While preservingthe same compression ratio, the Tucker model had clearlyoutperformed the PARAFAC model. These results can beexplained by higher flexibility of the Tucker model whileit allows interactions between the factors. A detailed sur-vey of Tucker based models, their computational methodsand applications can be found in (e.g. [1, 14]).

3.4 Extensions of the Tucker ModelThe existing Tucker decomposition techniques can be di-vided into two main groups, namely real-valued and non-negative tensor decomposition models. There exist sev-eral models in each group but in this section we focus onlyon two essential models to point out the main trends inboth groups. An extensive survey of existing models canbe found in (e.g. [1, 14, 6]).

3.4.1 Multilinear Principal Component AnalysisThe basic PARAFAC and Tucker models were designed toprocess one data tensor. However, many actual patternrecognition tasks require simultaneously processing of aset of tensors in order to perform reduction to extract thedominant features across all the data samples (tensors).Let’s assume that we have a set of N -mode tensors Am,where m = 1, . . . ,M and that only a reduced numberof basis tensors in the expansion (5) is sufficient to ap-proximate all tensors in a given dataset. Then, each datatensor Am can be expressed as their linear combination

Am ≈∑i∈K

Qm,i · (u

(1)i1◦ u

(2)i2◦ ... ◦ u

(N)iN

), (8)

whereK ⊂ Υ. In other words, tensors in the given datasetcan be found ‘close’ to the |K|-dimensional hyperplanein the tensor space spanned by the basis rank-1 tensors

(u(1)i1◦ u

(2)i2◦ ... ◦ u

(N)iN

), i ∈ K. So then each tensor Amcan be represented through expansion coefficients Q

m,i,i ∈ K.

Page 4: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

38 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

Based on the principle of simultaneous decomposition of aset of tensors, Lu et al. proposed in [16] a tensor decompo-sition technique, multilinear principal component analy-sis (MPCA) for real-valued tensor. The MPCA techniquecan be considered a generalization of the classical PCAmethod to arbitrary N -th order tensors. The main objec-tive of MPCA is to find a basis (projection) matrix U (n)

for each n-mode space RIn , where n = 1, ..., N , whichmaximizes the total variance of expansion coefficient ten-sors (projected data tensors):

ΨQ =

M∑m=1

||Qm − Q||F , (9)

where Q denotes a mean of the expansion coefficient ten-sors Qm. This is an analogy with the objective of PCA.The PCA model finds projection that maximizes the vari-ance of the projected data vectors instead of tensors.

According to the authors of [16], there is no known opti-mal solution which allows for the simultaneous optimiza-tion of the N projection matrices U (n). They used a localoptimization procedure to compute one projection matrixat time while the others are kept fixed. Briefly, the pro-jection matrix U (n) consists of the Rn eigenvectors withthe largest eigenvalues of a matrix

Φ(n) =

M∑m=1

(Am(n)−A(n)) ·UΦ(n) ·UTΦ(n) ·(Am(n)−A(n))T ,

where UΦ(n) =(U (n+1) ⊗ U (n+2) ⊗ ... ⊗ U (N) ⊗ U (1) ⊗

U (2)...⊗U (n−1)), ⊗ denotes a Kronecker product [8] and

Am(n) represents unfolding of tensor Am along the n-

mode, so that Am(n) ∈ RIn×(I1×...×In−1×In+1×IN ) [7].The projection matrices are updated sequentially and thewhole process is repeated until the cost function (9) con-verges or the maximum number of iteration is reached.According to [16], altering the ordering of the projectionmatrix computation did not lead to any significant im-provements in practical situations.

3.4.2 Nonnegative Tucker DecompositionThe main motivation behind the development of nonnega-tive decomposition methods was the problem of perform-ing dimensionality reductions for inherently nonnegativedata, e.g. environmental data, chemical concentrations orcolor intensities [24]. The proposed solution was to rep-resent such data as a linear combination of basis vectorsand mixing coefficients, both with nonnegative elements.These nonnegativity constrains which allow only additivenot subtractive combinations respect the inherent non-negativity of analyzed data and avoid physically absurdand uninterpretable factors [25].

The first nonnegative Tucker decomposition algorithms,introduced in [13], use an alternative least square tech-nique to optimize a various global cost functions. Thecost functions measure discrepancies between the origi-nal data tensor A and the reduced rank approximationof the data tensor A. According to the Tucker decompo-sition scheme, defined by equation (6), the reduced rankapproximation is given by

A ≈ A = Q×1 U(1) ×2 U

(2) · · · ×N U (N), (10)

where for the nonnegative decomposition, the tensor withexpansion coefficients Q ∈ RR1×R2×···×RN and the n-mode basis matrices U (n) are constrained to have only

nonnegative elements. Note that, the initial restriction ofthe Tucker model about orthonormality of column vectorsof the basis matrices is relaxed. It can be easily shownthat as long as for each mode the vectors are linearly in-dependent, the basis tensors will be linearly independentas well.

The first listed model for the nonnegative Tucker decom-position is based on a minimalization of the widely usedFrobenius norm (1), between the data tensor A and its

approximation by the model A:

DF (A||A) = ||A − A||F . (11)

The model was proposed by Kim et al. in [13] and whileit incorporates optimization of the Froberius norm, it per-forms a nonnegative Tucker decomposition under an as-sumption of independent and identically distributed Gaus-sian noise. Other existing models incorporate differentcost functions (e.g. KL-divergence, α-divergence) thatare suitable for different types of noise.

The authors of [13] derived a multiplicative algorithm forupdating the model parameters. The n-mode basis ma-trices U (n) of the Tucker model are sequentially updatedby

U (n) ← U (n) ~A(n)(Q

(n)U )T

U (n)Q(n)U (Q

(n)U )T

(12)

where

Q(n)U = [Q×1 U

(1) ×2 · · ·×n−1U

(n−1) ×n+1 U(n+1) · · · ×N U (N)](n). (13)

Binary operator / represents element-wise division andoperator ~ represents a Hadamard product (element-wise

multiplication). While updating the matrices U (n), therest of parameters are fixed to their current values.

Besides the basis matrices, the core tensor Q also needs tobe updated in each cycle of the iteration while the basismatrices are fixed to their current values. The updatingalgorithm for the core tensor has a form

Q ← Q~A×1 U

(1)...×N (U (N))T

Q×1 (U (1))TU (1)...×N U (N))TU (N). (14)

According to the authors, the algorithm was directly de-rived from a previously nonnegative matrix factorizationupdating algorithm, so the monotonic convergence anal-ysis which was done for NMF method in [15] can be di-rectly applied also to this extended version of the modelfor tensors.

4. GMM-DR-BT ModelIn the previous sections we summarized the two maingroups of tensor decomposition methods, models for real-valued decomposition and models for non-negative de-composition. Such techniques, however, are not suitablefor processing binary tensors. Thus, in the tensor de-composition domain there is still need for a method thatexplicitly takes into account the binary nature of suchtensorial data.

To close this gap we propose the generalized multilin-ear model for dimensionality reduction of binary tensors(GMM-DR-BT). The model is based on an extension of

Page 5: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 39

the multilinear Tucker concept (6) for real-valued tenso-rial data. The extension is analogous to the generalizationof linear models for the exponential family of distribu-tions. We use the Tucker model as a multilinear ’predic-tor’ and a logistic function as the link function to link thereal-valued multilinear ’predictions’ with response vari-ables, in our case binary elements of the data tensors.A probabilistic framework is used to formally define themodel and to derive rules for the parameter estimation inanalogous manner as did Schein et al. for binary vectorialdata in [23].

4.1 The ModelConsider binary Nth-order tensor A ∈ {0, 1}I1×I2×...×IN .Assume we are given a set ofM such tensorsD = {A1,A2,...,AM}. Each element A

m,i of the tensor Am, m =

1, 2, ...,M , is independently generated from a Bernoullidistribution with the mean P

m,i (all mean parameters for

the data are collected in tensor P ∈ [0, 1]M×I1×I2×...×IN

of order N + 1). Assuming independence among the datatensors, the model likelihood reads

L(P) =

M∏m=1

∏i∈Υ

P (Am,i|Pm,i)

=

M∏m=1

∏i∈Υ

PA

m,im,i

· (1− Pm,i)

1−Am,i . (15)

A more detailed model description can be found in thethesis.

Our goal is to find a lower dimensional representation ofthe binary tensors in D while still capturing the data dis-tribution well. The mean Bernoulli parameters are con-fined to the interval [0, 1]. To solve our problem in anunbounded domain, we rewrite the Bernoulli distributionusing the log-odds parameters θ

m,i = log[Pm,i/(1−Pm,i)]

and the logistic link function σ(θm,i) = (1 + e

−θm,i)−1 =

Pm,i. We thus obtain for each data tensor Am, m =

1, 2, ...,M :

P (Am|θm) =∏i∈Υ

σ(θm,i)

Am,i · σ(−θ

m,i)1−A

m,i , (16)

so the model log-likelihood take the form

L(Θ) = log

M∏m=1

P (Am|θm)

=

M∑m=1

∑i∈Υ

Am,i log σ(θ

m,i) + (1−Am,i) log σ(−θ

m,i),

(17)

where we collect all the natural parameters θm,i in a ten-

sor Θ ∈ RM×I1×I2×...×IN . Now, θm,i ∈ R, which al-

lows us to incorporate the real-valued multilinear Tuckermodel. By using the Tucker model we constrain all theNth-order parameter tensors θm to lie in a subspace span-

ned by a reduced set of rank-1 basis tensors (u(1)r1 ◦ u

(2)r2 ◦

... ◦ u(N)rN ), where rn ∈ {1, 2, ..., Rn}, and Rn ≤ In, i =

1, 2..., N The indices r = (r1, r2, ..., rN ) take values fromthe set ρ = {1, 2, ..., R1}×{1, 2, ..., R2}×...×{1, 2, ..., RN}.

Furthermore, we allow for an Nth-order bias tensor ∆ ∈RI1×I2×...×IN , so that the parameter tensors θm are con-

strained onto an affine space. Using (6) we get

θm,i =

∑r∈ρQm,r ·

N∏n=1

u(n)rn,in

+ ∆i, (18)

and the log-likelihood is evaluated as

L =M∑m=1

∑i∈Υ

Am,i log σ

(∑r∈ρQm,r ·

N∏n=1

u(n)rn,in

+ ∆i

)+

(1−Am,i) log σ

(−∑r∈ρQm,r ·

N∏n=1

u(n)rn,in

−∆i

).(19)

4.2 Parameter EstimationTo get analytical parameter updates in the maximum like-lihood framework, we use the trick of [23] and take advan-tage of the fact that while the log-likelihood (19) of theconstrained model is not convex in the parameters, it isconvex in any of these parameters, if the others are fixed.The trick is to derive the analytical updates from a lowerbound on the log-likehood using

log σ(θ) ≥ − log 2 +θ

2− log cosh

2

)− (θ2− θ2)

tanh θ2

4θ,

(20)where θ stands for the current value of individual naturalparameters θ

m,i of Bernoulli noise models P (Am,i|θm,i)

and θ stands for the future estimate of the parameters,given the current parameter values. This leads to aniterative scheme where the model parameters are fittedalternating between the least square updates for basis

tensors u(n)rn , expansion coefficients Qm,r and bias ten-

sor ∆. While one set of parameters is updated, the oth-ers are held fixed. This procedure is repeated until thelog-likelihood converges to a desired degree of precision.The updates rules lead to monotonic increase in the log-likelihood.

Derivation of the parameter updates is rather involvedand (due to space limitations) we refer the interestedreader to the thesis, where details of the derivations aswell as the complete update formulas can be found. Inthe following text, we present just the final updating for-mulas.

4.2.1 Updates for n-mode Space BasisHolding the bias tensor ∆ and the expansion coefficientsQm,r, m = 1, 2, ...,M , r ∈ ρ fixed, we obtain a update

rule for the n-mode space basis {u(n)1 ,u

(n)2 , ...,u

(n)Rn}. For

each n-mode and its coordinate j ∈ {1, 2, . . . , In}, thebasis vectors are updated by solving linear system:

Rn∑t=1

u(n)t,j K

(n)q,t,j = S(n)

q,j , (21)

where

S(n)q,j =

M∑m=1

∑i∈Υ−n

(2A

m,[i,j|n]− 1

− Tm,[i,j|n]

∆[i,j|n]

)B(n)

m,i,q, (22)

Page 6: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

40 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

K(n)q,t,j =

M∑m=1

∑r∈ρ−n

Qm,[r,t|n]

×∑

i∈Υ−n

Tm,[i,j|n]

B(n)

m,i,q

N∏s=1,s6=n

u(s)rs,is

, (23)

B(n)

m,i,q=

∑r∈ρ−n

Qm,[r,q|n] ·N∏

s=1,s6=n

u(s)rs,is

, (24)

T denotes (tanhθm,i2

)/θm,i, and q = 1, 2, . . . , Rn. Note

that the updates for coefficients of basis vectors for dif-ferent mode n and its coordinates j ∈ {1, 2, . . . , In} areconveniently decoupled.

4.2.2 Updates for Expansion CoefficientsWhen updating the expansion coefficients Qm,r, the bias

tensor ∆ and the basis sets {u(n)1 ,u

(n)2 , ...,u

(n)Rn} for all

n modes n = 1, 2, ..., N are kept fixed to their currentvalues. The update rule for expansion coefficients Qm,rof the m-th input tensor Am can be obtained by solvinga set of linear equations

Tv,m =∑r∈ρPv,r,m Qm,r, (25)

where

Tv,m =∑i∈Υ

(2Am,i − 1− T

m,i ∆i) Cv,i, (26)

Pv,r,m =∑i∈Υ

Tm,i Cv,i Cr,i, (27)

Cr,i denotes∏Nn=1 u

(n)rn,in

and v ∈ ρ is a basis index. In

terms of these equations, the expansion coefficients up-dates for different input tensors are conveniently decou-pled.

4.2.3 Updates for the Bias TensorAs before, holding the expansion coefficients Qm,r, m =

1, 2, ...,M , r ∈ ρ, and the basis sets {u(n)1 ,u

(n)2 , ...,u

(n)Rn}

for all n modes n = 1, 2, ..., N fixed, we obtain a simpleupdate rule for the bias tensor:

∆j =

∑Mm=1 2A

m,j − 1− Tm,j ·

∑r∈ρQm,r Cr,j∑M

m=1 Tm,j. (28)

4.2.4 Decomposing Unseen Binary TensorsNote that our model is not generative, however, it isstraightforward to find expansion coefficients for an Nth-order tensor A′ ∈ {0, 1}I1×I2×...×IN not included in thetraining set D. One simply needs to solve for expansioncoefficients in the natural parameter space, given that theparameters are confined onto the affine subspace of thetensor parameter space found in the training phase. Re-call that the affine subspace is determined by the bias

tensor ∆ and the basis sets {u(n)1 ,u

(n)2 , ...,u

(n)Rn}, one for

each n mode, n = 1, 2, ..., N . These are kept fixed.

The likelihood (19) to be maximized with respect to the

expansion coefficients stored in tensor Q reads

L(Q;A′) =∑

i, s.t. A′i=1

log σ

(∑r∈ρQr Cr,i + ∆i

)

+∑

i, s.t. A′i=0

log σ

(−∑r∈ρQr Cr,i −∆i

). (29)

Any optimization technique can be used. The quantitiesCr,i and ∆i are constants given by the trained model.

The tensor Q can be initialized by first finding the clos-est data tensor from the training dataset D to A′ in theHamming distance sense,

m(A′) = arg minm=1,2,...,M

∑i∈Υ

|A′i −Am,i|,

and then setting the initial value of Q to the expansioncoefficient tensor of Am(A′).

When using gradient ascent,

Qv ← Qv + η∂ L(Q;A′)∂ Qv

,

the updates take the form

Qv ← Qv + η∑i∈Υ

Cv,i

[A′i − σ

(∑r∈ρQr Cr,i + ∆i

)],

(30)where η > 0.

5. ExperimentsIn this section, we compare our proposed generalized mul-tilinear model for dimensionality reduction of binary ten-sors (GMM-DR-BT) with several existing tensor decom-position methods. We evaluated how well their compactrepresentations preserve the information. The examinedmodels are used to compress and subsequently reconstructthe data to measure a discrepancy between original andreconstructed data. To achieve a fair comparison, we em-ploy two real-valued and two nonnegative tensor decom-position models, namely tensor latent semantic indexingmodel 1 (TensorLSI) [4], multilinear principal componentanalysis model (MPCA) [16], nonnegative tucker decom-position model employing least square error function andsimilar model employing KL-divergence [13].

5.1 Outline of the ExperimentsTo get a proper comparison of our proposed model withother existing models, we tested the ability of compressionon three different datasets which have data tensors withdifferent orders, complexity of underlying structure andsparsity. On each dataset, all the models were used tofind principal subspaces spanned by different number ofbasis tensors or vectors using a portion of data samplesassigned for training.

After training the models, tensors that were not includedin the training (hold-out set) were “compressed” by pro-jecting them onto the principal subspace thus obtain-ing their low dimensional representations (projections).

1TensorLSI model is only capable to process 2nd-ordertensors (matrices).

Page 7: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 41

In order to evaluate the amount of preserved information,the compressed representations were reconstructed backinto the original binary tensor space. Note that since themodels we consider represent binary tensors through con-tinuous values in RI1×I2×···×IN , a straightforward deter-ministic reconstruction in the binary space is not appro-priate. Therefore we used the area under the ROC curve(AUC) designed to compare different real-valued predic-tions of binary data to evaluate the amount of preservedinformation in compressed representations of the originaldata.

For a fair comparison, the decomposition methods arecompared based on the AUC with respect to the num-ber of free parameters. In general, the free parameterscorrespond to the basis vectors and the offset (bias ten-sor). For TensorLSI and MPCA the offset represents themean tensor (used to center the data); for GMM-DR-BTit represents the bias tensor; the NTD models do not havebias vector/tensor and centering of the data is not appro-priate considering the nonnegativity constraint. For thisreason we exclude centering of the data and the bias ten-sor from the models in this type of an experiment. Ifwe exclude the offset from the models, for GMM-DR-BT,MPCA and NTD models, the number of free parametersis equal to

∑Nn=1 Rn · In. TensorLSI with R basis tensors

has R ·∑Nn=1 min{R, In} free parameters.

5.2 Synthetic DataIn order to evaluate the ability of the models to find com-pact data representations, we generated 5 datasets of 3rd-order binary tensors of size (I1, I2, I3) = (15, 15, 15). Eachdataset has 4,500 tensors and was sampled from under-lying subspaces of the natural parameter space spannedby 30 randomly generated linearly independent basis ten-sors. The detailed description of the generating processcan be found in the thesis.

From each dataset we hold out one-third (1, 500) binarytensors as a test set and let the models find the latentsubspace on the remaining (3, 000) tensors (training set).The performance of the examined models to compress andsubsequently reconstruct the sets of synthetic tensors bycalculating the mean and standard deviation of AUC val-ues across all the 5 test sets of binary tensors is summa-rized in figure 1. As could be easily seen from the figure,GMM-DR-BT model outperforms all the real-valued andnonnegative counterpart models.

5.3 DNA SequencesThe DNA sequence dataset includes almost 62,000 DNAsubsequences. In brief, each sequence is represented bya 2nd-order binary tensor (matrix) with 31 rows and 250columns. Rows represent short subsequences, terms, thatare widespread sequences of nucleotides that have or mayhave a biological significance. Columns represent posi-tions in the DNA sequence and binary elements indicatethe presence/absence of a term in the sequence at a givenposition. Detailed information about the dataset and theDNA sequence representation by a binary tensor can befound in the thesis.

From the dataset of 62,000 sequences we randomly sam-pled out 5 groups, each of 4,500 sequences. On eachgroup, all examined decomposition models were used tofind a latent subspace spanned by different number of ba-sis tensors using 3,000 of the sequences (binary matrices).

The hold-out sets of sequences were projected onto thelatent space and then reconstructed back into the origi-nal tensor space to measure the discrepancy between pro-jected and original data by the AUC. The procedure issimilar ty the one used in the previous experiment withsynthetic data.

Reconstruction results in terms of AUC for the differentdimensionality of the latent spaces are shown in figure 2.Our proposed GMM-DR-BT model clearly outperformsother decomposition models except the case of the small-est latent subspace where TensorLSI model achieved thebest accuracy in term of AUC.

Furthermore, the subsequences from the dataset origi-nate from two different functional regions of genomic se-quences. In the thesis we illustrate how well can ourGMM-DR-BT method reveal biologically meaningful dom-inant trends and we closely analyzed the topographic or-ganization of DNA sequences in the latent space. Severalwell known biological characteristics were pointed out byour GMM-DR-BT analysis.

5.4 USF Gait Challenge DatasetThe dataset of gait silhouette video sequences, USF Hu-manID “Gait Challenge”, was created by Sarkar et al. [22]and is considered as a benchmark set for gait recognitionsystems. Each data sample represents a person walkingin elliptical paths in front of a camera. In our experimentwe used the dataset version 1.7 which was binarized by[16] to get the binary silhouettes. The dataset consists of731 binary gait samples of size (32x22x10).

For our experiment we randomly divided the samples into5 disjoin groups and performed a 5-fold cross validationto get the amount of preserved information in the recon-structed tensors. As in the previous experiments, we usedthe AUC for the evaluation. Results are summarized infigure 3 and once again our GMM-DR-BT model clearlyoutperforms the other tensor models.

6. Semi-Supervised Extension of GMM-DR-BTSo far we considered the GMM-DR-BT model to be onlyan unsupervised dimensionality reduction method for bi-nary tensors. However, many problems in the machinelearning domain involve decompositions which to certaindegree preserve the label information provided for somedata items. Such decomposition methods that aim tobenefit from both labeled and unlabeled data and makea compromise between the quality of data representa-tion and the degree of class separation are called semi-supervised decomposition methods.

In this section we propose an extension of our GMM-DR-BT model to the semi-supervised setting by forcing themodel to search for a natural parameter subspace thatrepresents a user specified compromise between the mod-elling quality and the degree of class separation. We do soby extending the model likelihood function with a sepa-rability measure of projected data samples from differentclasses.

6.1 The ModelTo enforce class separability of data items living in a met-ric space, Globerson and Roweis introduce a distribution

Page 8: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

42 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

=45 =90 =135 =180 =225 =270 =315 =360

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

GMM-DR-BT

MPCA

NTD-LS

NTD-KL

AUC Analysis of Hold-out Binary 3-D Synthetic Tensors

Number of free parameters

GMM-DR-BT (num. of basis vectors) 1x1x1 2x2x2 3x3x3 4x4x4 5x5x5 6x6x6 7x7x7 8x8x8

MPCA (num. of basis vectors) 1x1x1 2x2x2 3x3x3 4x4x4 5x5x5 6x6x6 7x7x7 8x8x8

NTD-LS (num. of basis vectors) 1x1x1 2x2x2 3x3x3 4x4x4 5x5x5 6x6x6 7x7x7 8x8x8

NTD-KL (num. of basis vectors) 1x1x1 2x2x2 3x3x3 4x4x4 5x5x5 6x6x6 7x7x7 8x8x8

Figure 1: AUC analysis of hold-out 3rd-order synthetic binary tensor reconstructions obtained by themodels using different number of free parameters among 5 different sets of binary tensors. Table underthe plot describes model settings for particular number of free parameters.

AUC Analysis of Hold-out DNA Sequence Samples Reconstruction

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

GMM-DR-BT

LPCA

MPCA

TLSI

NTDLS

NTDID

≈ 2,000 ≈ 4,000 ≈ 6,000 ≈ 8,000 ≈ 10,000 ≈ 12,000 ≈ 14,000 Number of free parameters

GMM-DR-BT (num. of row x column v.) 1 x 8 2 x 16 3 x 24 4 x 32 5 x 40 6 x 48 7 x 56 8 x 64

MPCA (num. of row x column vectors) 1 x 8 2 x 16 3 x 24 4 x 32 5 x 40 6 x 48 7 x 56 8 x 64

NTD-LS (num. of row x column vectors) 1 x 8 2 x 16 3 x 24 4 x 32 5 x 40 6 x 48 7 x 56 8 x 64

NTD-KL (num. of row x column vectors) 1 x 8 2 x 16 3 x 24 4 x 32 5 x 40 6 x 48 7 x 56 8 x 64

TensorLSI (num. of basis tensors) 7 15 22 29 37 45 53 61

≈ 16,000

AU

C

Figure 2: AUC analysis of DNA sequence hold-out 2nd-order binary tensor reconstructions obtained bythe models using different number of free parameters among 5 different sets of binary tensors. Tableunder the plot describes model settings for particular number of free parameters.

=96 =150 =204 =246 =300

AUC Analysis of Hold-out Binary Gait Samples Reconstruction

Number of free parameters

GMM-DR-BT (num. of basis vectors) 2x1x1 3x2x1 4x3x1 5x3x2 6x4x2 7x5x2 8x6x2 9x6x3 10x7x3

MPCA (num. of basis vectors) 2x1x1 3x2x1 4x3x1 5x3x2 6x4x2 7x5x2 8x6x2 9x6x3 10x7x3

NTD-LS (num. of basis vectors) 2x1x1 3x2x1 4x3x1 5x3x2 6x4x2 7x5x2 8x6x2 9x6x3 10x7x3

NTD-KL (num. of basis vectors) 2x1x1 3x2x1 4x3x1 5x3x2 6x4x2 7x5x2 8x6x2 9x6x3 10x7x3

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

GMM-DR-BT

MPCA

NTDLS

NTDID

=354 =408 =450 =504

AU

C

Figure 3: AUC analysis of hold-out 3rd-order binary gait tensor reconstructions obtained by the modelsusing different number of free parameters among 5 different sets of binary tensors. Table under the plotdescribes model settings for particular number of free parameters.

Page 9: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 43

over data items l, given a single data point m [9]:

p(l|m) =e−d(m,l)∑

k 6=m e−d(m,k)

m 6= l, (31)

where d(m, l) is the distance between the points m and l.Loosely speaking, given a particular data item m, underp(l|m) we are more likely to pick data points closer tom than the more distant ones. In the ideal situation,where all points in the same class are collapsed to a singlepoint and infinitely far from points of different classes,the conditional distributions (31) would become “bi-level”distributions [9]:

p0(l|m) ∝{

1 ym = yl0 ym 6= yl,

(32)

where ym denotes a class label of data point m. In [9],maximal class separation under a given data model isachieved by tuning the model parameters so that the classdivergence,

∑m KL[p0(·|m)||p(·|m)], is minimized. Mini-

mizing∑m KL[p0(·|m)||p(·|m)] is equivalent to maximiz-

ing∑m

1

c(ym)− 1

∑l:yl=yml 6=m

log p(l|m) = (33)

∑m

1

c(ym)− 1

∑l:yl=yml6=m

−d(m, l)− log∑k

k 6=m

e−d(m,k)

(34)

where c(ym) denotes a number of points in class ym.

Any two natural parameter tensors θm and θl living inthe tensor subspace represent tensors of Bernoulli distri-butions P (Am|θm) and P (Al|θl) given by (16). The dis-tance between those Bernoulli tensors is quantified by thesymmetric KL divergence D(m, l)

D(m, l) =∑i∈Υ

(KL[P

m,i || Pl,i] + KL[Pl,i || Pm,i]

2

),

(35)where KL divergence between two Bernoulli distributionsdefined by their means is equal to

KL[Pm,i || Pl,i] =

∑x∈{0,1}

P (x|Pm,i) log

P (x|Pm,i)

P (x|Pl,i)

(36)

and Pm,i = σ(θ

m,i) =(

1 + e−θ

m,i)−1

.

Using D(m, l) as a metric on the tensor subspace of theBernoulli natural parameters, (31) becomes

p(l|m) =e−D(m,l)∑

k 6=m e−D(m,k)

m 6= l. (37)

Given a subset of data tensors D` ⊂ D with class labels,the degree of projected class separation is quantified by(see (34))

F(D`,y) =

∑m∈D`

1

c(ym)− 1

∑l:yl=yml 6=m

−D(m, l)− log∑k∈D`k 6=m

e−D(m,k)

,

(38)

where y is an |D`|-dimensional vector that contains labelsfor each data tensor in D`.

We aim to find tensor basis that simultaneously maxi-mizes log-likelihood (17) of all training tensors and thedegree of projected class separation (38),

L(D) + β F(D`,y), (39)

where β > 0 is a regularization constant controlling thetrade-off between data representation and separation.

To fit tensor basis, any optimalization technique can beused to maximize (39). We derived the update rules byusing a basic gradient ascent method:

u(n)q,j ← u

(n)q,j + η

(∂L(D)

∂u(n)q,j

+ β∂F(D`,y)

∂u(n)q,j

), (40)

∆j ← ∆j + η

(∂L(D)

∂∆j+ β

∂F(D`,y)

∂∆j

). (41)

After each update cycle through the training set (updatesof the projection space), the expansion coefficients (pro-jections) of the data tensors were calculated as describedin section 4.2. A detailed derivation of the final updatingformulas can be found in the thesis.

6.2 ExperimentsTo illustrate the workings of the semi-supervised tensorbasis selection, we employed the proposed model to ana-lyze the previously mentioned USF gait challenge dataset[22]. For this experiment, we selected 4 persons (classes)from the dataset that had the highest number of samples(14). This gave us dataset of 4 classes with total of 56binary tensors of size (32x22x10). We split the 56 tensorsinto two equal sized disjoin sets, training and hold-outset. Each set contains 7 samples from each class.

The training set of tensors was used to train the mod-els by finding the tensor basis and projecting the train-ing tensors into that basis. After training the models,tensors that were not included in training (hold-out set)were ”compressed” by projecting them onto the principalsubspace by (30). The hold-out set was used to verifywhether the subspace found using the training set repre-sents any global trends in the data.

The model setting (number of basis vectors for each mode)was set to find 4 principal tensors obtained as outer prod-uct of 2 1st-order, 2 2nd-order and 1 4rd-order vectors.So each training or hold-out tensor is represented by a 4-dimensional vector of expansion coefficients. To visualizethe distribution of such representations, we used principalcomponent analysis (PCA) and projected the real-valued4-dimensional expansion vectors onto a two-dimensionalspace defined by the two leading principal vectors.

The results are presented in figure 4. Plots in the firstand second columns correspond to the training and hold-out sets, respectively. The first row represents a modelwith randomly chosen basis vectors for each mode. Thesecond row corresponds to the completely unsupervisedsetting (β = 0). The third and fourth rows represent twodifferent settings of class separation enforcement, β = 5and β = 20, respectively. There is a certain degree of nat-ural class separation visible in the tensor subspace found

Page 10: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

44 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−4 −2 0 2 4 6−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Train set

Train set

Hold-out set

Hold-out set

Randomly chosen basis vectors for each mode

Unsupervised decomposition (Beta=0)

Semi-supervised decomposition (Beta=5)Train set

Train setSemi-supervised decomposition (Beta=20)

Hold-out set

Hold-out set

Binary Gait Samples Person 1 Person 2 Person 3 Person 4

Figure 4: Two-dimensional PCA projections from 4 different tensor spaces of training and hold-out setsof binary gait tensors. The first row represents a tensor model with randomly chosen basis vectors foreach mode. The second row corresponds to the completely unsupervised setting (β = 0). The third andforth rows represents two different settings of class separation enforcement, β = 5 and β = 20, respectively.

Page 11: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 4, No. 3 (2012) 35-46 45

in the unsupervised manner, without using any class labelinformation. Random subspace position completely failsto discriminate between the four classes. Further impo-sition of pressure for more class separation yields tensorbasis giving strong improvement in the class distributionover the completely unsupervised case as could be seen inthe plots of the third and forth rows. More experimentswith different datasets can be found in the thesis.

7. Conclusions and ContributionsCurrent data processing tasks often involve manipula-tion of binary tensors. However, a principled system-atic framework for decomposition of binary tensors wasmissing. We closed this gap by introducing a generalizedmultilinear model for dimensionality reduction of binarytensors - GMM-DR-BT. The model is based on the Tuckermodel concept and could be considered a generalization ofthe existing logistic principal component analysis model(LPCA) for binary vectorial data decomposition. A prob-abilistic framework is used to derive the update rules formodel parameters. To account for binary nature of thedata, each tensor element is modeled by a Bernoulli noisedistribution. To extract the dominant trends in the data,we constrain the natural parameters of the Bernoulli dis-tributions to lie in a sub-space spanned by a reduced setof basis (principal) tensors. We derived a simple closedform iterative scheme for parameter estimation.

In the experiments involving synthetic and real-world datasets, we have shown that our GMM-DR-BT model is bet-ter suited for modeling binary tensors than the existingreal-valued and non-negative tensor decomposition coun-terparts. Examined models were used to compress andsubsequently reconstruct data to measure the discrepancybetween original and reconstructed data. Our proposedmodel clearly outperformed the other models.

We have also investigated the ability of the tensor basedmodels for unsupervised analysis of DNA sub-sequences(represented as binary tensors) from different functionalregions. The detailed description and results from thisexperiment can be found in the thesis and had confirmedthe conclusion from the previous experiments that ourGMM-DR-BT model is better suited for modeling binarytensors than the existing counterparts.

Furthermore, we extended our GMM-DR-BT model tothe semi-supervised setting by forcing the model to searchfor a natural parameter subspace that represents a userspecified compromise between the modeling quality andthe degree of class separation. The results from analyzingthe binary gait samples showed that implying a combinedpressure for modeling quality and class separation yieldsan improvement in the class distribution over the com-pletely unsupervised case.

Acknowledgements. This work was partially supportedby the Scientific Grant Agency of Slovak Republic, grantNo. VG1/0508/09 and by the Slovak Research and Devel-opment Agency under the contract No. APVV-0208-10.

References[1] E. Acar and B. Yener. Unsupervised multiway data analysis: A

literature survey. IEEE Trans. Knowl. Data En., 21(1):6 –20, Jan.2009.

[2] J. Brachat, P. Comon, B. Mourrain, and E. Tsigaridas. Symmetrictensor decomposition. Linear Algebra Appl., In Press, CorrectedProof, 2010.

[3] R. Bro. PARAFAC. tutorial and applications. Chemometrics andIntelligent Laboratory Systems, 38(2):149 – 171, 1997.

[4] D. Cai, X. He, and J. Han. Tensor space model for documentanalysis. In Proc. 29th Annu. ACM SIGIR Int. Conf. Research andDevelopment in Information Retrieval, pages 625–626, Seatlle,WA, Aug. 2006.

[5] J. Carroll and C. J. Analysis of individual differences inmultidimensional scaling via an n-way generalization ofeckart-young decomposition. Psychometrika, pages 283–319,1970.

[6] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. NonnegativeMatrix and Tensor Factorizations: Applications to ExploratoryMulti-way Data Analysis and Blind Source Separation. WileyPublishing, Chichester, 2009.

[7] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinearsingular value decomposition. SIAM J. Matrix Anal. Applicat.,21(4):1253–1278, 2000.

[8] H. Eves. Elementary Matrix Theory. Dover Publications, April1980.

[9] A. Globerson and S. Roweis. Metric learning by collapsingclasses. Advances in Neural Information Processing Systems,18:451–458, 2006.

[10] L. Haiping, K. N. Plataniotis, and A. N. Venetsanopoulos. Asurvey of multilinear subspace learning for tensor data. PatternRecognition, 44(7):1540–1551, Jul 2011.

[11] R. A. Harshman. Foundations of the PARAFPAC procedure:Models and conditions for an "explanatory" multimodal factoranalysis. UCLA Working Papers in Phonetics, 16:1–84, 1970.

[12] K. Jia and S. Gong. Multi-modal tensor face for simultaneoussuper-resolution and recognition. In 10th IEEE Int. Conf.Computer Vision, volume 2, pages 1683–1690, Beijing, Oct. 2005.

[13] Y. D. Kim and S. Choi. Nonnegative tucker decomposition.Computer Vision and Pattern Recognition, IEEE ComputerSociety Conference on, 0:1–8, 2007.

[14] T. G. Kolda and B. W. Bader. Tensor decompositions andapplications. SIAM Review, 2008. to appear (accepted June 2008).

[15] D. D. Lee and H. S. Seung. Algorithms for non-negative matrixfactorization. Advences in Neural Information ProcessingSystems, 13, 2001.

[16] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. MPCA:Multilinear principal component analysis of tensor objects. IEEETrans. Neural Netw., 19:18–39, Jan. 2008.

[17] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos.Uncorrelated multilinear discriminant analysis with regularizationand aggregation for tensor object recognition. IEEE Trans. NeuralNetw., 20:103–123, Jan. 2009.

[18] C. Nolker and H. Ritter. Visual recognition of continuous handpostures. IEEE Trans. Neural Netw., 13:983–994, July 2002.

[19] K. Pearson. On lines and planes of closest fit to systems of pointsin space. Philosophical Magazine, 2(6):559–572, 1901.

[20] A. Phan and A. Cichocki. Tensor decompositions for featureextraction and classification of high dimensional datasets. InNonlinear Theory and Its Applications, IEICE (invited paper),October 2010.

[21] N. Renard and S. Bourennane. An ICA-based multilinear algebratools for dimensionality reduction in hyperspectral imagery. InIEEE Int. Conf. Acoustics, Speech and Signal Processing,volume 4, pages 1345–1348, Las Vegas, NV, Apr. 2008.

[22] S. Sarkar, P. J. Phillips, Z. Liu, I. R. Vega, P. Grother, and K. W.Bowyer. The human id gait challenge problem: Data sets,performance, and analysis. IEEE Transactions on PatternAnalysis and Machine Intelligence, 27:162–177, 2005.

[23] A. Schein, L. Saul, and L. Ungar. A generalized linear model forprincipal component analysis of binary data. In 9th Int. WorkshopArtificial Intelligence and Statistics, Key West, FL, Jan. 2003.

Page 12: Generalized Multilinear Model for Dimensionality Reduction ...acmbulletin.fiit.stuba.sk/abstracts/mazgut.pdfGeneralized multilinear model for dimensionality reduc-tion of binary tensors

46 Mazgut, J.: Generalized Multilinear Model for Dimensionality Reduction of Binary Tensors

[24] S. Sra and D. I. S. Nonnegative matrix approximation: Algorithmsand applications. Technical Report TR-06-27, Dept. of ComputerSciences, University of Texas at Austin, Austin, TX 78712, USA,June 2006.

[25] J. A. Tropp. Literature Survey: Non-negative matrix factorization.Technical report, Institute for Computational Engineering andSciences, The University of Texas at Austin, Austin, TX 78712,USA, March 2009.

[26] L. R. Tucker. The extension of factor analysis tothree-dimensional matrices. In H. Gulliksen and N. Frederiksen,editors, Contributions to mathematical psychology., pages110–127. Holt, Rinehart and Winston, New York, 1964.

[27] H. Wang and N. Ahuja. Rank-r approximation of tensors: Usingimage-as-matrix representation. In Computer Vision and PatternRecognition, pages 346–353, 2005.

[28] H. Wang and N. Ahuja. A tensor approximation approach todimensionality reduction. Int. J. Comput. Vision, 76:217–229,March 2008.

[29] S. Zafeiriou. Discriminant nonnegative tensor factorizationalgorithms. IEEE Trans. Neural Netw., 20:217–235, Feb. 2009.

Selected Papers by the AuthorJ. Mažgut, P. Tino, M. Bóden, H. Yan. Multilinear Decomposition and

Topographic Mapping of Binary Tensors. In Artificial NeuralNetworks – ICANN 2010, LNCS 6352, pages 317–326,Thessaloniki, Greece, 2010. Springer.

J. Mažgut, M. Paulinyová, P. Tino. Using Dimensionality ReductionMethod for Binary Data to Questionnaire Analysis. InZ. Kotásek, et al. (Eds.), MEMICS 2011, LNCS 7119, pages146–154, Lednice, Czech Republic, 2011. Springer.

J. Mažgut. Analyzing DNA Sequences Based on Oligomer-PositionInformation. In 5th Student Research Conference in Informaticsand Information Technologies Bratislava, pages 186–193.Faculty of Informatics and Information Technologies, SlovakUniversity of Technology in Bratislava, 2009.

J. Mažgut. The Analysis of Binary Data. In 6th Student ResearchConference in Informatics and Information TechnologiesBratislava, pages 499–505. Faculty of Informatics andInformation Technologies, Slovak University of Technology inBratislava, 2010.