Long Long Thesis on Nmf

download Long Long Thesis on Nmf

of 185

Transcript of Long Long Thesis on Nmf

  • 8/13/2019 Long Long Thesis on Nmf

    1/185

    UNIVERSIT CATHOLIQUE DELOUVAINCOLE POLYTECHNIQUE DELOUVAINDPARTEMENT DINGNIERIE MATHMATIQUE

    NONNEGATIVEMATRIXFACTORIZATIONALGORITHMS ANDAPPLICATIONS

    NGOC-DIEPHO

    Thesis submitted in partial fulfillmentof the requirements for the degree ofDocteur en Sciences de lIngnieur

    Dissertation committee:Prof. Vincent Wertz (President)Prof. Paul Van Dooren (Promoter)Prof. Vincent Blondel (Promoter)Prof. Franois GlineurProf. Yurii NesterovProf. Bob PlemmonsProf. Johan Suykens

    June 2008

  • 8/13/2019 Long Long Thesis on Nmf

    2/185

  • 8/13/2019 Long Long Thesis on Nmf

    3/185

  • 8/13/2019 Long Long Thesis on Nmf

    4/185

    iv ACKNOWLEDGMENTS

    This research has been supported by the Belgian Programme on

    Inter-university Poles of Attraction, initiated by the Belgian State, PrimeMinisters Office for Science, Technology and Culture. It has been alsosupported by the ARC (Concerted Research Action) Large Graphs andNetworks, of the French Community of Belgium. I was also a FRIAfellow (Fonds pour la formation la Recherche dans lIndustrie et danslAgriculture).

  • 8/13/2019 Long Long Thesis on Nmf

    5/185

    TABLE OF CONTENTS

    Acknowledgments iii

    Table of contents vi

    Notation glossary vii

    Introduction 1

    1 Preliminaries 131.1 Matrix theory and linear algebra . . . . . . . . . . . . . . 131.2 Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 211.3 Low-rank matrix approximation . . . . . . . . . . . . . . 26

    2 Nonnegative matrix factorization 33

    2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 332.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.3 Exact factorization and nonnegative rank . . . . . . . . . 432.4 Extensions of nonnegative matrix factorization . . . . . . 46

    3 Existing algorithms 49

    3.1 Lee and Seung algorithm. . . . . . . . . . . . . . . . . . . 513.2 Alternating least squares methods . . . . . . . . . . . . . 553.3 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . 573.4 Scaling and stopping criterion. . . . . . . . . . . . . . . . 613.5 Initializations . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4 Rank-one residue iteration 65

    4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

  • 8/13/2019 Long Long Thesis on Nmf

    6/185

    vi TABLE OF CONTENTS

    4.2 Column partition of variables . . . . . . . . . . . . . . . . 67

    4.3 Convergence. . . . . . . . . . . . . . . . . . . . . . . . . . 704.4 Variants of the RRI method . . . . . . . . . . . . . . . . . 744.5 Regularizations . . . . . . . . . . . . . . . . . . . . . . . . 764.6 Algorithms for NMF extensions. . . . . . . . . . . . . . . 784.7 Numerical experiments. . . . . . . . . . . . . . . . . . . . 81

    5 Nonnegative matrix factorization with fixed row and column

    sums 99

    5.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 100

    5.2 Generalized KL divergence in NMF . . . . . . . . . . . . 1025.3 Application: stochastic matrix approximation. . . . . . . 109

    6 Weights in nonnegative matrix factorization 111

    6.1 Gradient information . . . . . . . . . . . . . . . . . . . . . 1126.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.3 Toward the weighted KL divergence . . . . . . . . . . . . 1206.4 Adding weights to existing NMF variants . . . . . . . . . 124

    6.5 Application: feature extraction of face images. . . . . . . 1257 Symmetry in nonnegative matrix factorization 131

    7.1 Symmetric approximations . . . . . . . . . . . . . . . . . 1327.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.3 Applications: graph clustering . . . . . . . . . . . . . . . 1507.4 Application: correlation matrix approximation . . . . . . 154

    Conclusion 161

    Bibliography 165

  • 8/13/2019 Long Long Thesis on Nmf

    7/185

    NOTATION GLOSSARY

    R field of real numbersR+ set of nonnegative real numbersR

    n+ set of nonnegative real vectors of sizen

    Rmn+ set ofm nnonnegative real matrices

    if and only if:= equal by definition todim X dimension ofX, generic inner product p p-norm (1 p+) 2 Euclidean norm (vectors) / spectral norm (matrices)D(A|B) generalized Kullback-Leibler divergenceei unit vectorei = (0 0 . . . 1

    ith position. . . 0)T

    1mn vector or matrix of all onesIk k kidentity matrixXT transpose of matrixX

    Xij element located at theith

    row and thejth

    column ofXXi: ith row of the matrixX

    X:j jth column of the matrixX

    vec(X) vector formed by stacking the columns ofXinto onevector

    rank(X) rank of matrixXrankUVT(X) nonnegative rank of matrixXrankVVT(X) completely positive rank of matrixXdet X determinant of square matrixXtrace(X) trace of square matrixX

  • 8/13/2019 Long Long Thesis on Nmf

    8/185

    viii NOTATION GLOSSARY

    k(X) k-th eigenvalue of matrixX

    (X) set of eigenvalues of the matrixX(X) maxi |i(X)|max(X) maximal singular value of matrixXmin(X) minimal singular value of matrixXA B Kronecker product between matricesAandBA B Hadamard product between matricesAandB[A]

    [B] Hadamard division between matricesAandB

    [A]+ projection ofAonto the nonnegative orthant

    D(v) diagonal matrix withvon the main diagonal

    Abbreviations and acronyms

    NMF Nonnegative Matrix FactorizationSNMF Symmetric Nonnegative Matrix FactorizationSSNMF Semi-Symmetric Nonnegative Matrix FactorizationWNMF Weighted Nonnegative Matrix FactorizationSVD Singular Value Decomposition

  • 8/13/2019 Long Long Thesis on Nmf

    9/185

    INTRODUCTION

    In every single second in this modern era, tons of data are being gener-ated. Think of the number of online people writing their blogs, designingtheir homepages and sharing their experiences through many other dig-ital supports: videos, photos, etc. Think also of the data generated whendecoding genes of living creatures and the data acquired from the outerspace or even from our own planet, etc.

    Data only become useful when having been processed. In front ofthis fast-growing amount of data, there are several approaches for data

    processing: applying classical methods, designing more powerful com-puting structures such as distributed computing, multicore processors,supercomputers, etc. But the growing amount and complexity of accu-mulated data seems to outweigh the growth of computing power whichis, at the present time, roughly doubling every year (cfr. Moores Law[90]). One very popular approach is called model reduction which triesto reduce the complexity while keeping the essentials of the problem (ordata).

    Besides, different types of data require different models to capturethe insight of the data. Using the right model saves a lot of time. Ofcourse, a model believed to be right will stand until a better model isfound. An ideal model may not exist. For instance, using dominantsubspaces with the Singular Value Decomposition (SVD) [50] has beenproposed as the best model to reduce the complexity of data and compli-cated systems. It offers the least error (with respect to some measures)with the same reduced complexity, compared to other models. But

    it is not the only one since the conic representation or conic coding[82] is also extensively used. Its properties favour the additive modelofsome types of data while SVD-related techniques do not. In this thesis,

  • 8/13/2019 Long Long Thesis on Nmf

    10/185

    INTRODUCTION

    we focus on finding the best reduced conic representation of nonnega-

    tive data through Nonnegative Matrix Factorization (NMF). We will gothrough several issues that are considered as the building blocks for thenonnegative matrix factorization (NMF).

    For nonnegative data, we will see that this additive model offers acloser physical representation to the reality than other techniques such asthe SVDs. But this is not for free. On the one hand, SVD decompositionis known to havepolynomial-timecomplexity. In fact, it can be done ina polynomial number, i.e. nm min(m, n), of basic operations for a full

    decomposition, wherenandmare the dimensions of the matrix data [50].When only a partial SVD decomposition is needed, iterative methodscan be applied with the computational load of mnr basic operations periteration, wherer, 1rmin(m, n), is the reduced dimension of thedecomposition. Their convergence speed, represented by the number ofiterations needed, has been being improved drastically, which allowsus to process massive data sets. On the other hand, NMF factorizationhas been recently proved to have a nondeterministic polynomial - NPcomputational complexity [121] for which the existence of a polynomial-time optimal algorithm is unknown. However, iterative methods withlow computational load iterations are still possible. There are iterativemethods whose computational load per iteration is roughly equivalentto the one of SVD, i.e. mnr, such as [81], [86], etc. as well as the onedescribed in this thesis. But onlyacceptablesolutions are expected ratherthan theoptimalone. And restarts maybe needed. The main aspectsthat differentiate these methods are then: to which solution they tend toconverge? how fast they converge? and how to drive them to converge to

    solutions that possess some desired properties?

    Part-based analysis

    An ordinary object is usually a collection of simple parts connected bysome relations between them. Building objects from basic parts is one ofthe simplest principle applicable to many human activities. Moreover,

    human vision isdesignedto be able to detect the presence or the absenceof features (parts) of a physical object. Thanks to these parts, a humancan recognize and distinguish most of the objects [15].

  • 8/13/2019 Long Long Thesis on Nmf

    11/185

    INTRODUCTION

    Without taking into account the possible relations between parts

    and assuming that we can establish a full list of all possible parts of allpossible objects, then there is one unified formula for composing theobjects:

    Objecti =Part1(bi1) with Part2(bi2) with . . . ,

    where

    bij=

    present if partiis present in object jabsent if partiis absent from objectj.

    Then we can represent objectiby a list(bi1, bi2, . . .)that can be simplifiedby replacing the status presentandabsentby 1 and 0. This model can beimproved by taking into account thequantityof parts inside an object.The final recipe for making an object is then something like

    Objecti=bi1 Part1 + bi2 Part2 +. . .wherebij0.

    In reality, only some objects are available through observations. Thetask is then to detect parts from observed objects and to use the detectedparts to reconstitute these objects. This simple idea appears in manyapplications and will be illustrated in the following examples.

    Image learning

    Digital image processing has been a hot topic in recent years. Thisincludes face recognition [54], optical character recognition [70], content-

    based image retrieval [109], etc. Each monochrome digital image is arectangular array of pixels. And each pixel is represented by its light in-tensity. Since the light intensity is measured by a nonnegative value, wecan represent each image as a nonnegative matrix, where each elementis a pixel. Color images can be coded in the same way but with severalnonnegative matrices.

    An example is the Cambridge ORL face database. It contains 400monochrome images of a front view of the face of 40 persons (10 images

    per person). The size of each image is 112 92 with 256 gray levelsper pixel. Some randomly selected images are shown on the left of thefollowing figure.

  • 8/13/2019 Long Long Thesis on Nmf

    12/185

    INTRODUCTION

    We want to decompose those images as:

    Imagei =bi1 Feature1+bi2 Feature2+. . .wherebij 0 are the participation weight of feature j in image i. Aprocedure similar to that proposed in [80] is used to extract a list of per-tinent features on which some sparsity constraints are imposed. Thesefeatures are shown on the right of the above figure. Each of the imagesin the database is then reconstructed by a nonnegative mixture of thosefeatures.

    The method that was used to construct these features guarantees not

    only agood

    reconstruction of images but also the nonnegativity of thefeatures. Therefore, each feature can be considered again as an image.Together with the participation of each feature in an image, one canestablish the composition of every image in a very comprehensible way.

    Document classification

    Textual data is an important source of information. The smallest mean-ingful units of texts are words. A sequence of words constitutes a

    document. Usually, each document is about a specific topic or category.In some cases, for instance news, school courses, etc, these categories arespecified. But in others such as blogs, discussion groups, they may not

  • 8/13/2019 Long Long Thesis on Nmf

    13/185

    INTRODUCTION

    Topic 1 Topic 2 Topic 3 Topic 4

    court president flowers diseasegovernment served leaves behaviorcouncil governor plant glandsculture secretary perennial contact

    supreme senate flower symptomsconstitutional congress plants skin

    rights presidential growing painjustice elected annual infection

    be listed. Moreover, a classification is hardly unique, several differentclassifications can be defined. For instance, news articles can be classi-fied not only with topics such as: economics, cultures, sports, sciences,etc. but also according to the geographical regions (Asia, Europe, Africa,etc).

    Without a grammar, a text can be seen as a set of words combinedwith their number of occurrences. Given a collection of texts, one wants

    to automatically discover the hidden classifications. The task is then totry toexplaina text as:

    Texti =bi1 Topic1+bi2 Topic2+. . .

    wherebij can be considered as the similarity or the participation of topicjto texti. Topics are characterized by a list of keywords, which describetheir semantics.

    The above table shows some topics discovered by NonnegativeMatrix Factorization from 30991 articles from the Grolier encyclopediaand reported in Nature [80]. In each column of the table, a discoveredtopic is represented by a list of keywords taken from the 15276-wordvocabulary. Both topics and their keywords are automatically retrieved.

    In reality, the topic of a document is often not pure, in the sense thateach document can belong to a number of topics. Carrying on with theabove example, the Constitution of the United States entry is, in fact,semantically a mixture of different categories with some weights. Their

    experiment shows that it is strongly related to the first two topics andweakly (almost not) related to the last two. And this matches perfectlyour intuition.

  • 8/13/2019 Long Long Thesis on Nmf

    14/185

    INTRODUCTION

    Having discovered a list of topics and their participation in each

    document, one can not only decide to which topics a document belongs,but also deal with the polysemy of words, detect new trends, revealhidden categories, etc.

    Why nonnegativity?

    As we have seen, for the part-based analysis, the presence or absenceof parts createsrecipesfor making an object. This existential status of

    parts is represented by nonnegative numbers, where 0 represents anabsence and a positive number represents a presence with some degree.Furthermore, the objects are also represented by a set of nonnegativenumbers, e.g., numbers of occurrences or light intensities. Because ofthat, nonnegativity is a crucial feature that one needs to maintain duringthe analysis of objects.

    The part-based analysis is also referred to as the additive modelbecause of the absence of subtractions in the model. This follows from

    the construction of an object:Objecti=bi1 Part1 + bi2 Part2 +. . .

    Allowing subtractions, i.e.,bij

  • 8/13/2019 Long Long Thesis on Nmf

    15/185

    INTRODUCTION

    of a nonnegative matrix A, by the product of two other nonnegative

    matricesUandV: AUVT.This factorization captures all the key ideas from the above examplesof the part-based analysis. Columns ofUdefine the extractedparts(orimage features or document topics, etc.). The matrix Vdescribes theparticipations of those parts in the original objects (or images, document,etc.).

    The idea of approximating a nonnegative matrix Aby the product

    UVT of two nonnegative matrices Uand V is not new. In fact, it isa generalized method of the well-known K-Means method [88]from1967, applied to nonnegative data. Suppose we havennonnegative datavectorsa1, . . . ,anandr initial centroidsu1, . . . , urrepresentingr clustersC1, . . . , Cr, what the K-Means method does is to repeat the followingtwo steps until convergence:

    1. For eachai, assign it toCjifujis the nearest centroid toai,with respect to Euclidean distance.

    2. For eachuj, replace it with the arithmetic mean of all aiinCj.

    We can construct a matrix Uby putting all the vectors uj in thecolumns ofUand create a matrixVsuch that

    Vij =

    1 ifaiCj0 otherwise,

    It turns out that the K-Means method tries to minimize the Euclideandistance between matricesAandUVT. Moreover, because each columnofUis a mean of some nonnegative vectors, both matricesUand Varenonnegative. Mathematically, we solve

    minU,V

    A UVT2F

    whereAandUare nonnegative matrices,Vis a binary matrix in whicheach row ofVcontains one and only one element equal to 1 and

    A

    UVT2Fdenotes the Euclidean distance between A andUVT. Two aboveiterative steps of the K-Means method are, in fact, the optimal solutionof the following subproblems:

  • 8/13/2019 Long Long Thesis on Nmf

    16/185

    INTRODUCTION

    (P1) minUA UVT2F,(P2) minVA UVT2F

    with the special structure ofV.Nonnegative matrix factorization is different from the K-Means

    method only in the structure matrix V. Instead of binary matrix asabove, in NMF,Vtaken to be a normal nonnegative matrix. This littledifference offers more flexibility to NMF as well as more difficulty tooptimally solve the two above subproblems. However, we will still seethe same iterations (P1) and (P2) in a number of NMF algorithms in this

    thesis.K-Means had been being applied successfully to many problems long

    before the introduction of NMF factorization in the nineties. Therefore,it is not surprising that the NMF, the generalized version of K-Means,has recently gained a lot of attention in many fields of application.We believe that preserving nonnegativity in the analysis of originallynonnegative data preserves essential properties of the data. The lossof some mathematical precision due to the nonnegativity constraint is

    compensated by a meaningful and comprehensible representation.

  • 8/13/2019 Long Long Thesis on Nmf

    17/185

    INTRODUCTION

    Thesis outline

    The objective of the thesis is to provide a better understanding and topropose better algorithms for nonnegative matrix factorization. Chapter2is about various aspects of the nonnegative matrix factorization prob-lem. Chapters3and4are about its algorithmic aspects. And the lastthree chapters are devoted to some extensions and applications. Here isthe outline of each chapter:

    Chapter1: Preliminaries. Some basic results and concepts usedthroughout the thesis are presented. Known results are shownwithout proof but references are given instead. This chapter is alsoa concise introduction to the main notations.

    Chapter 2: Nonnegative matrix factorization. This chapter isdevoted to the introduction, the optimality conditions, the repre-sentations of the factorization, the solution for some easy cases,and the characterization of local minima of the nonnegative matrix

    factorization problem. The exact nonnegative factorization andnonnegative ranks are also discussed. Some interesting extensionsof the factorization are also introduced such as: multilayer nonneg-ative matrix factorization and nonnegative tensor factorization.

    Chapter3: Existing algorithms. In this chapter, investigationsare carried out to clarify some algorithmic aspects of the existingalgorithms such as: themultiplicative updates,gradient based methodsand the alternating least square. Other algorithmic aspects like

    initializations and stopping conditions are also treated. Chapter4:Rank-one residue iteration. This chapter is an exten-

    sion of the report [62], where we proposed to decouple the problembased on rank-one approximations to create a new algorithm. Aconvergence analysis, numerical experiments and some extensionswere also presented for this algorithm. Two other independentreports [31] and [49]have also proposed this algorithm. Numericalexperiments are summarized at the end of the chapter to compare

    the performance of the newly proposed method to existing ones.It is seen that this method has good and fast convergence, and issuitable for large-scale problems. Moreover, it does not require

  • 8/13/2019 Long Long Thesis on Nmf

    18/185

    INTRODUCTION

    any parameter setting, which is an advantage over some other

    methods. Chapter5:Nonnegative matrix factorization with fixed row and

    column sums. We introduce a new problem in nonnegative matrixfactorizations where row and column sums of the original matrixare preserved in approximations. After some discussions of theproblem, we prove that by using the generalized Kullback-Leiblerdivergence, one can produce such a factorization naturally. Thisalso links the proposed method to Probabilistic Latent Semantic

    Analysis (pLSA) [65] and creates some applications such as: ap-proximation of stochastic matrices, approximation that preservesthe Perron vectors, etc.

    Chapter6: Weights in nonnegative matrix factorization. Thischapter incorporates weights into the nonnegative matrix fac-torization algorithms. We also extend the multiplicative rulesto take weights into account. We also point out a link betweenthe weighted Euclidean distance and the weighted generalizedKullback-Leibler divergence. A numerical experiment is carriedout on the database of human facial images where weights areadded to emphasize some image parts.

    Chapter7:Symmetry in nonnegative matrix factorization. Somesymmetric structures are imposed on the nonnegative matrix fac-torization. While solving theexactsymmetric nonnegative matrixfactorization is a hard problem, related to the class of completely

    positive matrices, approximating methods can nevertheless bedesigned. Several variants are treated. At the end, we mentiontwo applications: graph clustering and nonnegative factorizationof the correlation matrices.

    Some conclusions drawn from our research end the thesis.

  • 8/13/2019 Long Long Thesis on Nmf

    19/185

    INTRODUCTION

    Related publications

    2005. V.D. Blondel, N.-D. Ho and P. Van Dooren - Nonnegative matrixfactorization - Applications and Extensions. Technical Report 005 35,Cesame. University catholique de Louvain. Belgium.

    2005. N.-D. Ho and P. Van Dooren -On the Pseudo-inverse of the Laplacianof a Bipartite Graph. Applied Math. Letters, vol.18 p.917-922, 2005.

    2006. A. Vandendorpe, N.-D. Ho, S. Vanduffel and P. Van Dooren - On

    the parameterization of the CreditRisk+ model for estimating credit portfoliorisk. To appear in Insurance: Mathematics and Economics.

    2007. V. D. Blondel, N.-D. Ho and P. Van Dooren -Weighted NonnegativeMatrix Factorization and Face Feature Extraction. Submitted to Image andVision Computing.

    2007. N.-D. Ho and P. Van Dooren -Nonnegative Matrix Factorization withfixed row and column sums. Linear Algebra and Its Applications (2007),

    doi:10.1016/j.laa.2007.02.026.2007. N.-D. Ho, P. Van Dooren and V. D. Blondel - Descent algorithms forNonnegative Matrix Factorization. Survey paper. To appear in NumericalLinear Algebra in Signals, Systems and Control.

  • 8/13/2019 Long Long Thesis on Nmf

    20/185

    INTRODUCTION

  • 8/13/2019 Long Long Thesis on Nmf

    21/185

    1

    PRELIMINARIES

    This chapter introduces the basic results and concepts used throughoutthis thesis. Known results are only stated without proof.

    1.1 Matrix theory and linear algebra

    A m n real matrix is a m-row and n-column table containing realscalars. We have a square matrix when the number of rows is equal tothe number of columns. The set ofm nreal matrices is denoted byR

    mn. In this thesis, all matrices are real. We use uppercase letters formatrices. Theith row of the matrix Ais denoted byAi:. Thej

    th columnof the matrix Ais denoted byA:j. The element at the intersection of the

    ith row and thejth column of the matrix Ais denoted byAij or[A]ij .

    A column vector is a matrix of only one column. Likewise, a rowvector is a matrix of only one row. Unless explicitly stated otherwise,a vector is always a column vector. The set of all size-nvectors is Rn.Vectors are denoted by lowercase letters except when they are parts of amatrix as described in the preceding paragraph.

    An nsquare matrix A is said to be symmetric ifAij= Aji, for alli, j. A diagonal matrixDis a square matrix having nonzero elementsonly on its main diagonal (i.e., Aij =0 fori

    = j). We useDxto denote

    a diagonal matrix with the vectorx on its main diagonal (i.e. Aii =xi,i=1, . . . , n).

    Here are some special matrices:

  • 8/13/2019 Long Long Thesis on Nmf

    22/185

    PRELIMINARIES

    Matrices whose elements are all 1:11n,1m1,1mn.

    11n = (1 , 1 , . . . , 1) 1m1= (1 , 1 , . . . , 1)T 1mn=1m111n.

    Unit vectors

    ei = (0 , 0 , . . . ,

    ith position1 , . . . , 0)T.

    Identity matricesIn: diagonal matrices where diagonal elementsare equal to 1.

    Permutation matrices: square matrices having on each row andeach column only one nonzero element which is equal to 1.

    Selection matrices: any submatrices of permutation matrices.

    1.1.1 Matrix manipulation

    Here are some basic matrix operators

    Matrix transpose AT: ATij

    := Aji. A is a symmetric matrix

    AT = A. Matrix additionC = A+B:Cij := Aij +Bij . Matrix product C = A.B: Cij := kAik.Bkj . The product dot is

    often omitted.

    Matrix vectorization ofA Rmn

    vec(A) =

    A:1...

    A:n

    Rmn.

    Kronecker product of matrixA Rmn and matrixB

    A B= A11B . . . A1nB... . . . ...

    Am1B . . . Amn B

    .

  • 8/13/2019 Long Long Thesis on Nmf

    23/185

    1.1 MATRIX THEORY AND LINEAR ALGEB RA

    An important relation between the matrix product and the Kro-

    necker product is the following [118]:vec(AX BT) = (B A)vec(X).

    We write A < BifAij < Bij for alli,jand similarly for A B, A > Band AB. We useA ,Aand A, where R, asabbreviations ofA < 1mn, A > 1mn, A 1mnand A 1mn.The absolute matrix |A| is defined as: [|A|]ij=|Aij| for alli,j.

    We define theinner productof the two real vectors x, yRn

    as a realfunctional:x,y=

    i

    xiyi =xTy.

    Nonzero vectorsx, y Rn are said to beorthogonalif their inner productis zero:

    x,y=0.Considering a general matrix A Rmn as a vector: vec(A) Rmn, wecan also define the inner product of two real matrices of the same size:

    A, B= vec(A)Tvec(B) = ij

    Aij Bij =trace(ATB),

    where the trace ofA(trace(A)) is the sum of all the diagonal elementsofA. This implies the following useful relation:

    I,ABC

    = AT, BC= BTAT, C= CTBTAT,I= trace(ABC).

    A square matrix Ais said to beinvertibleif there exists a matrix Bsuch that

    AB = BA= I,

    whereBis called the inverse ofAand is denoted by B = A1. Whilenot all matrices have an inverse, thepseudo-inverse(or Moore-Penrosepseudoinverse) is its generalization, even to rectangular matrices. Theuniquely defined pseudo-inverse A+ of the matrix Asatisfies the fol-

    lowing four conditions:

    AA+A= A, A+AA+ = A+, (AA+)T = AA+ and (A+A)T = A+A.

  • 8/13/2019 Long Long Thesis on Nmf

    24/185

    PRELIMINARIES

    In particular, ifATAis invertible, then A+ = (ATA)1AT.

    The matrix sum C = A+B is defined as Cij = Aij + Bij . Thisoperator is said to be elementwise orentrywisesince each entry of theresult matrixC depends only on entries ofAand B at the same position.This is contrary to the usual matrix product C = AB where the relationsare no longer local. A simpler matrix product that is elementwise iscalled the Hadamard Product or Schur ProductC = A BwhereCij =Aij Bij and A, BandCarem nmatrices. This helps considerably tosimplify matrix formulas in many cases. Here are some properties of

    the Hadamard product [67]: A B= B A

    AT BT = (A B)T

    (a b)(c d)T = (acT) (bdT) = (adT) (bcT)

    The following are some relations of the Hadamard product with other

    operators:

    1T(A B)1=A, B

    A B= PT(A B)Q, wherePandQare selection matrices

    P= (e1 e1e2 e2 . . . em em)

    andQ= (e1 e1e2 e2 . . . en en).

    Roughly speaking,A Bis a submatrix ofA B.

    From the definition of the Hadamard product, we can define otherelementwise operators:

    Hadamard power: [Ar]

    ij

    = Ar

    ij

    ,rR.

    Hadamard division: C= [A][B] = A B1.

  • 8/13/2019 Long Long Thesis on Nmf

    25/185

    1.1 MATRIX THEORY AND LINEAR ALGEB RA

    1.1.2 Vector subspaces

    A linear subspaceE ofRn is the set of all linear combinations of a set ofvectorsV={v1, v2, . . . , vk} ofRn:

    E=

    k

    i=1

    ivi| i R

    .

    Eis also called thespanofVandVis called a spanning set ofE. Givena subspaceE, there are many spanning sets. Among them, a set from

    which no vector can be removed without changing the span is said to belinear independentand a basis ofE. The cardinality of a basis ofE is fixedand is called the dimension ofE.

    For example:

    E= span

    (1,2,1)T, (1,0,0)T

    is a subspace ofR3 anddim(E) =2 since (1,2,1)T, (1,0,0)T is linearindependent. Following this, therankof am nmatrix Acan also be

    defined as the dimension of the subspace spanned by the columns ofA:

    rank(A) =dim (span(A:1, A:2, . . . ,A:n))min(m, n).A linear subspace is closed under addition and scalar multiplication, i.e.,

    u, v E u+vE,uE, R u E.

    1.1.3 Eigenvalues and eigenvectors

    Central concepts in matrix analysis are eigenvalues and eigenvectors ofa square matrix. They provide essential information about the matrix.Related concepts for rectangular matrices are so-called singular valuesand vectors. They play a crucial role in low-rank approximations thatretain dominating characteristics of the original matrix.

    Definition 1.1. A scalar C is an eigenvalue of the matrixA Cnnif there exists a nonzero vectorx Cn such thatAx =x. The vectorxis called the associated eigenvector of the eigenvalue.

  • 8/13/2019 Long Long Thesis on Nmf

    26/185

    PRELIMINARIES

    Ann nmatrix has exactly n eigenvalues (multiplicity counted).The set of all the eigenvalues is denoted by(A). The maximum modu-lus of(A)is the spectral radius ofAand is denoted by(A):

    (A) =max{|| | (A)}.

    In this thesis, only eigenvalues and eigenvectors of some symmetricmatrices are investigated. For those matrices, the following well-knownresults can be established:

    Theorem 1.2(Spectral Theorem). Let Abe a real symmetric matrix. Alleigenvalues and eigenvectors of A are real.

    Moreover, for a real symmetric matrix A, if all the eigenvalues ofA are nonnegative (respectively nonpositive), A is said to bepositivesemidefinite(respectivelynegative semidefinite). If all the eigenvalues arepositive (respectively negative),Ais said to bepositive definite(respec-tivelynegative definite).

    A very useful tool in matrix analysis is theSingular Value Decomposi-tiondefined in the following theorem:

    Theorem 1.3. For any matrix A Rmn, there exist orthogonal matricesU Rmm and V Rnn such that

    A = UVT; (1.1)

    =

    1 0. .

    . Or(nr)0 r

    O(mr)r O(mr)(nr)

    , (1.2)

    where the singular valuesiare real and non-increasing scalars :

    1

    . . .

    r >0. (1.3)

    Proof and algorithms can be found in [50]. Moreover, the columnsofUandVare the eigenvectors ofATAand AAT, respectively.

  • 8/13/2019 Long Long Thesis on Nmf

    27/185

    1.1 MATRIX THEORY AND LINEAR ALGEB RA

    1.1.4 Norms

    Anormis used to measure the magnitude of a vector or a matrix. A normon Rn (or Rmn) is a real functional.on Rn (or Rmn) that satisfiesthe following four conditions:

    x 0, x Rn (or Rmn);x=0x=0;x=||x, xRn (or Rmn) and R;x+y x + y, x,y Rn (or Rmn).

    A most common norm is the Euclidean norm or the Frobenius normderived from the inner product:

    xF =

    x, x,wherex can be either a vector or a matrix. This norm plays the centralrole in least squares problems, where one tries to minimize an errormeasured by this norm.

    Popular norms are instances of the Hlder norms (p-norm):

    xp =

    n

    i=1

    |xi|p1/p

    , p=1,2, . . .

    where the most commonly used arep=1, p=2 andp= :

    1-norm:x1=|x1| + |x2| +. . .+ |xn|2-norm:x2= |x1|

    2 + |x2|2 +. . .+ |xn|2

    -norm:x =maxi |xi|.For vectors, the 2-norm (.2) is also the Frobenius norm (.F). But thisis no longer true for matrix p-norms, which are induced from vectorp-norms :

    Ap =maxx=0

    Axpxp .

    It is proved that[67]

    1-norm:A1 =maxj i |Aij|2-norm:A2 =

    (ATA)

    1/2-norm:A=maxi j |Aij|.

  • 8/13/2019 Long Long Thesis on Nmf

    28/185

    PRELIMINARIES

    Since the main problem treated in this thesis is a constrained least

    squares problem, the Frobenius norm will be extensively used. Othernorms will also be used to add more constraints on the main problem.

    1.1.5 Convex cone and polyhedral cone

    A setC Rn is call a convex cone if it is closed under the addition andthenonnegativescalar multiplication, i.e.

    u, vC u+vC,uC, 0 uC.

    A polyhedral cone is a convex cone nonnegatively generated by a finiteset of vectorsV={v1, v2, . . . , vk} ofRn:

    C=

    k

    i=1

    ivi| i R+

    .

    In this relation,C is also called thespanofVand Vis call a spanning setofC. There exists a set V

    Vthat nonnegatively generatesCand from

    which no vector can be removed without changing the cone. Vis calledthe frame ofCand its cardinality is called the dimension ofC.

    1.1.6 Nonnegative matrices

    Matrices whose elements are all nonnegative are called nonnegativematrices. We use Rn+ and R

    mn+ to denote the set ofn-dimensional

    nonnegative vectors and the set ofm nnonnegative matrices, respec-tively. These subsets are, indeed, polyhedral cones and usually calledthe nonnegative orthants.

    A nonnegative matrix is called row-allowable if it has no zero row.Similarly, a nonnegative matrix is called column-allowable if it has nozero column. A nonnegative matrix is said to be column (row) stochasticif all the column (row) sums are equal to one. A nonnegative matrix issaid to be doubly stochastic if it is column stochastic and row stochastic.

    The most important result for nonnegative matrices is the following:

    Theorem 1.4(Perron-Frobenius, see[8]). Let Abe a square nonnegativematrix. There exist a largest modulus eigenvalue ofAwhich is nonnegativeand a nonnegative eigenvector corresponding to it.

  • 8/13/2019 Long Long Thesis on Nmf

    29/185

    1.2 OPTIMIZATION

    This vector is usually referred to as the Perron vector of the nonnega-

    tive matrix. For a rectangular nonnegative matrix, similar results can beestablished for the largest singular value and its corresponding singularvectors.

    Given a subsetV Rmn and a matrix A Rmn, the nearest ele-ment ofVto A( with respect to a distance) is called the projection ofAon V, denoted byPV(A). When the target subsetVis the nonnega-tive orthant and the considered distance is the Euclidean distance, theprojection ofAis denoted by[A]+and defined as:

    [[A]+]ij= Aij ifAij >0

    0 otherwise =max(0,Aij ).

    1.2 Optimization

    Before presenting some basic results about optimization, we review theconcept of convex sets and convex functions.

    1.2.1 Convex set and convex function

    Definition 1.5(Convex sets). A set is said to be convex if and only iffor everyu, v , we have

    u+ (1 )v , for all[0, 1].Clearly, the convex cones and the polyhedral cone ares, by con-

    struction, convex sets, which implies that the set ofm

    nnonnegative

    matrices (the nonnegative orthant Rmn+ ) is also a convex set. The setR

    mn+ is one of the main objects used in this thesis.

    Definition 1.6(Convex functions). A function fdefined on a convexset is said to be convex if for everyu, v and every[0, 1], thefollowing holds:

    f(u+ (1 )v)f(u) + (1 )f(v).If for every

    (0, 1)andu

    =v, the following holds:

    f(u+ (1 )v)

  • 8/13/2019 Long Long Thesis on Nmf

    30/185

    PRELIMINARIES

    For more details about convex sets and convex functions, see[21].

    1.2.2 Optimality conditions

    Now, we summarize some basic results on the optimization problem

    minx

    f(x),

    where f is a real-valued function taken on the feasible set Rn.We distinguish two types of minima.

    Definition 1.7(Local minimum). A point x is said to be a localminimum of f over if there exists an open neighborhood N(x)ofxsuch that for all x N(x) , f(x) f(x). It is considered a strictlocal minimum if for allxN(x) andx=x, f(x)> f(x).Definition 1.8(Global minimum). A pointx is said to be a globalminimum of fover if for allx , f(x) f(x). A pointx issaid to be a strict global minimum of fover if for allx

    ,x

    =x,

    f(x)> f(x).

    Usually, unless fhas some convexity properties, finding the globalminimum is a very difficult task that needs global knowledge of thefunction f. On the other hand, finding local minima requires onlyknowledge of the neighborhood. The necessary conditions for localminima can also be easily derived by differential calculus. This explainswhy in our minimization problem we will try to find a local minimum,

    instead of a global one.In order to set up necessary conditions satisfied by local minima,

    the basic idea is to look around a point using the concept of feasibledirections. From a point x , a vectord is a feasible direction if thereis an >0 such thatx+ d for all[0,]. We have the followingfirst-order necessary conditions:

    Proposition 1.9 ([87]). Let be a subset ofRn and fbe a continuouslydifferentiable function on . Ifx is a local minimum of f over , then forevery feasible direction d at x, we have

    (f(x))Td0. (1.4)

  • 8/13/2019 Long Long Thesis on Nmf

    31/185

    1.2 OPTIMIZATION

    Conversely, every point that satisfies the condition (1.4) is called a

    stationary point. Whenx is an interior point of, then every vectord isa feasible direction and (1.4) implies f(x) =0.If is convex, to create all the feasible directions, one can use the

    vectorsd =x x, for everyx. This will generate all the feasibledirections atx, since from the convexity of, we have

    x+d= x+(x x) =x+ (1 )x , for all[0, 1].Therefore, a pointxis said to be astationary pointif it satisfies

    (f(x))T(x x)0, x .For the special case where fand are convex, every local minimum

    is also a global minimum. Furthermore, the set of all such minima isconvex. For more results and implications, see [87].

    1.2.3 Karush-Kuhn-Tucker conditions

    Let us consider the following constrained optimization problem:

    minhi(x)=0gj (x)0

    f(x),

    wherehi(x) =0, (i= 1, . . . , k)arekequality constraints andgj(x)0(j= 1, . . . , m)areminequality constraints. The following is known asKarush-Kuhn-Tucker necessary conditions (orKKT conditions):

    Proposition 1.10([13]). Letx be the local minimum of the above problem.

    Suppose that f,hi andgj are continuously differentiable functions fromR

    n

    to R andhi(x)andgj(x) are linearly independent. Then there existunique constantsi(i=1, . . . , k) andj(j =1, . . . , m), such that:

    f(x) +ki=1ihi(x) +mj=1jgj(x) =0,j0, j=1, . . . , mjgj(x

    ) =0, j=1, . . . , m.

    This constrained problem is often written in its associated Lagrange

    Function:L(x, i, . . . , k,1, . . . , m) = f(x) +

    k

    i=1

    ihi(x) +m

    j=1

    jgj(x)

  • 8/13/2019 Long Long Thesis on Nmf

    32/185

    PRELIMINARIES

    wherei(i= 1, . . . , k) andj(j= 1, . . . , m) are the same as those in the

    KKT conditions and are calledLagrange multipliers.

    1.2.4 Coordinate descent algorithm on a convex set

    We briefly describe a method for solving the following problem

    minx

    f(x),

    where Rn is a Cartesian product of closed convex sets 1,2, . . . ,m, where i R

    ni

    (i = 1,...,m) and ini =n. The variablex is alsopartitioned accordingly as

    x=

    x1...

    xm

    ,

    wherexi i. Algorithm1is called the coordinate descent algorithm.If we assume that Step 4of Algorithm1can be solved exactly and

    Algorithm 1Coordinate descent

    1: Initializexi2: repeat

    3: fori =1 tom do4: Solvexi =argmini f(x1, . . . , xi1, , xi+1, . . . , xm)5: end for

    6: untilStopping condition

    the minimum is uniquely attained, then we have the following result.Because this result is extensively used in this thesis, we include here itsproof taken from Proposition 2.7.1 in[13].

    Theorem 1.11(Convergence of Coordinate Descent Method). Supposethat f is a continuously differentiable function over the set described above.Furthermore, suppose that for each i and x , the solution of

    mini

    f(x1, . . . , xi

    1, , xi+1, . . . , xm)

    is uniquely attained. Let {xk} be the sequence generated by Algorithm1. Thenevery limit point is a stationary point.

  • 8/13/2019 Long Long Thesis on Nmf

    33/185

    1.2 OPTIMIZATION

    Proof. Let

    zki = xk+11 , . . . , xk+1i , xki+1, . . . , xkm .Step4of Algorithm1implies

    f(xk) f(zk1) f(zk2) f(zkm1) f(zkm), k. (1.5)

    Let x =(x1, . . . ,xm)be a limit point of the sequence{xk}. Notice thatx because is closed. Equation (1.5) implies that the sequence

    {f(xk)

    }converges to f(x). It now remains to show that xminimizes f

    over .Let {xkj |j=0,1, . . . }be a subsequence of{xk} that converges tox.

    We first show that{xkj+11 xkj1}converges to zero as j . Assume

    the contrary or, equivalently, that {zkj1 xkj} does not converges to zero.Letkj =zkj1 xkj. By possibly restricting to a subsequence of{kj},we may assume that there exists some > 0 such thatkj for allj. Lets

    kj

    1 = (z

    kj

    1xkj )/kj . Thus,z

    kj

    1 = xkj +kj s

    kj

    1,

    skj

    1 = 1, and s

    kj

    1differs from zero only along the first block-component. Notice thats

    kj1

    belongs to a compact set and therefore has a limit points1. By restricting

    to a further subsequence of{kj}, we assume thatskj1 converges tos1.Let us fix some [0, 1]. Notice that 0 kj . Therefore,

    xkj +skj1 lies on the segment joining x

    kj and xkj +kj skj1 = z

    kj1, and

    belongs to because is convex. Using the fact thatzkj1 minimizes f

    over allx that differ fromxkj along the first block-component, we obtain

    f(zkj1) = f(x

    kj +kj skj1) f(xkj +s

    kj1) f(xkj ).

    Since f(xkj )converges to f(x), Equation(1.5) shows that f(zkj1) also

    converges to f(x). We now take the limit asjtends to infinity, to obtainf(x) f(x+s1) f(x). We conclude that f(x) = f(x+s1), forevery [0, 1]. Since s1= 0, this contradicts the hypothesis that fis uniquely minimized when viewed as a function of the first block-component. This contradiction establishes that x

    kj+11 x

    kj1 converges to

    zero. In particular,zkj1 converges tox.

  • 8/13/2019 Long Long Thesis on Nmf

    34/185

    PRELIMINARIES

    From Step4of Algorithm1,we have

    f(zkj1) f(x1, x

    kj2, . . . , x

    kjm), x1 1.

    Taking the limit asjtends to infinity, we obtain

    f(x) f(x1,xkj2, . . . ,xkjm), x1 1.

    Using Proposition1.9over the convex set 1, we conclude that

    1f(x)T(x1 x1)0, x1 1,

    where ifdenotes the gradient of fwith respect to the componentx i.Let us now consider the sequence{zkj1 }. We have already shown

    thatzkj1 converges to x. A verbatim repetition of the preceding argument

    shows thatxkj+1

    2 x

    kj

    2

    converges to zero and

    1f(x)T(x1

    x1)

    0, foreveryx2 2. Continuing inductively, we obtain if(x)T(xi xi)0, for every xi i and for every i. Adding these inequalities, andusing the Cartesian product structure of the set , we conclude thatf(x)(x x)0 for everyx .

    1.3 Low-rank matrix approximation

    Low-rank approximation is a special case of matrix nearness problem[58]. When only a rank constraint is imposed, the optimal approximationwith respect to the Frobenius norm can be obtained from the SingularValue Decomposition.

    We first investigate the problem without the nonnegativity constrainton the low-rank approximation. This is useful for understanding prop-erties of the approximation when the nonnegativity constraints areimposed but inactive. We begin with the well-known Eckart-Young

    Theorem.

    Theorem 1.12(Eckart-Young). LetA Rmn (mn) have the singular

  • 8/13/2019 Long Long Thesis on Nmf

    35/185

    1.3 LOW-RANK MATRIX APPROXIMATION

    value decomposition

    A= PQT, =

    1 0 . . . 00 2 . . . 0...

    ... . . .

    ...0 0 . . . n...

    ... ...

    0 0 . . . 0

    where 1 2 . . . n 0 are the singular values of A and whereP Rmm andQ Rnn are orthogonal matrices. Then for1rn, thematrix

    Ar =PrQT, r =

    1 0 . . . 0 . . . 00 2 . . . 0 . . . 0...

    ... . . .

    ... ...

    0 0 . . . r . . . 0...

    ... ...

    . . . ...

    0 0 . . . 0 . . . 0

    is a global minimizer of the problem

    minBRmn rank(B)r

    12A B2F (1.6)

    and its error is1

    2A B2

    F =

    1

    2

    n

    i=r+1 2

    i.

    Moreover, ifr >r+1then Aris the unique global minimizer.

    The proof and other implications can be found for instance in [50].The columns of P and Q are called singular vectors of A, in whichvectors corresponding to the largest singular values are referred to asthe dominant singular vectors.

    Let us now look at the following modified problem

    minXRmr YRnr

    12A XYT2F, (1.7)

  • 8/13/2019 Long Long Thesis on Nmf

    36/185

    PRELIMINARIES

    where the rank constraint is implicit in the product XYT since the di-

    mensions ofXandYguarantee thatrank(XYT

    )r. Conversely, everymatrix of rank less than r can be trivially rewritten as a product XYT,whereX Rmr andY Rnr. Therefore Problems (1.6) and (1.7) areequivalent. But even when the product Ar =XYT is unique, the pairs(XRT, YR1)withR invertible, yield the same product XYT. In order toavoid this, we can always chooseXandYsuch that

    X=PD12 andY=QD

    12 , (1.8)

    wherePTP = Irr, QTQ = Irr and D is r r nonnegative diagonalmatrix. Doing this is equivalent to computing a compact SVD decompo-sition of the productAr =XYT =PDQT.

    As usual for optimization problems, we calculate the gradient withrespect toXandYand set them equal to 0.

    X=XYTY AY=0 Y=YXTX ATX=0. (1.9)If we then premultiplyAT with

    Xand Awith

    Y, we obtain

    (ATA)Y= (ATX)YTY (AAT)X= (AY)XTX. (1.10)

    ReplacingATX=YXTXand AY=XYTYinto (1.10) yields

    (ATA)Y=YXTXYTY (AAT)X=XYTYXTX. (1.11)

    Replacing (1.8) into (1.11) yields

    (ATA)QD12 =QDPTPDQTQD

    12 and(AAT)PD

    12 =PDQTQDPTPD

    12 .

    WhenDis invertible, this finally yields

    (ATA)Q= QD2 and(AAT)P= PD2.

    This shows that the columns ofPand Qare singular vectors andDii

    sare nonzero singular values ofA. Notice that ifD is singular, onecan throw away the corresponding columns ofP and Qand reduce it toa smaller-rank approximation with the same properties. Without loss of

    generality, we therefore can focus on approximations of Problem (1.7)which are of exact rankr. We can summarize the above reasoning in thefollowing theorem.

  • 8/13/2019 Long Long Thesis on Nmf

    37/185

    1.3 LOW-RANK MATRIX APPROXIMATION

    Theorem 1.13. LetA Rmn (m >nandrank(A) =t). IfAr(1rt)is a rankr stationary point of Problem1.7, then there exists two orthogonalmatrices P Rmm and Q Rnn such that:

    A= PQT and Ar =PrQT

    where

    =

    1 0 . . . 00 2 . . . 0...

    ... . . .

    ...0 0 . . . n...

    ... ...

    0 0 . . . 0

    , r =

    1 0 . . . 0 . . . 00 2 . . . 0 . . . 0...

    ... . . .

    ... ...

    0 0 . . . r . . . 0...

    ... ...

    . . . ...

    0 0 . . . 0 . . . 0

    and the i sare unsorted singular values ofA. Moreover, the approximationerror is:

    12A Ar2F= 12

    t

    i=r+12i.

    This result shows that, if the singular values are all different, there aren!

    r!(nr)! possible stationary pointsAr. When there are multiple singularvalues, there will be infinitely many stationary points Ar since thereare infinitely many singular subspaces. The next result will identify theminima among all stationary points. Other stationary points are saddle

    points whose every neighborhood contains both smaller and higherpoints.

    Theorem 1.14. The only minima of Problem1.7are given by Theorem1.12and are global minima. All other stationary points are saddle points.

    Proof. Let us assume that Aris a stationary point given by Theorem1.13but not by Theorem1.12. Then there always exists a permutation of

    the columns ofP and Q, and of the diagonal elements of and rsuchthatr+1 > r. We then construct two points in the-neighborhood ofAr that yield an increase and a decrease, respectively, of the distance

  • 8/13/2019 Long Long Thesis on Nmf

    38/185

    PRELIMINARIES

    measure. They are obtained by taking:

    r() =

    1+ . . . 0 . . . 0...

    . . . ... . . .

    ...0 . . . r . . . 0...

    ... ...

    . . . ...

    0 . . . 0 . . . 0

    , Ar() =Pr()QT

    and

    r() =

    1 . . . 0 0 . . . 0... . . .

    ... ... . . .

    ...

    0 . . . r r

    ... 00 . . .

    r

    2 . . . 0...

    ... ...

    ... . . .

    ...0 0 . . . 0 . . . 0

    , Ar() =Pr()Q

    T.

    Clearly Ar()and Ar()are of rankr. Evaluating the distance measureyields

    A Ar()2F = 2r2 + (r+1 2)2 +t

    i=r+2

    2i

    = 2[2 2(r+1 r)] +t

    i=r+1

    2i

    t

    i=r+1

    2i =A Ar2F

    for all >0. Hence, for an arbitrarily small positive, we obtain

    A Ar()2F

  • 8/13/2019 Long Long Thesis on Nmf

    39/185

    1.3 LOW-RANK MATRIX APPROXIMATION

    When we add a nonnegativity constraint in the next section, the

    results of this section will help to identify stationary points at which allthe nonnegativity constraints are inactive.

  • 8/13/2019 Long Long Thesis on Nmf

    40/185

  • 8/13/2019 Long Long Thesis on Nmf

    41/185

    2

    NONNEGATIVE MATRIX

    FACTORIZATION

    This chapter is a presentation of the Nonnegative Matrix Factorizationproblem. It consists of the formulation of the problem, the descriptionof the solutions and some observations. It gives the basics for the restof this thesis. Some observations are studied more carefully in other

    chapters.One could argue that the name Nonnegative Matrix Factorization

    maybe misleading in some cases and that Nonnegative Matrix Ap-proximation should be used instead. The term Factorization maybeunderstood as an exact decomposition such as Cholesky decomposition,LU decomposition, etc. where the input matrix is exactly factorizedas a product of other matrices. However, Nonnegative Matrix Factor-ization has become so popular that it does stand for the problem of

    approximating a nonnegative matrix by a product of two nonnegativematrices. We continue to use this term and refer to Exact NonnegativeMatrix Factorization for the exact case.

    2.1 Problem statement

    Nonnegative Matrix Factorization was first introduced by Paatero andTapper in [97]. But it has gained popularity by the works of Lee andSeung [80]. They argue that the nonnegativity is important in humanperceptions and also give two simple algorithms for finding a nonnega-tive representation for nonnegative data. Given anm nnonnegative

  • 8/13/2019 Long Long Thesis on Nmf

    42/185

  • 8/13/2019 Long Long Thesis on Nmf

    43/185

    2.1 PROBLEM STATEMENT

    Wherer is called the reduced rank. From now on, mand n will be

    used to denote the size of the target matrixAandris the reduced rankof a factorization.We rewrite the nonnegative matrix factorization as a standard non-

    linear optimization problem:

    minU0V0

    12A UVT2F.

    The associated Lagrangian function is

    L(U, V,, ) =12A UVT2F U V,

    whereandare twomatricesof the same size ofUand V, respectively,containing the Lagrange multipliers associated with the nonnegativityconstraintsUij 0 andVij 0. Then the Karush-Kuhn-Tucker condi-tions for the nonnegative matrix factorization problem say that if(U, V)is a local minimum, then there existij0 andij0 such that:

    U0 , V0, (2.1)LU=0 , LV=0, (2.2) U=0 , V=0. (2.3)

    Developing (2.2)we have:

    AV UVTV =0, ATU VUTU =0

    or=(UVTV AV), =(VUTU ATU).

    Combining this with ij 0, ij 0 and (2.3) gives the followingconditions:

    U0 , V0, (2.4)FU=UVTV AV0 , FV=VUTU ATU0, (2.5)

    U

    (UVTV

    AV) =0 , V

    (VUTU

    ATU) =0, (2.6)

    where the corresponding Lagrange multipliers forUand Vare also thegradient ofFwith respect toUandV.

  • 8/13/2019 Long Long Thesis on Nmf

    44/185

    NONNEGATIVE MATRIX FACTORIZATION

    Since the Euclidean distance is not convex with respect to both vari-

    ablesUandVat the same time, these conditions are only necessary. Thisis implied because of the existence of saddle points and maxima. Wethen call all the points that satisfy the above conditions, the stationarypoints.

    Definition 2.2(NMF stationary point). We call(U, V)a stationary pointof the NMF Problem if and only ifUandVsatisfy the KKT conditions(2.4), (2.5) and (2.6).

    Alternatively, a stationary point(U, V)of the NMF problem can alsobe defined by using the condition in Proposition1.9on the convex setsR

    mr+ and R

    nr+ , that isFU

    FV

    ,

    X UY V

    0, X Rmr+ , Y Rnr+ , (2.7)

    which can be shown to be equivalent to the KKT conditions (2.4), (2.5)and (2.6). Indeed, it is trivial that the KKT conditions imply (2.7). And

    by carefully choosing different values ofXandYfrom(2.7), one can

    easily prove that the KKT conditions hold.Representing a rank-kmatrix by the productUVT is, in fact, rarely

    used due to the loss of the uniqueness of the presentation. Because anonnegative factorization is, by definition, in this form, the rest of thesection tries to fix the uniqueness problem and to establish some simplerelations between the approximations.

    Let us consider the simplest nonnegative factorization problemwhere the matrix A is just a scalar a. The problem (of rank one) is

    then to find two scalar x and y whose product approximate a. Prob-lem2.1admits only exact approximations(a xy)2 =0, and we haveinfinite number of solutions given by the graph xy= a(Figure2.1).

    If we impose the unit norm condition onx(i.e.x =1), then forthis particular case, there will be only one solution (x=1 andy = a).

    To extend this scaling technique to higher dimensions, we continueto constrain the first factorUto have unit-norm columns. But this nolonger guarantees the uniqueness of the approximations. Moreover, it is

    not easy to determine when and how the uniqueness is obtainable.Two approximations(U1, V1)and(U2, V2)are said to be equivalent

    iff they yield the same product, i.e. U1VT1 =U2VT

    2.

  • 8/13/2019 Long Long Thesis on Nmf

    45/185

    2.1 PROBLEM STATEMENT

    Figure 2.1: Graph a = xy

    From a stationary point(U, V), if we can find an invertible matrixSsuch that U=U S

    0 and V=V(S1)T

    0, have we constructed an

    equivalent stationary point(U, V)? By plugging the matrices Uand Vinto the KKT conditions, we can see that the answer is not always easy.Indeed, ifUand Vare made to be nonnegative, then according to theKKT conditions(2.5) and (2.6), we should also have:

    (UVTV AV)(S1)T 0,(VUTU ATU)S0,(US)

    (UVTV AV)(S1)T

    =0,

    (V(S1)T) (VUTU ATU)S=0.

    In particular, for a permutation matrixS, these conditions are easilychecked. In this case, all the columns ofUand Vare retained in Uand V, but in a permuted order, which generate essentially the samepoint. Note thatScan not be a nonnegativemonomial matrix(i.e. matrixcreated from a permutation matrix by replacing some elements equalto 1 by other positive numbers), since UandUare constrained to haveunit-norm columns.

    For generalS, the study of the uniqueness of the stationary point isno longer easy and might be treated only on a case-by-case basis. Forexample, we remark that at some (stationary) points (U, V), S must

  • 8/13/2019 Long Long Thesis on Nmf

    46/185

    NONNEGATIVE MATRIX FACTORIZATION

    be a permutation matrix, otherwise, the nonnegativity of(U, V)will

    not be met. This implies that we can not generate other equivalentapproximations. The following result helps to identify a class of them.

    Lemma 2.3. IfUandVboth contain a r rmonomial matrix, thenS canonly be the permutation matrices.

    Proof. The assumption implies that we can selectrrows ofUto form anr rmonomial matrixUrandr rows ofVto form anr rmonomialmatrixVr. Then the nonnegativity constraint implies

    UrS0 andVr(S1)T 0.

    Since,Ur andVr are nonnegative monomial matrices, both S andS1

    must be nonnegative. This is only possible whenS is a monomial ma-trix1. Moreover,UandUSare constrained to have unit-norm columns,Scan only be permutation matrices.

    Another way to consider the set ofequivalent stationary points isto identify them by all the possible exact factorization of the matrixA =UVT where(U, V)is one known entry of the set. But there is no

    easy method to construct this set.A better representation of the stationary point is similar to the singu-

    lar value decomposition. We can use a triplet(U, D, V)to represent aNMF stationary point. So, instead of solving Problem2.1,we solve thefollowing problem:

    (ui, di, v

    i)

    ri=1 = argmin

    ui0 uTi ui =1vi0 vTi vi =1

    di0

    A ri=1

    diuivTi22,

    1We have S 0 and S1 0 and the off-diagonal elements ofSS1 = Ir andS1S = Ir are zero. As a consequence, ifSij > 0, we can conclude that S1jk = 0 andS1li = 0 for k= i and l= j. Because S1 is invertible hence can not contain zerorows and columns,S1ji is the only positive element on thej

    th row andith column of

    S1. Reciprocally,S1ji >0 impliesSij is the only positive element on the ith row andjth column ofS. SinceS can not contain zero rows and columns, repeating the abovereasoning through all the nonzero elements ofSyields the desired result.

  • 8/13/2019 Long Long Thesis on Nmf

    47/185

    2.2 SOLUTION

    or in matrix representation,Uand Vis nonnegative matrices with unit-

    norm columns and D is a nonnegative diagonal matrix. The matrixA is then approximated by UDVT. With this, we can also sort thecomponents in decreasing order of the value ofDii (i.e. D11 D22 Drr). This helps to compare equivalent solutions.

    In Chapter4,we use this representation to design our iterative algo-rithm and point out its advantages.

    2.2 Solution

    There are two values of reduced rank rfor which we can trivially identifythe global solution which are r = 1 andr = min(m, n). Forr = 1, apair of dominant singular vectors are a global minimizer. And forr = min(m, n), (U = A, V = I) is a global minimizer. Since most ofexisting methods for the nonnegative matrix factorization are descentalgorithms, we should pay attention to all local minimizers. For therank-one case, they can easily be characterized.

    2.2.1 Rank one case

    The rank-one NMF problem of a nonnegative matrix Acan be rewrittenas

    minuRm+ vRn+

    12A uvT2F (2.8)

    and a complete analysis can be carried out. It is well known that any

    pair of nonnegative Perron vectors of AAT and ATA yields a globalminimizer of this problem, but we can also show that the onlystationarypoints of (2.8) are given by such vectors. The following theorem excludesthe case whereu =0 and/orv =0.

    Theorem 2.4. The pair(u, v)is a local minimizer of (2.8) if and only ifu andvare nonnegative eigenvectors ofA AT andATArespectively of the eigenvalue=u22v22.

    Proof. Theif parteasily follows from Theorem1.13. For theonly if partwe proceed as follows. Without loss of generality, we can permute therows and columns ofA such that the corresponding vectors u and v

  • 8/13/2019 Long Long Thesis on Nmf

    48/185

    NONNEGATIVE MATRIX FACTORIZATION

    are partitioned as(u+0)T and(v+0)T respectively, whereu+,v+ >0.

    Partition the corresponding matrix Aconformably as follows

    A=

    A11 A12A21 A22

    ,

    then from (2.5) we have u+v

    T+ 0

    0 0

    v+

    0

    A11 A12A21 A22

    v+

    0

    0

    and v+u

    T+ 0

    0 0

    u+

    0

    AT11 AT21

    AT12 AT22

    u+

    0

    0

    implying that A21v+ 0 and AT12u+ 0. Since A21, A12 0 andu+,v+ >0, we can conclude that A12=0 andA21=0. Then from (2.6)we have:

    u+ (v+2

    2u+ A11v+) =0 andv+ (u+2

    2v+ A+

    11u+) =0.Sinceu+,v+ >0, we have:

    v+22u+= A11v+and u+22v+= AT11u+or

    u+22v+22u+= A11AT11u+and u+22v+22v+ = AT11A11v+.

    Setting =u+22v+22 and using the block diagonal structure ofAyields the desired result.

    Theorem2.4guarantees that all stationary points of the rank-onecase are nonnegative singular vectors of a submatrix ofA. These resultsimply that a global minimizer of the rank-one NMF can be calculatedcorrectly based on the largest singular value and corresponding singularvectors of the matrix A.

    For ranks other than 1 andmin(m, n), there are no longer trivialstationary points. In the next section, we try to derive some simple char-acteristics of the local minima of the nonnegative matrix factorization.

  • 8/13/2019 Long Long Thesis on Nmf

    49/185

    2.2 SOLUTION

    2.2.2 Characteristics of local minima

    The KKT conditions(2.6) help to characterize the stationary points of theNMF problem. Summing up all the elements of one of the conditions(2.6), we get:

    0 = ij

    U (UVTV AV)

    ij

    =

    U, UVTV AV

    = UVT, UVT A . (2.9)From that, we have some simple characteristics of the NMF solutions:

    Theorem 2.5. Let (U, V) be a stationary point of the NMF problem, thenUVT B A2,12AF, the ball centered at A2 and with radius = 12AF.Proof. From (2.9)it immediately follows that

    A2 UVT, A2 UVT

    = A2,

    A2

    which implies

    UVT B

    A

    2,

    12AF

    .

    Theorem 2.6. Let(U, V)be a stationary of the NMF problem, then

    12A UVT2F=

    12

    (A2F UVT2F).

    Proof. From (2.9), we have

    UVT,A

    =

    UVT, UVT

    . Therefore,

    12

    A UVT,A UVT

    =

    12

    (A2F 2

    UVT,A

    + UVT2F)

    = 12(A2F UVT2F).

  • 8/13/2019 Long Long Thesis on Nmf

    50/185

    NONNEGATIVE MATRIX FACTORIZATION

    Theorem2.6also suggests that at a stationary point (U, V) of the

    NMF problem, we should have A2F UV

    T

    2F. This norm inequality

    can be also found in [25]for less general cases where we have FU=0andFV = 0 at a stationary point. For this particular class of NMFstationary point, all the nonnegativity constraints onUand Vare inac-tive. And all such stationary points are also stationary points of theunconstrained problem, characterized by Theorem1.13.

    We have seen in Theorem 1.13that, for the unconstrained least-square problem the only stable stationary points are in fact global min-

    ima. Therefore, if the stationary points of the constrained problem areinside the nonnegative orthant (i.e. all constraints are inactive), we canthen probably reach the global minimum of the NMF problem. This canbe expected because the constraints may no longer prohibit the descentof the update.

    The equality ofA2F UVT2F implied by Theorem2.6is onlyobtained when we have an exact factorization (i.e A= UVT) and it will

    be the subject of the next section.

    LetArbe the optimal rank-rapproximation of a nonnegative matrixA, which we obtain from the singular value decomposition, as indicatedin Theorem1.13. Then we can easily construct its nonnegative part[Ar]+, which is obtained from Arby just setting all its negative elementsequal to zero. This is in fact the closest matrix in the cone of nonnegativematrices to the matrix Ar, in the Frobenius norm (in that sense, it isits projection on that cone). We now derive some bounds for the errorA [Ar]+F.

    Theorem 2.7. LetArbe the best rankrapproximation of a nonnegative matrixA, and let[Ar]+be its nonnegative part, then

    A [Ar]+F A ArF.

    Proof. This follows easily from the convexity of the cone of nonnegativematrices. Since both Aand[Ar]+are nonnegative and since[Ar]+is theclosest matrix in that cone toArwe immediately obtain the inequality

    A Ar2F A [Ar]+2F+ Ar [Ar]+2F A [Ar]+2Ffrom which the result readily follows.

  • 8/13/2019 Long Long Thesis on Nmf

    51/185

    2.3 EXACT FACTORIZATION AND NONNEGATIVE RANK

    If we now compare this bound with the nonnegative approximations

    then we obtain the following inequalities. LetUVT be an optimal

    nonnegative rankr approximation ofAand letUVTbe any stationarypoint of the KKT conditions for a nonnegative rank r approximation,then we have :

    A [Ar]+|2F A Ar2F =n

    i=r+1

    2i A UVT2F A UVT2F.

    2.3 Exact factorization and nonnegative rankIn this section, we will take a brief look at a stricter problem wherewe are interested only in the solutions where the objective functionis zero. This means that the matrix A is exactly factorized by UVT

    (i.e. A= UVT) with the same nonnegativity constraints on the factors.The smallest value ofr, the inner rank of the factorization UVT, thatfactorizes correctlyAis called the nonnegative rank ofA, denoted byrank+

    UVT(A). In [33], a nice treatment of the problem is carried out.

    The existence of an exact factorization of inner rank r is equivalentto determiningrank+

    UVT(A). For anyr > rank+

    UVT(A), we can trivially

    construct an exact nonnegative factorization from the factorizationUVT

    of inner rankrby adding zero columns to the factorsUandV.For the nonnegative rank, the following results are well known and

    can be found in [33]. First, an upper bound and a lower bound of thisnumber are easily computed.

    Lemma 2.8. Let A Rmn+ . Thenrank(A)rank+

    UVT(A)min(m, n).

    Proof. Since we can not construct the same matrix with lower rank thenthe first inequality holds. The second comes from one of the trivialfactorizationsImAand AIn.

    In certain cases, the first equality holds. For the rank-one nonnega-

    tive matrixA, we know that it can be represented byuvT, whereuandvare nonnegative. This implies that rank+

    UVT(A) =rank(A) =1. It is

    still true for a rank-two matrix, which is proved in[33] and [3].

  • 8/13/2019 Long Long Thesis on Nmf

    52/185

    NONNEGATIVE MATRIX FACTORIZATION

    Lemma 2.9. Let A Rmn+ where rank(A) =2. Thenrank+

    UVT(A) =2.

    Proof. Since A 0, the cone spanned by the columns of A is a con-vex polyhedral cone contained in the nonnegative orthant. Moreover,rank(A) = 2 implies that the cone is contained in a two dimensionallinear subspace. Therefore its spanning set, i.e. the columns ofA, can bereduced to only two vectors calledu1and u2. Every column ofAis thenrepresented by

    A:i=V1iu1+V2iu2, withV1i, V2i0.CreatingU= (u1 u2)andV={Vij}gives the desired rank-two non-negative factorizationUVT.

    The two spanning vectorsu1 andu2 in the proof of the precedinglemma are indeed a pair of columns of A that has the largest angle

    between them, i.e.

    (u1u2) =argminA:i A:j

    AT:iA:j

    A:iA:j .

    Vis computed by solving a least square, which yields

    V= ATU(UTU)1 0.

    So far, we have seen that whenrank(A)is 1, 2 ormin(m, n), we canconstruct an exact nonnegative matrix factorization with the same rank.For matrices with other ranks, determining the nonnegative rank is verydifficult. Indeed, Vavasis in [121] has recently proved theNP-hardnessof the nonnegative matrix factorization. Therefore, all algorithms forsolving the exact problem are expected to have a non polynomial com-plexity. In [116], a method is proposed to create nonnegative matrixfactorization via extremal polyhedral cones. Another possibility is usingthe quantifier elimination algorithms [113] to check for the feasibility offactoring a nonnegative matrixA by a nonnegative factorization of innerrank less thanr. All these algorithms are finite. One such algorithm isgiven by Renegar [101] and was used in the nonnegative rank problem

  • 8/13/2019 Long Long Thesis on Nmf

    53/185

    2.3 EXACT FACTORIZATION AND NONNEGATIVE RANK

    in [33]. Recently, the same method has been applied to the completely

    positive rank [9](cfr. Chapter7). This method is quite generic and canbe applied to other factorizations in this thesis. Here, we describe brieflyhow to derive the computational complexity bound for a nonnegativefactorization.

    Consider a first-order formula over the reals having the form

    (Q1x[1] Rn1 ) . . . (Qx[] Rn)P(y, x[1], . . . , x[]), (2.10)

    where the quantifiersQk

    {,

    }, the vectorycontainsn0free variables

    (unquantified) andPis a boolean function constructed from Matomtrue-falseexpressions

    gi(y, x[1], . . . , x[]) i0, i=1, . . . ,M

    withgiare polynomials of degree less thand and comparison operatorsi {}. Then the Renegar algorithm requires at most(Md)2

    O() knk multiplications and additions and at most (Md)O(knk)

    evaluations ofP.These complexity bounds are derived from the known constants,

    n0,n1,Mandd that can be easily computed for the standard nonnegativematrix factorization in the following lemma.

    Lemma 2.10([101]). The Renegar algorithm requires at most(6mn)2O(1)m2n2

    multiplications and additions and at most (6mn)O(mn) evaluations ofP todetermine the feasibility of factorizingAby a nonnegative factorizationUVT

    of inner rank r.

    Proof. We need to eliminate the quantifiers of the following formula:(UT, VT) Rr(m+n)

    P(A, (UT, VT))

    whereP(A, (UT, VT)) =ij

    k

    UikVjk= Aij

    ik

    (Uik0)

    jk

    Vjk0

    .

    This configuration gives: = 1, n0 = mn, n1 = r(m+n) 2mn,M= mn+r(m+n)3mnandd =2. And the bounds follow.

  • 8/13/2019 Long Long Thesis on Nmf

    54/185

    NONNEGATIVE MATRIX FACTORIZATION

    The result implies that the problem of determining the nonnega-

    tive rank can be solved in finite time by looping throughr = 1,2, . . . ,min(m, n), since the upper bound of the nonnegative rank ismin(m, n).The above lemma can be easily extended for other nonnegative ma-

    trix factorizations such as: multilayer nonnegative matrix factorizationand nonnegative tensor matrix factorization in the next section, sym-metric and semi-symmetric matrix factorization in Chapter7. For eachproblem, = 1, n0 is equal to the number of elements of the targetmatrix (or tensor),n1is the total number of elements of all the factors,

    M=n0+n1and dis equal to the number of factors. Simple countingsthen yield upper complexity bounds for Renegar algorithm for eachfeasibility problem.

    Similar to the nonnegative rank, the completely-positive rank(rankUUT(A))and the semi-symmetric nonnegative rank(rankUSUT(A))(crf. Chapter7) can be also computed in finite time using the Renegaralgorithm. This is due to the existence of an upper bound on these ranks.

    2.4 Extensions of nonnegative matrix factorization

    The essence of the nonnegative matrix factorization is to represent non-negative data by a nonnegative combination of nonnegative basis vec-tors, usually called parts. To enlarge the representing capability of themethod, improvements are made on how these bases are combined andon the structure of the bases. More specifically, in the standard non-negative matrix factorization, each data vector A:j Rn+is represented

    by A:j=i

    VjiU:i, withVji R+andU:i Rn+.

    We can list here two constructions ofU:i that may improve the perfor-mance of the nonnegative matrix factorization.

    2.4.1 Multilayer nonnegative matrix factorization

    We can assume that the U:i is approximated based on another set of

    basesX:t. Again, eachU:iis constructed by a nonnegative combinationofX:is. So we can write

    UXX1

  • 8/13/2019 Long Long Thesis on Nmf

    55/185

    2.4 EXTENSIONS OF NONNEGATIVE MATRIX FACTORIZATION

    whereXand X1 are nonnegative. With the same reasoning, one can

    assume thatX:is are not the primitives and constructed by another setof bases[X1]:is, and so on. This gives the formulation of the multilayernonnegative matrix factorization:

    Problem 2.11(Multilayer nonnegative matrix factorization).

    minXi0 V0

    12A X0X1. . .XkVT2F,

    where Ais the nonnegative data matrix and VandXis are nonnegativematrices of compatible sizes.

    This problem was studied in a number of works, e.g. [36], [29].Another related problem is theArchetypal Analysis[35], where the aboveproblem is restricted to only three layers, wherein the first layer consistsof the data themselves. Each data column is approximated by a convexcombination of a set of archetypes that are, in turn, convex combinationsof the data columns. The problem to be solved is the following:

    minX0 V0

    XT1=1 V1=1

    12A (AX)VT2F,

    where each column ofAXis an archetype.For the multilayer nonnegative matrix factorization, one can use

    algorithms proposed in [36],[29] or the algorithm proposed in Section4.6.1to construct an approximation.

    2.4.2 Nonnegative Tensor Factorization

    Data is, by its nature, not only restricted to nonnegative vectors, i.e. one-dimensional data. Each data can also be points in higher dimensions,for examplem nnonnegative matrices. And the additional model can

    be adapted to handle such data

    Aji VjiUi,whereVji0 andAj, Uis Rmn+ .

  • 8/13/2019 Long Long Thesis on Nmf

    56/185

    NONNEGATIVE MATRIX FACTORIZATION

    If we further restrictedUis to a nonnegative combination of some

    rank-one nonnegative matrices represented byxjyTj wherexj R

    m+and

    yj Rn+. Then the problem of finding xj, yj and Vij from Ais is anexample of the following Nonnegative Tensor Factorization problem:

    Problem 2.12(Nonnegative Tensor Factorization).

    minuij Rni+

    12A

    r

    j=1

    u1ju2j. . .udj2F

    where A Rn1n2...nd+ anda bstands for the outer product betweentwo vectors or tensorsaandb.

    A algorithm will be presented in Section4.6.2.Other methods are in[124], [104] and[32].

  • 8/13/2019 Long Long Thesis on Nmf

    57/185

    3

    EXISTING ALGORITHMS

    In this chapter, we briefly describe a number of existing algorithms forthe nonnegative matrix factorization problem and related issues such as:algorithm initializations, stopping conditions and convergence.

    We choose typical algorithms in three main categories: the multi-plicative updates, the alternating least squares methods and the gradientbased methods. This list is established based on the popularity of the al-

    gorithms in practice. The earliest algorithm is thealternating least squaresmethod proposed by Paatero [97] for thepositive matrix factorization. Butthe the attention of this part-based analysis technique really took offafter the introduction of themultiplicative updatesof Lee and Seung [80].The problem was then rebaptised tononnegative matrix factorization. Thesimplicity of the multiplicative updates and the interpretability of theresult helped to spread the influence of the nonnegative matrix factor-izations to almost all research fields: image processing [59][53] [83],

    text processing [128][103], music transcription [108], video analysis[34], bioinformatics [46], chemistry [45], etc. It was solved using thestandardprojected gradientmethod only in [86], where some advantagesin large-scale problems are reported. Recently, a revised version of thealternating least squares has been proposed in [12] offering a faster im-plementation by sacrifying the convergence property. Other attempts tryto make a change of variable to eliminate the nonnegativity constraints.For example, in [26],u x2 is used and two gradient algorithms areproposed. But they are, reportedly, not very efficient. Here we analyze arecently proposed method: therank-one residue iterationalgorithms thatwill be investigated in details in Chapter4. Fast convergence property

  • 8/13/2019 Long Long Thesis on Nmf

    58/185

    EXISTING ALGORITHMS

    without hidden parameters would make this method a good choice

    for the current and future applications. Its variants derived through-out the last four chapters of this thesis demonstrate its flexibility whenadditional constraints are imposed.

    We classify algorithms into two categories according to the searchspace: Full-space searchand (Block-)Coordinate search. Algorithms likestandard gradient methods can belong to both categories.

    Algorithms in the former category try to find updates for both UandVat the same time. This requires a search for a descent direction in

    the(m+n)r-dimensional space. Note also that the nonnegative matrixfactorization problem in this full space is not convex but the convergenceof algorithms using the full-space search might be easier to be proved.

    Algorithms in the latter category, on the other hand, find updatesfor each (block) coordinate in order to guarantee the descent of theobjective function. Usually, search subspaces are chosen to make theobjective function convex so that efficient methods can be applied. Sucha simplification might lead to the loss of some convergence properties.

    Most of the algorithms use the following column partitioning:12A UVT2F =

    12

    n

    i=1

    A:,i U(Vi,:)T22, (3.1)

    which shows that one can minimize with respect to each of the rowsofVindependently. The problem thus decouples into smaller convexproblems. This leads to the solution of quadratic problems of the form

    minv0 12a Uv22, (3.2)which is called Nonnegative Least Squares (NNLS).

    Updates for the rows ofVare then alternated with updates for therows ofUin a similar manner by transposing AandUVT.

    We begin the chapter with the description of the three categories ofalgorithms. More emphasis is put into the multiplicative rules, since theyare very popular but still lack a good convergence property. We will try

    to explain why this method may fail to converge to a local minimum. Weend the chapter with two short discussions of the stopping conditionsand the initialization methods.

  • 8/13/2019 Long Long Thesis on Nmf

    59/185

  • 8/13/2019 Long Long Thesis on Nmf

    60/185

  • 8/13/2019 Long Long Thesis on Nmf

    61/185

    3.1 LEE ANDSEUNG ALGORITHM

    Algorithm 2Multiplicative Rules (Mult)

    1: InitializeU0,V0 andk=02: repeat

    3: Uk+1 =Uk [AVk]

    [Uk(Vk)TVk]

    4: Vk+1 =Vk [ATUk+1]

    [Vk(Uk+1)TUk+1]5: k= k+ 16: untilStopping condition

    Theorem 3.1. The Euclidean distance A UVT2Fis non-increasing underthe updating rules of Algorithm2.

    Theorem3.1is a shortened version of the one in the original paper ofLee and Seung [80]. The original theorem has an additional part claimingthat the Euclidean distance is unchanged under the multiplicative rulesonlywhen it is at a stationary point. This is, in fact, not necessarily true,since if it converges, only the conditions (2.4) and (2.6) are satisfied atthe fixed point. No proof is provided to show that the conditions (2.5)can be met. There are two main obstacles in the investigation of theconvergence of these multiplicative rules.

    The first one is that these multiplicative rules fail to make asufficientdescent of the cost function. To see this, we can rewrite (3.4) as a variablemetric method [13]:

    v = v

    UTa[UTUv]= v

    UTa+UTUv UTUv

    [UTUv]

    = v

    1 +

    UTa UTUv

    [UTUv]

    = v [v][UTUv]

    (UTUv UTa)= v DvvF , (3.5)

    whereDv is a positive diagonal matrix with Dii = [v]i[UTUv]i

    . With this

  • 8/13/2019 Long Long Thesis on Nmf

    62/185

    EXISTING ALGORITHMS

    update, a necessary condition for a sufficient descent is that the eigen-

    values of matrixDv (i.e. Dii) must be bounded above and away fromzero [13]. But this is not true forDvin general. Hence, the limit pointsof the algorithm may not be stationary.

    The second obstacle is the possibility of zeros in U and V whenconsidering the conditions (2.5). The situation that we want to avoid is

    Vij F Mas suggested in [85] or making a gradient descentstep to the inside of the nonnegative orthant.

    Remark: it is possible that[Vk(Uk)TUk]ij =0 for some(i,j), whichresults in some zero-division exceptions. We investigate two followingpossible situations:

  • 8/13/2019 Long Long Thesis on Nmf

    63/185

    3.2 ALTERNATING LEAST SQUARES METHODS

    WhenVkij >0,[Vk(Uk)TUk]ij =0 implies that

    0= lt

    Vkit Uklt U

    kljVkijUkljUklj =Vkij

    l

    UkljUklj.

    This occurs only whenUk:j = 0, which is due to a rank-deficient

    approximation and can be fixed by generating a substitution forUk:j.

    WhenVkij =0, we have a 0/0 situation where

    Vij F=[AUk]ij0.

    Then, we should not replace[Vk(Uk)TUk]ij by (a small positiveconstant with >M) as suggested by many works, because thiswill keepVk+1ij = 0 which is unfavorable for the multiplicative

    updates. SettingVk+1ij = >Mis definitely a better choice.

    The multiplicative rules are also extended to the weighted nonneg-ative matrix factorization (see Chapter6), to the generalized Kullback-Leibler divergence (see Chapter5) and to a broader class of cost functionnamely Bregman divergence [36]. Many other extensions can be foundin [59], [119], [83],[122], [68], etc.

    3.2 Alternating least squares methods

    The first algorithm proposed for solving the nonnegative matrix factor-ization was the alternating least squares method [97]. It is known that,fixing eitherUor V, the problem becomes a least squares problem withnonnegativity constraint.

    Since the least squares problems in Algorithm3can be perfectlydecoupled into smaller problems corresponding to the columns or rowsofA, we can directly apply methods for the Nonnegative Least Squareproblem to each of the small problem. Methods that can be applied are

    [79],[44], [24], etc.A direct application of Theorem1.11can show that if the subproblem

    (3) and (4) in Algorithm3are exactly anduniquelysolved, every limit

  • 8/13/2019 Long Long Thesis on Nmf

    64/185

    EXISTING ALGORITHMS

    Algorithm 3Alternating Least Square (ALS)

    1: InitializeUandV2: repeat

    3: Solve: minV012A UVT2F4: Solve: minU012AT VUT2F5: untilStopping condition

    Algorithm 4Inexact Alternating Least Square (IALS)

    1: InitializeUandV2: repeat3: Solve forUin equation: UVTV= AV4: U= [U]+5: Solve forVin equation:VUTU= ATU6: V= [V]+7: untilStopping condition

    point of Algorithm3is a stationary point of the nonnegative matrixfactorization problem.

    But even with the faster implementation of these algorithm, theycan not match other methods in terms of running time. A modificationhas been made by replacing an exact solution of the nonnegative leastsquares problem by the projection of the solution of the unconstrainedleast squares problem into the nonnegative orthant[12] as in Algorithm

    4.This speeds up the algorithm by sacrifying the convergence property.Figure3.2is a typical example of the convergence of the AlternatingLeast Squares and the Inexact Alternating Least Squares. One can seethat while the earlier always makes a descent update, the latter doesnot. The exact method also produces better approximation errors. Butwith the same number of iterations, it spends significantly more timethan the inexact version does (3.435svs. 0.02s). Note that the solver forthe nonnegative least squares problem in this example is the standardMatlab functionlsqnonneg. For a faster solver such as[24], it is reportedthat the exact method is still far behind in terms of the running time. Inpractice, theexactAlternating Least Squares is seldomly used because itis very inefficient. And its inexact version does not, in general, converges

  • 8/13/2019 Long Long Thesis on Nmf

    65/185

    3.3 GRADIENT DESCENT

    Figure 3.2: Alternating Least Square (ALS) vs. Inexact Alternating Least Square(IALS)

    to a stationary point. It is suggested to use the inexact version as awarming-up phase of a hybrid algorithm [48].

    Two other versions namely Alternating Constrained Least Squareand Alternating Hoyer-Constrained Least Square are also given in [12].

    3.3 Gradient descent

    We can consider the nonnegative matrix factorization as a nonlinearoptimization problem on a convex set, which is the nonnegative orthant.We also know that the projection on this set is very simple and consists ofsetting any negative element to zero. In this case, the Projected Gradient

    scheme is often used and characterized by the following three basicsteps in each iteration:

    Calculating the gradient F(xk), Choosing the step sizek, Projecting the update on the nonnegative orthant Rn+:

    xk+1 = [xk

    k

    F(xk)]+,

    wherexk is the variable. The last two steps can be merged in oneiterative process and must guarantee a sufficient decrease of the objective

  • 8/13/2019 Long Long Thesis on Nmf

    66/185

    EXISTING ALGORITHMS

    function as well as the nonnegativity of the new point. This generates

    an inner loop inside each iteration.We will present two simple ways to carry out this idea in the non-

    negative matrix factorization. Both methods use the negative gradientas the basic search direction. Only the stepsizes are different.

    In Section3.3.3,some issues of the implementation will be pointedout. Especially for the case where one chooses to use the gradientmethod in alternating iterations, i.e. minimizing with respect toUandtoVin an alternating fashion.

    3