IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING …kijungs/papers/cdtfTKDE2017.pdf ·...

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Fully Scalable Methods forDistributed Tensor Factorization

Kijung Shin, Lee Sael, and U Kang

Abstract—Given a high-order large-scale tensor, how can we decompose it into latent factors? Can we process it on commoditycomputers with limited memory? These questions are closely related to recommender systems, which have modeled rating datanot as a matrix but as a tensor to utilize contextual information such as time and location. This increase in the order requirestensor factorization methods scalable with both the order and size of a tensor. In this paper, we propose two distributed tensorfactorization methods, CDTF and SALS. Both methods are scalable with all aspects of data and show a trade-off betweenconvergence speed and memory requirements. CDTF, based on coordinate descent, updates one parameter at a time, whileSALS generalizes on the number of parameters updated at a time. In our experiments, only our methods factorized a 5-ordertensor with 1 billion observable entries, 10M mode length, and 1K rank, while all other state-of-the-art methods failed. Moreover,our methods required several orders of magnitude less memory than their competitors. We implemented our methods onMAPREDUCE with two widely-applicable optimization techniques: local disk caching and greedy row assignment. They speededup our methods up to 98.2× and also the competitors up to 5.9×.

Index Terms—Tensor Factorization, Tensor Completion, Distributed Computing, MapReduce, Hadoop

F

1 INTRODUCTION

R ECOMMENDATION problems can be viewed ascompleting a partially observable user-item ma-

trix whose entries are ratings. Matrix factorization(MF), which decomposes the input matrix into a userfactor matrix and an item factor matrix such that theirproduct approximates the input matrix, is one of themost widely-used methods for matrix completion [1],[2], [3]. To handle web-scale data, efforts were madeto find distributed methods for MF [3], [4], [5].

On the other hand, there have been attempts toimprove the accuracy of recommendation by usingadditional contextual information such as time andlocation. A straightforward way to utilize such extrafactors is to model rating data as a partially observabletensor where additional modes correspond to theextra factors. Similar to the matrix completion, tensorfactorization (TF), which decomposes the input tensorinto multiple factor matrices and a core tensor, has

• Kijung Shin is with Computer Science Department, Carnegie MellonUniversity, USA,E-mail: [email protected]

• Lee Sael is with the Department of Computer Science, State Universityof New York Korea, Republic of Korea,E-mail: [email protected]

• U Kang is with the Department of Computer Science and Engineering,Seoul National University, Republic of Korea,E-mail: [email protected] (corresponding author)

This work was supported by the Basic Science Research Program throughthe National Research Foundation of Korea funded by the Ministry ofScience, ICT and Future Planning (grants No. 2013R1A1A1064409 andNRF-2015R1C1A2A01055739), and by the KEIT Korea under the ”GlobalAdvanced Technology Center” (10053204). The Institute of EngineeringResearch at Seoul National University provided research facilities for thiswork. The ICT at Seoul National University provided research facilitiesfor this study.

TABLE 1: Summary of scalability results. The factors whicheach method is scalable with are checked. CDTF and SALSare the only methods scalable with all the factors.

CDTF, SALS ALS PSGD FLEXIFACT(Proposed) [14], [3] [10] [11]

Order X X X

Observations X X X X

Mode Length X X

Rank X X

Machines X X

been used for tensor completion [6], [7], [8], [9].As more extra information becomes available, a

necessity for TF algorithms scalable with the order aswell as the number of entries in a tensor has arisen.A promising way to find such algorithms is to extenddistributed MF algorithms to higher orders. However,the extensions of existing methods [3] [10] [11] havelimited scalability, as we explain in Section 2.3.

In this paper, we propose Coordinate Descent forTensor Factorization (CDTF) and Subset AlternatingLeast Square (SALS), distributed tensor factorizationmethods scalable with all aspects of data. CDTFapplies coordinate descent, which updates one param-eter at a time, to TF. SALS, which includes CDTF as aspecial case, generalizes on the number of parametersupdated at a time. These two methods have distinctadvantages: CDTF is more memory efficient and ap-plicable to more loss functions, while SALS convergesfaster to a better solution.

Our methods can be used in any applications han-dling large-scale partially observable tensors, includ-ing social network analysis [12] and Web search [13].

The main contributions of our study are as follows:


(a) Running time (b) Memory requirements

Fig. 1: Scalability comparison of tensor factorization methods on the Hadoop cluster. o.o.m. : out of memory, o.o.t. : outof time (takes more than a week). Only our proposed CDTF and SALS scaled up to the largest scale S4 (details are inTable 4), and they showed a trade-off: CDTF was more memory-efficient, and SALS ran faster. The scalability of ALS,PSGD, and FLEXIFACT with five reducers was limited due to their high memory requirements. In particular, ALS andPSGD required 186GB for S4, which is 493× of 387MB that CDTF required. The scalability of FLEXIFACT with 40 reducerswas limited because of its rapidly increasing communication cost.

• Algorithm. We propose CDTF and SALS, scal-able tensor factorization algorithms. Their dis-tributed versions are the only methods scalablewith all the following factors: the order and sizeof data, the number of parameters, and the num-ber of machines (see Table 1).

• Analysis. We analyze our methods and theircompetitors in the general N-order setting inthe following aspects: computational complex-ity, communication complexity, memory require-ments, and convergence speed (see Table 3).

• Optimization. We implement our methods onMAPREDUCE with two widely-applicable op-timization techniques: local disk caching andgreedy row assignment. They speeded up notonly our methods (up to 98.2×) but also thecompetitors (up to 5.9×) (see Figure 10).

• Experiment. We empirically confirm the supe-rior scalability of our methods and their severalorders of magnitude less memory requirementsthan the competitors. Only our methods success-fully analyzed a 5-order tensor with 1 billionobservable entries, 10M mode length, and 1Krank, while all others failed (see Figure 1).

Our open-sourced code and the data we used areavailable at http://www.cs.cmu.edu/∼kijungs/codes/cdtf. In Section 2, we present preliminaries for tensorfactorization. In Section 3, we describe our proposedCDTF and SALS methods. In Section 4, we presentthe optimization techniques used in our implemen-tations on MAPREDUCE. In Section 5, we provideexperimental results. After reviewing related work inSection 6, we conclude in Section 7.

2 NOTATIONS AND PRELIMINARIESIn this section, we describe notations (summarized inTable 2); and the preliminaries of tensor factorizationand its distributed algorithms.

2.1 Tensor and the NotationsTensors are multi-dimensional arrays that generalizevectors (1-order tensors) and matrices (2-order ten-sors) to higher orders. Like rows and columns in

TABLE 2: Table of symbols.

Symbol Definition

X input tensor (∈ RI1×I2...×IN )xi1...iN (i1, ..., iN )th entry of XN order of XIn length of the nth mode of X

A(n) nth factor matrix (∈ RIn×K)

a(n)ink (in, k)th entry of A(n)

K rank of XΩ set of indices of observable entries of X

Ω(n)in

subset of Ω whose nth mode’s index is inmSn set of rows of A(n) assigned to machine mR residual tensor (∈ RI1×I2...×IN )

ri1...iN (i1, ..., iN )th entry of RM number of machines (reducers on MAPREDUCE)Tout number of outer iterationsTin number of inner iterations

λ(= λA) regularization parameter for factor matricesλb regularization parameter for bias termsC number of parameters updated at a timeη0 initial learning rate

a matrix, an N -order tensor has N modes, whoselengths are denoted by I1 through IN , respectively.We denote tensors with variable order N by boldfaceEuler script letters (e.g., X). Matrices and vectors aredenoted by boldface capitals (e.g., A) and boldfacelowercases (e.g., a), respectively. We denote the entryof a tensor by the symbolic name of the tensor with itsindices in subscript. For example, the (i1, i2)th entryof A is denoted by ai1i2 , and the (i1, ..., iN )th entry ofX is denoted by xi1...iN . The i1th row of A is denotedby ai1∗, and the i2th column of A is denoted by a∗i2 .

2.2 Tensor FactorizationOur definition of tensor factorization is based onPARAFAC Decomposition [15], the most popular de-composition method. We use L2 regularization, whoseweighted form has been used in many recommendersystems [1], [2], [3]. Other decomposition and regu-larization methods are also considered in Section 3.5.

Definition 1 (Partially Observable Tensor Factorization):Given an N -order tensor X(∈ RI1×I2...×IN ) withobservable entries xi1...iN |(i1, ..., iN ) ∈ Ω, therank K factorization of X is to find factor matricesA(n) ∈ RIn×K |1 ≤ n ≤ N that minimize (1).

http://www.cs.cmu.edu/~kijungs/codes/cdtf



TABLE 3: Summary of distributed tensor factorization algorithms for partially observable tensors. The performancebottlenecks that prevent each algorithm from handling web-scale data are marked by asterisks (*). Only our proposedSALS and CDTF methods have no bottleneck. Communication complexity is measured by the number of parameters thateach machine exchanges with the others. For simplicity, we assume that workload of each algorithm is equally distributedacross machines, that the length of every mode is equal to I , and that Tin of SALS and CDTF is set to one.

Algorithm Computational complexity Communication complexity Memory Convergence speed(per iteration) (per iteration) requirements

CDTF O(|Ω|N2K/M) O(NIK) O(NI) FastSALS O(|Ω|NK(N + C)/M +NIKC2/M) O(NIK) O(NIC) Fastest

ALS [14], [3] O(|Ω|NK(N +K)/M +NIK3/M)* O(NIK) O(NIK)* FastestPSGD [10] O(|Ω|NK/M) O(NIK) O(NIK)* Slow*

FLEXIFACT [11] O(|Ω|NK/M) O(MN−2NIK)* O(NIK/M) Fast

L(A(1), ...,A(N)) =∑(i1,...,iN )∈Ω

(xi1...iN −

K∑k=1

N∏n=1

a(n)ink

)2

+ λ

N∑n=1

‖A(n)‖2

F

(1)The loss function (1) depends only on the observable

entries. Each factor matrix A(n) corresponds to the la-tent feature vectors of the objects that the nth mode ofX represents, and

∑Kk=1

∏Nn=1 a

(n)ink

corresponds to theinteraction among the features. Note that this problemis non-identifiable in that (1) may have many globalminima corresponding to different factor matrices.

2.3 Distributed Methods for Tensor FactorizationIn this section, we explain how widely-used dis-tributed optimization methods are applied to partiallyobservable tensor factorization (see Section 6 for dis-tributed methods applicable only to fully observabletensors). Their performances are summarized in Ta-ble 3. Note that only our proposed CDTF and SALSmethods, which are described in Sections 3 and 4,have no bottleneck in any aspects. The preliminaryversion of CDTF and SALS appeared in [16].

2.3.1 ALS: Alternating Least SquareALS is a widely-used technique for fully observ-able tensor factorization [14] and was used also forpartially observable tensors [17]. However, previousALS approaches for partially observable tensors havescalability issues since they repeatedly estimate allunobservable entries, which can be far more thanobserved entries depending on data. Fortunately, arecent ALS approach for partially observable matrices[3] is extensible for tensors, and is parallelizable. Thisextended method is explained in this section.

ALS updates factor matrices one by one whilekeeping all other matrices fixed. When all other fac-tor matrices are fixed, minimizing (1) is analyticallysolvable in terms of the updated matrix. We updateeach factor matrix A(n) row by row, by the followingrule, exploiting the independence between rows:

[a(n)in1, ..., a

(n)inK ]T ← arg min

[a(n)in1,...,a

(n)inK

]T

L(A(1), ...,A(N))

= (B(n)in

+ λIK)−1c(n)in, (2)

where B(n)in

is a K by K matrix whose entries are(B

(n)in

)k1k2 =∑

(i1,...,iN )∈Ω(n)in

(∏l 6=n

a(l)ilk1

∏l6=n

a(l)ilk2

), ∀k1, k2,

c(n)in

is a length K vector whose entries are(c

(n)in

)k =∑

(i1,...,iN )∈Ω(n)in

(xi1...iN∏l 6=n

a(l)ilk

),∀k,

and IK is the K by K identity matrix. Ω(n)in

denotesthe subset of Ω whose nth mode’s index is in. Thisupdate rule is a special case of (9), which is provedin Theorem 1 in Section 3.3.1, since ALS is a specialcase of SALS (see Section 3.2).

Updating a row, a(n)in∗ for example, using (2)

takes O(|Ωin |K(N + K) + K3), which consists ofO(|Ω(n)

in|NK) to calculate

∏l 6=n a

(l)il1

through∏

l 6=n a(l)ilK

for all the entries in Ω(n)in

, O(|Ω(n)in|K2) to build B

(n)in

,O(|Ω(n)

in|K) to build c

(n)in

, and O(K3) to invert B(n)in

.Thus, updating every row of every factor matrixonce, which corresponds to a full ALS iteration, takesO(|Ω|NK(N +K) +K3

∑Nn=1 In).

In distributed environments, updating each factormatrix can be parallelized without affecting the cor-rectness of ALS by distributing the rows of the factormatrix across machines and updating them simulta-neously. The parameters updated by each machineare broadcast to all other machines. The number ofparameters each machine exchanges with the others isO(KIn) for each factor matrix A(n) and O(K

∑Nn=1 In)

per iteration. The memory requirements of ALS, how-ever, cannot be distributed. Since the update rule (2)possibly depends on any entry of any fixed factormatrix, every machine is required to load all thefixed matrices into its memory. This high memory re-quirements of ALS, O(K

∑Nn=1 In) memory space per

machine, have been noted as a scalability bottleneckeven in matrix factorization [4], [5].

2.3.2 PSGD: Parallelized Stochastic Gradient DescentPSGD [10] is a distributed algorithm based on stochas-tic gradient descent (SGD). In PSGD, the observableentries of X are randomly divided into M machineswhich run SGD independently using the assignedentries. The updated parameters are averaged aftereach iteration. For each observable entry xi1...iN , a(n)

inkfor all n and k, whose number is NK, are updated atonce by the following rule :

a(n)ink ← a

(n)ink − 2η

λa(n)ink

|Ω(n)in|− ri1...iN

∏l 6=n

a(l)ilk

(3)

where ri1...iN = xi1...iN −∑K

s=1

∏Nl=1 a

(l)ils

. It takesO(NK) to calculate ri1...iN and

∏Nl=1 a

(l)ilk

for all k.


Once they are calculated, since∏

l 6=n a(l)ilk

can be

calculated as(∏N

l=1 a(l)ilk

)/a

(n)ink

, calculating (3) takesO(1), and thus updating all the NK parameters takesO(NK). If we assume that X entries are equallydistributed across machines, the computational com-plexity per iteration is O(|Ω|NK/M). Averaging pa-rameters can also be distributed, and in the process,O(K

∑Nn=1 In) parameters are exchanged by each ma-

chine. Like ALS, the memory requirements of PSGDcannot be distributed, i.e., all the machines are re-quired to load all the factor matrices into their mem-ory. Thus, O(K

∑Nn=1 In) memory space is required

per machine. Moreover, PSGD tends to convergeslowly due to the non-identifiability of (1).

2.3.3 FlexiFaCT: Flexible Factorization of Coupled Tensors

FLEXIFACT [11] is another SGD-based algorithm thatremedies the high memory requirements and slowconvergence of PSGD. FLEXIFACT divides X into MN

blocks. Each M disjoint blocks not sharing commonfibers (i.e., rows in a general nth mode) composea stratum. FLEXIFACT processes one stratum of X

at a time by distributing the M blocks composinga stratum across machines and processing them in-dependently. The update rule is the same as (3),and the computational complexity per iteration isO(|Ω|NK/M), as in PSGD. However, contrary toPSGD, averaging is unnecessary since a set of param-eters updated by each machine is disjoint with thesets updated by the others. In addition, the memoryrequirements of FLEXIFACT are distributed amongthe machines. Each machine only needs to load theparameters related to the block it processes, whosenumber is (K

∑Nn=1 In)/M , into its memory at a time.

However, FLEXIFACT suffers from high communica-tion cost. After processing one stratum, each machinesends the updated parameters to the machine whichupdates them using the next stratum. Each machineexchanges at most (K

∑Nn=2 In)/M parameters per

stratum and MN−2K∑N

n=2 In per iteration whereMN−1 is the number of strata. Thus, the communi-cation cost increases exponentially with the order ofX and polynomially with the number of machines.

3 PROPOSED METHODS

In this section, we propose two scalable tensor fac-torization algorithms: CDTF (Section 3.1) and SALS(Section 3.2). After analyzing their theoretical proper-ties (Section 3.3), we discuss how these methods areparallelized in distributed environments (Section 3.4)and applied to diverse loss functions (Section 3.5).

3.1 Coordinate Descent for Tensor Factorization3.1.1 Update RuleCoordinate descent for tensor factorization (CDTF) isa tensor factorization algorithm based on coordinate

Algorithm 1: Serial version of CDTFInput : X, K, λOutput: A(n) for all n

1 initialize R and A(n) for all n2 for outer iter = 1..Tout do3 for k = 1..K do4 compute R using (5)5 for inner iter = 1..Tin do6 for n = 1..N do7 for in = 1..In do8 update a(n)

ink using (6)

9 update R using (7)

descent and extends CCD++ [5] to higher orders. Co-ordinate descent (CD) is a widely-used optimizationtechnique for minimizing a multivariate function. CDbegins with an initial guess of parameters (i.e., vari-ables) and iterates over parameters for updating eachparameter at a time. When CD updates a parameter,it minimizes the loss function with respect to theupdated parameter, while fixing all other parametersat their current values. This one-dimensional mini-mization problem is usually solved easier than theoriginal problem. CD usually iterates over parametersmultiple times, decreasing the loss function mono-tonically until convergence. CDTF applies coordinatedescent to tensor factorization, whose parameters arethe entries of factor matrices (i.e., A(1) through A(N))and loss function is (1). CDTF updates an entry of afactor matrix at a time by assigning the optimal valueminimizing (1). Equation (1) becomes a quadraticequation in terms of the updated parameter when allother parameters are fixed. This leads to the followingupdate rule for each parameter a(n)

ink:

a(n)ink ← arg min

a(n)ink

L(A(1), ...,A(N))

=

∑(i1,...,iN )∈Ω

(n)in

((ri1...iN +

N∏l=1

a(l)ilk

) ∏l 6=n

a(l)ilk

)

λ+∑

(i1,...,iN )∈Ω(n)in

∏l 6=n(a

(l)ilk

)2(4)

where ri1...iN = xi1...iN −∑K

s=1

∏Nl=1 a

(l)ils

. Computing(4) from the beginning takes O(|Ω(n)

in|NK), but we

can reduce it to O(|Ω(n)in|N) by maintaining a residual

tensor R up-to-date instead of calculating its entriesevery time. To maintain R up-to-date, after updatingeach parameter a(n)

ink, we update the entries of R in

ri1...iN |(i1, ..., iN ) ∈ Ω(n)in by the following rule:

ri1...iN ← ri1...iN +(

(a(n)ink)old − a(n)

ink

)∏l 6=n

a(l)ilk,

where (a(n)ink

)old is the old parameter value.

3.1.2 Update SequenceColumn-wise Order. The update sequence of param-eters is another important part of coordinate descent.CDTF adopts the following column-wise order:((a(1)

11 ...a(1)I11)︸︷︷︸

a(1)∗1

... (a(N)11 ...a

(N)IN1)︸︷︷︸

a(N)∗1

)...((a(1)1K ...a

(1)I1K

)︸︷︷︸a(1)∗K

... (a(N)1K ...a

(N)INK)︸︷︷︸

a(N)∗K

)

where we update the entries in, for example, the kthcolumn of a factor matrix and then move on to the


kth column of the next factor matrix. After the kthcolumns of all the factor matrices are updated, the(k + 1)th columns of the matrices are updated.

In this column-wise order, we can reduce the num-ber of R updates. Updating, for example, the kthcolumns of all the factor matrices is equivalent to therank-one factorization of tensor R whose (i1, ..., iN )thentry is ri1...iN = xi1...iN −

∑s 6=k

∏Nl=1 a

(l)ils

. R can becomputed from R by the following rule:

ri1...iN ← ri1...iN +∏N

n=1a

(n)ink, (5)

which takes O(N) for each entry and O(|Ω|N) forentire R. Once R is computed, we can compute therank-one factorization (i.e., updating a(n)

inkfor all n and

in) without updating R. This is because the entries ofR do not depend on the parameters updated duringthe rank-one factorization. Each parameter a

(n)ink

isupdated in O(|Ω(n)

in|N) by the following rule:

a(n)ink ←

∑(i1,...,iN )∈Ω

(n)in

(ri1...iN

∏l 6=n

a(l)ilk

)

λ+∑

(i1,...,iN )∈Ω(n)in

∏l 6=n(a

(l)ilk

)2. (6)

We need to update R only once per rank-one factor-ization by the following rule:

ri1...iN ← ri1...iN −∏N

n=1a

(n)ink, (7)

which takes O(|Ω|N) for entire R as in (5).Inner Iterations. The number of updates on R

can be reduced further by repeating the update of,for example, the kth columns multiple times beforemoving on to the (k+1)th columns. Since the repeatedcomputation (i.e., rank-one factorization) does notrequire to update R, if the number of the repetitions isTin, R needs to be updated only once per Tin times ofrank-one factorization. With Tin times of iteration, theupdate sequence at column level changes as follows:

(a(1)∗1 ...a

(N)∗1 )Tin(a

(1)∗2 ...a

(N)∗2 )Tin ...(a

(1)∗K ...a

(N)∗K )Tin .

Algorithm 1 describes the serial version of CDTFwith Tout times of outer iteration. We initialize theentries of A(1) to zero and those of all other factormatrices to random values, which makes the initialvalue of R equal to X. Instead of computing R (line 4)before rank-one factorization, the entries of R canbe computed while computing (6) and (7). This canresult in better performance on a disk-based systemlike MAPREDUCE by eliminating disk I/O operationsrequired to compute and store R.

3.2 Subset Alternating Least SquareSubset alternating least square (SALS), another scal-able tensor factorization algorithm, generalizes CDTFin terms of the number of parameters updated at atime. Figure 2 depicts the difference among CDTF,SALS, and ALS. Unlike CDTF, which updates eachcolumn of factor matrices entry by entry, and ALS,which updates all K columns row by row, SALSupdates each C (1 ≤ C ≤ K) columns row by row.

(a) CDTF (b) SALS (c) ALS

Fig. 2: Update rules of CDTF, SALS, and ALS. CDTFupdates each column of factor matrices entry by entry,SALS updates each C (1 ≤ C ≤ K) columns row by row,and ALS updates all K columns row by row.

Algorithm 2: Serial version of SALSInput : X, K, λOutput: A(n) for all n

1 initialize R and A(n) for all n2 for outer iter = 1..Tout do3 for split iter = 1..dK

Ce do

4 choose k1, ..., kC (from columns not updated yet)5 compute R using (8)6 for inner iter = 1..Tin do7 for n = 1..N do8 for in = 1..In do9 update a(n)

ink1, ..., a

(n)inkC

using (9)

10 update R using (10)

SALS contains CDTF (C = 1) and ALS (C = K)as special cases. In other words, SALS and ALS aresimilar in that, when they update a factor matrix,they fix all other factor matrices at the current valuesand update the factor matrix in a row-wise mannerby assigning the row vector minimizing (1) givenall other parameters. However, SALS and ALS aredifferent in that, when they update a row of a factormatrix, ALS updates all the K entries in the row at atime, while SALS updates only C(≤ K) entries in therow fixing the other (K−C) entries. Our experimentalresults in Section 5 show that, with proper C, SALSenjoys both the high scalability of CDTF and the fastconvergence of ALS although SALS requires morememory space than CDTF.

Algorithm 2 describes the procedure of SALS.Line 4, where C columns are randomly chosen, andline 9, where C parameters are updated simultane-ously, are the major differences from CDTF. UpdatingC columns (k1, ..., kC) of all the factor matrices (lines 7through 9) is equivalent to the rank-C factorizationof R whose (i1, ..., iN )th entry is ri1...iN = xi1...iN −∑

k/∈k1,...,kC∏N

n=1 a(n)ink

. R can be computed from R

by the following rule (line 5):ri1...iN = ri1...iN +

∑C

c=1

∏N

n=1a

(n)inkc

, (8)

which takes O(NC) for each entry and O(|Ω|NC) forentire R. After computing R, the parameters in the Ccolumns are updated row by row as follows (line 9):

[a(n)ink1

, ..., a(n)inkC

]T ← arg min[a

(n)ink1

,...,a(n)inkC

]T

L(A(1), ...,A(N))

= (B(n)in

+ λIC)−1c(n)in

(9)


where B(n)in

is a C by C matrix whose entries are

(B(n)in

)c1c2 =∑

(i1,...,iN )∈Ω(n)in

(∏l 6=n

a(l)ilkc1

∏l6=n

a(l)ilkc2

), ∀c1, c2

c(n)in

is a length C vector whose entries are

(c(n)in

)c =∑

(i1,...,iN )∈Ω(n)in

(ri1...iN∏l6=n

a(l)ilkc

), ∀c

and IC is the C by C identity matrix. The correctnessof this update rule is proved in Section 3.3.1. Com-puting (9) takes O(|Ω(n)

in|C(N + C) + C3), which con-

sists of O(|Ω(n)in|NC) to compute

∏l 6=n a

(l)ilk1

through∏l 6=n a

(l)ilkC

for all the entries in Ω(n)in

, O(|Ω(n)in|C2)

to build B(n)in

, O(|Ω(n)in|C) to build c

(n)in

, and O(C3)to compute the inverse. Instead of computing theinverse, the Cholesky decomposition, which also takesO(C3), can be used for speed and numerical stability.After repeating this rank-C factorization Tin times(line 6), R is updated by the following rule (line 10):

ri1...iN ← ri1...iN −∑C

c=1

∏N

n=1a

(n)inkc

, (10)

which takes O(|Ω|NC) for the entire R as in (8).

3.3 Theoretical Analysis

3.3.1 Convergence AnalysisWe prove that CDTF and SALS monotonically de-crease the loss function (1) until convergence.

Theorem 1 (Correctness of SALS): The update rule(9) minimizes (1) with respect to the updated param-eters. That is,

arg min[a

(n)ink1

,...,a(n)inkC

]T

L(A(1), ...,A(N)) = (B(n)in

+ λIC)−1c(n)in.

Proof:∂L

∂a(n)inkc

= 0, ∀c, 1 ≤ c ≤ C

⇔∑

(i1,...,iN )∈Ω(n)in

( C∑s=1

N∏l=1

a(l)ilks− ri1...iN

)∏l 6=n

a(l)ilkc

+ λa

(n)inkc

= 0,∀c

⇔∑

(i1,...,iN )∈Ω(n)in

C∑s=1

a(n)inks

∏l 6=n

a(l)ilks

∏l 6=n

a(l)ilkc

+ λa(n)inkc

=∑

(i1,...,iN )∈Ω(n)in

(ri1...iN∏l6=n

a(l)ilkc

), ∀c

⇔ (B(n)in

+ λIc)[a(n)ink1

, ..., a(n)inkC

]T = c(n)in.

This theorem also proves the correctness of CDTF,which is a special case of SALS, and leads to thefollowing convergence property.

Theorem 2 (Convergence of CDTF and SALS): InCDTF and SALS, the loss function (1) decreasesmonotonically.

Proof: By Theorem 1, every update in CDTF andSALS minimizes (1) in terms of updated parameters.Thus, the loss function never increases.

3.3.2 Complexity Analysis

In this section, we analyze the computational andspace complexity of CDTF and SALS.

Theorem 3 (Computational complexity of SALS): Thecomputational complexity of SALS (Algorithm 2) isO(Tout|Ω|NTinK(N + C) +ToutTinKC

2∑N

n=1 In).Proof: As explained in Section 3.2, updating each

C parameters (a(n)ink1

, ..., a(n)inkC

) using (9) (line 9 ofAlgorithm 2) takes O(|Ω(n)

in|C(C + N) + C3); and

both computing R (line 5) and updating R (line 10)take O(|Ω|NC). Thus, updating all the entries in Ccolumns (lines 8 through 9) takes O(|Ω|C(C + N) +InC

3), and the rank C factorization (lines 7 through9) takes O(|Ω|NC(N +C) +C3

∑Nn=1 In). As a result,

an outer iteration, which repeats the rank C factoriza-tion TinK/C times and both R and R updates K/Ctimes, takes O(|Ω|NTinK(N+C)+TinKC

2∑N

n=1 In)+O(|Ω|NK), where the second term is dominated.

Theorem 4 (Space complexity of SALS): The memoryrequirement of SALS (Algorithm 2) is O(C

∑Nn=1 In).

Proof: Since R computation (line 5 of Algo-rithm 2), rank C factorization (lines 7 through 9), andR update (line 10) all depend only on the C columnsof the factor matrices, the number of whose entriesis C

∑Nn=1 In, the other (K − C) columns need not

be loaded into the memory. Thus, the columns ofthe factor matrices can be loaded by turns dependingon (k1, ..., kC) values. Moreover, updating C columns(lines 8 through 9) can be processed by streaming theentries of R from disk and processing them one byone instead of loading them all at once because theentries of B(n)

inand c

(n)in

in (9) are the sum of the valuescalculated independently from each R entry. Likewise,R computation and R update can also be processedby streaming R and R, respectively.

These theorems are applied to CDTF, which is aspecial case of SALS.

Theorem 5 (Computational complexity of CDTF): Thecomputational complexity of CDTF (Algorithm 1) isO(Tout|Ω|N2TinK).

Theorem 6 (Space complexity of CDTF): Thememory requirement of CDTF (Algorithm 1) isO(∑N

n=1 In).

3.4 Parallelization in Distributed Environments

In this section, we describe the distributed versionsof CDTF and SALS. We assume a distributed en-vironment where machines do not share memory,such as in MAPREDUCE. The extension to shared-memory systems, where the only difference is thatwe do not need to broadcast updated parameters, isstraightforward. The method to assign the rows of thefactor matrices to machines (i.e., to decide mSn for alln and m) is explained in Section 3.4.3, and until then,we assume that the row assignments are given.


(a) Machine 1 (b) Machine 2 (c) Machine 3 (d) Machine 4

Fig. 3: Work and data distribution of CDTF and SALS indistributed environments when an input tensor is a 3-ordertensor and the number of machines is four. We assume thatthe rows of the factor matrices are assigned to the machinessequentially. The colored region of A(n) (the transpose ofA(n)) in each sub-figure corresponds to the parametersupdated by each machine, resp., and that of X correspondsto the data distributed to each machine.

Algorithm 3: Distributed version of CDTFInput : X, K, λ, mSn for all m and nOutput: A(n) for all n

1 distribute the mΩ entries of X to each machine m2 Parallel (P): initialize the mΩ entries of R3 P: initialize A(n) for all n4 for outer iter = 1..Tout do5 for k = 1..K do6 P: compute the mΩ entries of R7 for inner iter = 1..Tin do8 for n = 1..N do9 P: update ma

(n)∗k using (6)

10 P: broadcast ma(n)∗k

11 P: update the mΩ entries of R using (7)

3.4.1 Distributed Version of CDTF

Work and Data Distribution. Since the update rule(6) for each parameter a(n)

inkdoes not depend on the

other parameters in its column a(n)∗k , updating the

parameters in a column can be distributed across mul-tiple machines and processed simultaneously withoutaffecting the updated results and thus correctness(Theorem 2). For each column a

(n)∗k , machine m up-

dates the assigned parameters, ma(n)∗k = a(n)

ink|in ∈

mSn. At matrix level, mSn rows of each factor ma-trix A(n) are updated by machine m. The entries ofX necessary to update the assigned parameters aredistributed to each machine. That is, X entries inmΩ =

⋃Nn=1

(⋃in∈mSn

Ω(n)in

)are distributed to each

machine m, and the machine maintains and updatesR entries in mΩ.

The sets of X entries sent to the machines are notdisjoint (i.e., m1Ω∩m2Ω 6= ∅), and each entry of X canbe sent to N machines at most. Thus, the computa-tional cost for computing R and updating R increasesup to N times in distributed environments. However,since this cost is dominated by the others, the overallasymptotic complexity remains the same. Figure 3shows an example of work and data distribution.

Communication. In order to update the parameters,for example, in the kth column of a factor matrix using(6), the kth columns of all other factor matrices arerequired. Therefore, after updating the kth column ofa factor matrix, the updated parameters are broadcast

Algorithm 4: Distributed version of SALSInput : X, K, λ, mSn for all m and nOutput: A(n) for all n

1 distribute the mΩ entries of X to each machine m2 Parallel (P): initialize the mΩ entries of R3 P: initialize A(n) for all n4 for outer iter = 1..Tout do5 for split iter = 1..dK

Ce do

6 choose (k1, ..., kC) (from columns not updated yet)7 P: compute the mΩ entries of R8 for inner iter = 1..Tin do9 for n = 1..N do

10 P: update a(n)inkc|in ∈ mSn, 1 ≤ c ≤ C

using (9)11 P: broadcast a(n)

inkc|in ∈ mSn, 1 ≤ c ≤ C

12 P: update the mΩ entries of R using (10)

to all other machines so that they can be used toupdate the kth column of the next factor matrix.The broadcast parameters are also used to updatethe entries of R using (7). Therefore, after updatingthe assigned parameters in a

(n)∗k , each machine m

broadcasts |mSn| parameters and receives (In−|mSn|)parameters from the other machines. The number ofparameters each machine exchanges with the othermachines is

∑Nn=1 In per rank-one factorization and

KTin∑N

n=1 In per outer iteration. Algorithm 3 depictsthe distributed version of CDTF.

3.4.2 Distributed Version of SALSSALS is also parallelized in distributed environmentswithout affecting its correctness. The work and datadistribution of SALS is exactly the same as that ofCDTF, and the only difference between two methodsis that in SALS each machine updates and broadcastsC columns at a time.

Algorithm 4 describes the distributed version ofSALS. After updating the assigned parameters inC columns of a factor matrix A(n) (line 10), eachmachine m sends C|mSn| parameters and receivesC(In−|mSn|) parameters (line 11), and thus each ma-chine exchanges C

∑Nn=1 In parameters during rank

C factorization (lines 9 through 11). Since this factor-ization is repeated TinK/C times, the total number ofparameters each machine exchanges is KTin

∑Nn=1 In

per outer iteration, as in CDTF.

3.4.3 Row AssignmentThe running time of the parallel steps in the dis-tributed versions of CDTF and SALS depends on thelongest running time among all machines. Specifically,the running time of lines 6, 9, and 11 in Algorithm 3and lines 7, 10, and 12 in Algorithm 4 are proportionalto maxm |mΩ(n)| where mΩ(n) =

⋃in∈mSn

Ω(n)in

. Line 10in Algorithm 3 and line 11 in Algorithm 4 are pro-portional to maxm |mSn|. Therefore, it is important toassign the rows of the factor matrices to the machines(i.e., to decide mSn) such that |mΩ(n)| and |mSn| are


Algorithm 5: Greedy row assignment in CDTF and SALS

Input : X, MOutput: mSn for all m and n

1 initialize |mΩ| to 0 for all m2 for n = 1..N do3 initialize mSn to ∅ for all m4 initialize |mΩ(n)| to 0 for all m5 calculate |Ω(n)

in| for all in

6 foreach in (in decreasing order of |Ω(n)in|) do

7 find m with |mSn| < d InM e and the smallest |mΩ(n)|8 (in case of a tie, choose the machine with smaller |mSn|,

and if still a tie, choose one with smaller |mΩ|)9 add in to mSn

10 add |Ω(n)in| to |mΩ(n)| and |mΩ|

evenly distributed among all the machines. For thispurpose, we design a greedy assignment algorithmwhich aims to minimize maxm |mΩ(n)| under the con-dition that |mSn| is completely even (i.e., |mSn| =In/M for all n where M is the number of machines).For each factor matrix A(n), we sort its rows in thedecreasing order of |Ω(n)

in| and assign the rows one by

one to the machine m that satisfies |mSn| < dIn/Meand has the smallest |mΩ(n)| currently. Algorithm 5describes the details of the algorithm. The effects ofgreedy row assignment on actual running times aredescribed in Section 5.5. In the section, the greedy rowassignment is compared with the two baselines:• Sequential row assignment : Indices of the rows

in each mode are divided into M sequentialranges, and the rows whose indices belong tothe same range are assigned to the same machine(i.e., mSn = in ∈ N| In×(m−1)

M < in ≤ In×mM ).

• Random row assignment : Each row in each modeis assigned to a randomly chosen machine.

3.5 Loss Functions and Updates

In this section, we discuss how CDTF and SALSare applied to minimize various loss functions whichhave been found useful in previous studies. We specif-ically consider the PARAFAC decomposition with L1

and weighted-L2 regularization and non-negativityconstraints. Coupled tensor factorization and factor-ization using the bias model are also considered.

3.5.1 PARAFAC with L1 RegularizationL1 regularization, which leads to sparse factor ma-trices, can be used in the PARAFAC decompositioninstead of L2 regularization. Sparser factor matricesare easier to interpret as well as require less storage.With L1 regularization, we use the following lossfunction instead of (1):LLasso(A(1), ...,A(N)) =∑(i1,...,iN )∈Ω

(xi1...iN −

K∑k=1

N∏n=1

a(n)ink

)2

+ λ

N∑n=1

‖A(n)‖1.

(11)To minimize (11), CDTF (see Section 3.1), which

updates one parameter at a time while fixing the

others, uses the equation (12), instead of the originalequation (6), for updating each parameter a(n)

ink:

a(n)ink ← arg min

a(n)ink

LLasso(A(1), ...,A(N))

=

(λ− g)/d if g > λ−(λ+ g)/d if g < -λ0 otherwise

(12)

where g = −2∑

(i1,...,iN )∈Ω(n)in

(ri1...iN

∏l 6=n a

(l)ilk

)and

d = 2∑

(i1,...,iN )∈Ω(n)in

∏l 6=n(a

(l)ilk

)2. The proof of thisupdate rule can be found in Theorem 7 in [18]. If wesimply replace (6) with (12), Algorithms 1 and 3 canbe used to minimize (11) without other changes.

However, SALS is not directly applicable to (11)since a closed form solution for minimizing the lossfunction with respect to C parameters dose not existif C ≥ 2. Although an iterative or suboptimal updaterule can be used instead of (9), this is beyond thescope of this paper.

3.5.2 PARAFAC with Weighted L2 RegularizationWeighted L2 regularization [3] has been used in manyrecommender systems based on matrix factorization[1], [2], [3] since it effectively prevents overfitting. Ap-plying this method to the PARAFAC decompositionresults in the following loss function:

L(A(1), ...,A(N)) =∑

(i1,...,iN )∈Ω

(xi1...iN −

K∑k=1

N∏n=1

a(n)ink

)2

+ λ

N∑n=1

In∑in=1

|Ω(n)in| · ‖a(n)

in∗‖22. (13)

CDTF and SALS can be used to minimize (13) if λ in(6) and (9) is replaced with λ|Ω(n)

in|. This can be proved

also by replacing λ in Theorem 1 and its proof withλ|Ω(n)

in|. With these changes, Algorithms 1, 2, 3, and 4

can be used to minimize (13) without other changes.

3.5.3 PARAFAC with Non-negativity ConstraintThe constraint that factor matrices have only non-negative entries has been adopted in many applica-tions [11], [19] because the resulting factor matricesare easier to inspect. To minimize (1) under this non-negativity constraint, CDTF (see Section 3.1) updateseach parameter a

(n)ink

by assigning the nonnegativevalue minimizing (1) when the other parameters arefixed at their current values. That is,

a(n)ink ← arg min

a(n)ink≥0

L(A(1), ...,A(N)) = max

(−g

d+ 2λ, 0

),

(14)where g = −2

∑(i1,...,iN )∈Ω

(n)in

(ri1...iN

∏l 6=n a

(l)ilk

)and

d = 2∑

(i1,...,iN )∈Ω(n)in

∏l 6=n(a

(l)ilk

)2. See Theorem 8 in[18] for the proof of this update rule. Algorithms 1and 3 can be used to minimize (1) under the non-negativity constraint if we simply replace (6) with (14).

However, SALS cannot be directly used with thenon-negativity constraint since a closed form solutionfor minimizing (1) with respect to C parameters does


not exist under the constraint if C ≥ 2. Althoughan iterative or suboptimal update rule can be usedinstead of (9), this is beyond the scope of this paper.

3.5.4 Coupled Tensor FactorizationThe joint factorization of tensors sharing commonmodes (e.g., user-movie rating data and a social net-work among users) can provide a better understand-ing of latent factors [11], [20]. Suppose that an Nx-order tensor X and an Ny-order tensor Y share theirfirst mode without loss of generality. The coupledfactorization of them decomposes X into factor ma-trices xA

(1) through xA(Nx) and Y into yA

(1) throughyA

(Ny) under the condition that xA(1) = yA

(1). Theloss function of the coupled factorization is the sumof the loss function in each separate factorization (i.e.,equation (1)) as follows:

LCoupled(xA(1), ..., xA

(Nx), yA(1), ..., yA

(Ny)) =

L(xA(1), ..., xA

(Nx)) + L(yA(1), ..., yA

(Ny))− λ‖xA(1)‖2F(15)

where −λ‖xA(1)‖2F prevents the shared matrix frombeing regularized twice.

The update rule for the entries in the non-sharedfactor matrices are the same as that in separate fac-torization. When updating the entries in the sharedmatrix, however, we should consider both tensors.In SALS (see Section 3.2), if L(xA

(1), ..., xA(Nx)) is

minimized at [a(1)i1k1

, ..., a(1)i1kC

]T = (xB(1)i1

+λIC)−1xc

(1)i1

and L(yA(1), ..., yA

(Ny)) is minimized at (yB(1)i1

+

λIC)−1yc

(1)i1

(see (9) for the notations), LCoupled isminimized at (xB

(1)i1

+ yB(1)i1

+ λIC)−1(xc(1)i1

+ yc(1)i1

),given the other parameters. That is, to minimize (15),SALS updates each C parameter [a

(1)i1k1

, ..., a(1)i1kC

]T inthe shared factor matrix by the following rule:

[a(1)i1k1

, ..., a(1)i1kC

]T ←

arg min[a

(1)i1k1

,...,a(1)i1kC

]T

LCoupled(xA(1), ..., xA

(Nx), yA(1), ..., yA

(Ny))

= (xB(1)i1

+ yB(1)i1

+ λIC)−1(xc(1)i1

+ yc(1)i1

) (16)See Theorem 9 and Algorithm 6 in [18] for the proof

of this update rule and the pseudocode for coupledtensor factorization. CDTF, a special case of SALS,also can be used for minimizing (15) in the same way.

3.5.5 Bias ModelThe PARAFAC decomposition (see Section 2.2) cap-tures interactions among modes that lead to differententry values in X. In recommendation problems, how-ever, much of the variation in entry values (i.e., rates)is explained by each mode solely. For this reason, [3]added bias terms, which capture the effect of eachmode, in their matrix factorization model. Likewise,we can add bias vectors b(n) ∈ RIn |1 ≤ n ≤ N to(1) resulting in the following loss function:

LBias(A(1), ...,A(N),b(1), ...,b(N)) =∑(i1,...,iN )∈Ω

(xi1...iN − µ−

N∑n=1

b(n)in−

K∑k=1

N∏n=1

a(n)ink

)2

+ λA

N∑n=1

‖A(n)‖2

F + λb

N∑n=1

‖b(n)‖2

2 (17)

where µ denotes the average entry value of X, andλA and λb denote the regularization parameters forfactor matrices and bias terms, respectively.

With (17), each (i1, ...iN )th entry of the residualtensor R is redefined as ri1...in = xi1...iN − µ −∑N

n=1 b(n)in−∑K

k=1

∏Nn=1 a

(n)ink

. At each outer iteration,we first update factor matrices by either CDTF orSALS and then update bias vectors in the CDTFmanner. That is, we update each parameter b(n)

inat a

time, while fixing the other parameters at their currentvalues, by the following update rule:

b(n)in← arg min

b(n)in

LBias(A(1), ...,A(N),b(1), ...,b(N))

=∑

(i1,...,iN )∈Ω(n)in

ri1...iN /(λb + |Ω(n)in|), (18)

where ri1...iN = ri1...iN +b(n)in. The proof of this update

rule can be found in Theorem 10 in the supplementarydocument [18]. After updating each b(n)

in, the entries of

R in ri1...iN |(i1, ..., iN ) ∈ Ω(n)in should be updated by

the following rule:ri1...iN ← ri1...iN + (b

(n)in

)old − b(n)in

(19)

where (b(n)in

)old is the old parameter value. Algo-rithm 7 in [18] gives the detailed pseudocode of SALSfor the bias model. It includes CDTF for the biasmodel as a special case.

4 OPTIMIZATION ON MAPREDUCEIn this section, we describe the optimization tech-niques used to implement CDTF and SALS onMAPREDUCE, which is one of the most widely useddistributed platforms. Theses techniques can also beapplied to diverse distributed algorithms, includingALS, FLEXIFACT, and PSGD. The effect of these tech-niques on actual running time is in Section 5.5.

4.1 Local Disk CachingThe typical MAPREDUCE implementation of CDTFand SALS without local disk caching runs each paral-lel step (e.g., parameter (a(n)

∗k ) update, R update, andR computation in Algorithm 3) as a separate MAPRE-DUCE job, which incurs redistributing data acrossmachines. Repeatedly redistributing data is extremelyinefficient in SALS and CDTF due to their highlyiterative nature. We avoid this inefficiency by cachingdata to local disk once they are distributed. In ourimplementation of CDTF and SALS with local diskcaching, X entries are distributed across machinesand cached in their local disk during the map andreduce stages (see Algorithm 8 in the supplementarydocument [18] for the detailed pseudocode); and therest of CDTF and SALS runs in the close stage(cleanup stage in Hadoop) using the cached data.

4.2 Direct CommunicationIn MAPREDUCE, it is generally assumed that reducersrun independently and do not communicate directly


TABLE 4: Scale of synthetic datasets. B: billion, M: million,K: thousand. The length of every mode is equal to I .

S1 S2 (default) S3 S4

N 2 3 4 5I 300K 1M 3M 10M|Ω| 30M 100M 300M 1BK 30 100 300 1K

with each other. However, CDTF and SALS requireupdated parameters to be broadcast to other reducers.We circumvent this problem by adapting the directcommunication method used in FlexiFaCT [11]. Eachreducer writes the parameters that it updated to thedistributed file system, and the other reducers readthese parameters from the distributed file system. SeeSection 2.2 of [18] for details including a pseudocode.

4.3 Greedy Row AssignmentWe also apply the greedy row assignment, explainedin Section 3.4.3, to our MAPREDUCE implementations.See Section 2.3 of [18] for the detailed MAPREDUCEimplementation of the greedy row assignment.

5 EXPERIMENTS

Our experimental results answer the questions below:• Q1: Data scalability (Section 5.2). How do

CDTF, SALS, and their competitors scale with re-gard to the order, number of observations, modelength, and rank of an input tensor?

• Q2: Machine scalability (Section 5.3). How doCDTF, SALS, and their competitors scale withregard to the number of machines?

• Q3: Convergence (Section 5.4). How quickly andaccurately do CDTF, SALS, and their competi-tors factorize real-world tensors?

• Q4: Optimization (Section 5.5). How much dolocal disk caching and greedy row assignmentimprove the speed of CDTF and SALS? Canthese techniques be applied to other methods?

• Q5: Effects of C and Tin (Section 5.6) How dothe number of columns updated at a time (C)and the number of inner iterations (Tin) affect theconvergence of SALS and CDTF?

All experiments are focused on distributed methods,which are the most suitable to achieve our purpose ofhandling large-scale data. We compared our methodswith ALS, FLEXIFACT, and PSGD, all of which can beparallelized in distributed environments, as explainedin Section 2.3. Serial methods (e.g., [17], [21]) andmethods not directly applicable to partially observ-able tensors (e.g., [22], [23]) were not considered.

5.1 Experimental SettingsWe ran experiments on a 40-node Hadoop cluster.Each node had an Intel Xeon E5620 2.4GHz CPU. Themaximum heap size per reducer was set to 8GB.

We used both synthetic (Table 4) and real-world(Table 5) datasets most of which are available at http://

TABLE 5: Summary of real-world datasets.

Movielens4 Netflix3 Yahoo-music4

N 4 3 4I1 71,567 2,649,429 1,000,990I2 65,133 17,770 624,961

I3 / I4 169 / 24 74 / - 133 / 24|Ω| 9,301,274 99,072,112 252,800,275|Ω|test 698,780 1,408,395 4,003,960K 20 40 80

λ(= λA) 0.01 0.02 1.0λb 1.0 0.02 1.0η0 0.01 0.01 10−5 (FLEXIFACT)

10−4 (PSGD)

www.cs.cmu.edu/∼kijungs/codes/cdtf. Synthetic ten-sors were created as synthetic matrices were createdin [24]. That is, we first created ground-truth factormatrices with random entries. Then, we randomlysampled |Ω| entries of the created tensor and set eachsampled entry xi1...iN to δ+

∑Kk=1

∏Nn=1 a

(n)ink

where δdenotes random noise with mean 0 and variance 1.The real-world tensors are preprocessed as follows:• Movielens4

1: Movie rating data from Movie-Lens, an online movie recommender service. Weconverted them into a four-order tensor wherethe third and fourth modes correspond to (year,month) and hour-of-day when the movie wasrated, respectively. The rates range from 1 to 5.

• Netflix32: Movie rating data used in Netflix prize.

We regarded them as a three-order tensor wherethe third mode is (year, month) when the moviewas rated. The rates range from 1 to 5.

• Yahoo-music43: Music rating data used in KDD

CUP 2011. We converted them into a four-ordertensor in the same way as we did for Movielens4.The rates range from 0 to 100.

All the methods were implemented in Java withHadoop 1.0.3. The local disk caching, the direct com-munication, and the greedy row assignment (see Sec-tion 4) were applied to all the methods if possible. Allour implementations used weighted L2 regularization(see Section 3.5.2). For CDTF and SALS, Tin was setto 1 and C was set to 10, unless otherwise stated.The learning rate of FLEXIFACT and PSGD at tthiteration was set to 2η0/(1+ t), as in the open-sourcedFLEXIFACT4. The number of reducers was set to 5 forFLEXIFACT, 20 for PSGD, and 40 for the other meth-ods, each of which leaded to the best performance onthe machine scalability test in Section 5.3.

5.2 Data Scalability5.2.1 Scalability with Each Factor (Figures 4-7)We measured the scalability of CDTF, SALS, andthe competitors with regard to the order, numberof observations, mode length, and rank of an inputtensor. When measuring the scalability with regard to

1. http://grouplens.org/datasets/movielens2. http://www.netflixprize.com3. http://webscope.sandbox.yahoo.com/catalog.php?datatype=c4. http://alexbeutel.com/l/flexifact/




(a) Running time

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

2 3 4 5Required m

em

ory

(G

B)

Order

(b) Memory requirements

Fig. 4: Scalability w.r.t. order. FLEXIFACT was not scalablewith order, while the other methods, including CDTF andSALS, were scalable with order.


Fig. 5: Scalability w.r.t. the number of observations. Allmethods, including CDTF and SALS, were scalable withthe number of observable entries.

a factor, the factor was scaled up from S1 to S4 whileall other factors were fixed at S2 in Table 4.

As seen in Figure 4, FLEXIFACT did not scale withthe order because of its communication cost, whichincreases exponentially with the order. ALS and PSGDwere not scalable with the mode length and the rankdue to their high memory requirements as Figures 6and 7 show. They required up to 11.2GB, which is 48×of 234MB that CDTF required and 10× of 1,147MBthat SALS required. Moreover, the running time ofALS increased rapidly with rank owing to its cubicallyincreasing computational cost.

Only SALS and CDTF were scalable with all thefactors, as summarized in Table 1. Their running timesincreased linearly with all the factors except the order,with which they increased slightly faster due to thequadratically increasing computational cost.

5.2.2 Overall scalability (Figure 1 in Section 1)We measured the scalability of the methods by scalingup all the factors simultaneously from S1 to S4. Thescalability of FLEXIFACT with five machines, ALS,and PSGD was limited owing to their high memoryrequirements. ALS and PSGD required about 186GBto handle S4, which is 493× of 387MB that CDTFrequired and 100× of 1,912MB that SALS required.FLEXIFACT with 40 machines did not scale over S2due to its rapidly increasing communication cost.Only CDTF and SALS scaled up to S4, and there wasa trade-off between them: CDTF was more memory-efficient, and SALS ran faster.

5.3 Machine Scalability (Figure 8)We measured the speed-ups (T5/TM where TM isthe running time with M reducers) and memory


Fig. 6: Scalability w.r.t. mode length. o.o.m.: out of memory.ALS and PSGD did not scale with mode length, whileCDTF, SALS, and FLEXIFACT did.


Fig. 7: Scalability w.r.t. rank. o.o.m.: out of memory. ALSand PSGD did not scale with rank, while CDTF, SALS, andFLEXIFACT did.

(a) Speed-up (b) Memory requirements

Fig. 8: Machine scalability on the S2 scale dataset. FLEXI-FACT and PSGD did not scale with the number of machines,while CDTF, SALS, and ALS did.

requirements of the methods on the S2 scale datasetby increasing the number of reducers. The speed-ups of CDTF, SALS, and ALS increased linearly atthe beginning and then flattened out slowly owingto their fixed communication cost, which does notdepend on the number of reducers (see Table 3). Thespeed-up of PSGD flattened out fast, and PSGD evenslightly slowed down at 40 reducers because of theincreased overhead. FLEXIFACT slowed down as thenumber of reducers increased because of its rapidlyincreasing communication cost. The memory require-ments of FLEXIFACT decreased as the number ofreducers increased, while the other methods requiredthe fixed amounts of memory.

5.4 Convergence (Figure 9)We compared how quickly and accurately eachmethod factorizes real-world tensors using thePARAFAC model (Section 2.2) and the bias model(Section 3.5.5). Accuracies were measured per itera-tion by root mean square error (RMSE) on a held-outtest set, which is commonly used by recommender


(a) Movielens4 × 10 (PARAFAC) (b) Netflix3 (PARAFAC)

(c) Yahoo-music4 (PARAFAC) (d) Movielens4 × 10 (Bias)

(e) Netflix3 (Bias) (f) Yahoo-music4 (Bias)

Fig. 9: Convergence speed on real-world datasets. (a)-(c)show the results with the PARAFAC model, and (d)-(e)show the results with the bias model (Section 3.5.5). SALSand ALS converged fastest to the best solution, and CDTFfollowed them. CDTF and SALS, however, have muchbetter scalability than ALS as shown in Section 5.2.

systems. We used cross validation for K, λ, and η0,whose values used are in Table 5. Since the Movielens4

dataset (183MB) is too small to run on Hadoop, we in-creased its size by duplicating each user 10 times. Dueto the non-convexity of (1), the methods converged tolocal minima with different accuracies.

In all datasets, SALS was comparable with ALS,which converged fastest to the best solution, andCDTF followed them. CDTF and SALS, however,have much better scalability than ALS as shown inSection 5.2. PSGD converged slowest to the worstsolution because of the non-identifiability of (1).

5.5 Optimization (Figure 10)

We measured how our proposed optimization tech-niques, specifically the local disk caching and thegreedy row assignment, affect the running time ofCDTF, SALS, and the competitors on real-worlddatasets. The local disk caching speeded up CDTFup to 65.7×, SALS up to 15.5×, and the competitorsup to 4.8×. The speed-ups of SALS and CDTF werethe most significant because of their highly itera-tive nature. Additionally, the greedy row assignmentspeeded up CDTF up to 1.5×; SALS up to 1.3×; andthe competitors up to 1.2× compared with the secondbest one. It is not applicable to PSGD, which does notdistribute parameters in a row-wise manner.

(a) Netflix3 (b) Yahoo-music4

Fig. 10: Effects of the optimization techniques on runningtimes. NC: no caching, LC: local disk caching, SEQ: sequen-tial row assignment, RAN: random row assignment, GRE:greedy row assignment (see Section 3.4.3 for the row as-signment methods). Our proposed optimization techniques(LC+GRE) significantly speeded up CDTF, SALS, and alsotheir competitors.

5.6 Effects of C and Tin (Figures 11 and 12)Figure 11 compares the convergence properties ofSALS with different C values when Tin is fixed toone. As C increased, SALS tended to converge tobetter solutions (with lower test RMSE) although itrequired more memory space proportional to C (seeTheorem 4). When considering this trade-off, we sug-gest that C be set to the largest value where memoryrequirements do not exceed the available memory.

We also compared the convergence properties ofCDTF with different Tin values in Figure 12. Al-though there was an exception (Tin=1 in the Netflix3

dataset), CDTF tended to converge to better solutions(with lower test RMSE) more stably as Tin increased.In SALS with larger C values, however, the effectsof inner iterations on its convergence property weremarginal (see Figure 14 in the supplementary docu-ment [18] for the detailed experimental results). Wesuggest that C should be set first as described above.Then, if C = 1 (or equivalently CDTF is used), Tinshould be set to a high enough value, which was tenin our experiments. Otherwise, Tin can be set to one.

6 RELATED WORK

Matrix Factorization. Matrix factorization (MF) hasbeen successfully used in many recommender sys-tems [1], [2], [3]. The underlying intuition of MFfor recommendation can be found in [2]. Two majorapproaches for large-scale MF are alternating leastsquares (ALS) and stochastic gradient descent (SGD).ALS is inherently parallelizable [3] but has high mem-ory requirements and computational cost [5]. Effortswere made to parallelize SGD [4], [10], [25], includ-ing distributed stochastic gradient descent (DSGD)[4], which divides a matrix into disjoint blocks andprocesses them simultaneously. Recently, coordinatedescent was also applied to large-scale MF [5].

Fully Observable Tensor Factorization. Fully ob-servable tensor factorization is a basis for many ap-plications, including chemometrics [14] and signalprocessing [26]. Comprehensive survey on tensor fac-torization can be found in [15], [27]. To factorize large-



Fig. 11: Effects of the number of columns updated at a time(C) on the convergence of SALS. SALS tended to convergeto better solutions as C increased.

scale tensors, several approaches have been proposed.[22], [23], [28], [29] carefully reorder operations in thestandard ALS or GD algorithm to reduce intermediatedata, while [30] modifies the standard ALS algorithmitself for the same purpose. In [31], [32], a tensor isdivided into small subtensors, and each subtensoris factorized. Then, the factor matrices of the entiretensor are reconstituted from those of subtensors.In [26], [33], a tensor is compressed before beingfactorized. Among these methods, [22], [23], [28], [29],[31], [32] run on multiple machines in a distributedway. However, all these methods assume that inputtensors are fully observable without missing entries.Thus, they are not directly applicable to partially ob-servable tensors (e.g., rating data where most entriesare missing), which we consider in this work.

Partially Observable Tensor Factorization. The fac-torization of partially observable tensors (i.e., tensorswith missing entries) has been used in many fieldssuch as computer vision [34], chemometrics [17], so-cial network analysis [12], and Web search [13]. Re-cently, it also has been used in recommender systemsto utilize additional contextual information such astime and location [6], [7], [8], [9]. Moreover, evenwhen input tensors are fully observable, [35] suggeststhat converting them to partially observable tensorsby random sampling and factorizing the convertedtensors can be computationally and space efficient.[17] proposes two approaches for partially observabletensor factorization: (1) combining imputation withthe standard ALS algorithm, and (2) fitting the modelonly to observable entries. Since the former approachhas scalability issues, the latter approach has beentaken in gradient-based optimization methods [11],[21] as well as our methods. [11] is the state-of-the-art distributed algorithm for partially observabletensor factorization, but it has limited scalability w.r.t.the order and the number of machines as shown inSection 5. Our methods overcome these limitationsby using coordinate descent and its generalizationinstead of gradient-based optimization methods. Bothour methods and [11] supports coupled tensor factor-ization, and [28] also can be used for joint factorizationwith a partially observable matrix (see Appendix of


Fig. 12: Effects of inner iterations (Tin) on the convergenceof CDTF (or equivalently SALS with C = 1). CDTF tendedto converge stably to better solutions as Tin increased.

[28]) although [28] itself is a fully observable tensorfactorization method.

MAPREDUCE, Hadoop, and Alternative Frame-works. In this work, we implement our methods onHadoop, an open-source implementation of MAPRE-DUCE [36], because of its wide availability. Hadoop isone of the most widely-used distributed computingframeworks and offered by many cloud services (EC2,Azure, Google Cloud, etc.) so that it can be usedwithout any hardware or setup expertise. Due to thesame reason, a variety of data mining algorithms,including matrix and tensor factorization [4], [11],have been implemented on Hadoop. However, itslimited computational model and inefficiency due tothe repeated redistribution of data also have beenpointed out as problems [5], [37]. We address thisproblem by implementing local disk caching andbroadcast communication on the application levelwithout modifying Hadoop itself. Another, possiblysimpler, solution is to use other implementations ofMAPREDUCE [38], [39] or Spark [40], which supportlocal disk caching and broadcast communication asnative features. Especially, Spark has gained rapidadoption due to its fast speed and rich features.

7 CONCLUSION

In this paper, we propose two distributed algorithmsfor high-order and large-scale tensor factorization:CDTF and SALS. They decompose a tensor with agiven rank through a series of lower rank factoriza-tion. CDTF and SALS have advantages over eachother in terms of memory usage and convergencespeed, respectively. We compared our methods withother state-of-the-art distributed methods both analyt-ically and experimentally. Only CDTF and SALS arescalable with all aspects of data (i.e., order, the numberof observable entries, mode length, and rank) and suc-cessfully factorized a 5-order tensor with 1B observ-able entries, 10M mode length, and 1K rank with upto 493× less memory requirement. We implementedour methods on top of MAPREDUCE with two widely-applicable optimization techniques, which acceleratednot only CDTF and SALS (up to 98.2×), but also thecompetitors (up to 5.9×).


REFERENCES

[1] P.-L. Chen, C.-T. Tsai, Y.-N. Chen, K.-C. Chou, C.-L. Li, C.-H.Tsai, K.-W. Wu, Y.-C. Chou, C.-Y. Li, W.-S. Lin, et al., “A linearensemble of individual and blended models for music ratingprediction,” KDDCup 2011 Workshop, 2011.

[2] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech-niques for recommender systems,” Computer, vol. 42, no. 8,pp. 30–37, 2009.

[3] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scaleparallel collaborative filtering for the netflix prize,” in AAIM,pp. 337–348, 2008.

[4] R. Gemulla, E. Nijkamp, P. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradientdescent,” in KDD, 2011.

[5] H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon, “Scalable coor-dinate descent approaches to parallel matrix factorization forrecommender systems,” in ICDM, 2012.

[6] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver,“Multiverse recommendation: N-dimensional tensor factoriza-tion for context-aware collaborative filtering,” in RecSys, 2010.

[7] A. Nanopoulos, D. Rafailidis, P. Symeonidis, andY. Manolopoulos, “Musicbox: Personalized musicrecommendation based on cubic analysis of social tags,”IEEE TASLP, vol. 18, no. 2, pp. 407–412, 2010.

[8] V. W. Zheng, B. Cao, Y. Zheng, X. Xie, and Q. Yang, “Collabo-rative filtering meets mobile recommendation: A user-centeredapproach.,” in AAAI, 2010.

[9] H. Lamba, V. Nagarajan, K. Shin, and N. Shajarisales, “In-corporating side information in tensor completion,” in WWWCompanion, 2016.

[10] R. McDonald, K. Hall, and G. Mann, “Distributed trainingstrategies for the structured perceptron,” in HLT-NAACL, 2010.

[11] A. Beutel, A. Kumar, E. E. Papalexakis, P. P. Talukdar,C. Faloutsos, and E. P. Xing, “Flexifact: Scalable flexible fac-torization of coupled tensors on hadoop.,” in SDM, 2014.

[12] D. M. Dunlavy, T. G. Kolda, and E. Acar, “Temporal link pre-diction using matrix and tensor factorizations,” ACM TKDD,vol. 5, no. 2, p. 10, 2011.

[13] J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen, “Cubesvd: anovel approach to personalized web search,” in WWW, 2005.

[14] A. Smilde, R. Bro, and P. Geladi, Multi-way analysis: applicationsin the chemical sciences. John Wiley & Sons, 2005.

[15] T. Kolda and B. Bader, “Tensor decompositions and applica-tions,” SIAM review, vol. 51, no. 3, 2009.

[16] K. Shin and U. Kang, “Distributed methods for high-dimensional and large-scale tensor factorization,” in ICDM,2014.

[17] G. Tomasi and R. Bro, “Parafac and missing values,”Chemometr. Intell. Lab. Syst., vol. 75, no. 2, pp. 163–180, 2005.

[18] “Supplementary document (proofs, implementation details,and additional experiments).” Available at http://www.cs.cmu.edu/∼kijungs/codes/cdtf/supple.pdf.

[19] A. Cichocki and P. Anh-Huy, “Fast local algorithms for largescale nonnegative matrix and tensor factorizations,” IEICETrans. Fundamentals, vol. 92, no. 3, pp. 708–721, 2009.

[20] E. Acar, T. G. Kolda, and D. M. Dunlavy, “All-at-once op-timization for coupled matrix and tensor factorizations,” inMLG, 2011.

[21] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup, “Scalabletensor factorizations for incomplete data,” Chemometr. Intell.Lab. Syst., vol. 106, no. 1, pp. 41–56, 2011.

[22] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, “Haten2:Billion-scale tensor decompositions,” in ICDE, 2015.

[23] U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos, “Gi-gatensor: scaling tensor analysis up by 100 times-algorithmsand discoveries,” in KDD, 2012.

[24] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-freeapproach to parallelizing stochastic gradient descent,” NIPS,2011.

[25] M. Zinkevich, M. Weimer, A. J. Smola, and L. Li, “Parallelizedstochastic gradient descent,” in NIPS, 2010.

[26] N. D. Sidiropoulos and A. Kyrillidis, “Multi-way compressedsensing for sparse low-rank tensors,” IEEE Signal ProcessingLetters, vol. 19, no. 11, pp. 757–760, 2012.

[27] L. Sael, I. Jeon, and U. Kang, “Scalable tensor mining,” BigData Research, vol. 2, no. 2, pp. 82–86, 2015.

[28] J. H. Choi and S. Vishwanathan, “Dfacto: Distributed factor-ization of tensors,” in NIPS, 2014.

[29] B. Jeon, I. Jeon, L. Sael, and U. Kang, “Scout: Scalable coupledmatrix-tensor factorization - algorithm and discoveries,” inICDE, pp. 811–822, 2016.

[30] A. H. Phan and A. Cichocki, “Parafac algorithms for large-scale problems,” Neurocomputing, vol. 74, no. 11, pp. 1970–1984, 2011.

[31] A. L. De Almeida and A. Y. Kibangou, “Distributed large-scaletensor decomposition,” in ICASSP, 2014.

[32] A. L. de Almeida and A. Y. Kibangou, “Distributed computa-tion of tensor decompositions in collaborative networks,” inCAMSAP, pp. 232–235, 2013.

[33] J. E. Cohen, R. C. Farias, and P. Comon, “Fast decompositionof large nonnegative tensors,” IEEE Signal Processing Letters,vol. 22, no. 7, pp. 862–866, 2015.

[34] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completionfor estimating missing values in visual data,” IEEE TPAMI,vol. 35, no. 1, pp. 208–220, 2013.

[35] N. Vervliet, O. Debals, L. Sorber, and L. De Lathauwer,“Breaking the curse of dimensionality using decompositions ofincomplete tensors: Tensor-based scientific computing in bigdata analysis,” IEEE Signal Processing Magazine, vol. 31, no. 5,pp. 71–79, 2014.

[36] J. Dean and S. Ghemawat, “Mapreduce: simplified data pro-cessing on large clusters,” Commun. ACM, vol. 51, no. 1,pp. 107–113, 2008.

[37] C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrixcompletion.,” in ICDM, 2012.

[38] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “Haloop:efficient iterative data processing on large clusters,” PVLDB,vol. 3, no. 1-2, pp. 285–296, 2010.

[39] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,and G. Fox, “Twister: a runtime for iterative mapreduce,” inProceedings of the 19th ACM International Symposium on HighPerformance Distributed Computing, pp. 810–818, ACM, 2010.

[40] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, andI. Stoica, “Spark: cluster computing with working sets,” inUSENIX, 2010.

Kijung Shin is a Ph.D. student in the Com-puter Science Department of Carnegie Mel-lon University. He received B.S. in Com-puter Science and Engineering at Seoul Na-tional University. His research interests in-clude graph mining and scalable machinelearning.

Lee Sael, aka Sael Lee, holds a joint posi-tion as an assistant professor in the Depart-ment of Computer Science at SUNY Koreaand an assistant research professor in theDepartment of Computer Science at StonyBrook University. She received her Ph.D. inComputer Science from Purdue University,West Lafayette, IN in 2010, and her B.S.in Computer Science from Korea University,Seoul, Republic of Korea in 2005.

U Kang is an assistant professor in theDepartment of Computer Science and En-gineering of Seoul National University. Hereceived his Ph.D. in Computer Science atCarnegie Mellon University, and his B.S. inComputer Science and Engineering at SeoulNational University. He won 2013 SIGKDDDoctoral Dissertation Award, 2013 New Fac-ulty Award from Microsoft Research Asia,and two best paper awards. His researchinterests include data mining.

http://www.cs.cmu.edu/~kijungs/codes/cdtf/supple.pdf

http://www.cs.cmu.edu/~kijungs/codes/cdtf/supple.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING …kijungs/papers/cdtfTKDE2017.pdf ·...

Documents

Transcript of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING …kijungs/papers/cdtfTKDE2017.pdf ·...