NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed...

18
NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion Hyokyun Yun Purdue University [email protected] Hsiang-Fu Yu University of Texas, Austin [email protected] Cho-Jui Hsieh University of Texas, Austin [email protected] S V N Vishwanathan Purdue University [email protected] Inderjit Dhillon University of Texas, Austin [email protected] ABSTRACT We develop an efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas- tic Multi-machine algorithm for Asynchronous and Decen- tralized matrix completion). NOMAD is a decentralized algorithm with non-blocking communication between pro- cessors. One of the key features of NOMAD is that the ownership of a variable is asynchronously transferred be- tween processors in a decentralized fashion. As a conse- quence it is a lock-free parallel algorithm. In spite of being asynchronous, the variable updates of NOMAD are serial- izable, that is, there is an equivalent update ordering in a serial implementation. NOMAD outperforms synchronous algorithms which require explicit bulk synchronization af- ter every iteration: our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on commodity hardware, but also outperforms state- of-the-art algorithms on a HPC cluster both in multi-core and distributed memory settings. 1. INTRODUCTION The aim of this paper is to develop an efficient parallel distributed algorithm for matrix completion. We are specif- ically interested in solving large industrial scale matrix com- pletion problems on commodity hardware with limited com- puting power, memory, and interconnect speed, such as the ones found in data centers. The widespread availability of cloud computing platforms such as Amazon Web Services (AWS) make the deployment of such systems feasible. However, existing algorithms for matrix completion are designed for conventional high performance computing (HPC) platforms. In order to deploy them on commodity hardware we need to employ a large number of machines, which in- creases inter-machine communication. Since the network bandwidth in data centers is significantly lower and less- reliable than the high-speed interconnects typically found in HPC hardware, this can often have disastrous consequences in terms of convergence speed or the quality of the solution. In this paper, we present NOMAD (Non-locking, stOchas- tic Multi-machine algorithm for Asynchronous and Decen- tralized matrix completion), a new parallel algorithm for matrix completion with the following properties: Non-blocking communication: Processors exchange mes- sages in an asynchronous fashion [6], and there is no bulk synchronization. Decentralized: Processors are symmetric to each other, and each processor does the same amount of compu- tation and communication. Lock free: Using an owner computes paradigm, we completely eliminate the need for locking variables. Fully asynchronous computation: Because of the lock free nature of our algorithm, the variable updates in individual processors are fully asynchronous. Serializability: There is an equivalent update ordering in a serial implementation. In our algorithm stale pa- rameters are never used and this empirically leads to faster convergence [17]. Our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on com- modity hardware, but also outperforms state-of-the-art algo- rithms on a HPC cluster both in multi-core and distributed memory settings. We show that our algorithm is signif- icantly better than existing multi-core and multi-machine algorithms for the matrix completion problem. This paper is organized as follows: Section 2 establishes some notation and introduces the matrix completion prob- lem formally. Section 3 is devoted to describing NOMAD. We contrast NOMAD with existing work in Section 4. In Section 5 we present extensive empirical comparison of NO- MAD with various existing algorithms. Section 6 concludes the paper with a discussion. 2. BACKGROUND Let A R m×n be a rating matrix, where m denotes the number of users and n the number of items. Typically m n, although the algorithms we consider in this pa- per do not depend on such an assumption. Furthermore, let 1

Transcript of NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed...

Page 1: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

NOMAD Non-locking stOchastic Multi-machine algorithmfor Asynchronous and Decentralized matrix completion

Hyokyun YunPurdue University

yun3purdueedu

Hsiang-Fu YuUniversity of Texas Austin

rofuyucsutexasedu

Cho-Jui HsiehUniversity of Texas Austin

cjhsiehcsutexaseduS V N Vishwanathan

Purdue University

vishystatpurdueedu

Inderjit DhillonUniversity of Texas Austin

inderjitcsutexasedu

ABSTRACTWe develop an efficient parallel distributed algorithm formatrix completion named NOMAD (Non-locking stOchas-tic Multi-machine algorithm for Asynchronous and Decen-tralized matrix completion) NOMAD is a decentralizedalgorithm with non-blocking communication between pro-cessors One of the key features of NOMAD is that theownership of a variable is asynchronously transferred be-tween processors in a decentralized fashion As a conse-quence it is a lock-free parallel algorithm In spite of beingasynchronous the variable updates of NOMAD are serial-izable that is there is an equivalent update ordering in aserial implementation NOMAD outperforms synchronousalgorithms which require explicit bulk synchronization af-ter every iteration our extensive empirical evaluation showsthat not only does our algorithm perform well in distributedsetting on commodity hardware but also outperforms state-of-the-art algorithms on a HPC cluster both in multi-coreand distributed memory settings

1 INTRODUCTIONThe aim of this paper is to develop an efficient parallel

distributed algorithm for matrix completion We are specif-ically interested in solving large industrial scale matrix com-pletion problems on commodity hardware with limited com-puting power memory and interconnect speed such as theones found in data centers The widespread availability ofcloud computing platforms such as Amazon Web Services(AWS) make the deployment of such systems feasible

However existing algorithms for matrix completion aredesigned for conventional high performance computing (HPC)platforms In order to deploy them on commodity hardwarewe need to employ a large number of machines which in-creases inter-machine communication Since the networkbandwidth in data centers is significantly lower and less-

reliable than the high-speed interconnects typically found inHPC hardware this can often have disastrous consequencesin terms of convergence speed or the quality of the solution

In this paper we present NOMAD (Non-locking stOchas-tic Multi-machine algorithm for Asynchronous and Decen-tralized matrix completion) a new parallel algorithm formatrix completion with the following properties

bull Non-blocking communication Processors exchange mes-sages in an asynchronous fashion [6] and there is nobulk synchronization

bull Decentralized Processors are symmetric to each otherand each processor does the same amount of compu-tation and communication

bull Lock free Using an owner computes paradigm wecompletely eliminate the need for locking variables

bull Fully asynchronous computation Because of the lockfree nature of our algorithm the variable updates inindividual processors are fully asynchronous

bull Serializability There is an equivalent update orderingin a serial implementation In our algorithm stale pa-rameters are never used and this empirically leads tofaster convergence [17]

Our extensive empirical evaluation shows that not only doesour algorithm perform well in distributed setting on com-modity hardware but also outperforms state-of-the-art algo-rithms on a HPC cluster both in multi-core and distributedmemory settings We show that our algorithm is signif-icantly better than existing multi-core and multi-machinealgorithms for the matrix completion problem

This paper is organized as follows Section 2 establishessome notation and introduces the matrix completion prob-lem formally Section 3 is devoted to describing NOMADWe contrast NOMAD with existing work in Section 4 InSection 5 we present extensive empirical comparison of NO-MAD with various existing algorithms Section 6 concludesthe paper with a discussion

2 BACKGROUNDLet A isin Rmtimesn be a rating matrix where m denotes the

number of users and n the number of items Typicallym n although the algorithms we consider in this pa-per do not depend on such an assumption Furthermore let

1

Ω sube 1 mtimes1 n denote the observed entries of Athat is (i j) isin Ω implies that user i gave item j a rating ofAij The goal here is to predict accurately the unobservedratings For convenience we define Ωi to be the set of itemsrated by the i-th user ie Ωi = j (i j) isin Ω Analo-gously Ωj = i (i j) isin Ω is the set of users who haverated item j Also let agti denote the i-th row of A

One popular model for matrix completion finds matricesW isin Rmtimesk and H isin Rntimesk with k min(mn) such thatA asympWHgt One way to understand this model is to realizethat each row wgti isin Rk of W can be thought of as a k-dimensional embedding of the user Analogously each rowhgtj isin Rk of H is an embedding of the item in the same k-dimensional space In order to predict the (i j)-th entry ofA we simply use 〈wihj〉 where 〈middot middot〉 denotes the Euclideaninner product of two vectors The goodness of fit of themodel is measured by a loss function While our optimiza-tion algorithm can work with an arbitrary separable lossfor ease of exposition we will only discuss the square loss12

(Aij minus 〈wihj〉)2 Furthermore we need to enforce regu-larization to prevent over-fitting and to predict well on theunknown entries of A Again a variety of regularizers canbe handled by our algorithm but we will only focus on thefollowing weighted square norm-regularization in this paperλ2

summi=1 |Ωi| middot wi2 + λ

2

sumnj=1

∣∣Ωj∣∣ middot hi2 where λ gt 0 is

a regularization parameter Here | middot | denotes the cardinal-ity of a set and middot2 is the L2 norm of a vector Puttingeverything together yields the following objective function

minW isin Rmtimesk

H isin Rntimesk

J(WH) =1

2

sum(ij)isinΩ

(Aij minus 〈wihj〉)2

2

(msumi=1

|Ωi| middot wi2 +

nsumj=1

∣∣Ωj∣∣ middot hi2) (1)

This can be further simplified and written as

J(WH) =1

2

sum(ij)isinΩ

(Aij minus 〈wihj〉)2 + λ

(wi2 + hj2

)

In the above equations λ gt 0 is a scalar which trades offthe loss function with the regularizer

In the sequel we will let wil and hil for 1 le l le k de-note the l-th coordinate of the column vectors wi and hj respectively Furthermore HΩi (resp WΩj

) will be used to

denote the sub-matrix of H (resp W ) formed by collectingrows corresponding to Ωi (resp Ωj)

Note the following property of the above objective func-tion (1) If we fix H then the problem decomposes to mindependent convex optimization problems each of whichhas the following form

minwiisinRk

Ji(wi) =1

2

sumjisinΩi

(Aij minus 〈wihj〉)2 + λ wi2 (2)

Analogously if we fix W then (1) decomposes into n inde-pendent convex optimization problems each of which hasthe following form

minhjisinRk

Jj(hj) =1

2

sumiisinΩj

(Aij minus 〈wihj〉)2 + λ hj2

The gradient and Hessian of Ji(w) can be easily computed

nablaJi(wi) = Mwi minus b and nabla2Ji(wi) = M

where we have defined M = HgtΩiHΩi + λI and b = Hgtai

We will now present three well known optimization strate-gies for solving (1) which essentially differ in only two char-acteristics namely the sequence in which updates to thevariables in W and H are carried out and the level of ap-proximation in the update

21 Alternating Least SquaresA simple version of the Alternating Least Squares (ALS)

algorithm updates variables as follows w1 w2 wm h1h2 hn w1 and so on Updates to wi are computedby solving (2) which is in fact a least squares problem andthus the following Newton update gives us

wi larr wi minus[nabla2Ji(wi)

]minus1nablaJi(wi) (3)

which can be rewritten using M and b as wi larr Mminus1bUpdates to hj rsquos are analogous

22 Coordinate DescentThe ALS update involves formation of the Hessian and its

inversion In order to reduce the computational complexityone can replace the Hessian by its diagonal approximation

wi larr wi minus[diag

(nabla2Ji (wi)

)]minus1nablaJi (wi) (4)

which can be rewritten using M and b as

wi larr wi minus diag(M)minus1 [Mwi minus b] (5)

If we update one component of wi at a time the update (5)can be written as

wil larr wil minus 〈mlwi〉 minus blmll

(6)

where ml is l-th row of matrix M bl is l-th component of band mll is the l-th coordinate of ml

If we choose the update sequence w11 w1k w21 w2k wm1 wmk h11 h1k h21 h2k hn1 hnk w11 w1k and so on then this recoversCyclic Coordinate Descent (CCD) [15] On the other handthe update sequence w11 wm1 h11 hn1 w12 wm2 h12 hn2 and so on recovers the CCD++ algorithmof Yu et al [26] The CCD++ updates can be performedmore efficiently than the CCD updates by maintaining aresidual matrix [26]

23 Stochastic Gradient DescentThe stochastic gradient descent (SGD) algorithm for ma-

trix completion can be motivated from its more classical ver-sion gradient descent Given an objective function f(θ) =1m

summi=1 fi(θ) the gradient descent update is

θ larr θ minus st middot nablaθf(θ) (7)

where t denotes the iteration number and st is a sequenceof step sizes The stochastic gradient descent update re-placesnablaθf(θ) by its unbiased estimatenablaθfi(θ) which yields

θ larr θ minus st middot nablaθfi(θ) (8)

It is significantly cheaper to evaluate nablaθfi(θ) as comparedto nablaθf(θ) One can show that for sufficiently large t theabove updates will converge to a fixed point of f [16 21]The above update also enjoys desirable properties in termsof sample complexity and hence is widely used in machinelearning [7 22]

2

usersitems

(a)

usersitems

(b)

Figure 1 Illustration of updates used in matrix comple-tion Three algorithms are shown here (a) alternating leastsquares and coordinate descent (b) stochastic gradient de-scent Black indicates that the value of the node is beingupdated gray indicates that the value of the node is beingread White nodes are neither being read nor updated

For the matrix completion problem note that for a fixed(i j) pair the gradient of (1) can be written as

nablawiJ(WH) = (Aij minus 〈wihj〉) hj + λwi and

nablahjJ(WH) = (Aij minus 〈wihj〉) wi + λhj

Therefore the SGD updates require sampling a random in-dex (it jt) uniformly from the set of nonzero indicies Ω andperforming the update

wit larr wit minus st middot [(Aitjt minuswithjt)hjt + λwit ] and (9)

hjt larr hjt minus st middot [(Aitjt minuswithjt)wjt + λhjt ] (10)

3 NOMADIn NOMAD we use an optimization scheme based on

SGD In order to justify this choice we find it instructive tofirst understand the updates performed by ALS coordinatedescent and SGD on a bipartite graph which is constructedas follows the i-th user node corresponds to wi the j-thitem node corresponds to hj and an edge (i j) indicatesthat user i has rated item j (see Figure 1) Both the ALSupdate (3) and coordinate descent update (6) for wi requireus to access the values of hj for all j isin Ωi This is shownin Figure 1 (a) where the black node corresponds to wiwhile the gray nodes correspond to hj for j isin Ωi On theother hand the SGD update to wi (9) only requires us toretrieve the value of hj for a single random j isin Ωi (Figure 1(b)) What this means is that in contrast to ALS or CCDmultiple SGD updates can be carried out simultaneously inparallel without interfering with each other Put anotherway SGD has higher potential for finer-grained parallelismthan other approaches and therefore we use it as our opti-mization scheme in NOMAD

31 DescriptionFor now we will denote each parallel computing unit as a

worker in a shared memory setting a worker is a thread andin a distributed memory architecture a worker is a machineThis abstraction allows us to present NOMAD in a unifiedmanner Of course NOMAD can be used in a hybrid set-ting where there are multiple threads spread across multiplemachines and this will be discussed in Section 34

In NOMAD the users 1 2 m are split into p disjointsets I1 I2 Ip which are of approximately equal size1This induces a partition of the rows of the ratings matrix

A The q-th worker stores n sets of indices Ω(q)j for j isin

1 n which are defined as

Ω(q)j =

(i j) isin Ωj i isin Iq

as well as the corresponding values of A Note that oncethe data is partitioned and distributed to the workers it isnever moved during the execution of the algorithm

Recall that there are two types of parameters in ma-trix completion user parameters wirsquos and item parame-ters hj rsquos In NOMAD wirsquos are partitioned according toI1 I2 Ip that is the q-th worker stores and updateswi for i isin Iq The variables in W are partitioned at thebeginning and never move across workers during the execu-tion of the algorithm On the other hand the hj rsquos are splitrandomly into p partitions at the beginning and their own-ership changes as the algorithm progresses At each pointof time an hj variable resides in one and only worker and itmoves to another worker after it is processed independentof other item variables Hence these are nomadic variables2

Processing an item variable hj at the q-th worker entailsexecuting SGD updates (9) and (10) on the ratings in the

set Ω(q)j Note that these updates only require access to hj

and wi for i isin Iq since Iqrsquos are disjoint each wi variablein the set is accessed by only one worker This is why thecommunication of wi variables is not necessary On theother hand hj is updated only by the worker that currentlyowns it so there is no need for a lock this is the popularowner-computes rule in parallel computing See Figure 2

We now formally define the NOMAD algorithm (see Algo-rithm 1 for detailed pseudo-code) Each worker q maintainsits own concurrent queue queue[q] which contains a list ofitems it has to process Each element of the list consistsof the index of the item j (1 le j le n) and a correspond-ing k-dimensional parameter vector hj this pair is denotedas (jhj) Each worker q pops a (jhj) pair from its ownqueue queue[q] and runs stochastic gradient descent up-

date on Ω(q)j which is the set of ratings on item j locally

stored in worker q (line 16 to 21) This changes values ofwi for i isin Iq and hj After all the updates on item j aredone a uniformly random worker qprime is sampled (line 22) andthe updated (jhj) pair is pushed into the queue of thatworker qprime (line 23) Note that this is the only time where aworker communicates with another worker Also note thatthe nature of this communication is asynchronous and non-blocking Furthermore as long as there are items in thequeue the computations are completely asynchronous anddecentralized Moreover all workers are symmetric that isthere is no designated master or slave

32 Complexity AnalysisFirst we consider the case when the problem is distributed

across p workers and study how the space and time com-plexity behaves as a function of p Each worker has to store

1An alternative strategy is to split the users such that eachset has approximately the same number of ratings2Due to symmetry in the formulation of the matrix com-pletion problem one can also make the wirsquos nomadic andpartition the hj rsquos Since usually the number of users is muchlarger than the number of items this leads to more commu-nication and therefore we make the hj variables nomadic

3

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(a) Initial assignment of Wand H Each worker worksonly on the diagonal activearea in the beginning

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(b) After a worker finishesprocessing column j it sendsthe corresponding item pa-rameter hj to another workerHere h2 is sent from worker 1to 4

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(c) Upon receipt the col-umn is processed by the newworker Here worker 4 cannow process column 2 since itowns the column

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(d) During the execution ofthe algorithm the ownershipof the item parameters hjchanges

Figure 2 Illustration of the NOMAD algorithm

Algorithm 1 the basic NOMAD algorithm

1 λ regularization parameter2 st step size sequence3 initialize parameters

4 wil sim UniformReal(

0 1radick

)for 1 le i le m 1 le l le k

5 hjl sim UniformReal(

0 1radick

)for 1 le j le n 1 le l le k

6 initialize queues

7 for j isin 1 2 n do8 q sim UniformDiscrete 1 2 p9 queue[q]push((jhj))

10 end for11 start p workers

12 Parallel Foreach q isin 1 2 p13 while stop signal is not yet received do14 if queue[q] not empty then15 (jhj)larr queue[q]pop()

16 for (i j) isin Ω(q)j do

17 SGD update

18 tlarr number of updates on (i j)19 wi larr wi minus st middot [(Aij minuswihj)hj + λwi]20 hj larr hj minus st middot [(Aij minuswihj)wj + λhj ] 21 end for22 qprime sim UniformDiscrete 1 2 p23 queue[qprime]push((jhj))24 end if25 end while26 Parallel End

1p fraction of the m user parameters and approximately1p fraction of the n item parameters Furthermore eachworker also stores approximately 1p fraction of the |Ω| rat-ings Since storing a row of W or H requires O(k) spacethe space complexity per worker is O((mk + nk + |Ω|)p)As for time complexity we find it useful to use the follow-ing assumptions performing the SGD updates in line 16 to21 takes a middot k time and communicating a (jhj) to anotherworker takes c middot k time where a and c are hardware depen-dent constants On the average each (jhj) pair containsO (|Ω| np) non-zero entries Therefore when a (jhj) pairis popped from queue[q] in line 15 of Algorithm 1 on theaverage it takes a middot (|Ω| knp) time to process the pair Sincecomputation and communication can be done in parallel aslong as a middot (|Ω| knp) is higher than c middot k a worker thread isalways busy and NOMAD scales linearly

Suppose that |Ω| is fixed but the number of workers pincreases that is we take a fixed size dataset and distributeit across p workers As expected for a large enough valueof p (which is determined by hardware dependent constantsa and b) the cost of communication will overwhelm the costof processing an item thus leading to slowdown

On the other hand suppose the work per worker is fixedthat is |Ω| increases and the number of workers p increasesproportionally The average time a middot(|Ω| knp) to process anitem remains constant and NOMAD scales linearly

Finally we discuss the communication complexity of NO-MAD For this discussion we focus on a single item param-eter hj which consists of O(k) numbers In order to beprocessed by all the p workers once it needs to be commu-nicated p times This requires O(kp) communication peritem There are n items and if we make a simplifying as-sumption that during the execution of NOMAD each itemis processed a constant c number of times by each processorthen the total communication complexity is O(nkp)

33 Dynamic Load BalancingAs different workers have different number of ratings per

item the speed at which a worker processes a set of rat-

ings Ω(q)j for an item j also varies among workers Further-

more in the distributed memory setting different workersmight process updates at different rates dues to differencesin hardware and system load NOMAD can handle this bydynamically balancing the workload of workers in line 22 ofAlgorithm 1 instead of sampling the recipient of a messageuniformly at random we can preferentially select a workerwhich has fewer items in its queue to process To do this apayload carrying information about the size of the queue[q]is added to the messages that the workers send each otherThe overhead of passing the payload information is just asingle integer per message This scheme allows us to dy-namically load balance and ensures that a slower workerwill receive smaller amount of work compared to others

34 Hybrid ArchitectureIn a hybrid architecture we have multiple threads on a sin-

gle machine as well as multiple machines distributed acrossthe network In this case we make two improvements to thebasic algorithm First in order to amortize the communica-tion costs we reserve two additional threads per machine forsending and receiving (jhj) pairs over the network Intra-machine communication is much cheaper than machine-to-machine communication since the former does not involve

4

a network hop Therefore whenever a machine receives a(jhj) pair it circulates the pair among all of its threadsbefore sending the pair over the network This is done byuniformly sampling a random permutation whose size equalsto the number of worker threads and sending the item vari-able to each thread according to this permutation Circu-lating a variable more than once was found to not improveconvergence and hence is not used in our algorithm

35 Implementation DetailsMulti-threaded MPI was used for inter-machine commu-

nication Instead of communicating single (jhj) pairs wefollow the strategy of [23] and accumulate a fixed number ofpairs (eg 100) before transmitting them over the network

NOMAD can be implemented with lock-free data struc-tures since the only interaction between threads is via oper-ations on the queue We used the concurrent queue providedby Intel Thread Building Blocks (TBB) [3] Although tech-nically not lock-free the TBB concurrent queue neverthelessscales almost linearly with the number of threads

Since there is very minimal sharing of memory acrossthreads in NOMAD by making memory assignments in eachthread carefully aligned with cache lines we can exploit mem-ory locality and avoid cache ping-pong This results in nearlinear scaling for the multi-threaded setting

4 RELATED WORK

41 Map-Reduce and FriendsSince many machine learning algorithms are iterative in

nature a popular strategy to distribute them across multi-ple machines is to use bulk synchronization after every it-eration Typically one partitions the data into chunks thatare distributed to the workers at the beginning A mas-ter communicates the current parameters which are used toperform computations on the slaves The slaves return thesolutions during the bulk synchronization step which areused by the master to update the parameters The pop-ularity of this strategy is partly thanks to the widespreadavailability of Hadoop [1] an open source implementationof the MapReduce framework [9]

All three optimization schemes for matrix completion namelyALS CCD++ and SGD can be parallelized using a bulksynchronization strategy This is relatively simple for ALS[27] and CCD++ [26] but a bit more involved for SGD[12 18] Suppose p machines are available Then the Dis-tributed Stochastic Gradient Descent (DSGD) algorithm ofGemulla et al [12] partitions the indices of users 1 2 minto mutually exclusive sets I1 I2 Ip and the indices ofitems into J1 J2 Jp Now define

Ω(q) = (i j) isin Ω i isin Iq j isin Jq 1 le q le pand suppose that each machine runs SGD updates (9) and(10) independently but machine q samples (i j) pairs only

from Ω(q) By construction Ω(q)rsquos are disjoint and hencethese updates can be run in parallel A similar observationwas also made by Recht and Re [18] A bulk synchronizationstep redistributes the sets J1 J2 Jp and correspondingrows of H which in turn changes the Ω(q) processed by eachmachine and the iteration proceeds (see Figure 3)

Unfortunately bulk synchronization based algorithms havetwo major drawbacks First the communication and com-putation steps are done in sequence What this means is

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

Figure 3 Illustration of DSGD algorithm with 4 workersInitially W and H are partitioned as shown on the left Eachworker runs SGD on its active area as indicated After eachworker completes processing data points in its own activearea the columns of item parameters Hgt are exchangedrandomly and the active area changes This process is re-peated for each iteration

that when the CPU is busy the network is idle and viceversa The second issue is that they suffer from what iswidely known as the the curse of last reducer [4 24] In otherwords all machines have to wait for the slowest machine tofinish before proceeding to the next iteration Zhuang et al[28] report that DSGD suffers from this problem even in theshared memory setting

DSGD++ is an algorithm proposed by Teflioudi et al[25] to address the first issue discussed above Instead ofusing p partitions DSGD++ uses 2p partitions While thep workers are processing p partitions the other p partitionsare sent over the network This keeps both the network andCPU busy simultaneously However DSGD++ also suffersfrom the curse of the last reducer

Another attempt to alleviate the problems of bulk syn-chronization in the shared memory setting is the FPSGDalgorithm of Zhuang et al [28] given p threads FPSGDpartitions the parameters into more than p sets and usesa task manager thread to distribute the partitions Whena thread finishes updating one partition it requests for an-other partition from the task manager It is unclear how toextend this idea to the distributed memory setting

In NOMAD we sidestep all the drawbacks of bulk synchro-nization Like DSGD++ we also simultaneously keep thenetwork and CPU busy On the other hand like FPSGDwe effectively load balance between the threads To under-stand why NOMAD enjoys both these benefits it is instruc-tive to contrast the data partitioning schemes underlyingDSGD DSGD++ FPSGD and NOMAD (see Figure 4)Given p number of workers DSGD divides the rating matrixA into p times p number of blocks DSGD++ improves uponDSGD by further dividing each block to 1 times 2 sub-blocks(Figure 4 (a) and (b)) On the other hand FPSGD splitsA into pprime times pprime blocks with pprime gt p (Figure 4 (c)) while NO-MAD uses ptimes n blocks (Figure 4 (d)) In terms of commu-nication there is no difference between various partitioningschemes all of them require O(nkp) communication for eachitem to be processed a constant c number of times Howeverhaving smaller blocks means that NOMAD has much moreflexibility in assigning blocks to processors and hence betterability to exploit parallelism Because NOMAD operates atthe level of individual item parameters hj it can dynam-ically load balance by assigning fewer columns to a slower

5

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 2: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

Ω sube 1 mtimes1 n denote the observed entries of Athat is (i j) isin Ω implies that user i gave item j a rating ofAij The goal here is to predict accurately the unobservedratings For convenience we define Ωi to be the set of itemsrated by the i-th user ie Ωi = j (i j) isin Ω Analo-gously Ωj = i (i j) isin Ω is the set of users who haverated item j Also let agti denote the i-th row of A

One popular model for matrix completion finds matricesW isin Rmtimesk and H isin Rntimesk with k min(mn) such thatA asympWHgt One way to understand this model is to realizethat each row wgti isin Rk of W can be thought of as a k-dimensional embedding of the user Analogously each rowhgtj isin Rk of H is an embedding of the item in the same k-dimensional space In order to predict the (i j)-th entry ofA we simply use 〈wihj〉 where 〈middot middot〉 denotes the Euclideaninner product of two vectors The goodness of fit of themodel is measured by a loss function While our optimiza-tion algorithm can work with an arbitrary separable lossfor ease of exposition we will only discuss the square loss12

(Aij minus 〈wihj〉)2 Furthermore we need to enforce regu-larization to prevent over-fitting and to predict well on theunknown entries of A Again a variety of regularizers canbe handled by our algorithm but we will only focus on thefollowing weighted square norm-regularization in this paperλ2

summi=1 |Ωi| middot wi2 + λ

2

sumnj=1

∣∣Ωj∣∣ middot hi2 where λ gt 0 is

a regularization parameter Here | middot | denotes the cardinal-ity of a set and middot2 is the L2 norm of a vector Puttingeverything together yields the following objective function

minW isin Rmtimesk

H isin Rntimesk

J(WH) =1

2

sum(ij)isinΩ

(Aij minus 〈wihj〉)2

2

(msumi=1

|Ωi| middot wi2 +

nsumj=1

∣∣Ωj∣∣ middot hi2) (1)

This can be further simplified and written as

J(WH) =1

2

sum(ij)isinΩ

(Aij minus 〈wihj〉)2 + λ

(wi2 + hj2

)

In the above equations λ gt 0 is a scalar which trades offthe loss function with the regularizer

In the sequel we will let wil and hil for 1 le l le k de-note the l-th coordinate of the column vectors wi and hj respectively Furthermore HΩi (resp WΩj

) will be used to

denote the sub-matrix of H (resp W ) formed by collectingrows corresponding to Ωi (resp Ωj)

Note the following property of the above objective func-tion (1) If we fix H then the problem decomposes to mindependent convex optimization problems each of whichhas the following form

minwiisinRk

Ji(wi) =1

2

sumjisinΩi

(Aij minus 〈wihj〉)2 + λ wi2 (2)

Analogously if we fix W then (1) decomposes into n inde-pendent convex optimization problems each of which hasthe following form

minhjisinRk

Jj(hj) =1

2

sumiisinΩj

(Aij minus 〈wihj〉)2 + λ hj2

The gradient and Hessian of Ji(w) can be easily computed

nablaJi(wi) = Mwi minus b and nabla2Ji(wi) = M

where we have defined M = HgtΩiHΩi + λI and b = Hgtai

We will now present three well known optimization strate-gies for solving (1) which essentially differ in only two char-acteristics namely the sequence in which updates to thevariables in W and H are carried out and the level of ap-proximation in the update

21 Alternating Least SquaresA simple version of the Alternating Least Squares (ALS)

algorithm updates variables as follows w1 w2 wm h1h2 hn w1 and so on Updates to wi are computedby solving (2) which is in fact a least squares problem andthus the following Newton update gives us

wi larr wi minus[nabla2Ji(wi)

]minus1nablaJi(wi) (3)

which can be rewritten using M and b as wi larr Mminus1bUpdates to hj rsquos are analogous

22 Coordinate DescentThe ALS update involves formation of the Hessian and its

inversion In order to reduce the computational complexityone can replace the Hessian by its diagonal approximation

wi larr wi minus[diag

(nabla2Ji (wi)

)]minus1nablaJi (wi) (4)

which can be rewritten using M and b as

wi larr wi minus diag(M)minus1 [Mwi minus b] (5)

If we update one component of wi at a time the update (5)can be written as

wil larr wil minus 〈mlwi〉 minus blmll

(6)

where ml is l-th row of matrix M bl is l-th component of band mll is the l-th coordinate of ml

If we choose the update sequence w11 w1k w21 w2k wm1 wmk h11 h1k h21 h2k hn1 hnk w11 w1k and so on then this recoversCyclic Coordinate Descent (CCD) [15] On the other handthe update sequence w11 wm1 h11 hn1 w12 wm2 h12 hn2 and so on recovers the CCD++ algorithmof Yu et al [26] The CCD++ updates can be performedmore efficiently than the CCD updates by maintaining aresidual matrix [26]

23 Stochastic Gradient DescentThe stochastic gradient descent (SGD) algorithm for ma-

trix completion can be motivated from its more classical ver-sion gradient descent Given an objective function f(θ) =1m

summi=1 fi(θ) the gradient descent update is

θ larr θ minus st middot nablaθf(θ) (7)

where t denotes the iteration number and st is a sequenceof step sizes The stochastic gradient descent update re-placesnablaθf(θ) by its unbiased estimatenablaθfi(θ) which yields

θ larr θ minus st middot nablaθfi(θ) (8)

It is significantly cheaper to evaluate nablaθfi(θ) as comparedto nablaθf(θ) One can show that for sufficiently large t theabove updates will converge to a fixed point of f [16 21]The above update also enjoys desirable properties in termsof sample complexity and hence is widely used in machinelearning [7 22]

2

usersitems

(a)

usersitems

(b)

Figure 1 Illustration of updates used in matrix comple-tion Three algorithms are shown here (a) alternating leastsquares and coordinate descent (b) stochastic gradient de-scent Black indicates that the value of the node is beingupdated gray indicates that the value of the node is beingread White nodes are neither being read nor updated

For the matrix completion problem note that for a fixed(i j) pair the gradient of (1) can be written as

nablawiJ(WH) = (Aij minus 〈wihj〉) hj + λwi and

nablahjJ(WH) = (Aij minus 〈wihj〉) wi + λhj

Therefore the SGD updates require sampling a random in-dex (it jt) uniformly from the set of nonzero indicies Ω andperforming the update

wit larr wit minus st middot [(Aitjt minuswithjt)hjt + λwit ] and (9)

hjt larr hjt minus st middot [(Aitjt minuswithjt)wjt + λhjt ] (10)

3 NOMADIn NOMAD we use an optimization scheme based on

SGD In order to justify this choice we find it instructive tofirst understand the updates performed by ALS coordinatedescent and SGD on a bipartite graph which is constructedas follows the i-th user node corresponds to wi the j-thitem node corresponds to hj and an edge (i j) indicatesthat user i has rated item j (see Figure 1) Both the ALSupdate (3) and coordinate descent update (6) for wi requireus to access the values of hj for all j isin Ωi This is shownin Figure 1 (a) where the black node corresponds to wiwhile the gray nodes correspond to hj for j isin Ωi On theother hand the SGD update to wi (9) only requires us toretrieve the value of hj for a single random j isin Ωi (Figure 1(b)) What this means is that in contrast to ALS or CCDmultiple SGD updates can be carried out simultaneously inparallel without interfering with each other Put anotherway SGD has higher potential for finer-grained parallelismthan other approaches and therefore we use it as our opti-mization scheme in NOMAD

31 DescriptionFor now we will denote each parallel computing unit as a

worker in a shared memory setting a worker is a thread andin a distributed memory architecture a worker is a machineThis abstraction allows us to present NOMAD in a unifiedmanner Of course NOMAD can be used in a hybrid set-ting where there are multiple threads spread across multiplemachines and this will be discussed in Section 34

In NOMAD the users 1 2 m are split into p disjointsets I1 I2 Ip which are of approximately equal size1This induces a partition of the rows of the ratings matrix

A The q-th worker stores n sets of indices Ω(q)j for j isin

1 n which are defined as

Ω(q)j =

(i j) isin Ωj i isin Iq

as well as the corresponding values of A Note that oncethe data is partitioned and distributed to the workers it isnever moved during the execution of the algorithm

Recall that there are two types of parameters in ma-trix completion user parameters wirsquos and item parame-ters hj rsquos In NOMAD wirsquos are partitioned according toI1 I2 Ip that is the q-th worker stores and updateswi for i isin Iq The variables in W are partitioned at thebeginning and never move across workers during the execu-tion of the algorithm On the other hand the hj rsquos are splitrandomly into p partitions at the beginning and their own-ership changes as the algorithm progresses At each pointof time an hj variable resides in one and only worker and itmoves to another worker after it is processed independentof other item variables Hence these are nomadic variables2

Processing an item variable hj at the q-th worker entailsexecuting SGD updates (9) and (10) on the ratings in the

set Ω(q)j Note that these updates only require access to hj

and wi for i isin Iq since Iqrsquos are disjoint each wi variablein the set is accessed by only one worker This is why thecommunication of wi variables is not necessary On theother hand hj is updated only by the worker that currentlyowns it so there is no need for a lock this is the popularowner-computes rule in parallel computing See Figure 2

We now formally define the NOMAD algorithm (see Algo-rithm 1 for detailed pseudo-code) Each worker q maintainsits own concurrent queue queue[q] which contains a list ofitems it has to process Each element of the list consistsof the index of the item j (1 le j le n) and a correspond-ing k-dimensional parameter vector hj this pair is denotedas (jhj) Each worker q pops a (jhj) pair from its ownqueue queue[q] and runs stochastic gradient descent up-

date on Ω(q)j which is the set of ratings on item j locally

stored in worker q (line 16 to 21) This changes values ofwi for i isin Iq and hj After all the updates on item j aredone a uniformly random worker qprime is sampled (line 22) andthe updated (jhj) pair is pushed into the queue of thatworker qprime (line 23) Note that this is the only time where aworker communicates with another worker Also note thatthe nature of this communication is asynchronous and non-blocking Furthermore as long as there are items in thequeue the computations are completely asynchronous anddecentralized Moreover all workers are symmetric that isthere is no designated master or slave

32 Complexity AnalysisFirst we consider the case when the problem is distributed

across p workers and study how the space and time com-plexity behaves as a function of p Each worker has to store

1An alternative strategy is to split the users such that eachset has approximately the same number of ratings2Due to symmetry in the formulation of the matrix com-pletion problem one can also make the wirsquos nomadic andpartition the hj rsquos Since usually the number of users is muchlarger than the number of items this leads to more commu-nication and therefore we make the hj variables nomadic

3

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(a) Initial assignment of Wand H Each worker worksonly on the diagonal activearea in the beginning

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(b) After a worker finishesprocessing column j it sendsthe corresponding item pa-rameter hj to another workerHere h2 is sent from worker 1to 4

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(c) Upon receipt the col-umn is processed by the newworker Here worker 4 cannow process column 2 since itowns the column

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(d) During the execution ofthe algorithm the ownershipof the item parameters hjchanges

Figure 2 Illustration of the NOMAD algorithm

Algorithm 1 the basic NOMAD algorithm

1 λ regularization parameter2 st step size sequence3 initialize parameters

4 wil sim UniformReal(

0 1radick

)for 1 le i le m 1 le l le k

5 hjl sim UniformReal(

0 1radick

)for 1 le j le n 1 le l le k

6 initialize queues

7 for j isin 1 2 n do8 q sim UniformDiscrete 1 2 p9 queue[q]push((jhj))

10 end for11 start p workers

12 Parallel Foreach q isin 1 2 p13 while stop signal is not yet received do14 if queue[q] not empty then15 (jhj)larr queue[q]pop()

16 for (i j) isin Ω(q)j do

17 SGD update

18 tlarr number of updates on (i j)19 wi larr wi minus st middot [(Aij minuswihj)hj + λwi]20 hj larr hj minus st middot [(Aij minuswihj)wj + λhj ] 21 end for22 qprime sim UniformDiscrete 1 2 p23 queue[qprime]push((jhj))24 end if25 end while26 Parallel End

1p fraction of the m user parameters and approximately1p fraction of the n item parameters Furthermore eachworker also stores approximately 1p fraction of the |Ω| rat-ings Since storing a row of W or H requires O(k) spacethe space complexity per worker is O((mk + nk + |Ω|)p)As for time complexity we find it useful to use the follow-ing assumptions performing the SGD updates in line 16 to21 takes a middot k time and communicating a (jhj) to anotherworker takes c middot k time where a and c are hardware depen-dent constants On the average each (jhj) pair containsO (|Ω| np) non-zero entries Therefore when a (jhj) pairis popped from queue[q] in line 15 of Algorithm 1 on theaverage it takes a middot (|Ω| knp) time to process the pair Sincecomputation and communication can be done in parallel aslong as a middot (|Ω| knp) is higher than c middot k a worker thread isalways busy and NOMAD scales linearly

Suppose that |Ω| is fixed but the number of workers pincreases that is we take a fixed size dataset and distributeit across p workers As expected for a large enough valueof p (which is determined by hardware dependent constantsa and b) the cost of communication will overwhelm the costof processing an item thus leading to slowdown

On the other hand suppose the work per worker is fixedthat is |Ω| increases and the number of workers p increasesproportionally The average time a middot(|Ω| knp) to process anitem remains constant and NOMAD scales linearly

Finally we discuss the communication complexity of NO-MAD For this discussion we focus on a single item param-eter hj which consists of O(k) numbers In order to beprocessed by all the p workers once it needs to be commu-nicated p times This requires O(kp) communication peritem There are n items and if we make a simplifying as-sumption that during the execution of NOMAD each itemis processed a constant c number of times by each processorthen the total communication complexity is O(nkp)

33 Dynamic Load BalancingAs different workers have different number of ratings per

item the speed at which a worker processes a set of rat-

ings Ω(q)j for an item j also varies among workers Further-

more in the distributed memory setting different workersmight process updates at different rates dues to differencesin hardware and system load NOMAD can handle this bydynamically balancing the workload of workers in line 22 ofAlgorithm 1 instead of sampling the recipient of a messageuniformly at random we can preferentially select a workerwhich has fewer items in its queue to process To do this apayload carrying information about the size of the queue[q]is added to the messages that the workers send each otherThe overhead of passing the payload information is just asingle integer per message This scheme allows us to dy-namically load balance and ensures that a slower workerwill receive smaller amount of work compared to others

34 Hybrid ArchitectureIn a hybrid architecture we have multiple threads on a sin-

gle machine as well as multiple machines distributed acrossthe network In this case we make two improvements to thebasic algorithm First in order to amortize the communica-tion costs we reserve two additional threads per machine forsending and receiving (jhj) pairs over the network Intra-machine communication is much cheaper than machine-to-machine communication since the former does not involve

4

a network hop Therefore whenever a machine receives a(jhj) pair it circulates the pair among all of its threadsbefore sending the pair over the network This is done byuniformly sampling a random permutation whose size equalsto the number of worker threads and sending the item vari-able to each thread according to this permutation Circu-lating a variable more than once was found to not improveconvergence and hence is not used in our algorithm

35 Implementation DetailsMulti-threaded MPI was used for inter-machine commu-

nication Instead of communicating single (jhj) pairs wefollow the strategy of [23] and accumulate a fixed number ofpairs (eg 100) before transmitting them over the network

NOMAD can be implemented with lock-free data struc-tures since the only interaction between threads is via oper-ations on the queue We used the concurrent queue providedby Intel Thread Building Blocks (TBB) [3] Although tech-nically not lock-free the TBB concurrent queue neverthelessscales almost linearly with the number of threads

Since there is very minimal sharing of memory acrossthreads in NOMAD by making memory assignments in eachthread carefully aligned with cache lines we can exploit mem-ory locality and avoid cache ping-pong This results in nearlinear scaling for the multi-threaded setting

4 RELATED WORK

41 Map-Reduce and FriendsSince many machine learning algorithms are iterative in

nature a popular strategy to distribute them across multi-ple machines is to use bulk synchronization after every it-eration Typically one partitions the data into chunks thatare distributed to the workers at the beginning A mas-ter communicates the current parameters which are used toperform computations on the slaves The slaves return thesolutions during the bulk synchronization step which areused by the master to update the parameters The pop-ularity of this strategy is partly thanks to the widespreadavailability of Hadoop [1] an open source implementationof the MapReduce framework [9]

All three optimization schemes for matrix completion namelyALS CCD++ and SGD can be parallelized using a bulksynchronization strategy This is relatively simple for ALS[27] and CCD++ [26] but a bit more involved for SGD[12 18] Suppose p machines are available Then the Dis-tributed Stochastic Gradient Descent (DSGD) algorithm ofGemulla et al [12] partitions the indices of users 1 2 minto mutually exclusive sets I1 I2 Ip and the indices ofitems into J1 J2 Jp Now define

Ω(q) = (i j) isin Ω i isin Iq j isin Jq 1 le q le pand suppose that each machine runs SGD updates (9) and(10) independently but machine q samples (i j) pairs only

from Ω(q) By construction Ω(q)rsquos are disjoint and hencethese updates can be run in parallel A similar observationwas also made by Recht and Re [18] A bulk synchronizationstep redistributes the sets J1 J2 Jp and correspondingrows of H which in turn changes the Ω(q) processed by eachmachine and the iteration proceeds (see Figure 3)

Unfortunately bulk synchronization based algorithms havetwo major drawbacks First the communication and com-putation steps are done in sequence What this means is

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

Figure 3 Illustration of DSGD algorithm with 4 workersInitially W and H are partitioned as shown on the left Eachworker runs SGD on its active area as indicated After eachworker completes processing data points in its own activearea the columns of item parameters Hgt are exchangedrandomly and the active area changes This process is re-peated for each iteration

that when the CPU is busy the network is idle and viceversa The second issue is that they suffer from what iswidely known as the the curse of last reducer [4 24] In otherwords all machines have to wait for the slowest machine tofinish before proceeding to the next iteration Zhuang et al[28] report that DSGD suffers from this problem even in theshared memory setting

DSGD++ is an algorithm proposed by Teflioudi et al[25] to address the first issue discussed above Instead ofusing p partitions DSGD++ uses 2p partitions While thep workers are processing p partitions the other p partitionsare sent over the network This keeps both the network andCPU busy simultaneously However DSGD++ also suffersfrom the curse of the last reducer

Another attempt to alleviate the problems of bulk syn-chronization in the shared memory setting is the FPSGDalgorithm of Zhuang et al [28] given p threads FPSGDpartitions the parameters into more than p sets and usesa task manager thread to distribute the partitions Whena thread finishes updating one partition it requests for an-other partition from the task manager It is unclear how toextend this idea to the distributed memory setting

In NOMAD we sidestep all the drawbacks of bulk synchro-nization Like DSGD++ we also simultaneously keep thenetwork and CPU busy On the other hand like FPSGDwe effectively load balance between the threads To under-stand why NOMAD enjoys both these benefits it is instruc-tive to contrast the data partitioning schemes underlyingDSGD DSGD++ FPSGD and NOMAD (see Figure 4)Given p number of workers DSGD divides the rating matrixA into p times p number of blocks DSGD++ improves uponDSGD by further dividing each block to 1 times 2 sub-blocks(Figure 4 (a) and (b)) On the other hand FPSGD splitsA into pprime times pprime blocks with pprime gt p (Figure 4 (c)) while NO-MAD uses ptimes n blocks (Figure 4 (d)) In terms of commu-nication there is no difference between various partitioningschemes all of them require O(nkp) communication for eachitem to be processed a constant c number of times Howeverhaving smaller blocks means that NOMAD has much moreflexibility in assigning blocks to processors and hence betterability to exploit parallelism Because NOMAD operates atthe level of individual item parameters hj it can dynam-ically load balance by assigning fewer columns to a slower

5

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 3: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

usersitems

(a)

usersitems

(b)

Figure 1 Illustration of updates used in matrix comple-tion Three algorithms are shown here (a) alternating leastsquares and coordinate descent (b) stochastic gradient de-scent Black indicates that the value of the node is beingupdated gray indicates that the value of the node is beingread White nodes are neither being read nor updated

For the matrix completion problem note that for a fixed(i j) pair the gradient of (1) can be written as

nablawiJ(WH) = (Aij minus 〈wihj〉) hj + λwi and

nablahjJ(WH) = (Aij minus 〈wihj〉) wi + λhj

Therefore the SGD updates require sampling a random in-dex (it jt) uniformly from the set of nonzero indicies Ω andperforming the update

wit larr wit minus st middot [(Aitjt minuswithjt)hjt + λwit ] and (9)

hjt larr hjt minus st middot [(Aitjt minuswithjt)wjt + λhjt ] (10)

3 NOMADIn NOMAD we use an optimization scheme based on

SGD In order to justify this choice we find it instructive tofirst understand the updates performed by ALS coordinatedescent and SGD on a bipartite graph which is constructedas follows the i-th user node corresponds to wi the j-thitem node corresponds to hj and an edge (i j) indicatesthat user i has rated item j (see Figure 1) Both the ALSupdate (3) and coordinate descent update (6) for wi requireus to access the values of hj for all j isin Ωi This is shownin Figure 1 (a) where the black node corresponds to wiwhile the gray nodes correspond to hj for j isin Ωi On theother hand the SGD update to wi (9) only requires us toretrieve the value of hj for a single random j isin Ωi (Figure 1(b)) What this means is that in contrast to ALS or CCDmultiple SGD updates can be carried out simultaneously inparallel without interfering with each other Put anotherway SGD has higher potential for finer-grained parallelismthan other approaches and therefore we use it as our opti-mization scheme in NOMAD

31 DescriptionFor now we will denote each parallel computing unit as a

worker in a shared memory setting a worker is a thread andin a distributed memory architecture a worker is a machineThis abstraction allows us to present NOMAD in a unifiedmanner Of course NOMAD can be used in a hybrid set-ting where there are multiple threads spread across multiplemachines and this will be discussed in Section 34

In NOMAD the users 1 2 m are split into p disjointsets I1 I2 Ip which are of approximately equal size1This induces a partition of the rows of the ratings matrix

A The q-th worker stores n sets of indices Ω(q)j for j isin

1 n which are defined as

Ω(q)j =

(i j) isin Ωj i isin Iq

as well as the corresponding values of A Note that oncethe data is partitioned and distributed to the workers it isnever moved during the execution of the algorithm

Recall that there are two types of parameters in ma-trix completion user parameters wirsquos and item parame-ters hj rsquos In NOMAD wirsquos are partitioned according toI1 I2 Ip that is the q-th worker stores and updateswi for i isin Iq The variables in W are partitioned at thebeginning and never move across workers during the execu-tion of the algorithm On the other hand the hj rsquos are splitrandomly into p partitions at the beginning and their own-ership changes as the algorithm progresses At each pointof time an hj variable resides in one and only worker and itmoves to another worker after it is processed independentof other item variables Hence these are nomadic variables2

Processing an item variable hj at the q-th worker entailsexecuting SGD updates (9) and (10) on the ratings in the

set Ω(q)j Note that these updates only require access to hj

and wi for i isin Iq since Iqrsquos are disjoint each wi variablein the set is accessed by only one worker This is why thecommunication of wi variables is not necessary On theother hand hj is updated only by the worker that currentlyowns it so there is no need for a lock this is the popularowner-computes rule in parallel computing See Figure 2

We now formally define the NOMAD algorithm (see Algo-rithm 1 for detailed pseudo-code) Each worker q maintainsits own concurrent queue queue[q] which contains a list ofitems it has to process Each element of the list consistsof the index of the item j (1 le j le n) and a correspond-ing k-dimensional parameter vector hj this pair is denotedas (jhj) Each worker q pops a (jhj) pair from its ownqueue queue[q] and runs stochastic gradient descent up-

date on Ω(q)j which is the set of ratings on item j locally

stored in worker q (line 16 to 21) This changes values ofwi for i isin Iq and hj After all the updates on item j aredone a uniformly random worker qprime is sampled (line 22) andthe updated (jhj) pair is pushed into the queue of thatworker qprime (line 23) Note that this is the only time where aworker communicates with another worker Also note thatthe nature of this communication is asynchronous and non-blocking Furthermore as long as there are items in thequeue the computations are completely asynchronous anddecentralized Moreover all workers are symmetric that isthere is no designated master or slave

32 Complexity AnalysisFirst we consider the case when the problem is distributed

across p workers and study how the space and time com-plexity behaves as a function of p Each worker has to store

1An alternative strategy is to split the users such that eachset has approximately the same number of ratings2Due to symmetry in the formulation of the matrix com-pletion problem one can also make the wirsquos nomadic andpartition the hj rsquos Since usually the number of users is muchlarger than the number of items this leads to more commu-nication and therefore we make the hj variables nomadic

3

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(a) Initial assignment of Wand H Each worker worksonly on the diagonal activearea in the beginning

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(b) After a worker finishesprocessing column j it sendsthe corresponding item pa-rameter hj to another workerHere h2 is sent from worker 1to 4

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(c) Upon receipt the col-umn is processed by the newworker Here worker 4 cannow process column 2 since itowns the column

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(d) During the execution ofthe algorithm the ownershipof the item parameters hjchanges

Figure 2 Illustration of the NOMAD algorithm

Algorithm 1 the basic NOMAD algorithm

1 λ regularization parameter2 st step size sequence3 initialize parameters

4 wil sim UniformReal(

0 1radick

)for 1 le i le m 1 le l le k

5 hjl sim UniformReal(

0 1radick

)for 1 le j le n 1 le l le k

6 initialize queues

7 for j isin 1 2 n do8 q sim UniformDiscrete 1 2 p9 queue[q]push((jhj))

10 end for11 start p workers

12 Parallel Foreach q isin 1 2 p13 while stop signal is not yet received do14 if queue[q] not empty then15 (jhj)larr queue[q]pop()

16 for (i j) isin Ω(q)j do

17 SGD update

18 tlarr number of updates on (i j)19 wi larr wi minus st middot [(Aij minuswihj)hj + λwi]20 hj larr hj minus st middot [(Aij minuswihj)wj + λhj ] 21 end for22 qprime sim UniformDiscrete 1 2 p23 queue[qprime]push((jhj))24 end if25 end while26 Parallel End

1p fraction of the m user parameters and approximately1p fraction of the n item parameters Furthermore eachworker also stores approximately 1p fraction of the |Ω| rat-ings Since storing a row of W or H requires O(k) spacethe space complexity per worker is O((mk + nk + |Ω|)p)As for time complexity we find it useful to use the follow-ing assumptions performing the SGD updates in line 16 to21 takes a middot k time and communicating a (jhj) to anotherworker takes c middot k time where a and c are hardware depen-dent constants On the average each (jhj) pair containsO (|Ω| np) non-zero entries Therefore when a (jhj) pairis popped from queue[q] in line 15 of Algorithm 1 on theaverage it takes a middot (|Ω| knp) time to process the pair Sincecomputation and communication can be done in parallel aslong as a middot (|Ω| knp) is higher than c middot k a worker thread isalways busy and NOMAD scales linearly

Suppose that |Ω| is fixed but the number of workers pincreases that is we take a fixed size dataset and distributeit across p workers As expected for a large enough valueof p (which is determined by hardware dependent constantsa and b) the cost of communication will overwhelm the costof processing an item thus leading to slowdown

On the other hand suppose the work per worker is fixedthat is |Ω| increases and the number of workers p increasesproportionally The average time a middot(|Ω| knp) to process anitem remains constant and NOMAD scales linearly

Finally we discuss the communication complexity of NO-MAD For this discussion we focus on a single item param-eter hj which consists of O(k) numbers In order to beprocessed by all the p workers once it needs to be commu-nicated p times This requires O(kp) communication peritem There are n items and if we make a simplifying as-sumption that during the execution of NOMAD each itemis processed a constant c number of times by each processorthen the total communication complexity is O(nkp)

33 Dynamic Load BalancingAs different workers have different number of ratings per

item the speed at which a worker processes a set of rat-

ings Ω(q)j for an item j also varies among workers Further-

more in the distributed memory setting different workersmight process updates at different rates dues to differencesin hardware and system load NOMAD can handle this bydynamically balancing the workload of workers in line 22 ofAlgorithm 1 instead of sampling the recipient of a messageuniformly at random we can preferentially select a workerwhich has fewer items in its queue to process To do this apayload carrying information about the size of the queue[q]is added to the messages that the workers send each otherThe overhead of passing the payload information is just asingle integer per message This scheme allows us to dy-namically load balance and ensures that a slower workerwill receive smaller amount of work compared to others

34 Hybrid ArchitectureIn a hybrid architecture we have multiple threads on a sin-

gle machine as well as multiple machines distributed acrossthe network In this case we make two improvements to thebasic algorithm First in order to amortize the communica-tion costs we reserve two additional threads per machine forsending and receiving (jhj) pairs over the network Intra-machine communication is much cheaper than machine-to-machine communication since the former does not involve

4

a network hop Therefore whenever a machine receives a(jhj) pair it circulates the pair among all of its threadsbefore sending the pair over the network This is done byuniformly sampling a random permutation whose size equalsto the number of worker threads and sending the item vari-able to each thread according to this permutation Circu-lating a variable more than once was found to not improveconvergence and hence is not used in our algorithm

35 Implementation DetailsMulti-threaded MPI was used for inter-machine commu-

nication Instead of communicating single (jhj) pairs wefollow the strategy of [23] and accumulate a fixed number ofpairs (eg 100) before transmitting them over the network

NOMAD can be implemented with lock-free data struc-tures since the only interaction between threads is via oper-ations on the queue We used the concurrent queue providedby Intel Thread Building Blocks (TBB) [3] Although tech-nically not lock-free the TBB concurrent queue neverthelessscales almost linearly with the number of threads

Since there is very minimal sharing of memory acrossthreads in NOMAD by making memory assignments in eachthread carefully aligned with cache lines we can exploit mem-ory locality and avoid cache ping-pong This results in nearlinear scaling for the multi-threaded setting

4 RELATED WORK

41 Map-Reduce and FriendsSince many machine learning algorithms are iterative in

nature a popular strategy to distribute them across multi-ple machines is to use bulk synchronization after every it-eration Typically one partitions the data into chunks thatare distributed to the workers at the beginning A mas-ter communicates the current parameters which are used toperform computations on the slaves The slaves return thesolutions during the bulk synchronization step which areused by the master to update the parameters The pop-ularity of this strategy is partly thanks to the widespreadavailability of Hadoop [1] an open source implementationof the MapReduce framework [9]

All three optimization schemes for matrix completion namelyALS CCD++ and SGD can be parallelized using a bulksynchronization strategy This is relatively simple for ALS[27] and CCD++ [26] but a bit more involved for SGD[12 18] Suppose p machines are available Then the Dis-tributed Stochastic Gradient Descent (DSGD) algorithm ofGemulla et al [12] partitions the indices of users 1 2 minto mutually exclusive sets I1 I2 Ip and the indices ofitems into J1 J2 Jp Now define

Ω(q) = (i j) isin Ω i isin Iq j isin Jq 1 le q le pand suppose that each machine runs SGD updates (9) and(10) independently but machine q samples (i j) pairs only

from Ω(q) By construction Ω(q)rsquos are disjoint and hencethese updates can be run in parallel A similar observationwas also made by Recht and Re [18] A bulk synchronizationstep redistributes the sets J1 J2 Jp and correspondingrows of H which in turn changes the Ω(q) processed by eachmachine and the iteration proceeds (see Figure 3)

Unfortunately bulk synchronization based algorithms havetwo major drawbacks First the communication and com-putation steps are done in sequence What this means is

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

Figure 3 Illustration of DSGD algorithm with 4 workersInitially W and H are partitioned as shown on the left Eachworker runs SGD on its active area as indicated After eachworker completes processing data points in its own activearea the columns of item parameters Hgt are exchangedrandomly and the active area changes This process is re-peated for each iteration

that when the CPU is busy the network is idle and viceversa The second issue is that they suffer from what iswidely known as the the curse of last reducer [4 24] In otherwords all machines have to wait for the slowest machine tofinish before proceeding to the next iteration Zhuang et al[28] report that DSGD suffers from this problem even in theshared memory setting

DSGD++ is an algorithm proposed by Teflioudi et al[25] to address the first issue discussed above Instead ofusing p partitions DSGD++ uses 2p partitions While thep workers are processing p partitions the other p partitionsare sent over the network This keeps both the network andCPU busy simultaneously However DSGD++ also suffersfrom the curse of the last reducer

Another attempt to alleviate the problems of bulk syn-chronization in the shared memory setting is the FPSGDalgorithm of Zhuang et al [28] given p threads FPSGDpartitions the parameters into more than p sets and usesa task manager thread to distribute the partitions Whena thread finishes updating one partition it requests for an-other partition from the task manager It is unclear how toextend this idea to the distributed memory setting

In NOMAD we sidestep all the drawbacks of bulk synchro-nization Like DSGD++ we also simultaneously keep thenetwork and CPU busy On the other hand like FPSGDwe effectively load balance between the threads To under-stand why NOMAD enjoys both these benefits it is instruc-tive to contrast the data partitioning schemes underlyingDSGD DSGD++ FPSGD and NOMAD (see Figure 4)Given p number of workers DSGD divides the rating matrixA into p times p number of blocks DSGD++ improves uponDSGD by further dividing each block to 1 times 2 sub-blocks(Figure 4 (a) and (b)) On the other hand FPSGD splitsA into pprime times pprime blocks with pprime gt p (Figure 4 (c)) while NO-MAD uses ptimes n blocks (Figure 4 (d)) In terms of commu-nication there is no difference between various partitioningschemes all of them require O(nkp) communication for eachitem to be processed a constant c number of times Howeverhaving smaller blocks means that NOMAD has much moreflexibility in assigning blocks to processors and hence betterability to exploit parallelism Because NOMAD operates atthe level of individual item parameters hj it can dynam-ically load balance by assigning fewer columns to a slower

5

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 4: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(a) Initial assignment of Wand H Each worker worksonly on the diagonal activearea in the beginning

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(b) After a worker finishesprocessing column j it sendsthe corresponding item pa-rameter hj to another workerHere h2 is sent from worker 1to 4

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(c) Upon receipt the col-umn is processed by the newworker Here worker 4 cannow process column 2 since itowns the column

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

xx

xx

(d) During the execution ofthe algorithm the ownershipof the item parameters hjchanges

Figure 2 Illustration of the NOMAD algorithm

Algorithm 1 the basic NOMAD algorithm

1 λ regularization parameter2 st step size sequence3 initialize parameters

4 wil sim UniformReal(

0 1radick

)for 1 le i le m 1 le l le k

5 hjl sim UniformReal(

0 1radick

)for 1 le j le n 1 le l le k

6 initialize queues

7 for j isin 1 2 n do8 q sim UniformDiscrete 1 2 p9 queue[q]push((jhj))

10 end for11 start p workers

12 Parallel Foreach q isin 1 2 p13 while stop signal is not yet received do14 if queue[q] not empty then15 (jhj)larr queue[q]pop()

16 for (i j) isin Ω(q)j do

17 SGD update

18 tlarr number of updates on (i j)19 wi larr wi minus st middot [(Aij minuswihj)hj + λwi]20 hj larr hj minus st middot [(Aij minuswihj)wj + λhj ] 21 end for22 qprime sim UniformDiscrete 1 2 p23 queue[qprime]push((jhj))24 end if25 end while26 Parallel End

1p fraction of the m user parameters and approximately1p fraction of the n item parameters Furthermore eachworker also stores approximately 1p fraction of the |Ω| rat-ings Since storing a row of W or H requires O(k) spacethe space complexity per worker is O((mk + nk + |Ω|)p)As for time complexity we find it useful to use the follow-ing assumptions performing the SGD updates in line 16 to21 takes a middot k time and communicating a (jhj) to anotherworker takes c middot k time where a and c are hardware depen-dent constants On the average each (jhj) pair containsO (|Ω| np) non-zero entries Therefore when a (jhj) pairis popped from queue[q] in line 15 of Algorithm 1 on theaverage it takes a middot (|Ω| knp) time to process the pair Sincecomputation and communication can be done in parallel aslong as a middot (|Ω| knp) is higher than c middot k a worker thread isalways busy and NOMAD scales linearly

Suppose that |Ω| is fixed but the number of workers pincreases that is we take a fixed size dataset and distributeit across p workers As expected for a large enough valueof p (which is determined by hardware dependent constantsa and b) the cost of communication will overwhelm the costof processing an item thus leading to slowdown

On the other hand suppose the work per worker is fixedthat is |Ω| increases and the number of workers p increasesproportionally The average time a middot(|Ω| knp) to process anitem remains constant and NOMAD scales linearly

Finally we discuss the communication complexity of NO-MAD For this discussion we focus on a single item param-eter hj which consists of O(k) numbers In order to beprocessed by all the p workers once it needs to be commu-nicated p times This requires O(kp) communication peritem There are n items and if we make a simplifying as-sumption that during the execution of NOMAD each itemis processed a constant c number of times by each processorthen the total communication complexity is O(nkp)

33 Dynamic Load BalancingAs different workers have different number of ratings per

item the speed at which a worker processes a set of rat-

ings Ω(q)j for an item j also varies among workers Further-

more in the distributed memory setting different workersmight process updates at different rates dues to differencesin hardware and system load NOMAD can handle this bydynamically balancing the workload of workers in line 22 ofAlgorithm 1 instead of sampling the recipient of a messageuniformly at random we can preferentially select a workerwhich has fewer items in its queue to process To do this apayload carrying information about the size of the queue[q]is added to the messages that the workers send each otherThe overhead of passing the payload information is just asingle integer per message This scheme allows us to dy-namically load balance and ensures that a slower workerwill receive smaller amount of work compared to others

34 Hybrid ArchitectureIn a hybrid architecture we have multiple threads on a sin-

gle machine as well as multiple machines distributed acrossthe network In this case we make two improvements to thebasic algorithm First in order to amortize the communica-tion costs we reserve two additional threads per machine forsending and receiving (jhj) pairs over the network Intra-machine communication is much cheaper than machine-to-machine communication since the former does not involve

4

a network hop Therefore whenever a machine receives a(jhj) pair it circulates the pair among all of its threadsbefore sending the pair over the network This is done byuniformly sampling a random permutation whose size equalsto the number of worker threads and sending the item vari-able to each thread according to this permutation Circu-lating a variable more than once was found to not improveconvergence and hence is not used in our algorithm

35 Implementation DetailsMulti-threaded MPI was used for inter-machine commu-

nication Instead of communicating single (jhj) pairs wefollow the strategy of [23] and accumulate a fixed number ofpairs (eg 100) before transmitting them over the network

NOMAD can be implemented with lock-free data struc-tures since the only interaction between threads is via oper-ations on the queue We used the concurrent queue providedby Intel Thread Building Blocks (TBB) [3] Although tech-nically not lock-free the TBB concurrent queue neverthelessscales almost linearly with the number of threads

Since there is very minimal sharing of memory acrossthreads in NOMAD by making memory assignments in eachthread carefully aligned with cache lines we can exploit mem-ory locality and avoid cache ping-pong This results in nearlinear scaling for the multi-threaded setting

4 RELATED WORK

41 Map-Reduce and FriendsSince many machine learning algorithms are iterative in

nature a popular strategy to distribute them across multi-ple machines is to use bulk synchronization after every it-eration Typically one partitions the data into chunks thatare distributed to the workers at the beginning A mas-ter communicates the current parameters which are used toperform computations on the slaves The slaves return thesolutions during the bulk synchronization step which areused by the master to update the parameters The pop-ularity of this strategy is partly thanks to the widespreadavailability of Hadoop [1] an open source implementationof the MapReduce framework [9]

All three optimization schemes for matrix completion namelyALS CCD++ and SGD can be parallelized using a bulksynchronization strategy This is relatively simple for ALS[27] and CCD++ [26] but a bit more involved for SGD[12 18] Suppose p machines are available Then the Dis-tributed Stochastic Gradient Descent (DSGD) algorithm ofGemulla et al [12] partitions the indices of users 1 2 minto mutually exclusive sets I1 I2 Ip and the indices ofitems into J1 J2 Jp Now define

Ω(q) = (i j) isin Ω i isin Iq j isin Jq 1 le q le pand suppose that each machine runs SGD updates (9) and(10) independently but machine q samples (i j) pairs only

from Ω(q) By construction Ω(q)rsquos are disjoint and hencethese updates can be run in parallel A similar observationwas also made by Recht and Re [18] A bulk synchronizationstep redistributes the sets J1 J2 Jp and correspondingrows of H which in turn changes the Ω(q) processed by eachmachine and the iteration proceeds (see Figure 3)

Unfortunately bulk synchronization based algorithms havetwo major drawbacks First the communication and com-putation steps are done in sequence What this means is

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

Figure 3 Illustration of DSGD algorithm with 4 workersInitially W and H are partitioned as shown on the left Eachworker runs SGD on its active area as indicated After eachworker completes processing data points in its own activearea the columns of item parameters Hgt are exchangedrandomly and the active area changes This process is re-peated for each iteration

that when the CPU is busy the network is idle and viceversa The second issue is that they suffer from what iswidely known as the the curse of last reducer [4 24] In otherwords all machines have to wait for the slowest machine tofinish before proceeding to the next iteration Zhuang et al[28] report that DSGD suffers from this problem even in theshared memory setting

DSGD++ is an algorithm proposed by Teflioudi et al[25] to address the first issue discussed above Instead ofusing p partitions DSGD++ uses 2p partitions While thep workers are processing p partitions the other p partitionsare sent over the network This keeps both the network andCPU busy simultaneously However DSGD++ also suffersfrom the curse of the last reducer

Another attempt to alleviate the problems of bulk syn-chronization in the shared memory setting is the FPSGDalgorithm of Zhuang et al [28] given p threads FPSGDpartitions the parameters into more than p sets and usesa task manager thread to distribute the partitions Whena thread finishes updating one partition it requests for an-other partition from the task manager It is unclear how toextend this idea to the distributed memory setting

In NOMAD we sidestep all the drawbacks of bulk synchro-nization Like DSGD++ we also simultaneously keep thenetwork and CPU busy On the other hand like FPSGDwe effectively load balance between the threads To under-stand why NOMAD enjoys both these benefits it is instruc-tive to contrast the data partitioning schemes underlyingDSGD DSGD++ FPSGD and NOMAD (see Figure 4)Given p number of workers DSGD divides the rating matrixA into p times p number of blocks DSGD++ improves uponDSGD by further dividing each block to 1 times 2 sub-blocks(Figure 4 (a) and (b)) On the other hand FPSGD splitsA into pprime times pprime blocks with pprime gt p (Figure 4 (c)) while NO-MAD uses ptimes n blocks (Figure 4 (d)) In terms of commu-nication there is no difference between various partitioningschemes all of them require O(nkp) communication for eachitem to be processed a constant c number of times Howeverhaving smaller blocks means that NOMAD has much moreflexibility in assigning blocks to processors and hence betterability to exploit parallelism Because NOMAD operates atthe level of individual item parameters hj it can dynam-ically load balance by assigning fewer columns to a slower

5

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 5: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

a network hop Therefore whenever a machine receives a(jhj) pair it circulates the pair among all of its threadsbefore sending the pair over the network This is done byuniformly sampling a random permutation whose size equalsto the number of worker threads and sending the item vari-able to each thread according to this permutation Circu-lating a variable more than once was found to not improveconvergence and hence is not used in our algorithm

35 Implementation DetailsMulti-threaded MPI was used for inter-machine commu-

nication Instead of communicating single (jhj) pairs wefollow the strategy of [23] and accumulate a fixed number ofpairs (eg 100) before transmitting them over the network

NOMAD can be implemented with lock-free data struc-tures since the only interaction between threads is via oper-ations on the queue We used the concurrent queue providedby Intel Thread Building Blocks (TBB) [3] Although tech-nically not lock-free the TBB concurrent queue neverthelessscales almost linearly with the number of threads

Since there is very minimal sharing of memory acrossthreads in NOMAD by making memory assignments in eachthread carefully aligned with cache lines we can exploit mem-ory locality and avoid cache ping-pong This results in nearlinear scaling for the multi-threaded setting

4 RELATED WORK

41 Map-Reduce and FriendsSince many machine learning algorithms are iterative in

nature a popular strategy to distribute them across multi-ple machines is to use bulk synchronization after every it-eration Typically one partitions the data into chunks thatare distributed to the workers at the beginning A mas-ter communicates the current parameters which are used toperform computations on the slaves The slaves return thesolutions during the bulk synchronization step which areused by the master to update the parameters The pop-ularity of this strategy is partly thanks to the widespreadavailability of Hadoop [1] an open source implementationof the MapReduce framework [9]

All three optimization schemes for matrix completion namelyALS CCD++ and SGD can be parallelized using a bulksynchronization strategy This is relatively simple for ALS[27] and CCD++ [26] but a bit more involved for SGD[12 18] Suppose p machines are available Then the Dis-tributed Stochastic Gradient Descent (DSGD) algorithm ofGemulla et al [12] partitions the indices of users 1 2 minto mutually exclusive sets I1 I2 Ip and the indices ofitems into J1 J2 Jp Now define

Ω(q) = (i j) isin Ω i isin Iq j isin Jq 1 le q le pand suppose that each machine runs SGD updates (9) and(10) independently but machine q samples (i j) pairs only

from Ω(q) By construction Ω(q)rsquos are disjoint and hencethese updates can be run in parallel A similar observationwas also made by Recht and Re [18] A bulk synchronizationstep redistributes the sets J1 J2 Jp and correspondingrows of H which in turn changes the Ω(q) processed by eachmachine and the iteration proceeds (see Figure 3)

Unfortunately bulk synchronization based algorithms havetwo major drawbacks First the communication and com-putation steps are done in sequence What this means is

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

Figure 3 Illustration of DSGD algorithm with 4 workersInitially W and H are partitioned as shown on the left Eachworker runs SGD on its active area as indicated After eachworker completes processing data points in its own activearea the columns of item parameters Hgt are exchangedrandomly and the active area changes This process is re-peated for each iteration

that when the CPU is busy the network is idle and viceversa The second issue is that they suffer from what iswidely known as the the curse of last reducer [4 24] In otherwords all machines have to wait for the slowest machine tofinish before proceeding to the next iteration Zhuang et al[28] report that DSGD suffers from this problem even in theshared memory setting

DSGD++ is an algorithm proposed by Teflioudi et al[25] to address the first issue discussed above Instead ofusing p partitions DSGD++ uses 2p partitions While thep workers are processing p partitions the other p partitionsare sent over the network This keeps both the network andCPU busy simultaneously However DSGD++ also suffersfrom the curse of the last reducer

Another attempt to alleviate the problems of bulk syn-chronization in the shared memory setting is the FPSGDalgorithm of Zhuang et al [28] given p threads FPSGDpartitions the parameters into more than p sets and usesa task manager thread to distribute the partitions Whena thread finishes updating one partition it requests for an-other partition from the task manager It is unclear how toextend this idea to the distributed memory setting

In NOMAD we sidestep all the drawbacks of bulk synchro-nization Like DSGD++ we also simultaneously keep thenetwork and CPU busy On the other hand like FPSGDwe effectively load balance between the threads To under-stand why NOMAD enjoys both these benefits it is instruc-tive to contrast the data partitioning schemes underlyingDSGD DSGD++ FPSGD and NOMAD (see Figure 4)Given p number of workers DSGD divides the rating matrixA into p times p number of blocks DSGD++ improves uponDSGD by further dividing each block to 1 times 2 sub-blocks(Figure 4 (a) and (b)) On the other hand FPSGD splitsA into pprime times pprime blocks with pprime gt p (Figure 4 (c)) while NO-MAD uses ptimes n blocks (Figure 4 (d)) In terms of commu-nication there is no difference between various partitioningschemes all of them require O(nkp) communication for eachitem to be processed a constant c number of times Howeverhaving smaller blocks means that NOMAD has much moreflexibility in assigning blocks to processors and hence betterability to exploit parallelism Because NOMAD operates atthe level of individual item parameters hj it can dynam-ically load balance by assigning fewer columns to a slower

5

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 6: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(a) DSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(b) DSGD++

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(c) FPSGD

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

(d) NOMAD

Figure 4 Comparison of data partitioning schemes betweenalgorithms Example active area of stochastic gradient sam-pling is marked as gray

worker A pleasant side effect of such a fine grained parti-tioning coupled with the lock free nature of updates is thatone does not require sophisticated scheduling algorithms toachieve good performance Consequently NOMAD outper-forms DSGD DSGD++ and FPSGD

42 Asynchronous AlgorithmsThere is growing interest in designing machine learning al-

gorithms that do not perform bulk synchronization See forinstance the randomized (block) coordinate descent meth-ods of Richtarik and Takac [20] and the Hogwild algorithmof Recht et al [19] A relatively new approach to asyn-chronous parallelism is to use a so-called parameter serverA parameter server is either a single machine or a distributedset of machines which caches the current values of the pa-rameters Workers store local copies of the parameters andperform updates on them and periodically synchronize theirlocal copies with the parameter server The parameter serverreceives updates from all workers aggregates them andcommunicates them back to the workers The earliest workon a parameter server that we are aware of is due to Smolaand Narayanamurthy [23] who propose using a parameterserver for collapsed Gibbs sampling in Latent Dirichlet Al-location PowerGraph [13] upon which the latest version ofthe GraphLab toolkit is based is also essentially based onthe idea of a parameter server However the difference incase of PowerGraph is that the responsibility of parametersis distributed across multiple machines but at the addedexpense of synchronizing the copies

Very roughly speaking the asynchronously parallel ver-sion of the ALS algorithm in GraphLab works as follows wi

and hj variables are distributed across multiple machinesand whenever wi is being updated with equation (3) thevalues of hj rsquos for j isin Ωi are retrieved across the networkand read-locked until the update is finished GraphLab pro-vides functionality such as network communication and adistributed locking mechanism to implement this However

frequently acquiring read-locks over the network can be ex-pensive In particular a popular user who has rated manyitems will require read locks on a large number of items andthis will lead to vast amount of communication and delays inupdates on those items GraphLab provides a complex jobscheduler which attempts to minimize this cost but then theefficiency of parallelization depends on the difficulty of thescheduling problem and the effectiveness of the scheduler

In our empirical evaluation NOMAD performs significantlybetter than GraphLab The reasons are not hard to seeFirst because of the lock free nature of NOMAD we com-pletely avoid acquiring expensive network locks Secondwe use SGD which allows us to exploit finer grained paral-lelism as compared to ALS and also leads to faster conver-gence In fact the GraphLab framework is not well suitedfor SGD (personal communication with the developers ofGraphLab) Finally because of the finer grained data par-titioning scheme used in NOMAD unlike GraphLab whoseperformance heavily depends on the underlying schedulingalgorithms we do not require a complicated scheduling mech-anism

43 Numerical Linear AlgebraThe concepts of asynchronous and non-blocking updates

have also been studied in numerical linear algebra To avoidthe load balancing problem and to reduce processor idletime asynchronous numerical methods were first proposedover four decades ago by Chazan and Miranker [8] Given anoperator H Rm rarr Rm to find the fixed point solution xlowast

such that H(xlowast) = xlowast a standard Gauss-Seidel-type pro-cedure performs the update xi = (H(x))i sequentially (orrandomly) Using the asynchronous procedure each com-putational node asynchronously conducts updates on eachvariable (or a subset) xnew

i = (H(x))i and then overwritesxi in common memory by xnew

i Theory and applications ofthis asynchronous method have been widely studied (see theliterature review of Frommer and Szyld [11] and the semi-nal textbook by Bertsekas and Tsitsiklis [6]) The conceptof this asynchronous fixed-point update is very closely re-lated to the Hogwild algorithm of Recht et al [19] or theso-called Asynchronous SGD (ASGD) method proposed byTeflioudi et al [25] Unfortunately such algorithms are non-serializable that is there may not exist an equivalent up-date ordering in a serial implementation In contrast ourNOMAD algorithm is not only asynchronous but also serial-izable and therefore achieves faster convergence in practice

On the other hand non-blocking communication has alsobeen proposed to accelerate iterative solvers in a distributedsetting For example Hoefler et al [14] presented a dis-tributed conjugate gradient implementation with non-blockingcollective MPI operations for solving linear systems How-ever this algorithm still requires synchronization at each CGiteration so it is very different from our NOMAD algorithm

44 DiscussionWe remark that among algorithms we have discussed so

far NOMAD is the only distributed-memory algorithm whichis both asynchronous and lock-free Other parallel versionsof SGD such as DSGD and DSGD++ are lock-free but notfully asynchronous therefore the cost of synchronizationincreases as the number of machines grows [28] On theother hand the GraphLab implementation of ALS [17] isasynchronous but not lock-free and therefore depends on

6

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 7: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

a complex job scheduler to reduce the side-effect of usinglocks

5 EXPERIMENTSIn this section we evaluate the empirical performance of

NOMAD with extensive experiments For the distributedmemory experiments we compare NOMAD with DSGD [12]DSGD++ [25] and CCD++ [26] We also compare againstGraphLab but the quality of results produced by GraphLabare significantly worse than the other methods and thereforethe plots for this experiment are delegated to Appendix FFor the shared memory experiments we pitch NOMAD againstFPSGD [28] (which is shown to outperform DSGD in sin-gle machine experiments) as well as CCD++ Our experi-ments are designed to answer the following

bull How does NOMAD scale with the number of cores ona single machine (Section 52)

bull How does NOMAD scale as a fixed size dataset is dis-tributed across multiple machines (Section 53)

bull How does NOMAD perform on a commodity hardwarecluster (Section 54)

bull How does NOMAD scale when both the size of the dataas well as the number of machines grow (Section 55)

Since the objective function (1) is non-convex differentoptimizers will converge to different solutions Factors whichaffect the quality of the final solution include 1) initializationstrategy 2) the sequence in which the ratings are accessedand 3) the step size decay schedule It is clearly not feasibleto consider the combinatorial effect of all these factors oneach algorithm However we believe that the overall trendof our results is not affected by these factors

51 Experimental SetupPublicly available code for FPSGD3 and CCD++4 was

used in our experiments For DSGD and DSGD++ whichwe had to implement ourselves because the code is not pub-licly available we closely followed the recommendations ofGemulla et al [12] and Teflioudi et al [25] and in somecases made improvements based on our experience For afair comparison all competing algorithms were tuned for op-timal performance on our hardware The code and scriptsrequired for reproducing the experiments are readily avail-able for download from httpssitesgooglecomsite

hyokunyunsoftware Parameters used in our experimentsare summarized in Table 1

Table 1 Dimensionality parameter k regularization param-eter λ (1) and step-size schedule parameters α β (11)

Name k λ α βNetflix 100 005 0012 005Yahoo Music 100 100 000075 001Hugewiki 100 001 0001 0

For all experiments except the ones in Section 55 wewill work with three benchmark datasets namely NetflixYahoo Music and Hugewiki (see Table 2 for more details)

3httpwwwcsientuedutw~cjlinlibmf4httpwwwcsutexasedu~rofuyulibpmf

Table 2 Dataset Details

Name Rows Columns Non-zerosNetflix [5] 2649429 17770 99072112Yahoo Music [10] 1999990 624961 252800275Hugewiki [2] 50082603 39780 2736496604

The same training and test dataset partition is used consis-tently for all algorithms in every experiment Since our goalis to compare optimization algorithms we do very minimalparameter tuning For instance we used the same regu-larization parameter λ for each dataset as reported by Yuet al [26] and shown in Table 1 we study the effect of theregularization parameter on the convergence of NOMAD inAppendix A By default we use k = 100 for the dimension ofthe latent space we study how the dimension of the latentspace affects convergence of NOMAD in Appendix B All al-gorithms were initialized with the same initial parameterswe set each entry of W and H by independently sampling auniformly random variable in the range (0 1radic

k) [26 28]

We compare solvers in terms of Root Mean Square Error(RMSE) on the test set which is defined asradicsum

(ij)isinΩtest (Aij minus 〈wihj〉)2

|Ωtest|

where Ωtest denotes the ratings in the test setAll experiments except the ones reported in Section 54

are run using the Stampede Cluster at University of Texas aLinux cluster where each node is outfitted with 2 Intel XeonE5 (Sandy Bridge) processors and an Intel Xeon Phi Copro-cessor (MIC Architecture) For single-machine experiments(Section 52) we used nodes in the largemem queue whichare equipped with 1TB of RAM and 32 cores For all otherexperiments we used the nodes in the normal queue whichare equipped with 32 GB of RAM and 16 cores (only 4 outof the 16 cores were used for computation) Inter-machinecommunication on this system is handled by MVAPICH2

For the commodity hardware experiments in Section 54we used m1xlarge instances of Amazon Web Services whichare equipped with 15GB of RAM and four cores We utilizedall four cores in each machine NOMAD and DSGD++ usetwo cores for computation and two cores for network com-munication while DSGD and CCD++ use all four cores forboth computation and communication Inter-machine com-munication on this system is handled by MPICH2

Since FPSGD uses single precision arithmetic the ex-periments in Section 52 are performed using single precisionarithmetic while all other experiments use double precisionarithmetic All algorithms are compiled with Intel C++compiler with the exception of experiments in Section 54where we used gcc which is the only compiler toolchain avail-able on the commodity hardware cluster For ready refer-ence exceptions to the experimental settings specific to eachsection are summarized in Table 3

The convergence speed of stochastic gradient descent meth-ods depends on the choice of the step size schedule Theschedule we used for NOMAD is

st =α

1 + β middot t15 (11)

where t is the number of SGD updates that were performedon a particular user-item pair (i j) DSGD and DSGD++

7

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 8: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

Table 3 Exceptions to each experiment

Section ExceptionSection 52 bull run on largemem queue (32 cores 1TB RAM)

bull single precision floating point usedSection 54 bull run on m1xlarge (4 cores 15GB RAM)

bull compiled with gccbull MPICH2 for MPI implementation

Section 55 bull Synthetic datasets

on the other hand use an alternative strategy called bold-driver [12] here the step size is adapted by monitoring thechange of the objective function

52 Scaling in Number of CoresFor the first experiment we fixed the number of cores to 30

and compared the performance of NOMAD vs FPSGD5

and CCD++ (Figure 5) On Netflix (left) NOMAD notonly converges to a slightly better quality solution (RMSE0914 vs 0916 of others) but is also able to reduce theRMSE rapidly right from the beginning On Yahoo Mu-sic (middle) NOMAD converges to a slightly worse solu-tion than FPSGD (RMSE 21894 vs 21853) but as in thecase of Netflix the initial convergence is more rapid OnHugewiki the difference is smaller but NOMAD still out-performs The initial speed of CCD++ on Hugewiki is com-parable to NOMAD but the quality of the solution startsto deteriorate in the middle Note that the performance ofCCD++ here is better than what was reported in Zhuanget al [28] since they used double-precision floating pointarithmetic for CCD++ In other experiments (not reportedhere) we varied the number of cores and found that the rela-tive difference in performance between NOMAD FPSGDand CCD++ are very similar to that observed in Figure 5

For the second experiment we varied the number of coresfrom 4 to 30 and plot the scaling behavior of NOMAD (Fig-ures 6 and 7) Figure 6 (left) shows how test RMSE changesas a function of the number of updates on Yahoo MusicInterestingly as we increased the number of cores the testRMSE decreased faster We believe this is because when weincrease the number of cores the rating matrix A is parti-tioned into smaller blocks recall that we split A into ptimes nblocks where p is the number of parallel workers Thereforethe communication between workers becomes more frequentand each SGD update is based on fresher information (seealso Section 32 for mathematical analysis) This effect wasmore strongly observed on Yahoo Music than others sinceYahoo Music has much larger number of items (624961vs 17770 of Netflix and 39780 of Hugewiki) and thereforemore amount of communication is needed to circulate thenew information to all workers Results for other datasetsare provided in Figure 18 in Appendix D

On the other hand to assess the efficiency of computa-tion we define average throughput as the average numberof ratings processed per core per second and plot it foreach dataset in Figure 6 (right) while varying the numberof cores If NOMAD exhibits linear scaling in terms of thespeed it processes ratings the average throughput should re-

5Since the current implementation of FPSGD in LibMFonly reports CPU execution time we divide this by the num-ber of threads and use this as a proxy for wall clock time

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

5 10 15 20 25 300

1

2

3

4

5

middot106

number of cores

up

date

sp

erco

rep

erse

c

machines=1 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 6 Left Test RMSE of NOMAD as a function of thenumber of updates on Yahoo Music when the number ofcores is varied Right Number of updates of NOMAD percore per second as a function of the number of cores

main constant6 On Netflix the average throughput indeedremains almost constant as the number of cores changesOn Yahoo Music and Hugewiki the throughput decreasesto about 50 as the number of cores is increased to 30 Webelieve this is mainly due to cache locality effects

Now we study how much speed-up NOMAD can achieveby increasing the number of cores In Figure 7 we set y-axis to be test RMSE and x-axis to be the total CPU timeexpended which is given by the number of seconds elapsedmultiplied by the number of cores We plot the convergencecurves by setting the cores=4 8 16 and 30 If the curvesoverlap then this shows that we achieve linear speed up aswe increase the number of cores This is indeed the casefor Netflix and Hugewiki In the case of Yahoo Music weobserve that the speed of convergence increases as the num-ber of cores increases This we believe is again due to thedecrease in the block size which leads to faster convergence

53 Scaling as a Fixed Dataset is DistributedAcross Workers

In this subsection we use 4 computation threads per ma-chine For the first experiment we fix the number of ma-chines to 32 (64 for hugewiki) and compare the perfor-mance of NOMAD with DSGD DSGD++ and CCD++(Figure 8) On Netflix and Hugewiki NOMAD convergesmuch faster than its competitors not only is initial conver-gence faster but also it discovers a better quality solutionOn Yahoo Music all four methods perform similarly Thisis because the cost of network communication relative tothe size of the data is much higher for Yahoo Music whileNetflix and Hugewiki have 5575 and 68635 non-zero rat-ings per each item respectively Yahoo Music has only 404ratings per item Therefore when Yahoo Music is dividedequally across 32 machines each item has only 10 ratingson average per each machine Hence the cost of sending andreceiving item parameter vector hj for one item j across thenetwork is higher than that of executing SGD updates onthe ratings of the item locally stored within the machine

Ω(q)j As a consequence the cost of network communication

dominates the overall execution time of all algorithms andlittle difference in convergence speed is found between them

6Note that since we use single-precision floating pointarithmetic in this section to match the implementation ofFPSGD the throughput of NOMAD is about 50 higherthan that in other experiments

8

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 9: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 100 200 300 400091

092

093

094

095

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

FPSGD

CCD++

0 100 200 300 400

22

24

26

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

FPSGD

CCD++

0 500 1000 1500 2000 2500 300005

06

07

08

09

1

seconds

test

RMSE

Hugewiki machines=1 cores=30 λ = 001 k = 100

NOMAD

FPSGD

CCD++

Figure 5 Comparison of NOMAD FPSGD and CCD++ on a single-machine with 30 computation cores

0 1000 2000 3000 4000 5000 6000

092

094

096

098

seconds times cores

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 2000 4000 6000 8000

22

24

26

28

seconds times cores

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2

middot105

05

055

06

065

seconds times cores

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 7 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of cores) when thenumber of cores is varied

For the second experiment we varied the number of ma-chines from 1 to 32 and plot the scaling behavior of NO-MAD (Figures 10 and 9) Figure 10 (left) shows how testRMSE decreases as a function of the number of updates onYahoo Music Again if NOMAD scales linearly the av-erage throughput has to remain constant here we observeimprovement in convergence speed when 8 or more machinesare used This is again the effect of smaller block sizes whichwas discussed in Section 52 On Netflix a similar effect waspresent but was less significant on Hugewiki we did not seeany notable difference between configurations (see Figure 19in Appendix D)

In Figure 10 (right) we plot the average throughput (thenumber of updates per machine per core per second) as afunction of the number of machines On Yahoo Music theaverage throughput goes down as we increase the numberof machines because as mentioned above each item has asmall number of ratings On Hugewiki we observe almostlinear scaling and on Netflix the average throughput evenimproves as we increase the number of machines we believethis is because of cache locality effects As we partition usersinto smaller and smaller blocks the probability of cache misson user parameters wirsquos within the block decrease and onNetflix this makes a meaningful difference indeed there areonly 480189 users in Netflix who have at least one ratingWhen this is equally divided into 32 machines each machinecontains only 11722 active users on average Therefore thewi variables only take 11MB of memory which is smallerthan the size of L3 cache (20MB) of the machine we usedand therefore leads to increase in the number of updates permachine per core per second

Now we study how much speed-up NOMAD can achieveby increasing the number of machines In Figure 9 we sety-axis to be test RMSE and x-axis to be the number ofseconds elapsed multiplied by the total number of cores usedin the configuration Again all lines will coincide with eachother if NOMAD shows linear scaling On Netflix with 2and 4 machines we observe mild slowdown but with morethan 4 machines NOMAD exhibits super-linear scaling OnYahoo Music we observe super-linear scaling with respectto the speed of a single machine on all configurations butthe highest speedup is seen with 16 machines On Hugewikilinear scaling is observed in every configuration

54 Scaling on Commodity HardwareIn this subsection we want to analyze the scaling behav-

ior of NOMAD on commodity hardware Using AmazonWeb Services (AWS) we set up a computing cluster thatconsists of 32 machines each machine is of type m1xlarge

and equipped with quad-core Intel Xeon E5430 CPU and15GB of RAM Network bandwidth among these machinesis reported to be approximately 1Gbs7

Since NOMAD and DSGD++ dedicates two threads fornetwork communication on each machine only two cores areavailable for computation8 In contrast bulk synchroniza-tion algorithms such as DSGD and CCD++ which separate

7httpepamcloudblogspotcom201303testing-amazon-ec2-network-speedhtml8Since network communication is not computation-intensive for DSGD++ we used four computation threadsinstead of two and got better results thus we report resultswith four computation threads for DSGD++

9

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 10: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 20 40 60 80 100 120

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 20 40 60 80 100 120

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 200 400 60005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 8 Comparison of NOMAD DSGD DSGD++ and CCD++ on a HPC cluster

0 1000 2000 3000 4000 5000 6000

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 2000 4000 6000 8000

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot105

05

055

06

065

07

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 9 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a HPC cluster when the number of machines is varied

computation and communication can utilize all four cores forcomputation In spite of this disadvantage Figure 11 showsthat NOMAD outperforms all other algorithms in this set-ting as well In this plot we fixed the number of machinesto 32 on Netflix and Hugewiki NOMAD converges morerapidly to a better solution Recall that on Yahoo Music allfour algorithms performed very similarly on a HPC clusterin Section 53 However on commodity hardware NOMADoutperforms the other algorithms This shows that the ef-ficiency of network communication plays a very importantrole in commodity hardware clusters where the communica-tion is relatively slow On Hugewiki however the numberof columns is very small compared to the number of ratingsand thus network communication plays smaller role in thisdataset compared to others Therefore initial convergenceof DSGD is a bit faster than NOMAD as it uses all fourcores on computation while NOMAD uses only two Stillthe overall convergence speed is similar and NOMAD findsa better quality solution

As in Section 53 we increased the number of machinesfrom 1 to 32 and studied the scaling behavior of NOMADThe overall trend was identical to what we observed in Fig-ure 10 and 9 due to page constraints the plots for thisexperiment can be found in the Appendix C

55 Scaling as both Dataset Size and Numberof Machines Grows

In previous sections (Section 53 and Section 54) westudied the scalability of algorithms by partitioning a fixedamount of data into increasing number of machines In real-world applications of collaborative filtering however the

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 5 10 15 20 25 300

1

2

3

4

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

Netflix

Yahoo

Hugewiki

Figure 10 Results on HPC cluster when the number ofmachines is varied Left Test RMSE of NOMAD as a func-tion of the number of updates on Netflix and Yahoo MusicRight Number of updates of NOMAD per machine per coreper second as a function of the number of machines

size of the data should grow over time as new users areadded to the system Therefore to match the increasedamount of data with equivalent amount of physical memoryand computational power the number of machines shouldincrease as well The aim of this section is to compare thescaling behavior of NOMAD and that of other algorithmsin this realistic scenario

To simulate such a situation we generated synthetic datasetswhose characteristics resemble real data the number of rat-ings for each user and each item is sampled from the corre-sponding empirical distribution of the Netflix data As weincrease the number of machines from 4 to 32 we fixed thenumber of items to be the same to that of Netflix (17770)and increased the number of users to be proportional to the

10

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 11: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 100 200 300 400 500

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 100 200 300 400 500 600

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

0 1000 2000 3000 400005

055

06

065

seconds

test

RMSE

Hugewiki machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

DSGD++

CCD++

Figure 11 Comparison of NOMAD DSGD DSGD++ and CCD++ on a commodity hardware cluster

number of machines (480189 times the number of machines9)Therefore the expected number of ratings in each dataset isproportional to the number of machines (99072112 times thenumber of machines) as well

Conditioned on the number of ratings for each user anditem the nonzero locations are sampled uniformly at ran-dom Ground-truth user parameters wirsquos and item param-eters hj rsquos are generated from 100-dimensional standard iso-metric Gaussian distribution and for each rating Aij Gaus-sian noise with mean zero and standard deviation 01 isadded to the ldquotruerdquo rating 〈wihj〉

Figure 12 shows that the comparative advantage of NO-MAD against DSGD DSGD++ and CCD++ increases aswe grow the scale of the problem NOMAD clearly out-performs other methods on all configurations DSGD++ isvery competitive on the small scale but as the size of theproblem grows NOMAD shows better scaling behavior

6 CONCLUSION AND FUTURE WORKFrom our experimental study we conclude that

bull On a single machine NOMAD shows near-linear scal-ing up to 30 threads

bull When a fixed size dataset is distributed across multiplemachines NOMAD shows near-linear scaling up to 32machines

bull Both in shared-memory and distributed-memory set-ting NOMAD exhibits superior performance againststate-of-the-art competitors in commodity hardwarecluster the comparative advantage is more conspicu-ous

bull When both the size of the data as well as the numberof machines grow the scaling behavior of NOMAD ismuch nicer than its competitors

Although we only discussed the matrix completion prob-lem in this paper it is worth noting that the idea of NO-MAD is more widely applicable Specifically ideas discussedin this paper can be easily adapted as long as the objectivefunction can be written as

f(WH) =sumijisinΩ

fij(wihj)

As part of our ongoing work we are investigating ways torewrite Support Vector Machines (SVMs) binary logistic9480189 is the number of users in Netflix who have at leastone rating

regression as a saddle point problem which have the abovestructure

Inference in Latent Dirichlet Allocation (LDA) using a col-lapsed Gibbs sampler has a similar structure as the stochas-tic gradient descent updates for matrix factorization Anadditional complication in LDA is that the variables needto be normalized We are also investigating how the NO-MAD framework can be used for LDA

7 ACKNOWLEDGEMENTSWe thank the anonymous reviewers for their construc-

tive comments We thank the Texas Advanced ComputingCenter at University of Texas and the Research Comput-ing group at Purdue University for providing infrastructureand timely support for our experiments Computing ex-periments on commodity hardware were made possible byan AWS in Education Machine Learning Research GrantAward This material is partially based upon work sup-ported by the National Science Foundation under grant nosIIS-1219015 and CCF-1117055

References[1] Apache Hadoop 2009

httphadoopapacheorgcore

[2] Graphlab datasets 2013httpgraphlaborgdownloadsdatasets

[3] Intel thread building blocks 2013httpswwwthreadingbuildingblocksorg

[4] A Agarwal O Chapelle M Dudık and J LangfordA reliable effective terascale linear learning systemCoRR abs11104198 2011

[5] R M Bell and Y Koren Lessons from the netflix prizechallenge SIGKDD Explorations 9(2)75ndash79 2007

[6] D P Bertsekas and J N Tsitsiklis Parallel and Dis-tributed Computation Numerical Methods Athena Sci-entific 1997

[7] L Bottou and O Bousquet The tradeoffs of large-scalelearning Optimization for Machine Learning pages351ndash368 2011

[8] D Chazan and W Miranker Chaotic relaxation Lin-ear Algebra and its Applications 2199ndash222 1969

11

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 12: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 1000 2000 3000 4000 500015

16

17

18

seconds

test

RMSE

machines=4 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

seconds

test

RMSE

machines=16 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

0 02 04 06 08 1

middot104

16

18

2

22

24

seconds

test

RMSE

machines=32 cores=4 k = 100 λ = 001

NOMAD

DSGD

DSGD++

CCD++

Figure 12 Comparison of algorithms when both dataset size and the number of machines grows Left 4 machines middle16 machines right 32 machines

[9] J Dean and S Ghemawat MapReduce simplified dataprocessing on large clusters Communications of theACM 51(1)107ndash113 2008

[10] G Dror N Koenigstein Y Koren and M WeimerThe Yahoo music dataset and KDD-Cuprsquo11 Journalof Machine Learning Research-Proceedings Track 188ndash18 2012

[11] A Frommer and D B Szyld On asynchronous iter-ations Journal of Computational and Applied Mathe-matics 123201ndash216 2000

[12] R Gemulla E Nijkamp P J Haas and Y Sisma-nis Large-scale matrix factorization with distributedstochastic gradient descent In Proceedings of theconference on Knowledge Discovery and Data Miningpages 69ndash77 2011

[13] J E Gonzalez Y Low H Gu D Bickson andC Guestrin Powergraph Distributed graph-parallelcomputation on natural graphs In Proceedings of theUSENIX Symposium on Operating Systems Design andImplementation pages 17ndash30 2012

[14] T Hoefler P Gottschling W Rehm and A Lums-daine Optimizing a conjugate gradient solver with nonblocking operators Parallel Computing 2007

[15] C J Hsieh and I S Dhillon Fast coordinate descentmethods with variable selection for non-negative ma-trix factorization In Proceedings of the conference onKnowledge Discovery and Data Mining pages 1064ndash1072 August 2011

[16] H Kushner and D Clark Stochastic ApproximationMethods for Constrained and Unconstrained Systemsvolume 26 of Applied Mathematical Sciences SpringerNew York 1978

[17] Y Low J Gonzalez A Kyrola D BicksonC Guestrin and J M Hellerstein Distributedgraphlab A framework for machine learning and datamining in the cloud In Proceedings of the InternationalConference on Very Large Data Bases pages 716ndash7272012

[18] B Recht and C Re Parallel stochastic gradient algo-rithms for large-scale matrix completion MathematicalProgramming Computation 5(2)201ndash226 June 2013

[19] B Recht C Re S Wright and F Niu Hogwild Alock-free approach to parallelizing stochastic gradientdescent In Proceedings of the conference on Advancesin Neural Information Processing Systems pages 693ndash701 2011

[20] P Richtarik and M Takac Distributed coordinate de-scent method for learning with big data 2013 URLhttparxivorgabs13102059

[21] H E Robbins and S Monro A stochastic approxi-mation method Annals of Mathematical Statistics 22400ndash407 1951

[22] S Shalev-Schwartz and N Srebro SVM optimizationInverse dependence on training set size In Proceedingsof the International Conference on Machine Learningpages 928ndash935 2008

[23] A J Smola and S Narayanamurthy An architecturefor parallel topic models In Proceedings of the Inter-national Conference on Very Large Data Bases pages703ndash710 2010

[24] S Suri and S Vassilvitskii Counting triangles and thecurse of the last reducer In Proceedings of the Interna-tional Conference on the World Wide Web pages 607ndash614 2011

[25] C Teflioudi F Makari and R Gemulla Distributedmatrix completion In Proceedings of the InternationalConference on Data Mining pages 655ndash664 2012

[26] H-F Yu C-J Hsieh S Si and I S Dhillon Scal-able coordinate descent approaches to parallel matrixfactorization for recommender systems In Proceedingsof the International Conference on Data Mining pages765ndash774 2012

[27] Y Zhou D Wilkinson R Schreiber and R PanLarge-scale parallel collaborative filtering for the netflixprize In Proceedings of the conference on AlgorithmicAspects in Information and Management pages 337ndash348 2008

[28] Y Zhuang W-S Chin Y-C Juan and C-J LinA fast parallel SGD for matrix factorization in sharedmemory systems In Proceedings of the ACM conferenceon Recommender systems pages 249ndash256 2013

12

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 13: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

APPENDIXA EFFECT OF THE REGULARIZATION

PARAMETERIn this subsection we study the convergence behavior of

NOMAD as we change the regularization parameter λ (Fig-ure 13) Note that in Netflix data (left) for non-optimalchoices of the regularization parameter the test RMSE in-creases from the initial solution as the model overfits or un-derfits to the training data While NOMAD reliably con-verges in all cases on Netflix the convergence is notablyfaster with higher values of λ this is expected because reg-ularization smooths the objective function and makes theoptimization problem easier to solve On other datasets thespeed of convergence was not very sensitive to the selectionof the regularization parameter

B EFFECT OF THE LATENT DIMENSIONIn this subsection we study the convergence behavior of

NOMAD as we change the dimensionality parameter k (Fig-ure 14) In general the convergence is faster for smaller val-ues of k as the computational cost of SGD updates (9) and(10) is linear to k On the other hand the model gets richerwith higher values of k as its parameter space expands itbecomes capable of picking up weaker signals in the datawith the risk of overfitting This is observed in Figure 14with Netflix (left) and Yahoo Music (right) In Hugewikihowever small values of k were sufficient to fit the trainingdata and test RMSE suffers from overfitting with highervalues of k Nonetheless NOMAD reliably converged in allcases

C SCALING ON COMMODITY HARDWAREIn this section we augment Section 54 by providing ac-

tual plots of the experiment We increase the number ofmachines from 1 to 32 and plot how the convergence ofNOMAD is affected by the number of machines As in Sec-tion 54 we used m1xlarge machines from Amazon WebServies (AWS) which have quad-core Intel Xeon E5430 CPUand 15GB of RAM per each

The overall pattern is identical to what was found in Fig-ure 19 10 and 9 of Section 53 Figure 15 shows how thetest RMSE decreases as a function of the number of up-dates As in Figure 19 the speed of convergence is fasterwith larger number of machines as the updated informationis more frequently exchanged Figure 16 shows the numberof updates performed per second in each computation coreof each machine NOMAD exhibits linear scaling on Net-flix and Hugewiki but slows down on Yahoo Music due toextreme sparsity of the data Figure 17 compares the con-vergence speed of different settings when the same amountof computational power is given to each on every datasetwe observe linear to super-linear scaling up to 32 machines

D TEST RMSE AS A FUNCTION OF THENUMBER OF UPDATES IN HPC CLUS-TER

In this section we plot the test RMSE as a function of thenumber of updates in HPC cluster which were not includedin the main text due to page constraints Figure 18 showssingle-machine multi-threaded experiments and Figure 19shows multi-machine distributed memory experiments

E COMPARISON OF ALGORITHMS FORDIFFERENT VALUES OF THE REGU-LARIZATION PARAMETER

In this section we augment experiments in Section 53by comparing the performance of NOMAD CCD++ andDSGD on different values of the regularization parameter λFigure 20 shows the result of the experiment As NOMADand DSGD are both stochastic gradient descent methodsthey behave similarly to each other when the regularizationparameter is changed On the other hand CCD++ whichdecreases the objective function more greedily behaves adifferently

For small values of λ CCD++ seems to overfit to themodel due to its greedy strategy it generally converges toa worse solution than others For high values of λ how-ever the strategy of CCD++ is advantageous and it showsrapid initial convergence Note that in all cases NOMAD iscompetitive as compared to both CCD++ as well as DSGD

F COMPARISON WITH GRAPHLABPowerGraph 22 [17] the latest version of GraphLab10 was

used in our experiments Since the Intel compiler is unableto compile GraphLab we compiled it with gcc The rest ofexperimental setting is the same as in Section 51

GraphLab provides a number of optimization algorithmsfor matrix completion However we found that the Stochas-tic Gradient Descent (SGD) implementation of GraphLabdoes not converge According to the GraphLab develop-ers this is because the abstraction currently provided byGraphLab is not suitable for SGD11 Therefore we use theAlternating Least Squares (ALS) algorithm for minimizingthe objective function (1)

Although each machine in our HPC cluster is equippedwith 32 GB of RAM and we distribute the work across32 machines in multi-machine experiments we had to tunenfibers parameter to avoid out of memory problems Inspite of tuning many different parameters we were unable torun GraphLab on the Hugewiki data in any setting For theother datasets we tried both synchronous and asynchronousengines of GraphLab and report the best of the two settings

Figure 21 shows results of single-machine multi-threadedexperiments while Figure 22 and Figure 23 shows multi-machine experiments on the HPC cluster and the commod-ity cluster respectively Clearly NOMAD not only convergesorders of magnitude faster than GraphLab in every settingbut the quality of solution is also significantly better

Compared to the multi-machine setting with large numberof machines (32) but small number of cores (4) per machineGraphLab converges faster when it is run in a single-machinesetting with a large number of cores (30) We conjecturethat this is because locking and unlocking a variable over anetwork is expensive In contrast since NOMAD is a lockfree algorithm it scales well in both settings

Although GraphLab biassgd is based on a model differentfrom (1) for completeness we provide comparisons with it

10Available for download from httpsgithubcomgraphlab-codegraphlab

11GraphLab provides a different algorithm named biassgdHowever it is based on a model different from (1)

13

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 14: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 20 40 60 80 100 120 14009

095

1

105

11

115

seconds

test

RM

SE

Netflix machines=8 cores=4 k = 100

λ = 00005

λ = 0005

λ = 005

λ = 05

0 20 40 60 80 100 120 140

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=8 cores=4 k = 100

λ = 025

λ = 05

λ = 1

λ = 2

0 20 40 60 80 100 120 140

06

07

08

09

1

11

seconds

test

RMSE

Hugewiki machines=8 cores=4 k = 100

λ = 00025

λ = 0005

λ = 001

λ = 002

Figure 13 Convergence behavior of NOMAD when the regularization parameter λ is varied

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=8 cores=4 λ = 005

k = 10

k = 20

k = 50

k = 100

0 20 40 60 80 100 120 14022

23

24

25

seconds

test

RMSE

Yahoo machines=8 cores=4 λ = 100

k = 10

k = 20

k = 50

k = 100

0 500 1000 1500 200005

06

07

08

seconds

test

RMSE

Hugewiki machines=8 cores=4 λ = 001

k = 10

k = 20

k = 50

k = 100

Figure 14 Convergence behavior of NOMAD when the latent dimension k is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot1011

06

08

1

12

14

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machines=16

machines=32

Figure 15 Test RMSE of NOMAD as a function of the number of updates on a commodity hardware cluster when thenumber of machines is varied

0 5 10 15 20 25 30 350

05

1

15

middot106

number of machines

up

date

sp

erm

ach

ine

per

core

per

sec

Netflix cores=4 λ = 005 k = 100

0 5 10 15 20 25 30 350

05

1

middot106

number of machines

updatesper

machineper

core

per

sec

Yahoo cores=4 λ = 100 k = 100

10 15 20 25 300

02

04

06

08

1

12

middot106

number of machines

updatesper

machineper

core

per

sec

Hugewiki cores=4 λ = 100 k = 100

Figure 16 Number of updates of NOMAD per machine per core per second as a function of the number of machines on acommodity hardware cluster

14

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 15: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 02 04 06 08 1 12

middot104

092

094

096

098

1

seconds times machines times cores

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2

middot104

22

23

24

25

26

27

seconds times machines times cores

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 02 04 06 08 1

middot105

06

08

1

12

14

seconds times machines times cores

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=8

machinees=16

machines=32

Figure 17 Test RMSE of NOMAD as a function of computation time (time in seconds times the number of machines times thenumber of cores per each machine) on a commodity hardware cluster when the number of machines is varied

0 02 04 06 08 1

middot1010

092

094

096

098

number of updates

test

RM

SE

Netflix machines=1 λ = 005 k = 100

cores=4

cores=8

cores=16

cores=30

0 05 1 15 2 25 3

middot1010

22

24

26

28

number of updates

test

RMSE

Yahoo machines=1 λ = 100 k = 100

cores=4

cores=8

cores=16

cores=30

0 1 2 3 4 5

middot1011

05

055

06

065

number of updates

test

RMSE

Hugewiki machines=1 λ = 001 k = 100

cores=4

cores=8

cores=16

cores=30

Figure 18 Test RMSE of NOMAD as a function of the number of updates when the number of cores is varied

0 02 04 06 08 1 12 14

middot1010

092

094

096

098

1

number of updates

test

RM

SE

Netflix cores=4 λ = 005 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 05 1 15 2 25 3

middot1010

22

23

24

25

26

27

number of updates

test

RMSE

Yahoo cores=4 λ = 100 k = 100

machines=1

machines=2

machines=4

machines=8

machines=16

machines=32

0 1 2 3 4 5

middot1011

05

055

06

065

07

number of updates

test

RMSE

Hugewiki cores=4 λ = 001 k = 100

machines=4

machines=8

machines=16

machines=32

machines=64

Figure 19 Test RMSE of NOMAD as a function of the number of updates on a HPC cluster when the number of machinesis varied

15

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 16: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 20 40 60 8009

095

1

105

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 00125 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 025 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 00025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8009

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 0025 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 05 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 0005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80 100 120 140

092

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

22

24

26

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 001 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

094

096

098

1

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 01 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 8022

23

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 2 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 002 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80098

1

102

104

106

108

11

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 02 k = 100

NOMAD

DSGD

CCD++

0 20 40 60 80

24

25

26

27

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 400 k = 100

NOMAD

DSGD

CCD++

0 500 1000 1500 200005

055

06

065

07

seconds

test

RMSE

Hugewiki machines=64 cores=4 λ = 004 k = 100

NOMAD

DSGD

CCD++

Figure 20 Comparison of NOMAD DSGD and CCD++ on a HPC cluster when the regularization paramter λ is varied Thevalue of λ increases from top to bottom

16

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 17: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

on commodity hardware cluster Unfortunately GraphLabbiassgd crashed when we ran it on more than 16 machinesso we had to run it on only 16 machines and assumed GraphLab

will linearly scale up to 32 machines in order to generateplots in Figure 23 Again NOMAD was orders of magnitudefaster than GraphLab and converges to a better solution

17

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab
Page 18: NOMAD: Non-locking, stOchastic Multi-machine algorithm ...We develop an e cient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchas-tic Multi-machine

0 500 1000 1500 2000 2500 3000

095

1

105

11

seconds

test

RM

SE

Netflix machines=1 cores=30 λ = 005 k = 100

NOMAD

GraphLab ALS

0 1000 2000 3000 4000 5000 6000

22

24

26

28

30

seconds

test

RMSE

Yahoo machines=1 cores=30 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 21 Comparison of NOMAD and GraphLab on a single machine with 30 computation cores

0 100 200 300 400 500 600

1

15

2

25

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

0 100 200 300 400 500 600

30

40

50

60

70

80

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

Figure 22 Comparison of NOMAD and GraphLab on a HPC cluster

0 500 1000 1500 2000 2500

1

12

14

16

18

2

seconds

test

RM

SE

Netflix machines=32 cores=4 λ = 005 k = 100

NOMAD

GraphLab ALS

GraphLab biassgd

0 500 1000 1500 2000 2500 3000

25

30

35

40

seconds

test

RMSE

Yahoo machines=32 cores=4 λ = 100 k = 100

NOMAD

GraphLab ALS

GrpahLab biassgd

Figure 23 Comparison of NOMAD and GraphLab on a commodity hardware cluster

18

  • Introduction
  • Background
    • Alternating Least Squares
    • Coordinate Descent
    • Stochastic Gradient Descent
      • NOMAD
        • Description
        • Complexity Analysis
        • Dynamic Load Balancing
        • Hybrid Architecture
        • Implementation Details
          • Related Work
            • Map-Reduce and Friends
            • Asynchronous Algorithms
            • Numerical Linear Algebra
            • Discussion
              • Experiments
                • Experimental Setup
                • Scaling in Number of Cores
                • Scaling as a Fixed Dataset is Distributed Across Workers
                • Scaling on Commodity Hardware
                • Scaling as both Dataset Size and Number of Machines Grows
                  • Conclusion and Future Work
                  • Acknowledgements
                  • Effect of the Regularization Parameter
                  • Effect of the Latent Dimension
                  • Scaling on Commodity Hardware
                  • Test RMSE as a function of the number of updates in HPC cluster
                  • Comparison of Algorithms for Different Values of the Regularization Parameter
                  • Comparison with GraphLab