CuMF: Large-Scale Matrix Factorization on Just One Machine...

Wei Tan, IBM T. J. Watson Research Center

wtan@us.ibm.com

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Agenda

Why accelerate recommendation (matrix factorization) using GPUs?

What are the challenges?

How cuMF tackles the challenges?

So what is the result?

Why we need a fast and scalable recommender system?

Recommendation is pervasive

Drive 80% of Netflix’s watch hours1

Digital ads market in US: 37.3 billion2

Need: fast, scalable, economic

ALS (alternating least square) for collaborative filtering

Input: users ratings on some items

Output: all missing ratings

How: factorize the rating matrix R into

and minimize the lost on observed ratings:

ALS: iteratively solve

ΘT θv

Matrix factorization/ALS is versatile

ΘT θv

Recommendation

wordWord embedding

documentTopic model

a very important algorithmto accelerate

Challenges of fast and scalable ALS

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Challenge 1: memory access

Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec

Higher flops higher op intensity (more flops per byte) caching!

Operational intensity (Flops/Byte)

288G ×

under-utilized

fully-utilized

Address challenge 1: memory-optimized ALS

• To obtain

1. Reuse θv s for many users

2. Load a portion into smem

3. Tile and aggregate in register

Address challenge 1: memory-optimized ALS

Address challenge 2: scale-up ALS on multiple GPUs

Model parallel: solve a

portion of the model

parallel

Address challenge 2: scale-up ALS on multiple GPUs

Data parallel: solve

with a portion of the

training data

Data parallel

parallel

Address challenge 2: parallel reduction

Data parallel needs cross-GPU reduction

One-phase parallel reduction. Two-phase parallel reduction.

Inter-socketIntra-socket

Recap: cuMF tackled two challenges

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Connect cuMF to Spark MLlib

Spark applications relying on mllib/ALS need no change

Modified mllib/ALS detects GPU and offload computation

Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps

mllib/ALS

cuMFJNI

Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel

…RDD RDD RDD

CUDA kernel

…RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel

…RDD RDD RDD

shuffle

RDD on CPU: to distribute rating data and shuffle parameters

Solver on GPU: to form and solve

Able to run on multiple nodes, and multiple GPUs per node

Implementation

In C (circa. 10k LOC)

CUDA 7.0/7.5, GCC OpenMP v3.0

Baselines:– Libmf: SGD on 1 node [RecSys14]

– NOMAD: SGD on >1 nodes [VLDB 14]

– SparkALS: ALS on Spark

– FactorBird: SGD + parameter server for MF

– Facebook: enhanced Giraph

CuMF performance

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• 1 GPU vs. 30 cores, CuMF slightly

faster than libmf and NOMAD

• CuMF scales well on 1, 2, 4 GPUs

Effectiveness of memory optimization

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• Aggressively using registers 2x

• Using texture 25%-35% faster

CuMF performance and cost

• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes

≈ 10x NOMAD @32 AWS nodes

• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes

≈ 1% of its cost

CuMF accelerated Spark on Power 8

• CuMF @2 K40s achieves 6+x speedup in training (193 sec 1.3k sec)

*GUI designed by Amir Sanjar

Conclusion

Why accelerate recommendation (matrix factorization) using GPU?

– Need to be fast, scalable and economic

What are the challenges?

– Memory access, scale to multiple GPUs

How cuMF tackles the challenges?

– Optimize memory access, parallelism and communication

So what is the result?

– Up to 10x as fast, 100x as cost-efficient

– Use cuMF standalone or with Spark

– GPU can tackle ML problems beyond deep learning!

Wei Tan, IBM T. J. Watson Research Center

wtan@us.ibm.com

http://ibm.biz/wei_tan

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820

Source code available soon.

CuMF: Large-Scale Matrix Factorization on Just One Machine...

Documents

Transcript of CuMF: Large-Scale Matrix Factorization on Just One Machine...

Convolutional Matrix Factorization for Document Context ...dm.postech.ac.kr/~cartopy/ConvMF/ConvMF_RecSys16_for_public.pdf · Convolutional Matrix Factorization for Document Context-Aware

Nonnegative Matrix Factorization - Complexity, Algorithms ... · Nonnegative Matrix Factorization, Neural Computation 2012. Householder XIX Nonnegative Matrix Factorization: Complexity,

Matrix Factorization with Knowledge Graph Propagation for ...MATRIX FACTORIZATION (MF) 29 Reasoning with Matrix Factorization Word Relation Model Slot Relation Model word relation

Matrix Factorization via SGD

MATRIX FACTORIZATION AND LIFTING

Track 1 – Matrix Factorization Implementation Detailshpi.de/.../naumann/lehre/SS2011/Collaborative_Filtering/matrix_intermediate.pdfTrack 1 – Matrix Factorization Implementation

HHMF: hidden hierarchical matrix factorization for ... · Keywords Hierarchical matrix factorization · Collaborative ﬁltering ·Recommender systems 1Introduction Recommender systems

A Nonlinear Orthogonal Non-Negative Matrix Factorization … · 2018-06-17 · 42 factorization of the graph a nity matrix was replaced with the factorization of the data matrix itself

Matrix Factorization Methods - cse.cuhk.edu.hk€¦ · Introduction Outline 1 Introduction 2 Singular Value Decomposition 3 Probabilistic Matrix Factorization 4 Non-negative Matrix

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization for Clustering

Multiresolution Matrix Factorization - Directorypeople.cs.uchicago.edu/~risi/papers/KondorTenevaGargICML2014.pdf · Multiresolution Matrix Factorization by repeatedly splitting each

CuMF SGD: Fast and Scalable Matrix FactorizationCuMF_SGD: Fast and Scalable Matrix Factorization Xiaolong Xie 1, Wei Tan 2, Liana L. Fong , and Yun Liang1 1Center for Energy-e cient

Privacy-Preserving Matrix Factorization

Matrix Factorization Algorithms for the Identification of ...web.mit.edu › ckcheung › www › ScientificResearch_files › ... · Matrix Factorization Algorithms for the Identiﬁcation

Direct Robust Matrix Factorization

Matrix Factorization Methods for Recommender Systems633561/FULLTEXT01.pdf · Matrix Factorization Methods for Recommender Systems ... with matrix factorization methods, for recommender

Non-negative Multiple Matrix Factorization

Boolean Matrix Factorization - Max Planck Societypmiettin/slides/BooleanMatrix... · De nition of the BMF Boolean Matrix Factorization (BMF) The (exact) Boolean matrix factorization

Matrix Factorization For Topic Models