CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate...
Transcript of CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate...
![Page 1: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/1.jpg)
Wei Tan, IBM T. J. Watson Research Center
CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
![Page 2: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/2.jpg)
Agenda
Why accelerate recommendation (matrix factorization) using GPUs?
What are the challenges?
How cuMF tackles the challenges?
So what is the result?
2
![Page 3: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/3.jpg)
Why we need a fast and scalable recommender system?
3
Recommendation is pervasive
Drive 80% of Netflix’s watch hours1
Digital ads market in US: 37.3 billion2
Need: fast, scalable, economic
![Page 4: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/4.jpg)
ALS (alternating least square) for collaborative filtering
4
Input: users ratings on some items
Output: all missing ratings
How: factorize the rating matrix R into
and minimize the lost on observed ratings:
ALS: iteratively solve
R
X
xu
ΘT θv
![Page 5: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/5.jpg)
Matrix factorization/ALS is versatile
5
item
X
xu
ΘT θv
Recommendation
user
X
ΘT
word
wordWord embedding
X
ΘT
word
documentTopic model
a very important algorithmto accelerate
![Page 6: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/6.jpg)
Challenges of fast and scalable ALS
6
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Choleskydecomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
![Page 7: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/7.jpg)
Challenge 1: memory access
7
Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec
Higher flops higher op intensity (more flops per byte) caching!
Operational intensity (Flops/Byte)
Gflo
ps/s
4.3T
1
288G ×
15
×
under-utilized
fully-utilized
![Page 8: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/8.jpg)
Address challenge 1: memory-optimized ALS
8
• To obtain
1. Reuse θv s for many users
2. Load a portion into smem
3. Tile and aggregate in register
![Page 9: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/9.jpg)
Address challenge 1: memory-optimized ALS
9
f
fT
T
T
T
![Page 10: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/10.jpg)
Address challenge 2: scale-up ALS on multiple GPUs
Model parallel: solve a
portion of the model
10
model
parallel
![Page 11: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/11.jpg)
Address challenge 2: scale-up ALS on multiple GPUs
Data parallel: solve
with a portion of the
training data
11
Data parallel
model
parallel
![Page 12: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/12.jpg)
Address challenge 2: parallel reduction
Data parallel needs cross-GPU reduction
12
One-phase parallel reduction. Two-phase parallel reduction.
Inter-socketIntra-socket
![Page 13: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/13.jpg)
Recap: cuMF tackled two challenges
13
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Choleskydecomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
![Page 14: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/14.jpg)
Connect cuMF to Spark MLlib
14
Spark applications relying on mllib/ALS need no change
Modified mllib/ALS detects GPU and offload computation
Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMFJNI
![Page 15: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/15.jpg)
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
…RDD RDD RDD
CUDA kernel
GPU2
RDD
CUDA kernel
…RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
…RDD RDD RDD
shuffle
…
RDD on CPU: to distribute rating data and shuffle parameters
Solver on GPU: to form and solve
Able to run on multiple nodes, and multiple GPUs per node
15
![Page 16: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/16.jpg)
Implementation
16
In C (circa. 10k LOC)
CUDA 7.0/7.5, GCC OpenMP v3.0
Baselines:– Libmf: SGD on 1 node [RecSys14]
– NOMAD: SGD on >1 nodes [VLDB 14]
– SparkALS: ALS on Spark
– FactorBird: SGD + parameter server for MF
– Facebook: enhanced Giraph
![Page 17: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/17.jpg)
CuMF performance
17
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• 1 GPU vs. 30 cores, CuMF slightly
faster than libmf and NOMAD
• CuMF scales well on 1, 2, 4 GPUs
![Page 18: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/18.jpg)
Effectiveness of memory optimization
18
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• Aggressively using registers 2x
• Using texture 25%-35% faster
![Page 19: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/19.jpg)
CuMF performance and cost
19
• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes
≈ 10x NOMAD @32 AWS nodes
• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes
≈ 1% of its cost
![Page 20: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/20.jpg)
CuMF accelerated Spark on Power 8
20
• CuMF @2 K40s achieves 6+x speedup in training (193 sec 1.3k sec)
*GUI designed by Amir Sanjar
![Page 21: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/21.jpg)
Conclusion
Why accelerate recommendation (matrix factorization) using GPU?
– Need to be fast, scalable and economic
What are the challenges?
– Memory access, scale to multiple GPUs
How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
So what is the result?
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone or with Spark
– GPU can tackle ML problems beyond deep learning!
21
![Page 22: CuMF: Large-Scale Matrix Factorization on Just One Machine ...Conclusion Why accelerate recommendation (matrix factorization) using GPU? –Need to be fast, scalable and economic What](https://reader034.fdocuments.in/reader034/viewer/2022042809/5f8c6c8fa3ff7141286a9f06/html5/thumbnails/22.jpg)
Wei Tan, IBM T. J. Watson Research Center
http://ibm.biz/wei_tan
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820
Source code available soon.