Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
-
Upload
evancasey -
Category
Technology
-
view
112 -
download
2
description
Transcript of Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
![Page 1: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/1.jpg)
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014
![Page 2: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/2.jpg)
Overview● Apache Spark
○ Dataflow model○ Spark vs Hadoop MapReduce
● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned
![Page 3: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/3.jpg)
Apache Spark● Distributed data-processing
framework built on top of HDFS● Use cases:
○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!
![Page 4: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/4.jpg)
Spark vs Hadoop MapReduce● In-memory data flow model
optimized for multi-stage jobs
● Novel approach to fault tolerance
● Similar programming style to Scalding/Cascading
![Page 5: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/5.jpg)
Programming Model● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize● Parallel Operations
○ Map, GroupBy, Filter, Join, etc● Optimizations
○ Caching, shared variables● Demo
![Page 6: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/6.jpg)
What are recommendation algorithms?● Problem:
○ “Information overload”○ Diverse user interests
● User-Item Recommendation○ Recommend content for each user
based on a larger training set of user interaction histories
![Page 7: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/7.jpg)
Motivation● Large-scale recommender systems
○ Millions of users and items (100m+ ratings)● Problems:
○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity
![Page 8: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/8.jpg)
Collaborative Filtering
Shawn
Billy
Mary
4 3 8 9
2
4
3 4
1
2
8 8
4
● Similarity based approach
● Two main variants:○ User-based○ Item-based
?? ?
?
?
![Page 9: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/9.jpg)
User-based Collaborative Filtering
● Step 1: Obtain user-itemmatrix denoted Mi,j
![Page 10: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/10.jpg)
User-based Collaborative Filtering
● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors
pairwise users
rating vectors
![Page 11: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/11.jpg)
User-based Collaborative Filtering
● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score
recommendation score of item
pairwise user similarities
mean rating
co-rated user rating
![Page 12: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/12.jpg)
ResultsStandalone Cluster: Amazon EC2 Cluster:
![Page 13: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/13.jpg)
Evaluation
![Page 14: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark](https://reader034.fdocuments.in/reader034/viewer/2022042623/54c66eb84a795944538b4623/html5/thumbnails/14.jpg)
Lessons Learned● Must manually specify number of tasks
○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for
data that will be reused● Must account for the “power users”
○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!
○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity