Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce

● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables● Demo

What are recommendation algorithms?● Problem:

○ “Information overload”○ Diverse user interests

● User-Item Recommendation○ Recommend content for each user

based on a larger training set of user interaction histories

Motivation● Large-scale recommender systems

○ Millions of users and items (100m+ ratings)● Problems:

○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity

Collaborative Filtering

Shawn

Billy

Mary

4 3 8 9

2

4

3 4

1

2

8 8

4

● Similarity based approach

● Two main variants:○ User-based○ Item-based

?? ?

?

?

User-based Collaborative Filtering

● Step 1: Obtain user-itemmatrix denoted Mi,j


● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors

pairwise users

rating vectors


● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score

recommendation score of item

pairwise user similarities

mean rating

co-rated user rating

ResultsStandalone Cluster: Amazon EC2 Cluster:

Evaluation

Lessons Learned● Must manually specify number of tasks

○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for

data that will be reused● Must account for the “power users”

○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!

○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Technology

Transcript of Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark