Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...
Transcript of Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...
Movie RecommendationTeam ValakManpreet KaurHao Su
Overview
● Topic review● Sequential Design● Parallel Design● Strong Scaling ● Weak Scaling ● Future Work● What we have learned
Topic Review
● Implement a movie recommendation system using Collaborative Filtering and Map-Reduce.
● Collaborative Filtering will find other users with similar taste to the target user and make recommendations based on their ratings.
Movie Recommendation System
● Phase 1: Item Partitioning- Elements in the groups will be divided using minHash technique.
● Phase 2: Intra Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the same group.
● Phase 3: Inter Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the different group if these two items have received rating from same user.
● Phase 4: Find Recommendation - Find the similar items for the top rated items rated by the target user using similarity matrix constructed in above phases.
Locality Sensitive Hashing
1. Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.
2. Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix.
3. Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.
4. Choose a band size b for the number of rows for each item and compute hash value for each band.
Sequential Algorithm
Sequential Design
● Phase 1: Run Partitioning phase to divide the N items into K buckets.
● Phase 2: compute the similarity of every pair of items belonging to the same
bucket.
● Phase 3: Finally, we calculate the similarity of related item pair included in
different groups.
● Phase 4: Select the closest K neighbors for the top ratings given by the target
user.
● Recommend the top V movies by prediction.
Parallel Algorithm
Parallel Design● Phase 1: Run Partitioning phase to divide the N items into K buckets. Signature
matrix will be computed in parallel.
● Phase 2: compute the similarity of every pair of items belonging to the same
bucket. Each bucket will be handled by separate threads.
● Phase 3: Finally, we calculate the similarity of related item pair included in
different groups. Each thread will handle different related pairs.
● Phase 4: Select the closest K neighbors for the top ratings given by the target
user. Each top rating of user will be handled by separate threads.
● Recommend the top V movies by prediction.
Strong Scaling
Strong Scaling
Weak Scaling
Weak Scaling
Bottleneck in Phase 1
● We cannot begin processing before reading the entire file. -Sequential Part,
will increase as the data size increases.
● Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.
● Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix. -can be done only sequentially and will take more time as the data size increases.
● Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.
● Choose a band size b for the number of rows for each item and compute hash value for each band.
Bottleneck in Phase 3
● Phase 3 iterates over all the user and compute similarity between any two
items that have received rating from the same user.
● The outer loop iterating over the users cannot be parallelized as two items
may have received rating from two different users in that case if we have not
computed the similarity between these two items then two threads may end
up updating the same element in the Similarity Matrix.
● So, only items for the same user can be processed in parallel. Which can be
worst in a situation where each user has few ratings (less than the number of
threads)
Additional Bottlenecks
● The Algorithm runs in multiple phases and next phase can start processing
only when the previous phase has finished it’s processing.
Future Works
● Improve the efficiency of the program.
● User - Item implementation.
● Test on datasets other than movies.
What we have learned
● The concept of collaborative filtering and how to implement it.● Locality sensitive hashing to partition data into groups.● Data cleaning does help improving the performance.● We ran the algorithm without LSH and computing the
similarity between each pair of items. ○ With 803K ratings algorithm took more than 8 minutes.○ However, with our implementation it took 22s with 12
cores and 3 minutes for single core.