Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...

18
Movie Recommendation Team Valak Manpreet Kaur Hao Su

Transcript of Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...

Page 1: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Movie RecommendationTeam ValakManpreet KaurHao Su

Page 2: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Overview

● Topic review● Sequential Design● Parallel Design● Strong Scaling ● Weak Scaling ● Future Work● What we have learned

Page 3: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Topic Review

● Implement a movie recommendation system using Collaborative Filtering and Map-Reduce.

● Collaborative Filtering will find other users with similar taste to the target user and make recommendations based on their ratings.

Page 4: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Movie Recommendation System

● Phase 1: Item Partitioning- Elements in the groups will be divided using minHash technique.

● Phase 2: Intra Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the same group.

● Phase 3: Inter Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the different group if these two items have received rating from same user.

● Phase 4: Find Recommendation - Find the similar items for the top rated items rated by the target user using similarity matrix constructed in above phases.

Page 5: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Locality Sensitive Hashing

1. Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.

2. Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix.

3. Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.

4. Choose a band size b for the number of rows for each item and compute hash value for each band.

Page 6: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Sequential Algorithm

Page 7: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Sequential Design

● Phase 1: Run Partitioning phase to divide the N items into K buckets.

● Phase 2: compute the similarity of every pair of items belonging to the same

bucket.

● Phase 3: Finally, we calculate the similarity of related item pair included in

different groups.

● Phase 4: Select the closest K neighbors for the top ratings given by the target

user.

● Recommend the top V movies by prediction.

Page 8: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Parallel Algorithm

Page 9: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Parallel Design● Phase 1: Run Partitioning phase to divide the N items into K buckets. Signature

matrix will be computed in parallel.

● Phase 2: compute the similarity of every pair of items belonging to the same

bucket. Each bucket will be handled by separate threads.

● Phase 3: Finally, we calculate the similarity of related item pair included in

different groups. Each thread will handle different related pairs.

● Phase 4: Select the closest K neighbors for the top ratings given by the target

user. Each top rating of user will be handled by separate threads.

● Recommend the top V movies by prediction.

Page 10: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Strong Scaling

Page 11: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Strong Scaling

Page 12: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Weak Scaling

Page 13: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Weak Scaling

Page 14: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Bottleneck in Phase 1

● We cannot begin processing before reading the entire file. -Sequential Part,

will increase as the data size increases.

● Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.

● Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix. -can be done only sequentially and will take more time as the data size increases.

● Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.

● Choose a band size b for the number of rows for each item and compute hash value for each band.

Page 15: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Bottleneck in Phase 3

● Phase 3 iterates over all the user and compute similarity between any two

items that have received rating from the same user.

● The outer loop iterating over the users cannot be parallelized as two items

may have received rating from two different users in that case if we have not

computed the similarity between these two items then two threads may end

up updating the same element in the Similarity Matrix.

● So, only items for the same user can be processed in parallel. Which can be

worst in a situation where each user has few ratings (less than the number of

threads)

Page 16: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Additional Bottlenecks

● The Algorithm runs in multiple phases and next phase can start processing

only when the previous phase has finished it’s processing.

Page 17: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

Future Works

● Improve the efficiency of the program.

● User - Item implementation.

● Test on datasets other than movies.

Page 18: Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation System Phase 1: Item Partitioning- Elements in the groups will be divided using minHash

What we have learned

● The concept of collaborative filtering and how to implement it.● Locality sensitive hashing to partition data into groups.● Data cleaning does help improving the performance.● We ran the algorithm without LSH and computing the

similarity between each pair of items. ○ With 803K ratings algorithm took more than 8 minutes.○ However, with our implementation it took 22s with 12

cores and 3 minutes for single core.