Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...

Movie RecommendationTeam ValakManpreet KaurHao Su

Overview

● Topic review● Sequential Design● Parallel Design● Strong Scaling ● Weak Scaling ● Future Work● What we have learned

Topic Review

● Implement a movie recommendation system using Collaborative Filtering and Map-Reduce.

● Collaborative Filtering will find other users with similar taste to the target user and make recommendations based on their ratings.

Movie Recommendation System

● Phase 1: Item Partitioning- Elements in the groups will be divided using minHash technique.

● Phase 2: Intra Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the same group.

● Phase 3: Inter Similarity Phase - Compute the item similarity for every pair of items ti and t j belonging to the different group if these two items have received rating from same user.

● Phase 4: Find Recommendation - Find the similar items for the top rated items rated by the target user using similarity matrix constructed in above phases.

Locality Sensitive Hashing

1. Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.

2. Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix.

3. Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.

4. Choose a band size b for the number of rows for each item and compute hash value for each band.

Sequential Algorithm

Sequential Design

● Phase 1: Run Partitioning phase to divide the N items into K buckets.

● Phase 2: compute the similarity of every pair of items belonging to the same

bucket.

● Phase 3: Finally, we calculate the similarity of related item pair included in

different groups.

● Phase 4: Select the closest K neighbors for the top ratings given by the target

user.

● Recommend the top V movies by prediction.

Parallel Algorithm

Parallel Design● Phase 1: Run Partitioning phase to divide the N items into K buckets. Signature

matrix will be computed in parallel.

● Phase 2: compute the similarity of every pair of items belonging to the same

bucket. Each bucket will be handled by separate threads.

● Phase 3: Finally, we calculate the similarity of related item pair included in

different groups. Each thread will handle different related pairs.

● Phase 4: Select the closest K neighbors for the top ratings given by the target

user. Each top rating of user will be handled by separate threads.

● Recommend the top V movies by prediction.

Strong Scaling

Weak Scaling

Bottleneck in Phase 1

● We cannot begin processing before reading the entire file. -Sequential Part,

will increase as the data size increases.

● Build an m×n matrix where every shingle that appears in a set is marked with a 1 otherwise 0.

● Permute the rows of the matrix from step 2 and build a new p×n signature matrix where the number of the row of the first shingle to appear for a set is recorded for the permutation of the signature matrix. -can be done only sequentially and will take more time as the data size increases.

● Repeat permuting the rows of the input matrix MAX_PERMUTATIONS times and complete filling in the p×n signature matrix.

● Choose a band size b for the number of rows for each item and compute hash value for each band.

Bottleneck in Phase 3

● Phase 3 iterates over all the user and compute similarity between any two

items that have received rating from the same user.

● The outer loop iterating over the users cannot be parallelized as two items

may have received rating from two different users in that case if we have not

computed the similarity between these two items then two threads may end

up updating the same element in the Similarity Matrix.

● So, only items for the same user can be processed in parallel. Which can be

worst in a situation where each user has few ratings (less than the number of

threads)

Additional Bottlenecks

● The Algorithm runs in multiple phases and next phase can start processing

only when the previous phase has finished it’s processing.

Future Works

● Improve the efficiency of the program.

● User - Item implementation.

● Test on datasets other than movies.

What we have learned

● The concept of collaborative filtering and how to implement it.● Locality sensitive hashing to partition data into groups.● Data cleaning does help improving the performance.● We ran the algorithm without LSH and computing the

similarity between each pair of items. ○ With 803K ratings algorithm took more than 8 minutes.○ However, with our implementation it took 22s with 12

cores and 3 minutes for single core.

Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...

Documents

Transcript of Movie Recommendation - Computer Scienceark/654/team/5/presentation4.pdf · Movie Recommendation...