zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf ·...

22
zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected] -1- zMovie: The Movie Recommendation Engine I. ABSTRACT zMovie adds a whole new dimension to the movie watching experience by providing real-time personalized movie recommendations to users. It takes a collaborative social-networking approach where a user’s own tastes are mixed with that of the entire community to generate meaningful results. Most existing movie services like IMDB (www.imdb.com ) do not personalize their recommendations but simply provide an overall rating for a movie. This significantly decreases the value of each recommendation as it does not cater to the individual movie preferences of the user. Unlike these systems, zMovie’s Recommendation Engine will continually analyze individual user’s movie preferences and recommend custom movie recommendations. The overall goal is to ease the movie discovery process. zMovie is purely a movie recommendation service in that it offers a list of movie suggestions based on previous user ratings. zMovie is designed not to search for movies but to discover them through our recommendation process. zMovie will allow users to rate movies they have seen. This data is then analyzed, and recommendations are then returned to the user. The core of our project, zMovie’s recommendation algorithm, is based on a cluster-smoothed collaborative filtering algorithm [2]. We have refined and tuned the parameters around this algorithm by comparing our predicted ratings against actual ratings using in-sample and out-of-sample techniques as well as analyzing live user feedback.

Transcript of zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf ·...

Page 1: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-1-

zMovie: The Movie Recommendation Engine I. ABSTRACT zMovie adds a whole new dimension to the movie watching experience by providing real-time personalized movie recommendations to users. It takes a collaborative social-networking approach where a user’s own tastes are mixed with that of the entire community to generate meaningful results. Most existing movie services like IMDB (www.imdb.com) do not personalize their recommendations but simply provide an overall rating for a movie. This significantly decreases the value of each recommendation as it does not cater to the individual movie preferences of the user. Unlike these systems, zMovie’s Recommendation Engine will continually analyze individual user’s movie preferences and recommend custom movie recommendations. The overall goal is to ease the movie discovery process. zMovie is purely a movie recommendation service in that it offers a list of movie suggestions based on previous user ratings. zMovie is designed not to search for movies but to discover them through our recommendation process. zMovie will allow users to rate movies they have seen. This data is then analyzed, and recommendations are then returned to the user. The core of our project, zMovie’s recommendation algorithm, is based on a cluster-smoothed collaborative filtering algorithm [2]. We have refined and tuned the parameters around this algorithm by comparing our predicted ratings against actual ratings using in-sample and out-of-sample techniques as well as analyzing live user feedback.

Page 2: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-2-

II. RELATED WORK EXISTING PRODUCTS AND SYSTEMS Current Social Networking World Internet social networking sites, which began in 1995 with Classmates.com, have surged in popularity and use through word-of-mouth advertising. Since then, a wide range of virtual communities have formed serving different purposes and targeting varying niche audiences:

ProfessionalsActiveRain (real estate)

Ecademy (business)LinkedIn (business)

Orkut (Google)

Cultural CommunitiesBlackPlanet.com (African

Americans)Cyworld (South Korean)

Hyves (Dutch)IRC-Galleria (Finland)

iWiW (Hungary)LunarStorm (Sweden)MiGente.com (Latinos)

AcademicBebo (schools)

Classmates.com (public)Facebook (public)Xuqa (colleges)

BlogsBlurty

LivejournalXanga

VoxWindows Live Spaces

Tabulas

PhotosFlickr

MyPhotoBucketWebshots

Picasa

RelationshipsMatch.com

Multiply

MusicLast.fm

MystrandsBolt

GoldmicMusic ForteMymidishare

ReviewsTribe

RateitallChowhound

Yelp

VideoYouTube

BlinkxAkimboDave.tv

Brightcove

Social Networking World

Relationships

Bookmarkingdel.icio.us

diggreddit

MoviesYahoo! Movies

MovieLensFlixster

Blockbuster/Netflix

Social Movie Platforms In particular, we’ve chosen to explore the movie niche as this is an area where our project can provide significant improvements compared to existing products and systems. Traditional movie websites (IMDB, AOL Movies) function by proving global user ratings on movies in their database. Movies are categorized by metadata such as genre, era, directors, and so on. Users can search for movies, browse lists and read reviews written by critics or other users. However, most of these services lack any personal recommendation system and haven’t taken advantage of social-networking communities or crowd wisdom. Some websites, such as Blockbuster, do provide individualized recommendations based on a user’s ratings but do not include any social networking component. Yahoo! Movies goes further and uses personal ratings to suggest movies currently playing in theatre, on TV, and out on DVD. It also draws

Page 3: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-3-

upon its vast user base to give lists of similar movie fans, their ratings, and reviews. Other movie sites, like Flixster, take a different approach. Flixster forms web-based communities around movies and suggests movies to watch based on what your friends have rated. Recommender Systems Two conventional paradigms applied to recommender systems and user preference predictions are collaborative (CF) filtering and content-based (CB) filtering [1]. Collaborative filtering makes recommendations for a given user based on aggregating rating information of similar users in a historical database. On the other hand, content-based systems provide recommendations by comparing representations of the content contained in a given item (i.e. book, movie, song) with representations of content that match a given user profile. While CB systems can characterize users more uniquely, CF systems have been more successful because they not only do not require content to be associated with items but also can provide recommendations that are relevant to a user without having them contain content from the user’s profile [1, 2]. For these reasons, we have chosen to focus primarily on CF systems. There have been two primary kinds of CF systems (See Appendix A for illustrative diagrams of each recommender system): 1. User-Based CF: In User-Based CF [13], a target user’s choices are compared with other users in the

database to identify a group of “similar minded” people. Once this group is identified, highly rated content from the group are then recommended to the target user. Limitations of this include:

Bias towards what has already been recommended or chosen - frequent recommendation of most popular items and poor new item discovery tool

“Cold start” problem - items must be chosen by users before recommendation can be made; inability to recommend new content

Potential poor quality in recommendation - system only accounts for user’s pattern of choices (ratings on items) with no understanding of underlying content behind data (attributes of an item)

Extremely data-intensive - requires large number of user choices and ratings, often 30 or more, before reasonable recommendations can be made

Difficulty to analyze in real time - especially as number of users in the database grows 2. Item-Based CF: Item-based CF [13] examines each item on the target user’s list of chosen/rated items

and finds other items in the choice set that seems similar to the item. In this case, similarity can be determined on predefined attributes (e.g. movie genre, lead actors, director, etc.) and/or by calculating correlations between items. The main advantage of Item-Based CF over User-Based CF is increased scalability due to the fact that items can be classified or pre-scored based on explicit attributes, making recommendation computation much faster. Limitations of this model include:

“Cold start” problem - like User-Based CF Often highly inaccurate in recommendation - due to unavailability of attributes for each item

and disregard for group patterns Extremely data-intensive - requires large number of user choices/ratings before meaningful

patterns can be identified Cannot be applied across content domains - requires users to choose/rate items in each

domain separately Difficulty to analyze in real time - especially as number of users in the database grows

In addition, there are two main approaches to CF systems [2]:

Page 4: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-4-

1. Memory-based CF: Under this approach, similarity between two users is determined by comparing users’ ratings on a set of items, often the entire database. This suffers from 3 fundamental problems:

Data sparsity - most users only rate only a small number of items First rater problem - unrated items (new or obscure items) cannot be recommended Scalability - computational difficulty when dealing with large number of users and items

Examples: Pearson-Correlation based approach [3], vector similarity based approach [4], extended generalized vector-space model [5]

2. Model-based CF: In this case, users are grouped into a small number of classes based on rating patterns. Users are categorized into one or more predefined classes, and predictions are made based on ratings the classes would make on particular items. However, this approach also has its drawbacks:

Time consuming to build and update Unable to cover a diverse user range

Examples: Bayesian network approach [4], clustering [6, 7], aspect models [8], mixture models [11] Current Implementations In particular, we have examined existing work related specifically to movie recommendation systems such as Yahoo! Movies, MovieLens, Flixster, and Netflix. We elaborate on two contrasting systems in more depth below: Yahoo! Movies (2005) Yahoo! Movies uses ChoiceStream’s algorithm, Attributized Bayesian Choice Modeling (ABCM), which attempts to understand not only what users like but why they like it (a kind of hybrid user-based and item-based CF algorithm). It makes personalized recommendations in three stages [13]:

1. Automatically classifies content according to explicit and implicit attributes deed as the most powerful predictors of user preferences

2. Creates a rich, accurate profile of each user’s preference for those attributes 3. Matches users with content based on their preferences

As a result, this algorithm mitigates many of the existing limitations in traditional user-based and item-based CF models. It allows for easier handling of new items, broad diversity of item recommendations (including new content), and faster individual user preference learning. Pre-calculated summary of user tastes enable quick ranking of items in a given choice set (group of items), making the algorithm more lightweight and scalable.

MovieLens MovieLens is a free service provided by GroupLens Research at the University of Minnesota. It utilizes a traditional CF system which collects movie preferences of individual users and groups similar users together [14]. By comparing users, it then attempts to make predictions on movies not seen yet by individuals. MovieLens lets anyone sign up online and then displays 10 movies each page for users to rate. It requires at least 15 predictions for each new user in order to generate predictions and give the user access to the full MovieLens service. From there, the user can easily update his or her profile, view the most often rated movies, tag ratings with keywords, view newly released movies, and so on. The interface enables the user to easily see each movie against his or her predicted rating.

Page 5: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-5-

III. TECHNICAL APPROACH OUR PROPOSED PROJECT From the above discussion, it is evident that CF algorithms are continually being improved and revised both within academic institutions and industry and are broadly applied to many categories, not just movies. Companies, especially web retailers like Amazon or Netflix, are constantly developing more sophisticated recommender systems and are delving deeply into machine learning and artificial intelligence. In fact, recently in October 2006 Netflix announced that it will give away $1 million to anyone who can write an algorithm that improves its movie recommendation service [15]. This public “Netflix Prize” Contest not only reemphasizes the industry demand for a better recommender algorithm but also illustrates the trend of more companies seeking “crowd wisdom” in improving its products and service offerings. Due to our limited resources, zMovie is far more focused and scoped project, combining different aspects seen in existing movie recommender systems (Yahoo! Movies and MovieLens). The project essentially assesses and analyzes the results of varying different parameters in a more sophisticated user-based/model-based CF algorithm as opposed to making comparisons between users across the entire database (memory-based CF). Overall, the implementation process can be separated into two stages: 1. Algorithm development, testing, and improvement: The core of the project lies evaluating the model-

based CF algorithm (e.g. comparing our database of actual vs. predicted user ratings). We have chosen to run this algorithm on existing “clean” datasets provided by the MovieLens project. This allows us to concentrate on efforts on revising the algorithm as opposed to data gathering and parsing.

2. Front-end new user predictions and recommendations: The second stage involves developing a front-end Windows client application, allowing a new user to login, rate a certain number of movies, and immediately receive 1) a list of movie recommendations and 2) predicted ratings on new unrated movies. Upon completing the first stage, our goal is to optimize the feedback time for the second stage to make the system more scalable, especially as the number of users in the database grows.

Page 6: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-6-

SYSTEM ARCHITECTURE zMovie System Overview

User/Model BasedCollaborative Filtering

AlgorithmUsing Clustering

`

User

zMovie Recommendation System Overview

Movie

IMBD Database of Movies

User Ratings (rating of 1-5 per movie)

GlobalUsers

GlobalMovie

Database

User Profile

GlobalUser-Movie-Rating

Table

Movie Recommendations

Predicted Ratings vs.

Actual Ratings(base data vs. test data)

Live User Feedback“Sanity Check”

Analysis of Results

Refine/Retune Algorithm

CORE SYSTEMENGINE

The zMovie system has been built as a C#/Microsoft SQL Server combination Windows application. It has been organized into a three tier architecture in order to make the system modular:

1. Data Layer 2. zMovie Recommendation Engine Layer 3. User Interface Layer

Page 7: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-7-

1. Data Layer

This consists of the back-end database in which all the data is stored (provided by the MovieLens project). We will be using Microsoft SQL Server as our database engine. The database will store the global list of movies available to rate or recommend along with corresponding metadata. It will also store user profiles for individual users. These two will come together in a ratings table where users express their preferences for movies they have seen. In running our Recommendation Engine, we will be dividing the global database (e.g. MovieLens DataSet 1: 943 users; 1643 items; integer ratings between 1 and 5; total of 100,000 user-item-rating links) into two groups: “Base” dataset (e.g. 50% of users, ~50,000 user-item-rating links): This dataset is used to building

the Recommendation Engine. “Test” dataset (e.g. rest of dataset on -different- users not in “base” dataset): User rating predictions

are conducted on this dataset. This dataset is used for accuracy checking of the implemented algorithm.

For a more detailed description of each of the Tables and Fields refer to Appendix B. 2. zMovie Recommendation Engine Layer This is where the core algorithm for making recommendations on movies is implemented. The zMovie Recommendation Engine will focus on enhancing the CF technique employed by the Ringo personalized music recommendation system [9] by drawing on a hybrid memory-based and model-based approach through clustering. We chose this algorithm in particular because it addresses the fundamental flaws seen in traditional memory-based CF systems. Instead of comparing users’ ratings on items across the entire user database which is both highly inefficient and expensive especially as the database of users grows, users in the “base” dataset are grouped into a smaller set of K unique “clusters” (K groups of similar users). The hypothesis is that users in the same cluster have similar movie preferences and enjoy/dislike watching the same set of movies. Thus, for a new user that enters the database, that user will be prompted to rate a certain number of movies (say M). Once this is completed, the user is placed into one of the existing clusters based on similarity to users of that cluster, and recommendations are given based on the top rated items by users of that given cluster. (1) Clustering Users: Cluster-Based Collaborative Filtering Algorithm This algorithm creates K user clusters on the “base” dataset using a K-means algorithm, which works in the following manner: Input: Integer K, number of desired clusters

1. On the first pass of the “base” dataset, K users are chosen at random to serve as the centroids, or mean points, of K unique clusters.

2. Each remaining user in this “base” dataset is compared to the centroids of the clusters using the following Similarity Function between user u (user to be clustered) and user u’ (centroid of cluster) based on the traditional Pearson correlation-coefficient function [2] where Ru(t) represents the rating of an item t for user u, Ru represents the average rating for user u, and T(u) represents the set of all items rated by user u.

Page 8: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-8-

Between any two users, this function compares differences in item ratings of each user (relative to average user rating across all items) only among items rated both users. Each user is then “clustered” into group where it has the highest similarity. If the similarity function cannot be calculated (e.g. 0 in denominator), this user becomes the centroid of its own new cluster. For each cluster, similarities between all users within the cluster are accounted for and updated accordingly as a new user joins a cluster.

3. After all the users have been “clustered” into K groups in this first pass, the centroid (mean point representative user) of each cluster is recalculated. This mean point representative user is determined by being the most similar to all other users within the cluster. In other words, if cluster A has 23 users, for a given user u, the similarities between user u and all other 22 users are determined and then summed. This is done for each user in cluster A, and the user with the greatest sum becomes the centroid of that cluster.

4. Once the new centroids have been determined, the “base” dataset of users are again reclustered now using the new centroids. (Repeat step 2 with non-random, pre-selected centroids.)

5. This process of recalculating centroids and reclustering is repeated a certain number of times (say P passes) until the final K clusters are most representative of the given dataset (and distinctly unique from each other).

Running time: O(P*N*k) where P = # of passes through K-means algorithm, N = number of users to be clustered (~471), k = work done to cluster a given user (function of number of items user has rated relative to another) (2) Making a User Rating Prediction Input: User ua, Set of M rated movies, Item t Output: Prediction on item t

1. For given user ua and set of M ratings, the user is clustered according to the process outlined in (1) based on the Similarity Function to the “best” Cluster A (out of all K clusters).

2. The prediction on item t for the given user ua is computed according to the below function for the top N most similar users (called neighbors) within the cluster:

where

The prediction is computed as weighted average of deviations from the user ua’s N most similar neighbors, looking at items rated by both the user ua and each neighbor. The resulting value is returned.

The running time is a function of K clusters, number of items rated by active user, number of items rated by each user in Cluster A. (3) Making User Recommendations Input: User ua, Set of M rated movies Output: List of top 10 movies to watch and list of top 10 movies to avoid

1. For given user ua and set of M ratings, the user is clustered according to the process outlined in (1) above based on the Similarity Function to the “best” Cluster A (out of all K clusters).

Page 9: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-9-

2. Once the optimal Cluster A is determined, the top 10 movies are selected as the top 10 most highly rated items by the users within Cluster A. The bottom 10 movies are selected as the worst 10 rated items by the users within Cluster A.

The running time is function of K clusters and number of items rated by all users in Cluster A. 3. User Interface Layer This is the front-end of the application with which the user interacts. The front-end will have two areas:

1. Test interface: Users will be able to see the performance of the algorithm on the previously collected “test” dataset.

2. Live Recommendations: A new user will be able to create a profile and rate a small set of M movies. The application will then provide real-time recommendations of movies that the user would like.

These three tiers will be able to communicate with each other. The user interface will request recommendations from the zMovie Recommendation Engine, which will pull relevant data from the database. The user interface can also directly communicate with the data layer when performing operations such as creating user profiles and so on. METHODOLOGY 1. The Dataset The initial dataset we are using is MovieLens data (http://www.grouplens.org/taxonomy/term/14) that consists of 100,000 ratings (scale of 1-5) for 1682 movies by 943 users. Each user in the dataset has rated at least 20 movies. We first partition the dataset by randomly selecting 50% of the users as our “base” dataset and the remaining as our “test” dataset. This partitioning process divides the global number of users in the dataset into two subsets of unique users. 2. Building the Engine

Page 10: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-10-

To generate the clusters using our algorithm discussed above, we provide the Recommendation Engine with the entire “base” dataset. The Engine will randomly select K users to be the first centroids of the K clusters. It will then apply K-means algorithm to eventually generate the final clusters. These final clusters are then stored in the database to be easily referenced. Thus, the formation of the clusters occurs only once. 3. Making Recommendations When a new user arrives in the system through the User Interface layer, the user first rates an initial set of M movies. The Recommendation Engine then matches this user with one of the previously created and stored clusters. Once a user is assigned to a cluster, the Recommendation Engine will provide appropriate movie recommendations from that cluster. Thus, the process of giving recommendations to a new user is relatively quick as the user is only compared to K clusters. 4. Analyzing the Algorithm To analyze the accuracy of the algorithm we test the system on the “test” dataset that we separated earlier. For each user in the “test” dataset, we randomly pick M movies out of the Total movies that the user has rated and supply those ratings to the Recommendation Engine, as if a new user has arrived in the system. The Recommendation Engine then predicts ratings for all other movies (originally rated by the user but not yet supplied to our engine). These predicted ratings are compared with the actual ratings to determine the success rate of the algorithm. The effectiveness of the algorithm is determined by measuring the Mean Absolute Error (MAE):

The lower the MAE, the better accuracy the system has. 5. Choosing the Parameters The zMovie Recommendation Engine is dependent on a number of key parameters: Parameter Description Values Global Dataset Split % in “Base” Dataset

% in “Test” Dataset 50-50 Split (based on users)

K Number of clusters K = 10 P Number of passes under K-means algorithm P = 7 N Number of most similar neighbors to active user

within cluster to select N = 20

M Minimum number of ratings required to make recommendations to a user

M = 10

λ λ = 0: Basic Pearson correlation-coefficient λ = 0

Page 11: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-11-

algorithm λ = 1: Basic cluster-based CF algorithm using average rating of clustering for similarity computation

In order to determine these values, we tested the system by varying one parameter while holding all other variables constant. However, aside from the initial random selection of centroids on the first pass, not all these parameters are perfectly independent. As a result, there is some noise in the data with parameter values depending largely on the size of the global dataset. The values provided in the table above represent those that resulted in the lowest MAE. PRINCIPLE TECHNICAL CHALLENGES Below we detail some principal technical challenges we encountered during this project and how we resolved each issue: Data gathering and parsing: How will we obtain clean data that clearly separates user information, item rating, and then links between users-items-ratings? Originally, we had planned on developing a music recommender system by collecting user ratings on Apple iTunes. However, we quickly realized that this meant devoting a large part of our efforts in gathering “clean” data rather than focusing on the core algorithm. After finding the MovieLens datasets and discussing this issue with Professor Guha, we decided to switch our recommender system to movies. Data storage and management: How do we store data in manner to increase efficiency? Can the algorithm run entirely in memory or would we need to write database stored procedures? Careful C# class design of back-end Recommendation Engine- This was crucial in making our

recommender system as modular and flexible as possible (detailed in Appendix C). Microsoft .NET platform and comprehensive C# API library allowed us to more easily design our algorithm without “reinventing the wheel.”

Efficient data structure utilization- For instance, we use different data structures depending on our desired purpose: hashtables to perform fast key lookups, lists for sorting purposes, etc.

Dynamic updates- When possible, methods for each class are written to perform updates incrementally as opposed to doing a large one-time computation at the very end. For instance, the similarity between users of a given clusters are updated upon adding a new user to a cluster. They are not calculated separately. This enables the algorithm to quickly retrieve similarity values more readily.

Algorithm selection and definition: How can we interpret supplied information and make the “best” recommendation for an individual user? How do we specify the parameters to our algorithm? Using hybrid CF model: After performing research on recommender systems, we chose to

implement a user-based / model-based CF algorithm based on clustering as opposed to making comparisons between users across the entire database (seen in traditional memory-based CF). While this algorithm does have its limitations, we focused on refining the parameter inputs based on the size of the initial “base” dataset.

Simulate “real world” data scarcity: o 50-50 split: Furthermore, we decided on a 50-50 split in order to evaluate the results in the

most realistic situations. After all, data is scarce in the real world, and recommender systems must make suggestions on very little information (relative to the database) for any given user. MovieLens datasets actually performs a 80-20 split but the users in the “base” dataset and

Page 12: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-12-

“test” dataset overlap and are non-unique. Using this dataset would have clearly skewed the performance of our algorithm to far more positive, and, thus, this approach was avoided.

o Focus on smaller dataset: While MovieLens provides a larger database of 1 million ratings, 3900 movies, and 6040 users, we wanted to evaluate recommendation and rating prediction performance when given only a small “base” dataset.

Optimization of running time: Can finding similar users within a database be done fast? For a new user, how can we ensure we provide predicted ratings / recommended movies in a reasonably fast manner? Clustering technique- By clustering users into groups, a great deal of computational time is saved as

comparisons no longer need to be performed across all users in the database. Performing as much as possible in memory- We minimized any expensive database-related

operations. For instance, by removing all database writes, we reduced the running time from a few hours to minutes! The initial absurdly slow algorithm was mainly due to our inexperience with handling large datasets.

Assessing quality of algorithm: How do we evaluate how well our algorithm performs? Can this be done in a manner without requiring feedback from a new user? In-sample and out-of-sample testing- By clearly separating the “base” dataset and the “test” dataset,

we could easily build our Recommendation Engine with the first dataset and make predictions (and later compare them to actual) with the latter. This provides an objective method of accessing the quality of the recommender system.

Live user input and feedback- In addition to the above, zMovie also enables a new user to create a profile, rate a certain number of movies, and get recommendations / predictions immediately. To access this method, however, we will need to record and evaluate each user’s subjective opinions of the system as a whole.

Division of work and integration of end-to-end system: How do we easily integrate the back-end Recommendation Engine with the front-end user interface? What design choices should we be making? Three-tier architecture- By having three distinct layers, we could easily divide up coding and work

without having to deal with synchronization and version control issues. Emphasis on design / modularity- A great deal of our time was devoted to designing the end-to-end

system architecture, class structure, and algorithm to allow for maximum flexibility in parameter setting and accounting for a variety of different cases.

Keeping respectful coding habits- We commented our code religiously to help each other better understand the other.

Page 13: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-13-

IV. CONCLUSION zMovie RESULTS: ACTUAL VS. PREDICTED RATINGS As mentioned above, the parameter values vary depending on the size of initial “base” dataset and other algorithmic settings since not all parameter variables are perfectly independent from one another. Still, trends emerge holding some of the parameters constant. In particular, for each parameter, there is a more “optimal” value that leads to the lowest MAE.

Effect of Varying K on MAE

0

1

2

3

4

5

6

5 10 15 20 25

K (Number of Clusters)

Mea

n A

bsol

ute

Erro

r (M

AE)

Holding Constant: N = 20, P = 7, M = 10, λ = 1

Effect of Varying M on MAE

0

2

4

6

8

10

5 10 15 20 25

M (Number of Movies)

Mea

n A

bsol

ute

Erro

r (M

AE)

Holding Constant: K = 10, N = 20, P = 7, λ = 1 Run on one 50-50 split of initial MovieLens dataset Results: Most Optimal Values (yielding lowest MAE of 0.3939): K = 10 N = 20 P = 7 M = 10 λ = 0

Effect of Varying N on MAE

00.5

11.5

22.5

33.5

4

5 10 15 20 25

N (Number of Top Most Similar Neighbors Within Cluster to Active User)

Mea

n A

bsol

ute

Erro

r (M

AE)

Holding Constant: K = 10, P = 7, M = 10, λ = 1

Effect of Varying P on MAE

00.5

11.5

22.5

33.5

4

5 6 7 8 9

P (Number of Passes under K-Means Algorithm)

Mea

n A

bsol

ute

Erro

r (M

AE)

Holding Constant: K = 10, N = 20, M = 10, λ = 1

Effect of Varying Lambda on MAE

0

2

4

6

8

10

0 0.1 0.25 0.5 0.75 1

Lambda

Mea

n A

bsol

ute

Erro

r (M

AE)

Holding Constant: K = 10, N = 20, M = 10, P = 7

Page 14: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-14-

Reasons for “optimal” parameter values: K Too few clusters results in grouping dissimilar users whereas too many clusters defeats

the purpose of grouping “like” users. N There is a fine line when picking the top N most similar neighbors within a cluster to an

active user. N should not be too small (not representative of cluster) or too large (including everyone within cluster would be meaningless and inefficient).

P After a certain number of passes of the K-means algorithm, the process of recalculating centroids and reclustering the “base” dataset becomes less and less valuable (lower incremental improvement in clustering of dataset).

M There is a minimum number of movies that a new user must rate in order to receive reasonable predictions / recommendations. A M that is too small makes it harder to classify the new user to an existing cluster.

Finally, as a sanity check, the above results were repeated on four additional 50-50 randomized splits of the MovieLens dataset, all of which yielded very similar overall MAE on the “test” dataset. zMovie RESULTS: LIVE USER RECOMMENDATION Below is an example of the movie recommendations received by this new user (UserID = 867) after rating 10 movies: UserID: 867 Age: 24 Gender: M Occupation: Scientist Zipcode: 92507 10 Rated Songs No. Movie Title Actual Rating 1 Fried Green Tomatoes (1991) 4 2 Dead Poets Society (1989) 3 3 North by Northwest (1959) 5 4 Taxi Driver (1976) 5 5 M*A*S*H (1970) 3 6 2001: A Space Odyssey (1968) 5 7 Fugitive, The (1993) 4 8 Nobody's Fool (1994) 4 9 Schindler's List (1993) 5 10 Terminator 2: Judgment Day (1991) 5

Top 5 Movies to Watch No. Movie Title 1 Madame Butterfly (1995) 2 Golden Earrings (1947) 3 My Crazy Life (Mi vida loca) (1993) 4 Losing Chase (1996) 5 Late Bloomers (1996)

Top 5 Movies to Avoid No. Movie Title 1 Temptress Moon (Feng Yue) (1996) 2 Getting Even with Dad (1994)

Page 15: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-15-

3 Babysitter, The (1995) 4 Gumby: The Movie (1995) 5 Tom and Huck (1995)

CONCLUDING THOUGHTS Overall, we have developed a user-based / model-based CF algorithm involving clustering users in order to address issues of scalability and computation in traditional memory-based CF systems. Primarily, we refined our algorithm to develop a sense of the parameter values required to achieve the best results (lowest MAE). However, several limitations still exist due to the user-based model aspect of the recommender system (e.g. not accounting for the uniqueness of individual movies). For future work, we would like to investigate the effects of metadata on individual movies (e.g. content-based factors such as genre, actors, year produced) as well as additional social networking effects (e.g. tags on movies supplied by users in the global database, recommendations made by a user’s friends, and so on).

Page 16: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-16-

V. REFERENCES [1] P. Melville, R. J. Mooney, R. Nagarajan, “Content-Boosted Collaborative Filtering for Improved

Recommendations,” Proc. of the 18th National Conference on Artificial Intelligence, 2002. [2] G. Xue, C. Lin, Q. Yang, et. al., “Scalable Collaborative Filtering Using Cluster-based

Smoothing,” Proc. of 28 th Annual International ACM SIGIR Conference, 2005. [3] P. Resnick, N. Iacovou, M. Suchak, P.Bergstrom, and J. Riedl, “Grouplens: An Open

Architecture for Collaborative Filtering of Netnews,” Proc. of ACM Conference on Computer Supported Cooperative Work, 1994.

[4] J. S. Bresse, D. Heckerman, and C. Kadie, “Empirical analysis of Predictive Algorithms for Collaborative Filtering,” Proc. of the 14th Conference on Uncertainty in Artificial Intelligence, 1998.

[5] I. M. Soboroff and C. Nicholas, “Collaborative Filtering and the Generalized Vector Space Model,” Proc. of 23rd Annual International ACM SIGIR Conference, 2000.

[6] A. Kohrs and B. Merialdo, “Clustering for Collaborative Filtering Applications,” Proc. of CIMCA, 1999.

[7] L. H. Ungar and D. P. Foster, “Clustering Methods for Collaborative Filtering,” Proc. Workshop on Recommendation Systems at 15th National Conference on Artificial Intelligence.

[8] T. Hofmann and J. Puzicha, “Latent Class Models for Collaborative Filtering,” Proc. of 16th International Joint Conference on Artificial Intelligence, 1999.

[9] U. Shardanand, P. Maes, “Social information filtering: algorithms for automating ‘word of mouth,’ ” Proc. of SIGCHI conference on Human Factors in Computing Systems, 1995.

[10] F. Kuo, M. Chiang, M. Shan, S. Lee, “Emotion-based music recommendation by association discovery from film music,” Proc. of 13th Annual ACM International Conference on Multimedia, 2005.

[11] J. M. Kleinberg, Mark Sandler, “Using Mixture Models for Collaborative Filtering,” Proc. of 36th Annual ACM Symposium on Theory of Computing, 2004.

[12] H. Chen, A. L. P. Chen, “A music recommendation system based on music data grouping and user interests,” Proc. of 10th International Conference on Information and Knowledge Management, 2001.

[13] ChoiceStream. “Attributized Bayesian Choice Modeling.” ChoiceStream Technology Brief. 9 April 2007 <http://www.choicestream.com/pdf/ChoiceStream_TechBrief.pdf>.

[14] GroupLens Research. Department of Computer Science and Engineering at University of Minnesota. 9 April 2007 <http://www.grouplens.org>.

[15] Greene, Kate. “The $1 Million Netflix Challenge.” Technology Review. (6 October 2006). 9 April 2007 <http://www.technologyreview.com/read_article.aspx?ch=specialsections &sc=social&id=17587>.

Page 17: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-17-

APPENDIX A. Recommender Algorithm Diagrams User-Based Collaborative Filtering

Item-Based Collaborative Filtering

Page 18: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-18-

ChoiceStream’s Attributized Bayesian Choice Modeling (ABCM)

Page 19: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-19-

APPENDIX B. Database Tables Below is a detailed description of various tables and fields used in our backend database: Table Name: Items Description: Global table of all movies in database Field Name Type Description Example movieID int Unique primary key assigned to movie 1

movieTitle text Title of the movie Toy Story (1995)

releaseDate text Release date of the movie 01-Jan-1995

videoReleaseDate text Video release date of movie

imdbURL text IMDB URL of movie http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)

unknown int (0 or 1) 0

action int Genre classification (0 or 1) 0

adventure int Genre classification (0 or 1) 0

animation int Genre classification (0 or 1) 1

children int Genre classification (0 or 1) 1

comedy int Genre classification (0 or 1) 1

crime int Genre classification (0 or 1) 0

documentary int Genre classification (0 or 1) 0

drama int Genre classification (0 or 1) 0

fantasy int Genre classification (0 or 1) 0

filmNoir int Genre classification (0 or 1) 0

horror int Genre classification (0 or 1) 0

musical int Genre classification (0 or 1) 0

mystery int Genre classification (0 or 1) 0

romance int Genre classification (0 or 1) 0

scifi int Genre classification (0 or 1) 0

thriller int Genre classification (0 or 1) 0

war int Genre classification (0 or 1) 0

western int Genre classification (0 or 1) 0

Table Name: Users Description: Global table of user profiles Field Name Type Description Example userID int Unique primary key assigned to each user 1

age int Age of user 24

gender text Gender (M or F) M

occupation text Occupation of user technician

zipcode Int Zipcode of user 85711

Page 20: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-20-

Table Name: User-Item-Rating Description: Global table linking users to items to item ratings Field Name Type Description Example userID int Unique id assigned to each user 134

itemID int Unique id assigned to each movie 1325

rating int Rating by user on given item (integer scale between 1 and 5, 5 being the highest)

4

timestamp int Timestamp of when user rated particular item (in Unix seconds since 1/1/1970 UTC)

881250949

Page 21: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-21-

APPENDIX C. Recommendation Engine Class Structure Global Level Variables Variable Name Type Description K int Number of clusters P int Number of passes under K-means algorithm N int Number of most similar neighbors to active user within cluster to select M int Minimum number of ratings required to make recommendations to a user λ float λ = 0: Basic Pearson correlation-coefficient algorithm

λ = 1: Basic cluster-based CF algorithm using average rating of clustering for similarity computation

HT_allUsers Hashtable Key-value pair linking all userIDs with User object (int userID, User user) pairing

HT_allItems Hashtable Key-value pair linking all itemIDs with Item object (int itemID, Item item) pairing

HT_allClusters Hashtable Key-value pair linking all clusterIDs with Cluster object (int clusterID, Cluster cluster) pairing

HT_clusterCenter Hashtable Key-value pair linking userIDs of all cluster centroids with User object (int userID, User user) pairing

Class: Cluster Variable Name Type Description id int Unique id assigned to each cluster (K total clusters) userCount int Total number of users in the cluster itemCount int Total number of items rated by users in the cluster centerID int UserID for centroid of cluster centerRating float Average rating for centroid of cluster averageRating float Average rating of entire cluster C_Users Hashtable Key-value pair linking all userIDs in cluster with User object

(int userID, User user) pairing C_Items Hashtable Key-value pair linking all itemIDs in cluster with Item object

(int itemID, Item item) pairing

Page 22: zMovie: The Movie Recommendation Engine I. ABSTRACTcse400/CSE400_2006_2007/ChenJatia/Writeup.pdf · zMovie CSE 401: Senior Design Jing Chen, jingchen@seas.upenn.edu Faculty Advisor:

zMovie CSE 401: Senior Design Jing Chen, [email protected] Faculty Advisor: Sudipto Guha Anant Jatia, [email protected]

-22-

C_Likeness Hashtable Key-value pair linking two userIDs with their similarity value (Pair (int userID_1, int userID_2), float similarityValue) pairing

C_MaxLikeness Hashtable Key-value pair linking particular userID with totalSimilarityValue (sum of similarity between userID and all other users in cluster) (int userID, float totalSimilarityValue) pairing

Class: User Variable Name Type Description id int Unique id assigned to each user averageRating float Average rating for user itemsRated int Total number of items rated by user star int True if user has had predicted ratings assigned to its items list clusterID int Cluster id for given user clusterLikeness float Similarity value between user and cluster U_Items Hashtable Key-value pair linking all itemIDs rated by user with ratings provided by

user for all the items (int itemID, int itemRating) pairing

U_PredictedItems Hashtable Key-value pair linking all itemIDs and corresponding rating prediction for that item (int itemID, float predictedRating) pairing

Class: Item Variable Name Type Description id int Unique id assigned to each item averageItemRating float Average rating for given item timesRated int Total number of times item was rated by users in database I_Users Hashtable Key-value pair linking all userIDs (those who rated given item) to User

objects (int userID, User user) pairing

I_ClusterCount Hashtable Key-value pair linking all clusterIDs (associated with given item) to the number of times item was rated (for that given clusterID) (int clusterID, int timesRated) pairing