Collaborative filtering common_problems_and_solutions

download Collaborative filtering common_problems_and_solutions

If you can't read please download the document

Transcript of Collaborative filtering common_problems_and_solutions

  • 1. Collaborative Filtering Algorithms :Common Problems and SolutionsVivek A. [email protected] Data Gods Meetup, Santa Clara, CA May 13,2013

2. Before we startCopyright 2013, Vivek A. Ganesan, All rights reserved 1o A BIG thank you to our sponsors Big Data Cloudo Meeting Spaceo Supporto Check out their big data training 3. IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o Program Outlineo This is an opt-in program, it is FREE! (as in beer)o We do social coding (which means you share yourcode as open source, Apache v2 license)o Program duration = 1 month, weekly sprintso Weekly meetup (topical + social coding + Q/A)o A weekend hackathon (Sat. afternoon) alternateweeks (deep technical immersion)o Demo at the end of the program 4. AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 3o CF Algorithms : recapo Problems with CF and solutionso Update on the Projecto Questions?o Discussion 5. CF : Common sense versionCopyright 2013, Vivek A. Ganesan, All rights reserved 4o Out of a large group of users who have rateditems :o Pick a small subset of users who are similar toyouo Now, for an item that you have not yet rated but yoursimilar users have rated :o Figure out an average rating for the item from yoursimilar group of userso Weigh it with your rating history and predict a rating 6. CF : VisualCopyright 2013, Vivek A. Ganesan, All rights reserved 5User/Movie Sleepless in Seattle Titanic Terminator 2Alice 5 5 3Bob 1 3 5Chandra 3 5 4Dawood 2 3 5Eduardo (you oractive user)2 4? 7. A sample approachCopyright 2013, Vivek A. Ganesan, All rights reserved 6o Compute Eduardos similarity to all otheruserso Pick the three users most similar to Eduardoo Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardoo Make sure that the predicted rating is withinthe given scale (0 to 5)o and predict Eduardos rating for Terminator 2 8. Step 1 : Measuring SimilarityCopyright 2013, Vivek A. Ganesan, All rights reserved 7o Start with a distance metrico There are several : lets pick Euclidean for e.g.o For n space, square root of sum of squareddifferenceso Convert it to a similarity score (0 to 1)o 1/(1 + Euclidean Distance) (adding 1 to avoiddivision by zero)o 0 for no match, 1 for perfect match 9. CF : Distances & SimilaritiesCopyright 2013, Vivek A. Ganesan, All rights reserved 8Alice Bob Chandra Dawood3.16 & 0.24 1.414 & 0.414 1.414 & 0.414 1 & 0.5 Pick the top three users most similar to Eduardo : Dawood, Bob and Chandra Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardo : (0.414 x 5) + (0.414 x 4) + (0.5 x 5) = 6.226 Ooops too big a rating (0 to 5 scale)! Divide by sum of similarities (0.414 + 0.414 + 0.5) Answer : 6.226/1.328 = 4.688 (our prediction) 10. Pick the correct similarity metricCopyright 2013, Vivek A. Ganesan, All rights reserved 9o Pearsons Correlation Co-efficiento COV(x,y)/(SD(x)*SD(y))o Scale-invariant i.e. adjusts for rating biaso However, can give skewed results for smalldimensionso Solution : Use a smoothing functiono Other metrics : Cosine similarity, Jaccard,Tanimotoetc. 11. Cold Start ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 10o First User problemo New user does not have any ratingso No good way in CF to find similar users based onratings (no rating history for new user)o A Solution : Start with popular items or item-basedo First Item problemo Use Item attributes to recommend similar items 12. Sparse Ratings ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 11o Given large numbers of users and items,o Most users would only rate a handful of itemso So, the number of users who would have rated thesame set of items would be quite lesso Throws off the recommendations (small set ofusers to recommend from)o A Solution : Hybrid recommenders i.e. use bothcollaborative and content-based approacheso Also : Use model-based approaches 13. User Quirks ProblemsCopyright 2013, Vivek A. Ganesan, All rights reserved 12o Power/Super Userso Users who rate unusually high number of itemso Black Sheep Userso So idiosyncratic that recommendations breakdowno Skewed ratingso Usually done deliberately (for e.g. to boost onesrestaurant and/or disparage a competitor) 14. Some ConsiderationsCopyright 2013, Vivek A. Ganesan, All rights reserved 13o K.I.S.So If you dont understand the approach, dont use ito Test, test, testo Use RMSE to test on existing datao Do A/B testing on a live systemo Try hybrid approacheso Use combination of item-based and user-basedplus content attributes 15. For this scrumCopyright 2013, Vivek A. Ganesan, All rights reserved 14o Dont worry too much about the problemso Goal is to learn Collaborative Filteringo However, o Do implement testingo For instance, remove a few ratings from thedata set and see how close the system canpredict those (Use RMSE as a test metric)o A/B testing for live systems 16. Questions? Comments?Thank You!E-mail: [email protected] : onevivekCopyright 2013, Vivek A. Ganesan, All rightsreserved15