Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are...

136
Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.

Transcript of Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are...

  • Slide 1
  • Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.
  • Slide 2
  • Applications 2
  • Slide 3
  • 3
  • Slide 4
  • 4 Corporate Intranets
  • Slide 5
  • System Inputs Interaction data (users items) Explicit feedback rating, comments Implicit feedback purchase, browsing User/Item individual data User side: Structural attribute information Personal description Social network Item side: Structural attribute information Textual description/content information Taxonomy of item (category) 5
  • Slide 6
  • Interaction between Users and Items 6 Observed preferences (Purchases, Ratings, page views, bookmarks, etc)
  • Slide 7
  • Profiles of Users and Items 7 User Profile: (1) Attribute Nationality,Sex, Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Co lor,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)
  • Slide 8
  • All Information about Users and Items 8 Observed preferences (Purchases, Ratings, page views, bookmarks, etc) User Profile: (1) Attribute Nationality,Sex, Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Co lor,Brand,etc (2) Text Product description (3) link Taxonomy of i tem (category)
  • Slide 9
  • Artificial Intelligence Statistics Machine learning KDD Database Natural Language Processing Data mining is a multi-disciplinary field KDD and Data Mining 9
  • Slide 10
  • Recommendation Approaches Collaborative filtering Using interaction data (user-item matrix) Process: Identify similar users, extrapolate from their ratings Content based strategies Using profiles of users/items (features) Process: Generate rules/classifiers that are used to classify new items Hybrid approaches 10
  • Slide 11
  • A Brief Introduction Collaborative filtering Nearest neighbor based Model based 11
  • Slide 12
  • Recommendation Approaches Collaborative filtering Nearest neighbor based User based Item based Model based 12
  • Slide 13
  • User-based Collaborative Filtering Idea: people who agreed in the past are likely to agree again To predict a users opinion for an item, use the opinion of similar users Similarity between users is decided by looking at their overlap in opinions for other items
  • Slide 14
  • User-based CF (Ratings) 14 Item 1Item 2Item 3Item 4Item 5Item 6 User 1 817298 User 2 987 ? 12 User 3 898931 User 4 211231 User 5 312322 User 6 122111 10 9 2 1 good bad
  • Slide 15
  • Similarity between Users Item 1Item 2Item 3Item 4Item 5Item 6 User 2 987 ? 12 User 3 898931 Only consider items both users have rated Common similarity measures: Cosine similarity Pearson correlation 15
  • Slide 16
  • Recommendation Approaches Collaborative filtering Nearest neighbor based User based Item based Model based Content based strategies Hybrid approaches 16
  • Slide 17
  • Item-based Collaborative Filtering Idea: a user is likely to have the same opinion for similar items Similarity between items is decided by looking at how other users have rated them 17
  • Slide 18
  • Example: Item-based CF Item 1Item 2Item 3Item 4Item 5 User 181 ? 27 User 222575 User 354747 User 471738 User 517465 User 683837
  • Slide 19
  • Similarity between Items Item 3Item 4 ? 2 57 74 73 46 83 Only consider users who have rated both items Common similarity measures: Cosine similarity Pearson correlation
  • Slide 20
  • Recommendation Approaches Collaborative filtering Nearest neighbor based Model based Matrix factorization (i.e., SVD) Content based strategies Hybrid approaches 20
  • Slide 21
  • Singular Value Decomposition (SVD) Mathematical method used to apply for many problems Given any mxn matrix R, find matrices U,I, and V that R = UIV T U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormal Remove the smallest values to get R m,k with k
  • Hierarchical Agglomerative Clustering Put every point in a cluster by itself. For I=1 to N-1 do{ let C 1 and C 2 be the most mergeable pair of clusters Create C 1,2 as parent of C 1 and C 2 } Example: for simplicity, we use 1-dimensional objects. Numerical Objects: 1, 2, 5, 6, 7 Agglomerative clustering: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 12567 90
  • Slide 91
  • Recommendation Approaches Collaborative filtering Content based strategies Association Rule Mining Text similarity based Clustering Classification Hybrid approaches 91
  • Slide 92
  • Illustrating Classification Task
  • Slide 93
  • Classification k-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 93
  • Slide 94
  • k-Nearest Neighbor Classification (kNN) kNN does not build model from the training data. Approach To classify a test instance d, define k-neighborhood P as k nearest neighbors of d Count number n of training instances in P that belong to class c j Estimate Pr(c j |d) as n/k (majority vote) No training is needed. Classification time is linear in training set size for each test case. k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications. 94
  • Slide 95
  • Example: k=1 (1NN) Car Book Clothes Book 95 which class?
  • Slide 96
  • Example: k=3 (3NN) Car Book Clothes Car 96 which class?
  • Slide 97
  • Discussion Advantage Nonparametric architecture Simple Powerful Requires no training time Disadvantage Memory intensive Classification/estimation is slow Sensitive to k 97
  • Slide 98
  • Classification k-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 98
  • Slide 99
  • Example of a Decision Tree categorical continuous class Training Data Judge the cheat possibility: Yes/No
  • Slide 100
  • Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree Judge the cheat possibility: Yes/No
  • Slide 101
  • Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data! Judge the cheat possibility: Yes/No
  • Slide 102
  • Decision Tree - Construction Creating Decision Trees Manual - Based on expert knowledge Automated - Based on training data (DM) Two main issues: Issue #1: Which attribute to take for a split? Issue #2: When to stop splitting?
  • Slide 103
  • Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5 Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 103
  • Slide 104
  • The CART Algorithm Classification And Regression Trees Developed by Breiman et al. in early 80s. Introduced tree-based modeling into the statistical mainstream Rigorous approach involving cross-validation to select the optimal tree 104
  • Slide 105
  • Key Idea Recursive Partitioning Take all of your data. Consider all possible values of all variables. Select the variable/value (X=t 1 ) that produces the greatest separation in the target. (X=t 1 ) is called a split. If X< t 1 then send the data to the left; otherwise, send data point to the right. Now repeat same process on these two nodes You get a tree Note: CART only uses binary splits. 105
  • Slide 106
  • Key Idea Let (s |t ) be a measure of the goodness of a candidate split s at node t, where: Then the optimal split maximizes this (s |t ) measure over all possible splits at node t. 106
  • Slide 107
  • Key Idea (s |t ) is large when both of its main components are large: and 1. - Maximum value if child nodes are equal size (same support) ): E.g. 0.5*0.5 = 0.25 and 0.9*0.1= 0.09 Maximum value if for each class the child nodes are completely uniform (pure) Theoretical maximum value for Q (s|t) is k, where k is the number of classes for the target variable 107 2. Q (s |t )=
  • Slide 108
  • CART Example 108 Training Set of Records for Classifying Credit Risk
  • Slide 109
  • CART Example Candidate Splits 109 Candidate Splits for t = Root Node Candidate SplitLeft Child Node, t L Right Child Node, t R 123456789123456789 Savings = low Savings = medium Savings = high Assets = low Assets = medium Assets = high Income $75,000 CART is restricted to binary splits
  • Slide 110
  • CART Primer Split 1. -> Savings=low (L-true, R-false) Right:1,3,4,6,8 Left:2,5,7 P R =5/8 = 0.625 P L =3/8=0.375 -> 2*P L P R =15/64=0.46875 P(j=Bad | t) P(Bad | t R )= 1/5 = 0.2 P(Bad | t L )= 2/3 = 0.67 P(j=Good | t) P(Good | t R )= 4/5 = 0.8 P(Good | t L )= 1/3 = 0.33 Q(s|t)= |0.67-0.2|+|0.8-0.33| = 0.934 110
  • Slide 111
  • CART Example 111 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R Q(s|t)(s |t ) 123456789123456789 0.375 0.25 0.5 0.25 0.375 0.625 0.875 0.625 0.75 0.5 0.75 0.625 0.375 0.125 G:0.333 B:0.667 G:1 B:0 G:0.5 B:0.5 G:0 B:1 G:0.75 B:0.25 G:1 B:0 G:0.333 B:0.667 G:0.4 B:0.6 G:0.571 B:0.429 G:0.8 B:0.2 G:0.4 B:0.6 G:0.667 B:0.333 G:0.833 B:0.167 G:0.5 B:0.5 G:0.5 B:0.5 G:0.8 B:0.2 G:1 B:0 G:1 B:0 0.46875 0.375 0.5 0.375 0.46875 0.21875 0.934 1.2 0.334 1.667 0.5 1 0.934 1.2 0.858 0.4378 0.5625 0.1253 0.6248 0.25 0.375 0.4378 0.5625 0.1877 For each candidate split, examine the values of the various components of the measure (s|t)
  • Slide 112
  • CART Example 112 CART decision tree after initial split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High}
  • Slide 113
  • CART Example 113 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R Q(s|t)(s |t ) 1235678912356789 0.167 0.5 0.333 0.667 0.333 0.5 0.167 0.833 0.5 0.667 0.333 0.667 0.5 0.833 G:1 B:0 G:1 B:0 G:0.5 B:0.5 G:0.75 B:0.25 G:1 B:0 G:0.5 B:0.5 G:0.667 B:0.333 G:0.8 B:0.2 G:0.8 B:0.2 G:0.667 B:0.333 G:1 B:0 G:1 B:0 G:0.75 B:0.25 G:1 B:0 G:1 B:0 G:1 B:0 0.2782 0.5 0.4444 0.5 0.2782 0.4 0.6666 1 0.5 1 0.6666 0.4 0.1112 0.3333 0.4444 0.2222 0.4444 0.3333 0.1112 Values of Components of Measure (s|t) for Each Candidate Split on Decision Node A
  • Slide 114
  • CART Example 114 CART decision tree after decision node A split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High} Savings=High Savings {Low,Medium} Decision Node B (Records 3,6) Good Risk (Records 1,4,5,8)
  • Slide 115
  • CART Example 115 CART decision tree, fully grown form Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High} Savings=High Savings {Low,Medium} Decision Node B (Records 3,6) Good Risk (Records 1,4,5,8) Assets=High Assets=Medium Bad Risk (Records 3) Good Risk (Records 6)
  • Slide 116
  • Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5 Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 116
  • Slide 117
  • The C4.5 Algorithm Proposed by Quinlan in 1993 An internal node represents a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. 117
  • Slide 118
  • The C4.5 Algorithm Differences between CART and C4.5: Unlike CART, the C4.5 algorithm is not restricted to binary splits. It produces a separate branch for each value of the categorical attribute. C4.5 method for measuring node homogeneity is different from the CART. 118
  • Slide 119
  • The C4.5 Algorithm - Measure We have a candidate split S, which partitions the training data set T into several subsets, T 1, T 2,..., T k. C4.5 uses the concept of entropy reduction to select the optimal split. entropy_reduction(S) = H(T)-HS(T), where entropy H(X) is: Where Pi represents the proportion of records in subset i. The weighted sum of the entropies for the individual subsets T 1, T 2,..., T k C4.5 chooses the optimal split - the split with greatest entropy reduction 119
  • Slide 120
  • Classification k-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 120
  • Slide 121
  • Bayes Rule Recommender system question L i is the class for item i (i.e., that the user likes item i) A is the set of features associated with item i Estimate p(L i |A) p(L i |A) = p(A| L i ) p(L i ) / p(A) We can always restate a conditional probability in terms of The reverse condition p(A| L i ) Two prior probabilities p(L i ) p(A) Often the reverse condition is easier to know We can count how often a feature appears in items the user liked Frequentist assumption 121
  • Slide 122
  • Naive Bayes Independence (Nave Bayes assumption) the features a 1, a 2,..., a k are independent For joint probability For conditional probability Bayes' Rule 122
  • Slide 123
  • An Example Compute all probabilities required for classification 123
  • Slide 124
  • An Example For C = t, we have For class C = f, we have C = t is more probable. t is the final class. 124
  • Slide 125
  • Nave Bayesian Classifier Advantages: Easy to implement Very efficient Good results obtained in many applications Disadvantages Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) 125
  • Slide 126
  • Classification K-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 126
  • Slide 127
  • References for Machine Learning T. Mitchell, Machine Learning, McGraw Hill, 1997 C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001. V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. Y. Kodratoff, R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, 1990 127
  • Slide 128
  • Recommendation Approaches Collaborative filtering Nearest neighbor based Model based Content based strategies Association Rule Mining Text similarity based Clustering Classification Hybrid approaches 128
  • Slide 129
  • The Netflix Prize Slides here are from Yehuda Koren.
  • Slide 130
  • Netflix Movie rentals by DVD (mail) and online (streaming) 100k movies, 10 million customers Ships 1.9 million disks to customers each day 50 warehouses in the US Complex logistics problem Employees: 2000 But relatively few in engineering/software And only a few people working on recommender systems Moving towards online delivery of content Significant interaction of customers with Web site 130
  • Slide 131
  • The $1 Million Question 131
  • Slide 132
  • Million Dollars Awarded Sept 21 st 2009 132
  • Slide 133
  • 133
  • Slide 134
  • Lessons Learned Scale is important e.g., stochastic gradient descent on sparse matrices Latent factor models work well on this problem Previously had not been explored for recommender systems Understanding your data is important, e.g., time- effects Combining models works surprisingly well But final 10% improvement can probably be achieved by judiciously combining about 10 models rather than 1000s This is likely what Netflix will do in practice 134
  • Slide 135
  • Useful References Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009 Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009 Y. Koren, Factor in the neighbors: scalable and accurate collaborative filtering, ACM Transactions on Knowledge Discovery in Data, 2010 135
  • Slide 136
  • Thank you! 136