Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are...

Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.

Applications 2

4 Corporate Intranets

System Inputs Interaction data (users items) Explicit feedback rating, comments Implicit feedback purchase, browsing User/Item individual data User side: Structural attribute information Personal description Social network Item side: Structural attribute information Textual description/content information Taxonomy of item (category) 5

Interaction between Users and Items 6 Observed preferences (Purchases, Ratings, page views, bookmarks, etc)

Profiles of Users and Items 7 User Profile: (1) Attribute Nationality,Sex, Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Co lor,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)

All Information about Users and Items 8 Observed preferences (Purchases, Ratings, page views, bookmarks, etc) User Profile: (1) Attribute Nationality,Sex, Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Co lor,Brand,etc (2) Text Product description (3) link Taxonomy of i tem (category)

Artificial Intelligence Statistics Machine learning KDD Database Natural Language Processing Data mining is a multi-disciplinary field KDD and Data Mining 9

Recommendation Approaches Collaborative filtering Using interaction data (user-item matrix) Process: Identify similar users, extrapolate from their ratings Content based strategies Using profiles of users/items (features) Process: Generate rules/classifiers that are used to classify new items Hybrid approaches 10

A Brief Introduction Collaborative filtering Nearest neighbor based Model based 11

Recommendation Approaches Collaborative filtering Nearest neighbor based User based Item based Model based 12

User-based Collaborative Filtering Idea: people who agreed in the past are likely to agree again To predict a users opinion for an item, use the opinion of similar users Similarity between users is decided by looking at their overlap in opinions for other items

User-based CF (Ratings) 14 Item 1Item 2Item 3Item 4Item 5Item 6 User 1 817298 User 2 987 ? 12 User 3 898931 User 4 211231 User 5 312322 User 6 122111 10 9 2 1 good bad

Similarity between Users Item 1Item 2Item 3Item 4Item 5Item 6 User 2 987 ? 12 User 3 898931 Only consider items both users have rated Common similarity measures: Cosine similarity Pearson correlation 15

Recommendation Approaches Collaborative filtering Nearest neighbor based User based Item based Model based Content based strategies Hybrid approaches 16

Item-based Collaborative Filtering Idea: a user is likely to have the same opinion for similar items Similarity between items is decided by looking at how other users have rated them 17

Example: Item-based CF Item 1Item 2Item 3Item 4Item 5 User 181 ? 27 User 222575 User 354747 User 471738 User 517465 User 683837

Similarity between Items Item 3Item 4 ? 2 57 74 73 46 83 Only consider users who have rated both items Common similarity measures: Cosine similarity Pearson correlation

Recommendation Approaches Collaborative filtering Nearest neighbor based Model based Matrix factorization (i.e., SVD) Content based strategies Hybrid approaches 20

Singular Value Decomposition (SVD) Mathematical method used to apply for many problems Given any mxn matrix R, find matrices U,I, and V that R = UIV T U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormal Remove the smallest values to get R m,k with k

Hierarchical Agglomerative Clustering Put every point in a cluster by itself. For I=1 to N-1 do{ let C 1 and C 2 be the most mergeable pair of clusters Create C 1,2 as parent of C 1 and C 2 } Example: for simplicity, we use 1-dimensional objects. Numerical Objects: 1, 2, 5, 6, 7 Agglomerative clustering: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 12567 90

Recommendation Approaches Collaborative filtering Content based strategies Association Rule Mining Text similarity based Clustering Classification Hybrid approaches 91

Illustrating Classification Task

Classification k-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 93

k-Nearest Neighbor Classification (kNN) kNN does not build model from the training data. Approach To classify a test instance d, define k-neighborhood P as k nearest neighbors of d Count number n of training instances in P that belong to class c j Estimate Pr(c j |d) as n/k (majority vote) No training is needed. Classification time is linear in training set size for each test case. k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications. 94

Example: k=1 (1NN) Car Book Clothes Book 95 which class?

Example: k=3 (3NN) Car Book Clothes Car 96 which class?

Discussion Advantage Nonparametric architecture Simple Powerful Requires no training time Disadvantage Memory intensive Classification/estimation is slow Sensitive to k 97

Example of a Decision Tree categorical continuous class Training Data Judge the cheat possibility: Yes/No

Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree Judge the cheat possibility: Yes/No

Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data! Judge the cheat possibility: Yes/No

Decision Tree - Construction Creating Decision Trees Manual - Based on expert knowledge Automated - Based on training data (DM) Two main issues: Issue #1: Which attribute to take for a split? Issue #2: When to stop splitting?

Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5 Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 103

The CART Algorithm Classification And Regression Trees Developed by Breiman et al. in early 80s. Introduced tree-based modeling into the statistical mainstream Rigorous approach involving cross-validation to select the optimal tree 104

Key Idea Recursive Partitioning Take all of your data. Consider all possible values of all variables. Select the variable/value (X=t 1 ) that produces the greatest separation in the target. (X=t 1 ) is called a split. If X< t 1 then send the data to the left; otherwise, send data point to the right. Now repeat same process on these two nodes You get a tree Note: CART only uses binary splits. 105

Key Idea Let (s |t ) be a measure of the goodness of a candidate split s at node t, where: Then the optimal split maximizes this (s |t ) measure over all possible splits at node t. 106

Key Idea (s |t ) is large when both of its main components are large: and 1. - Maximum value if child nodes are equal size (same support) ): E.g. 0.5*0.5 = 0.25 and 0.9*0.1= 0.09 Maximum value if for each class the child nodes are completely uniform (pure) Theoretical maximum value for Q (s|t) is k, where k is the number of classes for the target variable 107 2. Q (s |t )=

CART Example 108 Training Set of Records for Classifying Credit Risk

CART Example Candidate Splits 109 Candidate Splits for t = Root Node Candidate SplitLeft Child Node, t L Right Child Node, t R 123456789123456789 Savings = low Savings = medium Savings = high Assets = low Assets = medium Assets = high Income $75,000 CART is restricted to binary splits

CART Primer Split 1. -> Savings=low (L-true, R-false) Right:1,3,4,6,8 Left:2,5,7 P R =5/8 = 0.625 P L =3/8=0.375 -> 2*P L P R =15/64=0.46875 P(j=Bad | t) P(Bad | t R )= 1/5 = 0.2 P(Bad | t L )= 2/3 = 0.67 P(j=Good | t) P(Good | t R )= 4/5 = 0.8 P(Good | t L )= 1/3 = 0.33 Q(s|t)= |0.67-0.2|+|0.8-0.33| = 0.934 110

CART Example 111 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R Q(s|t)(s |t ) 123456789123456789 0.375 0.25 0.5 0.25 0.375 0.625 0.875 0.625 0.75 0.5 0.75 0.625 0.375 0.125 G:0.333 B:0.667 G:1 B:0 G:0.5 B:0.5 G:0 B:1 G:0.75 B:0.25 G:1 B:0 G:0.333 B:0.667 G:0.4 B:0.6 G:0.571 B:0.429 G:0.8 B:0.2 G:0.4 B:0.6 G:0.667 B:0.333 G:0.833 B:0.167 G:0.5 B:0.5 G:0.5 B:0.5 G:0.8 B:0.2 G:1 B:0 G:1 B:0 0.46875 0.375 0.5 0.375 0.46875 0.21875 0.934 1.2 0.334 1.667 0.5 1 0.934 1.2 0.858 0.4378 0.5625 0.1253 0.6248 0.25 0.375 0.4378 0.5625 0.1877 For each candidate split, examine the values of the various components of the measure (s|t)

CART Example 112 CART decision tree after initial split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High}

CART Example 113 SplitPLPL PRPR P(j|t L )P(j|t R )2P L P R Q(s|t)(s |t ) 1235678912356789 0.167 0.5 0.333 0.667 0.333 0.5 0.167 0.833 0.5 0.667 0.333 0.667 0.5 0.833 G:1 B:0 G:1 B:0 G:0.5 B:0.5 G:0.75 B:0.25 G:1 B:0 G:0.5 B:0.5 G:0.667 B:0.333 G:0.8 B:0.2 G:0.8 B:0.2 G:0.667 B:0.333 G:1 B:0 G:1 B:0 G:0.75 B:0.25 G:1 B:0 G:1 B:0 G:1 B:0 0.2782 0.5 0.4444 0.5 0.2782 0.4 0.6666 1 0.5 1 0.6666 0.4 0.1112 0.3333 0.4444 0.2222 0.4444 0.3333 0.1112 Values of Components of Measure (s|t) for Each Candidate Split on Decision Node A

CART Example 114 CART decision tree after decision node A split Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High} Savings=High Savings {Low,Medium} Decision Node B (Records 3,6) Good Risk (Records 1,4,5,8)

CART Example 115 CART decision tree, fully grown form Root Node (All Records) Assets = Low vs. Assets {Medium, High} Bad Risk (Records 2,7) Decision Node A (Records 1,3,4,5,6,8) Assets=Low Assets {Medium, High} Savings=High Savings {Low,Medium} Decision Node B (Records 3,6) Good Risk (Records 1,4,5,8) Assets=High Assets=Medium Bad Risk (Records 3) Good Risk (Records 6)

Classification k-Nearest Neighbor (kNN) Decision Tree CART C4.5 Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 116

The C4.5 Algorithm Proposed by Quinlan in 1993 An internal node represents a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. 117

The C4.5 Algorithm Differences between CART and C4.5: Unlike CART, the C4.5 algorithm is not restricted to binary splits. It produces a separate branch for each value of the categorical attribute. C4.5 method for measuring node homogeneity is different from the CART. 118

The C4.5 Algorithm - Measure We have a candidate split S, which partitions the training data set T into several subsets, T 1, T 2,..., T k. C4.5 uses the concept of entropy reduction to select the optimal split. entropy_reduction(S) = H(T)-HS(T), where entropy H(X) is: Where Pi represents the proportion of records in subset i. The weighted sum of the entropies for the individual subsets T 1, T 2,..., T k C4.5 chooses the optimal split - the split with greatest entropy reduction 119

Bayes Rule Recommender system question L i is the class for item i (i.e., that the user likes item i) A is the set of features associated with item i Estimate p(L i |A) p(L i |A) = p(A| L i ) p(L i ) / p(A) We can always restate a conditional probability in terms of The reverse condition p(A| L i ) Two prior probabilities p(L i ) p(A) Often the reverse condition is easier to know We can count how often a feature appears in items the user liked Frequentist assumption 121

Naive Bayes Independence (Nave Bayes assumption) the features a 1, a 2,..., a k are independent For joint probability For conditional probability Bayes' Rule 122

An Example Compute all probabilities required for classification 123

An Example For C = t, we have For class C = f, we have C = t is more probable. t is the final class. 124

Nave Bayesian Classifier Advantages: Easy to implement Very efficient Good results obtained in many applications Disadvantages Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) 125

Classification K-Nearest Neighbor (kNN) Decision Tree Nave Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 126

References for Machine Learning T. Mitchell, Machine Learning, McGraw Hill, 1997 C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001. V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. Y. Kodratoff, R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, 1990 127

Recommendation Approaches Collaborative filtering Nearest neighbor based Model based Content based strategies Association Rule Mining Text similarity based Clustering Classification Hybrid approaches 128

The Netflix Prize Slides here are from Yehuda Koren.

Netflix Movie rentals by DVD (mail) and online (streaming) 100k movies, 10 million customers Ships 1.9 million disks to customers each day 50 warehouses in the US Complex logistics problem Employees: 2000 But relatively few in engineering/software And only a few people working on recommender systems Moving towards online delivery of content Significant interaction of customers with Web site 130

The $1 Million Question 131

Million Dollars Awarded Sept 21 st 2009 132

Lessons Learned Scale is important e.g., stochastic gradient descent on sparse matrices Latent factor models work well on this problem Previously had not been explored for recommender systems Understanding your data is important, e.g., time- effects Combining models works surprisingly well But final 10% improvement can probably be achieved by judiciously combining about 10 models rather than 1000s This is likely what Netflix will do in practice 134

Useful References Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009 Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009 Y. Koren, Factor in the neighbors: scalable and accurate collaborative filtering, ACM Transactions on Knowledge Discovery in Data, 2010 135

Thank you! 136

Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are...

Documents

Transcript of Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are...