Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene...

41
Collaborative Filtering - Rajashree

Transcript of Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene...

Page 1: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Collaborative Filtering

- Rajashree

Page 2: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Apache Mahout

• In 2008 as a subproject of Apache’s Lucene project

• Mahout absorbed the Taste open source collaborative filtering project

Page 3: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Apache Mahout

• A Mahout is an elephant trainer/driver/keeper, hence…

+Machine Learning

=

Page 4: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

What is Apache Mahout?

• Machine learning• For large data• Based on Hadoop• But can work on a non Hadoop cluster• Scalable • Licensed by Apache

Mahout is a Java written open source scalable machine learning library from Apache

Page 5: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Mahout inApache Software Foundation

Lucene: information retrieval software library

Page 6: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Mahout inApache Software Foundation

Hadoop: framework for distributed storage and programming based on MapReduce

Page 7: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Mahout inApache Software Foundation

Taste: collaborative filtering framework

Page 8: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Why Mahout?

• Mahout provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms.

• Mahout is designed for performance, scalability and flexibility.

Page 9: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Why do we need a scalable framework?

Page 10: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

General Architecture

Three-tiers architecture (Application, Algorithms and Shared Libraries)

Page 11: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

General Architecture

Data Storage and Shared Libraries

Page 12: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

General Architecture

Business Logic

Page 13: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

General Architecture

External Applications invoking Mahout APIs

Page 14: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Use cases

•Currently Mahout supports mainly four use cases: –Recommendation - takes users' behavior and from that tries to find items users might like. –Clustering - takes e.g. text documents and groups them into groups of topically related documents. –Classification - learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. –Frequent itemset mining - takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Page 15: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

In this presentation we will focus on

Recommendation

Page 16: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Recommendation

Mahout implements a Collaborative Filtering framework –Popularized by Amazon and others –Uses hystorical data (ratings, clicks, and purchases) to provide recommendations •User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users. •Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline. •Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences).

Page 17: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Collaborative Filtering

• Recommend people and products– User-User• User likes X, you might too

– Item-Item• People who bought X also bought Y

Page 18: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Example

Amazon.com

Page 19: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Recommendation - Architecture

Inceptive Idea:

A Java/J2EE application invokes a Mahout Recommender whose DataModel is based on a set of User Preferences that are built on the ground of a physical DataStore

Page 20: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Recommendation - Architecture

Physical Data

Data model

Recommender

External Application

Page 21: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Recommendation in Mahout

•Input: raw data (user preferences) •Output: preferences estimation

•Step 1 –Mapping raw data into a DataModel Mahout-compliant •Step 2 –Tuning recommender components

•Similarity measure, neighborhood, etc.

Page 22: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Recommendation Components

Mahout implements interfaces to these key abstractions: –DataModel •Methods for mapping raw data to a Mahout-compliant form –UserSimilarity •Methods to calculate the degree of correlation between two users –ItemSimilarity •Methods to calculate the degree of correlation between two items –UserNeighborhood •Methods to define the concept of ‘neighborhood’ –Recommender •Methods to implement the recommendation step itself

Page 23: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Components: DataModel

A DataModel is the interface to draw information about user preferences.

Which sources is it possible to draw? –Database •MySQLJDBCDataModel –External Files •FileDataModel –Generic (preferences directly feed through Java code) •GenericDataModel

Page 24: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Components: DataModel

Basic object: Preference –Preference is a triple (user,item,score)

Two implementations –GenericUserPreferenceArray •It stores numerical preference, as well. –BooleanUserPreferenceArray •It skips numerical preference values.

Page 25: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Components: UserSimilarity

•UserSimilarity defines a notion of similarity between two Users.

–(respectively) ItemSimilarity defines a notion of similarity between two Items.

•Which definition of similarity are available? –Pearson Correlation –Spearman Correlation –Euclidean Distance –Tanimoto Coefficient –LogLikelihood Similarity

Page 26: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Example: TanimotoDistance

Page 27: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Components: UserNeighborhood

•UserSimilarity defines a notion of similarity between two Users.

–(respectively) ItemSimilarity defines a notion of similarity between two Items.

•Which definition of neighborhood are available? –Nearest N users

•The first N users with the highest similarity are labeled as ‘neighbors’ –Tresholds

•Users whose similarity is above a threshold are labeled as ‘neighbors’

Page 28: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Components: Recommender

•Given a DataModel, a definition of similarity between users (items) and a definition of neighborhood, a recommender produces as output an estimation of relevance for each unseen item

•Which recommendation algorithms are implemented? –User-based CF –Item-based CF –SVD-based CF

Page 29: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Download Mahout

• Download –The latest Mahout release is 0.9 –Available at: http://apache.fastbull.org/mahout/0.9/mahout-distribution-0.9.zip -Extract all the libraries and include them in a new NetBeans (Eclipse) project

•Requirement: Java 1.6.x or greater. •Hadoop is not mandatory!

Page 30: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Page 31: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

FileData model

Page 32: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Neighbors

Page 33: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity);

List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

} Top-1 Recommendation for User 1

Page 34: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Page 35: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Output

Page 36: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Exercise 2: MovieLens Recommender

• Download the MovieLens dataset (100k) –http://files.grouplens.org/datasets/movielens/ml-100k.zip

• similarity calculations with a bigger dataset •Next: now we can run the recommendation framework against a state-of-the-art dataset

Page 37: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

MovieLens Dataset

u.data • The full u data set, 100000 ratings by 943 users

on 1682 items. • Each user has rated at least 20 movies. • Users and items are numbered consecutively

from 1• The data is randomly ordered. This is a tab

separated list of • user id | item id | rating | timestamp

Page 38: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Data Processing

• Its format is not Mahout compliant

cat u.data | cut –f1,2,3 | tr “\\t” “,” >>u.data1

Page 39: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

package dnslab.recommender;import java.io.File;import java.io.IOException;import java.util.List;import org.apache.mahout.cf.taste.common.TasteException;import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;import org.apache.mahout.cf.taste.model.DataModel;import org.apache.mahout.cf.taste.recommender.RecommendedItem;import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

public class ItemRecommend {public static void main(String[] args) {try {

DataModel dm = new FileDataModel(new File("data/u.data1"));TanimotoCoefficientSimilarity sim = new TanimotoCoefficientSimilarity(dm);GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(dm, sim);int x=1;for(LongPrimitiveIterator items = dm.getItemIDs(); items.hasNext();) {long itemId = items.nextLong();List<RecommendedItem>recommendations = recommender.mostSimilarItems(itemId, 5);for(RecommendedItem recommendation : recommendations) {System.out.println(itemId + "," + recommendation.getItemID() + "," + recommendation.getValue());}x++;if(x>10) System.exit(1);}} catch (IOException e) {System.out.println("There was an error.");e.printStackTrace();} catch (TasteException e) {System.out.println("There was a Taste Exception");e.printStackTrace();}

}

}

Page 40: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Output

Similar items

Preferences similarity

items

Page 41: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Thank you