Mahout part1

Post on 15-Jan-2015

1.181 views 2 download

Tags:

description

Part one of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/

Transcript of Mahout part1

Mahout in ActionPart 1

Yasmine M. Gaber28 February 2013

Agenda

Meet Apache Mahout Part 1: Recommendation Part 2: Clustering Part 3: Classification

Meet Apache Mahout

It is an open source machine learning library from Apache

It is scalable

It is a Java library

It can be used with Hadoop to deal with large scale data.

Famous Engines Recommender engines: Amazon.comNetflix Dating sites like Líbímseti Social networking sites like Facebook

Clustering engines:Google NewsSearch engines like Clusty

Classification engines:Spam emailsGoogle’s PicasaOptical character recognition softwareApple’s Genius feature in iTunes

Recommendations

Recommender Input

A preference consists of a user ID and an item ID, user’s preference for the item

It is .csv file

Create Recommender

Recommender Evaluation

Average difference vs Root-mean-square

Mahout RecommenderEvaluator

Precision and Recall

RecommenderIRStatsEvaluator

Representing Recommender Data

Preference object new GenericPreference(123, 456, 3.0f)

Preference Array

Representing Recommender Data

Preference Array

FastByIDMap and FastIDSet

In-memory DataModels

GenericDataModel

File-based data

Refreshable components

Database-based data

Coping without preference values

Coping without preference values

User-based Recommender

The algorithm

for every item i that u has no preference for yet

for every other user v that has a preference for i

compute a similarity s between u and v

incorporate v's preference for i, weighted by s, into a running average

return the top items, ranked by weighted average

Recommender Components

Data model, implemented via DataModel

User-user similarity metric, implemented via UserSimilarity

User neighborhood definition, implemented via UserNeighborhood

Recommender engine, implemented via a Recommender (here, GenericUserBasedRecommender)

GenericUserBasedRecommender

User Neighborhoods

Fixed-size neighborhoods

Threshold-based neighborhood

similarity metrics

Pearson correlation–based similarity It is a number between –1 and 1 that measures

the tendency of two series of numbers, paired up one-to-one, to move together

Problems: It doesn’t take into account the number of items in

which two users’ preferences overlap, which is probably a weakness in the context of recommender engines.

If two users overlap on only one item, no correlation can be computed because of how the computation is defined

similarity metrics

Euclidean distance similarity 1 / (1+euclidean distance)

Cosine measure similarity between –1 and 1

Tanimoto coefficient similarity The ratio of the size of the

intersection to the size of

the union of their preferred items

Item-based recommendation

The algorithm

for every item i that u has no preference for yet

for every item j that u has a preference for

compute a similarity s between i and j

add u's preference for j, weighted by s, to a running average

return the top items, ranked by weighted average

GenericItemBasedRecommender

Slope-one recommender

The algorithm

for every item i the user u expresses no preference for

for every item j that user u expresses a preference for

find the average preference difference between j and i

add this diff to u's preference value for j

add this to a running average

return the top items, ranked by these averages

Taking Recommender to Production

User-based recommenders

Thank You

Contact at:Email: Yasmine.Gaber@espace.com.egTwitter: Twitter.com/yasmine_mohamed