Impersonal Recommendation system on top of Hadoop

29
Kostiantyn Kudriavtsev April 2014 Creating Impersonal Recommendation system in the BigData era

description

Impersonal Recommendation system on top of Hadoop

Transcript of Impersonal Recommendation system on top of Hadoop

Page 1: Impersonal Recommendation system on top of Hadoop

Kostiantyn KudriavtsevApril 2014Creating Impersonal Recommendation system in the BigData era

Page 2: Impersonal Recommendation system on top of Hadoop

Agenda

1. Recommendation system overview

2. Different approaches to build recommendation system

3. Impersonal recommendation system in theory

4. Impersonal recommendation system in practice

Page 3: Impersonal Recommendation system on top of Hadoop

Recommendation system

The goal of a recommendation system is to predict the degree to which an user will like or dislike a set of items, such as goods or services.

Recommendation systems have become extremely common in recent years, and are applied in a variety of applications and fields. The most popular ones are goods, movies, music, news, books, research articles, search queries, social tags, restaurants, financial services, live insurances and people (social networks and online dating).

Page 4: Impersonal Recommendation system on top of Hadoop

Examples of using recommenders

Amazon uses recommendation system to increase sales by 35% and suggests goods based on previous user’s experience and the frequenters bought goods

Netflix suggests movies based on behaviour of similar users and previous user’s rating (result: 2 of 3 movies are watched after recommendation)

Page 5: Impersonal Recommendation system on top of Hadoop

Pandora radio suggest music base on previous user’s experience

Examples of using recommenders

In 2012, Target predicted woman pregnant before medical test based on changes in her shopping behaviour

Page 6: Impersonal Recommendation system on top of Hadoop

Possible approaches

There are several total different approaches to build recommendation system:

❖ Collaborative filtering – based on users interaction (likes, views, buys); extremely popular on online services, shops, etc

❖ Knowledge base – pursue knowledge-based approach; common used for impersonal recommendations

❖ Content based – similarity of items results in suggestions; common used to suggest text articles, songs

❖ Hybrid – combine the others approaches

Page 7: Impersonal Recommendation system on top of Hadoop

Collaborative filtering

❖ Also known as social-filtering systems, aggregate data about customer’s preferences or purchasing habits. Then they give recommendations based on similarity between users or similarity in overall behaviour patterns.

❖ For example, Netflix uses tuned collaborative filtering algorithm to suggest movies. If user U1 likes movie M1, and user U2 likes movies M1 and M2 then movie M2 will be recommended for user M1

Page 8: Impersonal Recommendation system on top of Hadoop

Collaborative filtering❖ The users behaviour history (views, clicks, buys) is

required to implement collaborative filtering recommender. The idea is to find users with similar preferences and gives them recommendations based on similar user’s preferences.

❖ In fact, this approach requires access to user’s profiles and capability to save each action (both technical and legal). After that, analysis may be run to get list of preferences for each user.

❖ There is cold start problem: not possible to get recommendations for new user, because of similar user is unknown yet

Page 9: Impersonal Recommendation system on top of Hadoop

Knowledge based recommenders

❖ Suggest products/services based on inferences about a user’s preferences and needs. There are several different types of these systems: some of them uses prebuilt/already known rules, the others build these rules dynamically.

❖ Unlike collaborative filtering this approach doesn’t require user’s profiles. Recommendations may be given based on some predefined or dynamically created rules.

❖ This approach may be used not only for online application, but also for different offline use cases as retail

Page 10: Impersonal Recommendation system on top of Hadoop

Knowledge based recommenders

For example, there is recommender built by Yhat that suggest new sort of beer to try based on knowledge about beer (i.e. user who likes light lager with known aroma, palate, etc will like similar beer XXX).

http://jeweell.com/ct/food/1133467-beer.html

Page 11: Impersonal Recommendation system on top of Hadoop

Content-based recommenders

❖ Content-based recommenders are based on machine learning research (particularly, clustering and classification). It’s common used by news aggregators to suggest new stories the user might like to read and cluster them in different groups.

❖ For example, Google News recommendations for the article:

Page 12: Impersonal Recommendation system on top of Hadoop

Hybrid approach

❖ Combine previous described methods to reach the best performance.

❖ There is well known cold start problem when algorithm doesn’t have data to give recommendation for new user/product. It can be solved by using different approaches to give recommendations for new/well-known users or products. For example, goods might be recommended by collaborative filtering, but knowledge base recommender will be used for new users/products (we don’t have history yet)

Page 13: Impersonal Recommendation system on top of Hadoop

Which approach to choose?

❖ In fact, the thorough analysis is required to choose the correct approach for each use case.

❖ Several approaches may be used to solve the same issue and the correct one is not easy to choose, because of a lot of factors influence performance of recommendations and different goals may returns in different approaches.

Page 14: Impersonal Recommendation system on top of Hadoop

Which approach to choose?

Let’s imagine user living in Lviv with some café preferences. He is making a short weekend trip to London. What could be recommend for him in London?

All previously mentioned approaches applicable to answer this question:

•Collaborative filtering

•Knowledge-base

•Content-base

Page 15: Impersonal Recommendation system on top of Hadoop

Impersonal recommender

The idea of impersonal recommender is to give recommendations not for particular user, but in general.

For instance, it may be used in retail to find goods-complements. There are really not obvious case: Wal-Mart discovered that diapers are sold together with expensive beer on friday evening. Placed them together leads in geometrical growth of sales.

Page 16: Impersonal Recommendation system on top of Hadoop

Impersonal recommender

Applicable in the different areas:

• retail, by employees to increase revenue/sales

• in e-commerce as short-budget approach for making recommendations on web-site

• interactive navigator-kiosk

http://smartcity.prom.ua/g2763766-interaktivnyj-sensornyj-kiosk

Page 17: Impersonal Recommendation system on top of Hadoop

Data Science way of getting things done

http://www.tomatosphere.org/teacher-resources/teachers-guide/principal-investigation/scientific-method.cfm

Page 18: Impersonal Recommendation system on top of Hadoop

The problem

There is a history of customer’s actions:

{a1, a2, a3}

{a2, a3, a5, a6}

{a4, a2}

{a1, a5, a6, a3}

{a3, a5, a2}

What should we suggest to customer who has already committed {a2, a5} (let’s assume that order doesn’t matter)?

Page 19: Impersonal Recommendation system on top of Hadoop

Naive approach: frequent item-sets

Affinity analysis is used to build Frequent Item-Sets are widely used in Market Basket analysis.

Several algorithms were created to perform affinity analysis (Apriory, FP-Growth)

Unfortunately, it doesn’t work. Frequent Item-Sets don’t filter out already-purchased goods and “cannibals”.

Page 20: Impersonal Recommendation system on top of Hadoop

Next step: association rules

Association rules are active used in Market Basket Analysis and may be effectively used for creating recommendations.

General association rule looks like:

A => B,

usually purchase of A leads to purchase of B (rule is user independent).

Rule has several statistical characteristics (supports, confidence, lift) that show strength of rule and may be used for high quality recommendations

Page 21: Impersonal Recommendation system on top of Hadoop

Rules for recommendation

It’s not enough to build rules, they must be correctly interpreted lately. The most important properties of each rule are support (show who this rule is important/frequent), confidence (how you can rely on this rule) and lift.

Lift is a derivate from Bayes’ theorem and show positive/negative correlation between head and tail of rule:

head => tail

All these parameters must be chosen for each particular case. In general,

• lift < 1 means negative correlation (rules works, but has negative effect)

• lift ~ 1 means no correlation (rules doesn’t work)

• lift > 1 means positive correlation

Page 22: Impersonal Recommendation system on top of Hadoop

Online recommender evaluation

Of course, recommender is not ended up generating rules.

The remaining task: evaluate quality of generated rules. It gives possibility not only compare different models (using A/B testing), but also use clickstream to improve rules.

http://www.sitedoublers.com/blog/multivariate-test-victorias-secret

Page 23: Impersonal Recommendation system on top of Hadoop

Online recommender improvement

❖ Users reaction on recommendation may be used to improve quality of recommender. For example, it’s possible to save successful/ignored recommendations and use these information to improve new generated recommendations.

❖ It is quite important, because user preferences is not stable and is changing during the time.

Page 24: Impersonal Recommendation system on top of Hadoop

Technology stack❖ There are a lot of already implemented solutions for

building association rules made by Oracle, SAS, Microsoft, etc.

❖ However, in the new world of unstructured/semistructured data and growing data amount, it’s not enough. Quality of recommender depends on amount of data used to train recommender. Than more data is available for analysis, than better.

❖ EDW trends to engage Hadoop as main storage and processing system

❖ Here comes Hadoop-centric solution…

Page 25: Impersonal Recommendation system on top of Hadoop

Overal architecture

Page 26: Impersonal Recommendation system on top of Hadoop

Apache Hadoop

Hadoop is designed to save and process petabytes of data and is an ideal choice to build recommendation system on top of it.

Hadoop provides wide range of tools for efficient data processing as well as specialised library for data science needs (Mahout), i.e. for building of recommender

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.

Page 27: Impersonal Recommendation system on top of Hadoop

ElasticSearch

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. It provides scalable search, has near real-time search, and supports multitenancy.

ElasticSearch is used by GitHub, Foursquare, Etsy, SoundCloud, Xing and Wikimedia and can leverage several TB of data.

ElasticSearch will be used for keeping rules and serving requests

Page 28: Impersonal Recommendation system on top of Hadoop

Architecture

Page 29: Impersonal Recommendation system on top of Hadoop

Questions?

Thank you for attention