Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
-
Upload
spark-summit -
Category
Data & Analytics
-
view
1.251 -
download
0
Transcript of Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Joint work with
Max Sperlich(Sailthru)
David Glueck(Sailthru)
Alex Gaudio(Alluvium)
Jeremy Stanley(Instacart)
users
items
Buy?Matrix Factorization model = ALS.trainImplicit(data) model.predict(user_id, item_id)
users
items
Buy?Matrix Factorization model = ALS.trainImplicit(data) model.predict(user_id, item_id)
Other Methods- Item-Item Similarity- Content Based- Nearest neighbor based methods
users
items
Buy?Matrix Factorization model = ALS.trainImplicit(data) model.predict(user_id, item_id)
Other Methods- Item-Item Similarity- Content Based- Nearest neighbor based methods
Open source scalableimplementations
Easily Incorporate additional features?
Additional Features?
titleimage
description
item
location
browser
# clicks
user
user-item
- purchases- item views
Goal: Framework to easily integrate user, item, and user-item interactions for meaningful recommendations
prediction interval response
Can use a combination of collaborative filtering algorithms!
user item
10
user-item interactions user-item features
So far….users
items
Buy?
- Easily integrate user, item, and user-item features
- Recommendation → Binary classification
- Next: Implementation and use of Spark!
Structured Tables: e.g. url views, purchases
item-item similarity
filtering table
filtering table
filtering table
filtering table
Structured Tables: e.g. url views, purchases
filtering table
filtering table
filtering table
filtering table
Extract features:user, item,user-item features
Model Building,Validation, Reporting
item-item similarity
Recommendations
Know Thy Partitions!
data.select(‘user_id’, ‘url’)data.groupBy(‘user_id’)data.count()data.filter(count > 10)
Know Thy Partitions!
data.select(‘user_id’, ‘url’)data.groupBy(‘user_id’)data.count()data.filter(count > 10)
Inefficient!
Know Thy Partitions!
partition by user_id
data.select(‘user_id’, ‘url’)data.groupBy(‘user_id’)data.count()data.filter(count > 10)
Know Thy Partitions!
data.select(‘user_id’, ‘url’)data.mapPartitions(filter_url)da
partition by user_id
Break up application into small chunks!
users = data.select(‘user_id’, ‘url’) .mapPartitions(filter_url)
.collect()
BigData.filter(lambda x: x.user_id in users)
Break up application into small chunks!
users = data.select(‘user_id’, ‘url’) .mapPartitions(filter_url)
.collect()
BigData.filter(lambda x: x.user_id in users)
Break up application into small chunks!
users = data.select(‘user_id’, ‘url’) .mapPartitions(filter_url)
.collect()
BigData.filter(lambda x: x.user_id in users)
Mesos + Scheduler + Docker + Spark
A
B
C D E F
- Carefully define applications and state a dependency graph
- Manage graph using:github.com/sailthru/stolos
Mesos + Scheduler + Docker + Spark
A
B
C D E F
- Carefully define applications and state a dependency graph
- Manage graph using:github.com/sailthru/stolos
users
items
P(Buy) = 0.002
RowMatrix.computeSimilarites
item-item similarities
Item-Item Similarities
users
items
P(Buy) = 0.002
RowMatrix.computeSimilarites
item-item similarities
filter and score items for each user
Item-Item Similarities
To Summarize..users
items
Buy?
- Recommendations using diverse features
- Carefully designed spark apps connected using a graph
- Future work:- More spark apps + spark-mesos cluster manager
THANK YOU.email: [email protected]