Recommender Systems with Apache Spark's ALS Function
-
Upload
will-johnson -
Category
Data & Analytics
-
view
3.792 -
download
0
Transcript of Recommender Systems with Apache Spark's ALS Function
Building aBuilding aRecommenderRecommenderSystemSystemin Pysparkin Pyspark
Will JohnsonWill Johnson- Uline- Uline- DePaul- DePaul
LearnBy Marketing.com
AGENDAAGENDA- RecSys- RecSys * Basics* Basics * MF* MF * Evaluation* Evaluation * Advanced* Advanced- PySpark- PySpark * Basics* Basics * ALS* ALS
User Based Collaborative Filtering
4.5
4.0
5.0
4.5
3.0
4.0
2.0
1.0 2.0
1.5
4.5
User Based Collaborative Filtering
4.5
4.0
5.0
4.5
3.0
4.0
3.8 2.0
1.0 2.0
1.5
4.5
Item Based Collaborative Filtering
Item Based Collaborative Filtering
Matrix Factorization
Matrix Factorization
Evaluation
RMSE = √∑ (Predicted−Actual)2
nPrecision Recall
|hitsu||RecoSet u|
|hitsu||TestSetu|
Expert Review: Novelty, Context
CRISP-DM
Data Understanding
movielens = sc.textFile("../in/ml-100k/u.data")
Data Understanding
movielens.first()
movielens.count() 100,000
u'196\t242\t3\t881250949'
Data Understanding
clean_data = movielens.map(lambda x:x.split('\t'))rate = clean_data.map(lambda y: int(y[2]))
rate.mean() 3.529863
users = clean_data.map(lambda y: int(y[0]))
users.distinct().count() 943
clean_data.map(lambda y: int(y[1])).\ distinct().count() 1,682
Data Preparation
from pyspark.mllib.recommendation\ import ALS, MatrixFactorizationModel, Rating
mls = movielens.map(lambda l: l.split('\t'))ratings = mls.map(lambda x:\ Rating(int(x[0]), int(x[1]), float(x[2])))
Rating(user=196, product=242, rating=3.0)
Data Preparation
train, test = ratings.randomSplit([0.7,0.3],7856)
train.count()
70,005
test.count()
29,995
train.cache()test.cache()
Modeling
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process
#Create the model on the training datamodel = ALS.train(train, rank, numIterations)
Modeling / Evaluation
model.userFeatures()
model.productFeatures()
Modeling / Evaluation
# For Product X, Find N Users to Sell Tomodel.recommendUsers(242,100)
# For User Y Find N Products to Promotemodel.recommendProducts(196,10)
#Predict Single Product for Single Usermodel.predict(196, 242)
Modeling / Evaluation
# Predict Multi Users and Multi Products# Pre-Processingpred_input = train.map(lambda x:(x[0],x[1]))
# Lots of Predictionspred = model.predictAll(pred_input) #Returns Ratings(user, item, prediction)
(196, 242)
Rating(user=894, product=1560, rating=3.845)
Evaluation
User Item Actual Pred
196 242 3.0 3.91
186 302 3.0 3.29
22 377 1.0 1.09
244 51 2.0 3.66
298 474 4.0 4.11
TRAINING RMSE: 0.763
Evaluation
#Organize the data to make (user, product) the key)true_reorg = train.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2]))
#Do the actual jointrue_pred = true_reorg.join(pred_reorg)
from math import sqrtMSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean()RMSE = sqrt(MSE)#Results in 0.7629908117414474
((582, 1014), (4.0, 3.397))
((196, 242), 3.0)
Evaluation
test_input = test.map(lambda x:(x[0],x[1])) pred_test = model.predictAll(test_input)test_reorg = test.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred_test.map(lambda x:\ ((x[0],x[1]), x[2]))test_pred = test_reorg.join(pred_reorg)test_MSE = test_pred.map(lambda r:\ (r[1][0] - r[1][1])**2).mean()test_RMSE = sqrt(test_MSE)
TEST RMSE: 1.0145
CRISP-DM
RECAP
RecSys are Nearest Neighbors or MF Based
ALS is Implemented in Spark
RECAP
rank = 5; numIterations = 10;#Create the model on the training datamodel = ALS.train(train, rank, numIterations)# Lots of Predictionspred = model.predictAll(pred_input)#Examine Model Featuresmodel.productFeatures()# Save your model!model.save(sc,"../out/ml-model")
Questions?Questions?
LearnBy Marketing.com