Recommender Systems with Apache Spark's ALS Function

Building aBuilding aRecommenderRecommenderSystemSystemin Pysparkin Pyspark

Will JohnsonWill Johnson- Uline- Uline- DePaul- DePaul

LearnBy Marketing.com

AGENDAAGENDA- RecSys- RecSys * Basics* Basics * MF* MF * Evaluation* Evaluation * Advanced* Advanced- PySpark- PySpark * Basics* Basics * ALS* ALS

User Based Collaborative Filtering

4.5

4.0

5.0

4.5

3.0

4.0

2.0

1.0 2.0

1.5

4.5

User Based Collaborative Filtering

4.5

4.0

5.0

4.5

3.0

4.0

3.8 2.0

1.0 2.0

1.5

4.5

Item Based Collaborative Filtering

Matrix Factorization

Evaluation

RMSE = √∑ (Predicted−Actual)2

nPrecision Recall

|hitsu||RecoSet u|

|hitsu||TestSetu|

Expert Review: Novelty, Context

CRISP-DM

Data Understanding

movielens = sc.textFile("../in/ml-100k/u.data")

Data Understanding

movielens.first()

movielens.count() 100,000

u'196\t242\t3\t881250949'

Data Understanding

clean_data = movielens.map(lambda x:x.split('\t'))rate = clean_data.map(lambda y: int(y[2]))

rate.mean() 3.529863

users = clean_data.map(lambda y: int(y[0]))

users.distinct().count() 943

clean_data.map(lambda y: int(y[1])).\ distinct().count() 1,682

Data Preparation

from pyspark.mllib.recommendation\ import ALS, MatrixFactorizationModel, Rating

mls = movielens.map(lambda l: l.split('\t'))ratings = mls.map(lambda x:\ Rating(int(x[0]), int(x[1]), float(x[2])))

Rating(user=196, product=242, rating=3.0)

Data Preparation

train, test = ratings.randomSplit([0.7,0.3],7856)

train.count()

70,005

test.count()

29,995

train.cache()test.cache()

Modeling

rank = 5 # Latent Factors to be made

numIterations = 10 # Times to repeat process

#Create the model on the training datamodel = ALS.train(train, rank, numIterations)

Modeling / Evaluation

model.userFeatures()

model.productFeatures()


# For Product X, Find N Users to Sell Tomodel.recommendUsers(242,100)

# For User Y Find N Products to Promotemodel.recommendProducts(196,10)

#Predict Single Product for Single Usermodel.predict(196, 242)


# Predict Multi Users and Multi Products# Pre-Processingpred_input = train.map(lambda x:(x[0],x[1]))

# Lots of Predictionspred = model.predictAll(pred_input) #Returns Ratings(user, item, prediction)

(196, 242)

Rating(user=894, product=1560, rating=3.845)

Evaluation

User Item Actual Pred

196 242 3.0 3.91

186 302 3.0 3.29

22 377 1.0 1.09

244 51 2.0 3.66

298 474 4.0 4.11

TRAINING RMSE: 0.763

Evaluation

#Organize the data to make (user, product) the key)true_reorg = train.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2]))

#Do the actual jointrue_pred = true_reorg.join(pred_reorg)

from math import sqrtMSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean()RMSE = sqrt(MSE)#Results in 0.7629908117414474

((582, 1014), (4.0, 3.397))

((196, 242), 3.0)

Evaluation

test_input = test.map(lambda x:(x[0],x[1])) pred_test = model.predictAll(test_input)test_reorg = test.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred_test.map(lambda x:\ ((x[0],x[1]), x[2]))test_pred = test_reorg.join(pred_reorg)test_MSE = test_pred.map(lambda r:\ (r[1][0] - r[1][1])**2).mean()test_RMSE = sqrt(test_MSE)

TEST RMSE: 1.0145

CRISP-DM

RECAP

RecSys are Nearest Neighbors or MF Based

ALS is Implemented in Spark

RECAP

rank = 5; numIterations = 10;#Create the model on the training datamodel = ALS.train(train, rank, numIterations)# Lots of Predictionspred = model.predictAll(pred_input)#Examine Model Featuresmodel.productFeatures()# Save your model!model.save(sc,"../out/ml-model")

Questions?Questions?

LearnBy Marketing.com

Recommender Systems with Apache Spark's ALS Function

Data & Analytics

Transcript of Recommender Systems with Apache Spark's ALS Function