UniversitΓ€t Potsdam Institut fΓΌr Informatik
Lehrstuhl Maschinelles Lernen
Recommendation
Tobias Scheffer
Inte
lligent D
ata
Analy
sis
II
Recommendation Engines
Recommendation of products, music, contacts, ..
Based on user features, item features, and past
transactions: sales, reviews, clicks, β¦
User-specific recommendations, no global ranking
of items.
Feedback loop: choice of recommendations
influences available transaction and click data.
2
Inte
lligent D
ata
Analy
sis
II
Netflix Prize
Data analysis challenge, 2006-2009
Netflix made rating data available: 500,000 users,
18,000 movies, 100 million ratings
Challenge: predict ratings that were held back for
evaluation; improve by 10% over Netflixβs
recommendation
Award: $ 1 million.
3
Inte
lligent D
ata
Analy
sis
II
Problem Setting
Users π = {1,β¦ ,π}
Items π = {1,β¦ ,πβ²}
Ratings π = {(π’1, π₯1, π¦1)β¦ , (π’π, π₯π, π¦π)}
Rating space π¦π β π
E.g., π = β1,+1 , π = {β,β¦ . ,βββββ}
Loss function β(π¦π , π¦π)
E.g., missing a good movie is bad but watching a
terrible movie is worse.
Find rating model: ππ: π’, π₯ β¦ π¦.
4
Inte
lligent D
ata
Analy
sis
II
Problem Setting: Matrix Notation
Users π = {1,β¦ ,π}
Items π = {1,β¦ ,πβ²}
Ratings π =
π¦11 π¦12
π¦21 π¦23
π¦33
Rating space π¦π β Ξ₯
E.g.,Ξ₯ = β1,+1 ,Ξ₯ = {β,β¦ . ,βββββ}
Loss function β(π¦π , π¦π)
5
item 1 2 3
user 1 user 2 user 3
Incomplete matrix
Inte
lligent D
ata
Analy
sis
II
Problem Setting
Model ππ(π’, π₯)
Find model parameters that minimize risk
πβ = argminΞΈβ« β« β« β π¦, ππ π’, π₯ π π’, π₯, π¦ ππ₯ππ’ππ
As usual: π π’, π₯, π¦ is unknown β minimize
regularized empirical risk
πβ = argminΞΈ β π¦π , ππ π’π , π₯π
π
π=1+ πΞ©(π)
6
Inte
lligent D
ata
Analy
sis
II
Content-Based Recommendation
Idea: User may like movies that are similar to other
movies which they like.
Requirement: attributes of items, e.g.,
Tags,
Genre,
Actors,
Director,
β¦
7
Inte
lligent D
ata
Analy
sis
II
Content-Based Recommendation
Feature space for items
E.g., Ξ¦ = comedy, action, year, dir tarantino, dir cameron T
π avatar = 0, 1, 2009, 0, 1 T
8
Inte
lligent D
ata
Analy
sis
II
Content-Based Recommendation
Users π = {1,β¦ ,π}
Items π = {1,β¦ ,πβ²}
Ratings π = {(π’1, π₯1, π¦1)β¦ , (π’π, π₯π, π¦π)}
Rating space π¦π β π
E.g.,Ξ₯ = β1,+1 ,Ξ₯ = {β,β¦ . ,βββββ}
Loss function β(π¦π , π¦π)
E.g., missing a good movie is bad but watching a
terrible movie is worse.
Feature function for items: π: π₯ β¦ βπ
Find rating model: ππ: π’, π₯ β¦ π¦.
9
Inte
lligent D
ata
Analy
sis
II
Independent Learning Problems for Users
Minimize regularized empirical risk
πβ = argminΞΈ β π¦π , ππ π’π , π₯π
π
π=1+ πΞ©(π)
One model per user:
πππ’π₯ β¦ Ξ₯
One learning problem per user:
ππ’β = argminππ’
β π¦π , πππ’π₯π
π:π’π=π’+ πΞ©(ππ’)
10
Inte
lligent D
ata
Analy
sis
II
Independent Learning Problems for Users
One learning problem per user:
βπ’: ππ’β = argminππ’
β π¦π , πππ’π₯π
π:π’π=π’+ πΞ©(ππ’)
Use any model class and learning mechanism; e.g.,
πππ’π₯π = π π₯π
Tππ’
Logistic loss + β2 regularization: logistic regression
Hinge loss + β2 regularization: SVM
Squared loss + β2 regularization: ridge regression
11
Inte
lligent D
ata
Analy
sis
II
Independent Learning Problems for Users
Obvious disadvantages of independent problems:
Commonalities of users are not exploited,
User does not benefit from ratings given by other
users,
Poor recommendations for users who gave few
ratings.
Rather use joint prediction model:
Recommendations for each user should benefit from
other osersβ ratings.
12
Inte
lligent D
ata
Analy
sis
II
Independent Learning Problems
13
Parameter vectors of independent
prediction models for users
Regularizer
π1
π2
Inte
lligent D
ata
Analy
sis
II
Joint Learning Problem
14
Parameter vectors of independent
prediction models for users
Regularizer
π1
π2
Inte
lligent D
ata
Analy
sis
II
Joint Learning Problem
Standard β2 regularization follows from the
assumption that model parameters are governed by
normal distribution with mean vector zero.
Instead assume that there is a non-zero population
mean vector.
15
Inte
lligent D
ata
Analy
sis
II
Joint Learning Problem
16
0
Graphical model of
hierarchical prior
π1 π2 ππ
π
Ξ£
Inte
lligent D
ata
Analy
sis
II
Joint Learning Problem
Population mean vector
π ~π 0,1
π πΌ
User-specific mean vector:
ππ’~π π ,1
ππΌ
Substitution: ππ’ = π + ππ’β² ; now π and ππ’
β² have mean
vector zero.
-Log-prior = regularizer
Ξ© π + ππ’β² = π π
2+ π ππ’
β² 2
17
Inte
lligent D
ata
Analy
sis
II
Joint Learning Problem
Joint optimization problem:
minπ1β² ,β¦,ππ
β² ,π β π¦π , πππ’
β² +π π₯π + πΞ© πβ²π’ + π Ξ©(π )
π:π’π=π’π’
Parameters ππ’β² are independent, π is shared.
Hence, ππ’ are coupled.
18
ππ’ = π + ππ’β²
Coupling
strength
Global
regularization
Inte
lligent D
ata
Analy
sis
II
Discussion
Each user benefits from other usersβ ratings.
Does not take into account that users have different
tastes.
Two sci-fi fans may have similar preferences, but a
horror-movie fan and a romantic-comedy fan do
not.
Idea: look at ratings to determine how similar users
are.
19
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering
Idea: People like items that are liked by people who
have similar preferences.
People who give similar ratings to items probably
have similar preferences.
This is independent of item features.
20
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering
Users π = {1,β¦ ,π}
Items π = {1,β¦ ,πβ²}
Ratings π = {(π’1, π₯1, π¦1)β¦ , (π’π, π₯π, π¦π)}
Rating space π¦π β Ξ₯
E.g.,Ξ₯ = β1,+1 ,Ξ₯ = {β,β¦ . ,βββββ}
Loss function β(π¦π , π¦π)
Find rating model: ππ: π’, π₯ β¦ π¦.
21
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering by Nearest Neighbor
Define distance function on users:
π(π’, π’β²)
Predicted rating:
ππ π’, π₯ = 1
ππ¦π’π,π₯
π nearestneighbors π’π of π’
Predicted rating is the average rating of the
π nearest neighbors in terms of π π’, π’β² .
No learning involved.
Performance hinges on π(π’, π’β²).
22
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering by Nearest Neighbor
Define distance function on users:
π π’, π’β² =1
πβ² π¦π’β²,π₯ , βπ¦π’,π₯
2πβ²
π₯=1
Euclidean distance between ratings for all items.
Skip items that have not been rated by both users.
23
Inte
lligent D
ata
Analy
sis
II
Extensions
Normalize ratings (subtract mean rating of user,
divide by userβs standard deviation)
Weight influence of neighbors by inverse of
distance.
Weight influence of neighbors with number of jointly
rated items.
ππ π’, π₯ =
1
π(π’, π’π)π¦π’π,π₯π nearest
neighbors π’π of π’
1
π(π’, π’π)π nearestneighbors π’π of π’
24
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Example
π =4 5 45 5 15 3 4
How much would Alice enjoy Zombiland?
25
Matr
ix
Zom
bila
nd
Titanic
Death
Pro
of
Alice
Bob
Carol
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Example
π =4 5 45 5 15 3 4
π π’, π’β² =1
πβ² π¦π’β²,π₯ , βπ¦π’,π₯
2πβ²
π₯=1
π π΄, π΅ =
π π΄, πΆ =
π π΅, πΆ =
26
M
Z
T
D
Alice
Bob
Carol
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Example
π =4 5 45 5 15 3 4
π π’, π’β² =1
πβ² π¦π’β²,π₯ , βπ¦π’,π₯
2πβ²
π₯=1
π π΄, π΅ = 2.9
π π΄, πΆ = 1
π π΅, πΆ = 1.4
27
M
Z
T
D
Alice
Bob
Carol
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Example
π =4 5 45 5 15 3 4
ππ π΄, π =
1
π(π΄,π’π)π¦π’π,π2 nearest
neighbors π’π of π΄
1
π(π΄,π’π)2 nearest
neighbors π’π of π΄
=
28
M
Z
T
D
Alice
Bob
Carol
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Example
π =4 5 45 5 15 3 4
ππ π΄, π =
1
π(π΄,π’π)π¦π’π,π2 nearest
neighbors π’π of π΄
1
π(π΄,π’π)2 nearest
neighbors π’π of π΄
=1
2.95+
1
13
1
2.9+
1
1
29
M
Z
T
D
Alice
Bob
Carol
Inte
lligent D
ata
Analy
sis
II
Collaborative Filtering: Discussion
K nearest neigbor and similar methods are called
memory-based approaches.
There are no model parameters, no optimization
criterion is being optimized.
Each prediction reuqires an iteration over all training
instances β impractical!
Better to train a model by minimizing an appropriate
loss function over a space of model parameter, then
use model to make predictions quickly.
30
Inte
lligent D
ata
Analy
sis
II
Latent Features
Idea: Instead of ad-hoc definition of distance
between users, learn features that actually
represent preferences.
If, for every user π’, we had a feature vector ππ’ that
describes their preferences,
Then we could learn parameters ππ₯ for item π₯ such
that ππ₯Tππ’ quantifies how much π’ enjoys π₯.
31
Inte
lligent D
ata
Analy
sis
II
Latent Features
Or, turned around,
If, for every item π₯ we had a feature vector ππ₯ that
characterizes its properties,
We could learn parameters ππ’ such that ππ’Tππ₯
quantifies how much π’ enjoys π₯.
In practice some user attributes ππ’ and item
attributes ππ₯ are usually available, but they are
insufficient to understand π’βs preferences and π₯βs
relevant properties.
32
Inte
lligent D
ata
Analy
sis
II
Latent Features
Idea: construct user attributes ππ’ and item
attributes ππ₯ such that ratings in training data can
be predicted accurately.
Decision function: πΞ¨,Ξ¦ π’, π₯ = ππ’
Tππ₯
Prediction is product of user preferences and item
properties.
Model parameters:
Matrix Ξ¨ of user features ππ’ for all users,
Matrix Ξ¦ of item features ππ₯ for all items.
33
Inte
lligent D
ata
Analy
sis
II
Latent Features
Optimization criterion:
Ξ¨β, Ξ¦β
= argminΞ¨,Ξ¦ β(π¦π’,π₯ ,
π₯,π’
πΞ¨,Ξ¦ π’, π₯ )
+ π ππ’2
π’+ ππ₯
2
π₯
34
Feature vectors of all users
and all Items are regularized
Inte
lligent D
ata
Analy
sis
II
Latent Features
Both item and user features are the solution of an
optimization problem.
Number of features π has to be set.
Meaning of the features is not pre-determined.
Sometimes they turn out to be interpretable.
35
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization
Decision function:
πΞ¨,Ξ¦ π’, π₯ = ππ’Tππ₯
In matrix notation:
π Ξ¨,Ξ¦ = ΨΦT
Matrix elements: π¦ 11 β¦ π¦ 1πβ²
β±π¦ π1 π¦ ππβ²
=
π11 β¦ π1π
β±ππ1 πππ
π11 β¦ ππβ²1
β±π1π ππβ²π
36
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization
Decision function in matrix notation:
π¦ 11 β¦ π¦ 1πβ²
β±π¦ π1 π¦ ππβ²
=
π11 β¦ π1π
β±ππ1 πππ
π11 β¦ ππβ²1
β±π1π ππβ²π
37
M
Z
T
D
Alice
Bob
Carol
Latent features
of Alice Latent features
of Matrix
Predicted rating
of Matrix for Alice
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization
Decision function in matrix notation:
π¦ 11 β¦ π¦ 1πβ²
β±π¦ π1 π¦ ππβ²
=
π11 β¦ π1π
β±ππ1 πππ
π11 β¦ ππβ²1
β±π1π ππβ²π
38
M
Z
T
D
Alice
Bob
Carol
Latent features
of Alice
Latent features
of Carol Latent features
of Matrix
Latent features
of Death Proof
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization
Optimization criterion:
Ξ¨β, Ξ¦β
= argminΞ¨,Ξ¦ β(π¦π’,π₯ ,
π₯,π’
πΞ¨,Ξ¦ π’, π₯ )
+ π Ξ¨2+ Ξ¦
2
Criterion is not convex:
For instance, multiplying all feature vectors with -1
gives an equally good solution:
πΞ¨,Ξ¦ π’, π₯ = ππ’Tππ₯ = (βππ’
T)(βππ₯)
Limiting the number of latent features to π restricts
the rank of matrix π .
39
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization
Optimization criterion:
Ξ¨β, Ξ¦β
= argminΞ¨,Ξ¦ β(π¦π₯,π’,
π₯,π’
πΞ¨,Ξ¦ π’, π₯ )
+ π Ξ¨2+ Ξ¦
2
Optimization by
Stochastic gradient descent or
Alternating least squares.
40
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Stochastic Gradient Descent
Iterate through ratings π¦π’,π₯ in training sample
Let ππ’β² β ππ’ β πΌ
ππΞ¨,Ξ¦(π’,π₯)
πππ’
Let ππ₯β² β ππ₯ β πΌ
ππΞ¨,Ξ¦(π’,π₯)
πππ₯
Until convergence.
Requires differentiable loss function; e.g., squared
loss, β¦
41
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Alternating Least Squares
For squared loss and parallel architectures.
Alternate between 2 optimization processes:
Keep Ξ¦ fixed, optimize ππ’ in parallel for all π’.
Keep Ξ¨ fixed, optimize ππ₯ in parallel for all π₯.
42
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Alternating Least Squares
For squared loss and parallel architectures.
Alternate between 2 optimization processes:
Keep Ξ¦ fixed, optimize ππ’ in parallel for all π’.
Keep Ξ¨ fixed, optimize ππ₯ in parallel for all π₯.
Optimization criterion for Ξ¨:
ππ’β = argminππ’
π¦π’ β π¦ π’2 β π ππ’
2
= argminππ’π¦π’ β ππ’
TΞ¦T 2β π ππ’
2
π¦ π’1 β¦ π¦ π’πβ² = ππ’1 β¦ ππ’π
π11 β¦ ππβ²1
β±π1π ππβ²π
43
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Alternating Least Squares
For squared loss and parallel architectures.
Alternate between 2 optimization processes:
Keep Ξ¦ fixed, optimize ππ’ in parallel for all π’.
Keep Ξ¨ fixed, optimize ππ₯ in parallel for all π₯.
Optimization criterion for Ξ¨:
ππ’β = argminππ’
π¦π’ β ππ’TΞ¦T 2
β π ππ’2
ππ’β = ΦΦT + ππΌ
β1Ξ¦π¦π’
44
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Alternating Least Squares
For squared loss and parallel architectures.
Alternate between 2 optimization processes:
Keep Ξ¦ fixed, optimize ππ’ in parallel for all π’.
Keep Ξ¨ fixed, optimize ππ₯ in parallel for all π₯.
Optimization criterion for Ξ¨:
ππ’β = argminππ’
π¦π’ β ππ’TΞ¦T 2
β π ππ’2
ππ’β = ΦΦT + ππΌ
β1Ξ¦π¦π’
Optimization criterion for Ξ¦:
ππ₯β = argminππ₯
π¦π₯ β ππ₯TΞ¨T 2
β π ππ₯2
ππ₯β = ΨΨT + ππΌ
β1Ξ¨π¦π₯
45
Inte
lligent D
ata
Analy
sis
II
Matrix Factorization by Alternating Least Squares
For squared loss and parallel architectures.
Initialize Ξ¨, Ξ¦ randomly.
Repeat until convergence:
Keep Ξ¨ fixed, for all π’ in parallel:
ππ’ = ΦΦT + ππΌβ1
Ξ¦π¦π’
Keep Ξ¦ fixed, for all π₯ in parallel:
ππ’ = ΨΨT + ππΌβ1
Ξ¨π¦π₯
46
Inte
lligent D
ata
Analy
sis
II
Extensions: Biases
Some users just give optimistic or pessimistic
ratings; some items are hyped. Decision function: πΞ¨,Ξ¦,π΅π’,π΅π₯
π’, π₯ = ππ’ + ππ₯ + ππ’Tππ₯
Optimization criterion: Ξ¨β, Ξ¦β, π΅π’, π΅π₯
= argminΞ¨,Ξ¦ β(π¦π₯,π’,
π₯,π’
πΞ¨,Ξ¦,π΅π’,π΅π₯π’, π₯ )
+ π Ξ¨2+ Ξ¦
2+ π΅π’
2+ π΅π₯
2
47
Inte
lligent D
ata
Analy
sis
II
Extensions: Explicit Features
Often, explicit user and item features are available.
Concatenate vectors ππ’ and ππ₯; explicit features
are fixed, latent features are free paremeters.
48
Inte
lligent D
ata
Analy
sis
II
Extensions: Temporal Dynamics
How much a user likes an item depends on the
point in time when the rating takes place.
πΞ¨,Ξ¦,π΅π’,π΅π₯,π‘ π’, π₯ = ππ’ π‘ + ππ₯(π‘) + ππ’ π‘ Tππ₯
49
Inte
lligent D
ata
Analy
sis
II
Summary
Purely content-based recommendation: users donβt
benefit from other usersβ ratings.
Collaborative filtering by nearest neighbors: fixed
definition of similarity of users. No model
parameters, no learning. Has to iterate over data to
make recommendation.
Latent factor models, matrix factorization: user
preferences and item properties are free
parameters, optimized to minimized discrepancy
between inferred and actual ratings.
50
Top Related