Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1
CSE 6240: Web Search and Text Mining. Spring 2020
Recommendation Systems: Part II
Prof. Srijan Kumarhttp://cc.gatech.edu/~srijan
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2
Announcements• Project:
– Final report rubric: released – Final presentation: details forthcoming
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3
Recommendation Systems• Content-based• Collaborative Filtering• Latent Factor Models • Case Study: Netflix Challenge• Deep Recommender Systems
Slide reference: Mining Massive Dataset http://mmds.org/
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4
Latent Factor Models• These models learn latent factors to
represent users and items from the rating matrix– Latent factors are not directly observable – These are derived from the data
• Recall: Network embeddings• Methods:
– Singular value decomposition (SVD)– Principal Component Analysis (PCA)– Eigendecompositon
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5
Latent Factors: Example• Embedding axes are a type of latent factors• In a user-movie rating matrix: • Movie latent factors can represent axes:
– Comedy vs drama– Degree of action– Appropriateness to children
• User latent factors will measure a user’s affinity towards corresponding movie factors
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Geared towards females
Geared towards males
Serious
Funny
Latent Factors: Example
6
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Factor 2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 7
SVD
• SVD: SVD decomposes an input matrix into multiple factor matrices– A = U S VT
– Where,– A: Input data matrix– U: Left singular vecs– V: Right singular vecs– S: Singular values
Am
n
Sm
n
VT»U
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8
SVD• SVD gives minimum reconstruction error
(Sum of Squared Errors):
min$,&,'
( 𝐴*+ − 𝑈Σ𝑉0 *+1
�
*+∈4• SSE and RMSE are monotonically related:
– 𝑅𝑀𝑆𝐸 = ;<𝑆𝑆𝐸� è SVD is minimizing RMSE
• Complication: The sum in SVD error term is over all entries. But our R has missing entries.– Solution: no-rating in interpreted as zero-rating.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9
SVD on Rating Matrix
• “SVD” on rating data: R ≈ Q · PT
• Each row of Q represents an item• Each column of P represents a user
45531
312445
53432142
24542
522434
42331
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1≈
users
item
s
PT
Q
item
s
users
R
factors
factors
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10
Ratings as Products of Factors• How to estimate the missing rating of
user x for item i?45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
≈
item
s
users
users
?
PT
𝒓>𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙=(𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
�
𝒇
qi = row i of Qpx = column x of PT
fact
ors
Qfactors
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 11
Ratings as Products of Factors• How to estimate the missing rating of
user x for item i?45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
≈
item
s
users
users
?
PT
fact
ors
Qfactors
𝒓>𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙=(𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
�
𝒇
qi = row i of Qpx = column x of PT
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12
Ratings as Products of Factors• How to estimate the missing rating of
user x for item i?45531
312445
53432142
24542
522434
42331
item
s
.2-.4.1
.5.6-.5
.5.3-.2
.32.11.1
-22.1-.7
.3.7-1
-.92.41.4.3-.4.8-.5-2.5.3-.21.1
1.3-.11.2-.72.91.4-1.31.4.5.7-.8
.1-.6.7.8.4-.3.92.41.7.6-.42.1
≈
item
s
users
users
?
QPT
2.4
fact
ors
factors
𝒓>𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙=(𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
�
𝒇
qi = row i of Qpx = column x of PT
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models: Example
13
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Fact
or 2
Movies plotted in two dimensions. Dimensions have meaning.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Geared towards females
Geared towards males
Serious
Funny
Latent Factor Models
14
The PrincessDiaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Fact
or 2
Users fall in the same space, showing their preferences.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15
SVD: Problems
• SVD minimizes SSE for training data– Want large k (# of factors) to capture all the
signals– But, error on test data begins to rise for k > 2
• This is a classical example of overfitting:– With too much freedom (too many free
parameters) the model starts fitting noise– Model fits too well the training data and thus not
generalizing well to unseen test data
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16
Preventing Overfitting• To solve overfitting we introduce
regularization:– Allow rich model where there are sufficient data– Shrink aggressively where data are scarce
úû
ùêë
é++- ååå
ii
xx
trainingxixi
QPqppqr 2
22
12
,)(min ll
l1, l2 … user set regularization parameters
“error” “length”
Note: We do not care about the “raw” value of the objective function,but we care in P,Q that achieve the minimum of the objective
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Geared towards females
Geared towards males
serious
funny
The Effect of Regularization
17
The Lion King
Braveheart
Lethal Weapon
Independence Day
AmadeusThe Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Fact
or 2
The PrincessDiaries
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18
Modeling Biases and Interactions
¡ μ =overallmeanrating¡ bx =biasofuserx¡ bi =biasofmoviei
user-movie interactionmovie biasuser bias
User-Movieinteraction¡ Characterizesthematchingbetween
usersandmovies¡ Attractsmostresearchinthefield¡ Benefitsfromalgorithmicand
mathematicalinnovations
Baselinepredictor§ Separatesusersandmovies§ Benefitsfrominsightsintouser’s
behavior§ Amongthemainpractical
contributionsofthecompetition
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19
Baseline Predictor• We have expectations on the rating by
user x of movie i, even without estimating x’s attitude towards movies like i
– Ratingscaleofuserx– Valuesofotherratingsuser
gaverecently(day-specificmood,anchoring,multi-useraccounts)
– (Recent)popularityofmoviei– Selectionbias;relatedto
numberofratingsusergaveonthesameday(“frequency”)
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20
Putting It All Together
• Example:– Mean rating: µ = 3.7– You are a critical reviewer: your ratings are
1 star lower than the mean: bx = -1– Star Wars gets a mean rating of 0.5 higher
than average movie: bi = + 0.5– Predicted rating for you on Star Wars:
= 3.7 - 1 + 0.5 = 3.2
Overallmeanrating
Biasforuserx
Biasformoviei
𝑟F* = 𝜇 +𝑏F+𝑏*+𝑞*⋅ 𝑝FUser-Movieinteraction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21
Fitting the New Model
• Solve:
• Stochastic gradient decent to find parameters– Note: Both biases bx, bi as well as interactions qi,
px are treated as parameters (we estimate them)
regularization
goodness of fit
l is selected via grid-search on a validation set
( )
÷ø
öçè
æ++++
+++-
åååå
åÎ
ii
xx
xx
ii
Rixxiixxi
PQ
bbpq
pqbbr
24
23
22
21
2
),(,
)(min
llll
µ
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22
Recommendation Systems• Content-based• Collaborative Filtering• Latent Factor Models • Case Study: Netflix Challenge• Deep Recommender Systems
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Case Study: Netflix Prize
23
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Grand Prize: 0.8563
Netflix: 0.9514
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Case Study: The Netflix Prize
Basic Collaborative filtering: 0.94
Latent factors: 0.90
Latent factors + Biases: 0.89
Collaborative filtering++: 0.91
24
Competition• Task: Reduce RMSE• 2,700+ teams• $1 million prize for 10%
improvement on Netflix
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25
Case Study: The Netflix Prize
• Training data– 100 million ratings, 480,000 users, 17,770
movies– 6 years of data: 2000-2005
• Test data– Last few ratings of each user (2.8 million)– Evaluation criterion: Root Mean Square Error
(RMSE) = ;L
∑ �̂�F* − 𝑟F* 1�(*,F)∈L
�
– Netflix’s system RMSE: 0.9514
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26
BellKor Recommender System• The winner of the Netflix Challenge!• Multi-scale modeling of the data:
Combine top level, “regional”modeling of the data, with a refined, local view:– Global:
• Overall deviations of users/movies– Factorization:
• Addressing “regional” effects– Collaborative filtering:
• Extract local patterns
Global effects
Factorization
Collaborative filtering
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27
Modeling Local & Global Effects• Global:
– Mean movie rating: 3.7 stars– The Sixth Sense is 0.5 stars above avg.– Joe rates 0.2 stars below avg. Þ Baseline estimation: Joe will rate The Sixth Sense 4 stars
• Local neighborhood (CF/NN):– Joe didn’t like related movie Signs– Þ Final estimate:
Joe will rate The Sixth Sense 3.8 stars
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28
Interpolation Weights• So far: 𝑟F*Q = 𝑏F* + ∑ 𝑤*+ 𝑟F+ − 𝑏F+�
+∈S(*;F)– Weights wij derived based
on their role; no use of an arbitrary similarity measure (wij ¹ sij)
– Explicitly account for interrelationships among the neighboring movies
• Next: Latent factor model– Extract “regional” correlations
Global effects
Factorization
CF/NN
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29
Temporal Biases Of Users• Sudden rise in the
average movie rating(early 2004)– Improvements in Netflix– GUI improvements– Meaning of rating changed
• Movie age– Users prefer new movies
without any reasons– Older movies are just
inherently better than newer ones
Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30
Temporal Biases & Factors• Original model:
rxi = µ +bx + bi + qi ·px
• Add time dependence to biases:rxi = µ +bx(t)+ bi(t) +qi · px– Make parameters bx and bi to depend on time– (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks
• Add temporal dependence to factors– px(t) = user preference vector on day t
Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31
Adding Temporal Effects
0.875
0.88
0.885
0.89
0.895
0.9
0.905
0.91
0.915
0.92
1 10 100 1000 10000
RM
SE
Millions of parameters
CF (no time bias)Basic Latent FactorsCF (time bias)Latent Factors w/ Biases+ Linear time factors+ Per-day user biases
+ CF
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Grand Prize: 0.8563
Netflix: 0.9514
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Case Study: The Netflix Prize
Basic Collaborative filtering: 0.94
Latent factors: 0.90
Latent factors + Biases: 0.89
Collaborative filtering++: 0.91
32
Latent factors+Biases+Time: 0.876
• New update: Added time• But still no prize! • What to do next?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3333
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Standing on June 26th 2009
34
June 26th submission triggers 30-day “last call”
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 35
The Last 30 Days• Ensemble team formed
– Group of other teams on leaderboard forms a new team– Relies on combining their models– Quickly also get a qualifying score over 10%
• BellKor– Continue to get small improvements in their scores– Realize that they are in direct competition with Ensemble
• Strategy– Both teams carefully monitoring the leaderboard– Only sure way to check for improvement is to submit a
set of predictions• This alerts the other team of your latest score
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 36
24 Hours from the Deadline• Submissions limited to 1 a day
– Only 1 final submission could be made in the last 24h• 24 hours before deadline…
– BellKor team member in Austria notices that Ensembleposts a score that is slightly better than BellKor’s
• Frantic last 24 hours for both teams– Much computer time on final optimization– Carefully calibrated to end about an hour before deadline
• Final submissions– BellKor submits a little early (on purpose), 40 mins
before deadline– Ensemble submits their final entry 20 mins later– ….and everyone waits….
37
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
$1M Awarded Sept 21st 2009
38
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 39
Recommendation Systems• Content-based• Collaborative Filtering• Latent Factor Models • Case Study: Netflix Challenge• Deep Recommender Systems
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 40
Deep Recommender Systems• How can deep learning advance
recommendation systems?• Simple way for content-based models: Use
CNNs, LSTMs for generate image and text features of items
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 41
Deep Recommender Systems• But how can DL be used for tasks and
methods at the core of recommendation systems? – For collaborative filtering?– For latent factor models? – For temporal dynamics?– Some new techniques?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 42
Why Deep Learning TechniquesPros: • Capture non-linearity well• Non-manual representation learning• Efficient sequence modeling• Somewhat flexible and easy to retrainCons:• Lack of interpretability• Large data requirements• Extensive hyper-parameter tuning
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 43
Applicable DL TechniquesDeep Learning methods: • MLPs and AutoEncoders• CNNs• RNNs• Adversarial Networks• Attention models• Deep reinforcement learning
How to uses these methods to improve recommender systems?
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 44
Several Methods• Neural Collaborative Filtering• Recurrent Recommender Systems• LatentCross• Dynamic User Model: JODIE
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 45
Neural Collaborative Filtering• Neural extensions of traditional
recommender system• Input: rating matrix, user profile and item
features (optional)– If user/item features are unavailable, we can use
one-hot vectors • Output: User and item embeddings • Traditional matrix factorization is a special
case of NCF • Reference: Neural Collaborative Filtering,
He et al., WWW 2017
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 46
NCF Setup• User feature vector: • Item feature vector: • User embedding matrix: U• Item embedding matrix: I• Neural network: f• Neural network parameters: 𝛩• Predicted rating:
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 47
NCF Model Architecture• Multiple layers of
fully connected layers form the Neural CF layer.
• Output is a rating score
• Real rating score is rui
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 48
NCF model: Loss function• Train on the difference between predicted
rating and the real rating• Use negative sampling to reduce the
negative data points• Loss = cross-entropy loss
Top Related