Fast ALS-Based Matrix Factorization for Recommender Systems

Fast ALS-Based Matrix

Factorization for

Recommender Systems

David Zibriczky

LAWA Workpackage Meeting

16th January, 2013

Problem setting

16th January, 20132

Item Recommendation

• Classical item recommendation problem (see Netflix)

• Explicit feedbacks (ratings)

16th January, 20133 LAWA Workpackage Meeting

The Matrix The Matrix 2 Twilight The Matrix 3

Collaborative Filtering (Explicit)

• Classical item recommendation problem (see Netflix)

• Explicit feedbacks (ratings)

• Collaborative Filtering

• Based on other users

The Matrix 3The Matrix The Matrix 2 Twilight

Collaborative Filtering (Implicit)

• Items are not movies only (live content, products, holidays, …)

• Implicit feedbacks (buy, view, …)

• Less information about pref.

Item4Item1 Item2 Item3

Industrial motivation

• Keeping the response time low

• Up-to-date user models, the adaptation should be fast

• The items may change rapidly, the training time can be a bottleneck of

live performance

• Increasing amount of data from a customer Increasing training time

• Limited resources

16th January, 20137

Preference Matrix

• Matrix representation

• Implicit Feedbacks: Assuming

positive preference

• Value = 1

• Estimation of unknown preference?

• Sorting items by estimation Item

Recommendation

R Item1 Item2 Item3 Item4

User1 1 ? ? ?

User2 ? ? 1 ?

User3 1 1 ? ?

User4 ? 1 ? 1

Matrix Factorization

𝑹 = 𝑷𝑸𝑻 𝑟𝑢𝑖 = 𝒑𝑢𝑇𝒒𝑖

𝑹𝑵𝒙𝑴: preference matrix

𝑷𝑵𝒙𝑲: user feature matrix

𝑸𝑴𝒙𝑲: item feature matrix

𝑵: #users

𝑴: #items

𝑲: #features

𝑲 ≪ 𝑴 , 𝑲 ≪ 𝑵

R Item1 Item2 Item3 …

User2 𝒓𝑢𝑖

𝒑𝑢𝑇

QT 𝒒𝑖

𝒑𝒖 ≔ 𝑷 𝒖 𝑻

𝒒𝒊 ≔ 𝑸 𝒊 𝑻

Objective Function

16th January, 201310

Preference Matrix

User1 1

User2 1

User3 1 1

User4 1 1

• Zero value for unknown preference (zero example). Many 0s, few 1s, in practice

Preference Matrix

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

• Zero value for unknown preference (zero example). Many 0s, few 1s, in practice-

• 𝒄𝑢𝑖 confidence for known feedback (constant or function of the context of event)

• Zero examples are less important, but important.

Confidence Matrix

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

C Item1 Item2 Item3 Item4

User1 𝒄11 1 1 1

User2 1 1 𝒄23 1

User3 𝒄31 𝒄32 1 1

User4 1 𝒄42 1 𝒄44

• Objective function:

Weighted Sum of Squared Errors

C Item1 Item2 Item3 Item4

User1 𝒄11 1 1 1

User2 1 1 𝒄23 1

User3 𝒄31 𝒄32 1 1

User4 1 𝒄42 1 𝒄44

𝒇 𝑷,𝑸 = 𝑾𝑺𝑺𝑬 =

(𝒖,𝒊)

𝒄𝒖𝒊 𝒓𝒖𝒊 − 𝒓𝒖𝒊𝟐 𝑷 = ?

𝑸 = ?

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

Optimizer

• Ridge Regression

• 𝑝𝑢 = 𝑄𝑇𝐶𝑢𝑄 −1𝑄𝑇𝐶𝑢𝑅𝑟 𝑢

• 𝑞𝑖 = 𝑃𝑇𝐶𝑖𝑃−1

𝑃𝑇𝐶𝑖𝑅𝑐 𝑖

Optimizer – Alternating Least Squares

QT0.1 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.2 0.6

0.6 0.4

0.7 0.2

0.5 -0.2

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.3 -0.3 0.7 0.7

0.7 0.8 -0.5 -0.1

-0.2 0.6

0.6 0.4

0.7 0.2

0.5 -0.2

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.3 -0.3 0.7 0.7

0.7 0.8 -0.5 -0.1

-0.2 0.7

0.6 0.5

0.8 0.2

0.6 -0.2

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

• Complexity of naive solution: 𝚶 𝑰𝑲𝟐𝑵𝑴 + 𝑰𝑲𝟑 𝑵 + 𝑴

𝑬: number of examples, 𝑰 : number of iterations

• Improvement (Hu, Koren, Volinsky)

Ridge Regression: 𝑝𝑢 = 𝑄𝑇𝐶𝑢𝑄 −1𝑄𝑇𝐶𝑢𝑅𝑟 𝑢

𝑄𝑇𝐶𝑢𝑄 = 𝑄𝑇𝑄 + 𝑄𝑇 𝐶𝑢 − 𝐼 𝑄 = 𝐶𝑂𝑉𝑄0 + 𝐶𝑂𝑉𝑄+, 𝚶(𝑰𝑲𝟐𝑵𝑴) is costly

𝐶𝑂𝑉𝑄0 is user independent, need to be calculated at the start of the iteration

Calculating 𝐶𝑂𝑉𝑄+ needs only #𝑷(𝒖)+steps.

o #𝑷(𝒖)+: number of positive examples of user u

Complexity: 𝜪 𝑰𝑲𝟐𝑬 + 𝑰𝑲𝟑(𝑵 + 𝑴) = 𝜪 𝑰𝑲𝟐(𝑬 + 𝑲(𝑵 + 𝑴)

Codename: IALS

• Complexity issues on large dataset:

If 𝑲 is low: 𝜪(𝑰𝑲𝟐𝑬) is dominant

If 𝑲 is high: 𝑶(𝑰𝑲𝟑(𝑵 + 𝑴)) is dominant

19 LAWA Workpackage Meeting 16th January, 2013

Problem: Complexity

Ridge Regression with Coordinate Descent

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• Initialize with zero values

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

0.51 0 0

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• Target vector: 𝒆𝒖= 𝑪𝒖 𝒓𝒖 − 𝒑𝒖𝑸𝑻

• Optimize only one feature of 𝑝𝑢 at once

• 𝑝𝑢𝑘 = 𝑖=1

𝑀 𝑐𝑢𝑖𝑞𝑖𝑘𝑒𝑢𝑖

𝑖=1𝑀 𝑐𝑢𝑖𝑞𝑖𝑘𝑞𝑖𝑘

=𝑆𝑄𝐸

𝑆𝑄𝑄

• 𝑒𝑢𝑖 = 𝑒𝑢𝑖 − 𝑝𝑢𝑘𝑒𝑢𝑖𝑐𝑢𝑖

• Apply more iteration

0.51 0.10 0

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• 𝑝𝑢𝑘 = 𝑖=1

=𝑆𝑄𝐸

𝑆𝑄𝑄

0.51 0.10 0.08

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• 𝑝𝑢𝑘 = 𝑖=1

=𝑆𝑄𝐸

𝑆𝑄𝑄

0.47 0.10 0.08

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• 𝑝𝑢𝑘 = 𝑖=1

=𝑆𝑄𝐸

𝑆𝑄𝑄

0.46 0.11 0.07

User1 1 0 0 0

0.9 -0.4 0.8 0.6

0.6 0.7 -0.7 -0.2

-0.1 -0.4 -0.1 0.6

• 𝑝𝑢𝑘 = 𝑖=1

=𝑆𝑄𝐸

𝑆𝑄𝑄

Optimizer – Coordinate Descent

QT0.1 0.4 1.1 0.6

0.6 0.7 1.5 1.0

• Ridge Regression with Coordinate Descent

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 1.1 0.6

0.6 0.7 1.5 1.0

0.3 -0.1

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 1.1 0.6

0.6 0.7 1.5 1.0

0.3 -0.1

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 1.1 0.6

0.6 0.7 1.5 1.0

0.3 -0.1

0.1 0.5

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 1.1 0.6

0.6 0.7 1.5 1.0

0.3 -0.1

0.1 -0.5

-0.4 0.2

0.5 -0.4

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0 0 0

0 0 0 0

0.3 -0.1

0.1 -0.5

-0.4 0.2

0.5 -0.4

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0 0 0

0.6 0 0 0

0.3 -0.1

0.1 -0.5

-0.4 0.2

0.5 -0.4

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 0 0

0.6 0 0 0

0.3 -0.1

0.1 -0.5

-0.4 0.2

0.5 -0.4

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 -0.1 0.2

0.6 0.7 0.8 0.5

0.3 -0.1

0.1 -0.5

-0.4 0.2

0.5 -0.4

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 -0.1 0.2

0.6 0.7 0.8 0.5

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 -0.1 0.2

0.6 0.7 0.8 0.5

0.2 -0.1

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

QT0.1 0.4 -0.1 0.2

0.6 0.7 0.8 0.5

0.2 -0.1

0.1 -0.4

-0.3 0.1

0.5 -0.6

User1 1 0 0 0

User2 0 0 1 0

User3 1 1 0 0

User4 0 1 0 1

• Complexity of naive solution: 𝚶 𝑰𝑲𝑵𝑴

• Ridge Regression calculates the features based on examples directly,

Covariance precomputing solution cannot be applied here.

Optimizer – Coordinate Descent Improvement

• Synthetic examples (Pilászy, Zibriczky, Tikk)

• Solution of Ridgre Regression with CD: 𝑝𝑢𝑘 = 𝑖=1

=𝑆𝑄𝐸

𝑆𝑄𝑄

• Calculate statistics for this user, who watched nothing (𝑆𝐸𝑄0 and 𝑆𝑄𝑄0)

• The solution is calculated incrementally: 𝑝𝑢𝑘 =𝑆𝑄𝐸

𝑆𝑄𝑄=

𝑆𝑄𝐸0+𝑆𝑄𝐸+

𝑆𝑄𝑄0+𝑆𝑄𝑄+(𝑴 + #𝑷(𝒖)+ steps)

• Eigenvalue decomposition: 𝑄𝑇𝑄 = 𝑆Λ𝑆𝑇 = 𝑆 Λ𝑇

Λ𝑆 = 𝐺𝑇𝐺

• Zero examples are compressed to synthetic examples: 𝑄𝑀𝑥𝐾 → 𝐺𝐾𝑥𝐾

• 𝑆𝐺𝐺0 = 𝑆𝑄𝑄0, but needs only 𝐊 steps to compute: 𝑝𝑢𝑘 =𝑺𝑮𝑬𝟎+𝑆𝑄𝐸+

𝑺𝑮𝑮𝟎+𝑆𝑄𝑄+(𝑲 + #𝑷(𝒖)+ steps)

• 𝑆𝐺𝐸0 is calculated the same way as 𝑆𝑄𝐸0, but using 𝐊 steps only.

• Complexity: 𝛰 𝐼𝐾(𝐸 + 𝐾𝑀 + 𝐾𝑁)) = 𝚶 𝑰𝑲(𝑬 + 𝑲(𝑴 + 𝑵)

• Complexity of naive solution: 𝚶 𝑰𝑲𝑵𝑴

• Ridge Regression calculates the features based on examples directly,

Covariance precomputing solution cannot be applied here.

• Synthetic Examples

• Codename: IALS1

• Complexity reduction (IALSIALS1)

𝜪 𝑰𝑲(𝑬 + 𝑲(𝑴 + 𝑵)

• IALS1 requires higher 𝑲 for the same accuracy as IALS.

...does it work in practice?

• Average Rank Position on the subset of a propietary implicit feedback dataset. The lower

value is better.

• IALS1 offers better time-accuracy tradeoffs, especially when K is large.

Comparison

IALS IALS1

K ARP time ARP time

5 0,1903 153 0,1898 112

10 0,1578 254 0,1588 134

20 0,1427 644 0,1432 209

50 0,1334 2862 0,1344 525

100 0,1314 11441 0,1325 1361

250 0,1311 92944 0,1312 6651

500 N/A N/A 0,1282 24697

1000 N/A N/A 0,1242 104611

100 1000 10000 100000A

Training Time (s)

IALS IALS1

Conclusion

• Explicit feedbacks are rarely or not provided.

• Implicit feedbacks are more general.

• Complexity issues of Alternating Least Squares.

• Efficient solution by using approximation and synthetic examples.

• IALS1 offers better time-accuracy tradeoffs, especially when 𝑲 is large.

• IALS is approximation algorithm too, so why not change it to be even

more approximative?

Other algorithms

Model – Tensor Factorization

• Different preferences during the day

• Time period 1: 06:00-14:00

R1 Item1 Item2 Item3 …

User1 1 …

User2 1 …

User3 …

…. … … … …

• Time period 2: 14:00-22:00

User1 1 …

User2 1 0 …

User3 …

…. … … … …

User1 1 …

User2 1 …

User3 1 …

…. … … … …

• Time period 3: 22:00-06:00

User1 1 …

User2 1 0 …

User3 …

…. … … … …

User1 0 1 …

User2 1 …

User3 1 …

…. … … … …

User1 1 …

User2 …

User3 1 1 …

…. … … … …

User1 1 …

User2 1 0 …

User3 …

…. … … … …

User1 0 1 …

User2 1 …

User3 1 …

…. … … … …

User1 …

User2 𝒓𝑢𝑖𝑡 …

User3 …

…. … … … …

QTq11 q21 q31 …

q12 q22 q32 …

p11 p12

p21 p22

p31 p32

… …

𝑹𝑵𝒙𝑴: preference matrix

𝑷𝑵𝒙𝑲: user feature matrix

𝑸𝑴𝒙𝑲: item feature matrix

𝑻𝑳𝒙𝑲: time feature matrix

𝑵: #users

𝑴: #items

𝑳: #time periods

𝑲: #features

𝒓𝒖𝒊t =

𝒑𝒖𝒌𝒒𝒊𝒌𝒕𝒕𝒌

𝑹 = 𝑷°𝑸°𝑻

• Data sets: Netflix Rating 5, IPTV Provider VOD rental, Grocery buys

• Evaluation Metric: Recall@20, Precision-Recall@20

• Number of features: 20

Comparison – ITALS vs. IALS

Test case (20) IALS ITALS

Netflix Probe 0.087 0.097

Netflix Time Split 0.054 0.071

IPTV VOD 1day 0.063 0.112

IPTV VOD 1week 0.055 0.100

Grocer 0.065 0.103

Comparison – ITALS vs. IALS

Objective Function – Ranking-based objective function

• Ranking-based objective function approach:

• 𝒓𝒖𝒊 − 𝒓𝒖𝒋 : difference of preference between item i and j

• 𝒓𝒖𝒊 − 𝒓𝒖𝒋 : estimated difference of preference between item i and j

• 𝒔𝒋: importance of item j in objective function

• Model: Matrix Factorization

• Optimizer: Alternating Least Squares

• Name: RankALS

𝒇 𝜽 =

𝒖𝝐𝑼

𝒊𝝐𝑰

𝒄𝒖𝒊

𝒊𝝐𝑰

𝒔𝒋[ 𝒓𝒖𝒊 − 𝒓𝒖𝒋 − 𝒓𝒖𝒊 − 𝒓𝒖𝒋 ]𝟐

Comparison – RankIALS vs. IALS

Related Publications

• Alternating Least Squares with Coordinate Descent

I. Pilászy, D. Zibriczky, D. Tikk. Fast ALS-based matrix factorization for explicit and

implicit feedback datasets. RecSys 2010

• Tensor Factorization

B. Hidasi, D. Tikk: Fast ALS-Based Tensor Factorization for Context-Aware

Recommendation from Implicit Feedback, ECML PKDD 2012

• Personalized Ranking

G. Takács, D. Tikk: Alternating least squares for personalized ranking, RecSys 2012

• IPTV Case Study

D. Zibriczky, B. Hidasi, Z. Petres, D. Tikk: Personalized recommendation of linear content

on interactive TV platforms: beating the cold start and noisy implicit user feedback,

TVMMP @ UMAP 2012

Fast ALS-Based Matrix Factorization for Recommender Systems

Science

Transcript of Fast ALS-Based Matrix Factorization for Recommender Systems

Joint user knowledge and matrix factorization for ...rwang/publications/18- · Keywords Recommender systems · Social networks · User status · Matrix factorization ... such as Amazon,1

Matrix Factorisation / Spotify · Spotify Improvements for the Matrix Factorization Model Netflix Prize Competition 2. Recommender Systems 3. Content Filtering Create a profile for

Relation Embedding for Personalised Translation-Based POI ... · Matrix factorization · Recommender System · POI recommendation 1 Introduction With the increase of mobile devices

Scalable Machine Learninggobie.csb.pitt.edu/SML/MatrixFactorization.pdf · Scalable Machine Learning Matrix factorization. ... I Representation learning. Recommender systems There

Parallel Matrix Factorization for Recommender Systemsinderjit/public_papers/kais-pmf.pdf · Parallel Matrix Factorization for Recommender Systems 3 Table 1. Comparison between CCD++

Recommender Systems with Apache Spark's ALS Function

Deep Matrix Factorization Models for Recommender Systems · Deep Matrix Factorization Models for Recommender Systems Hong-Jian Xue, Xin-Yu Dai, Jianbing Zhang, Shujian Huang, Jiajun

Recommender System for Real Mobile Applications: Two Case ... · Champion model of several CTR prediction contest Human feature engineering Automatic feature conjunction. 20 Factorization

COVER FEATURE MATRIX FACTORIZATION … · Netflix have made recommender systems a salient part ... MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER ... data aspects and other application

Matrix Factorization and Factorization Machines for ...cjlin/talks/sdm2015.pdfMatrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department of Computer

Parallel Matrix Factorization for Recommender Systemsinderjit/public_papers/kais...Parallel Matrix Factorization for Recommender Systems 5 Fig. 1. Comparison between ALS, DSGD, and

Predicting Missing Ratings in Recommender Systems ...joans/journals/09 IJEC Predicting missing...Predicting Missing Ratings in Recommender Systems: Adapted Factorization Approach Carme

Explainable Matrix Factorization for Collaborative Filteringgdac.uqam.ca/ · Explanations, Matrix Factorization (MF), Recommender Sys-tems, Collaborative Filtering (CF) 1. INTRODUCTION

Eindhoven University of Technology MASTER … · Matrix Factorization Algorithms in Movie Recommender Systems by ... manager of the radio station dubstep.fm that ... Item characteristics

Matrix Factorization and Factorization Machines for ...cjlin/talks/recsys.pdfMatrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department of Computer

Fast ALS-based matrix factorization for explicit and implicit feedback datasets

Parallel Matrix Factorization for Recommender Systemsrofuyu/papers/kais-pmf.pdf · to the design of fast and scalable methods for large-scale matrix factorization problems [2, 3,

Matrix Factorization Techniques for Recommender Systemsstaff.ustc.edu.cn/~ynyang/group-meeting/2014/matrix... · · 2014-10-22Matrix Factorization Techniques for Recommender Systems

Context-aware Preference Modeling with Factorization · [1] BalázsHidasiand DomonkosTikk: Fast ALS-based tensor factorization for context-aware recommendation from implicit feedback.

Advances of Deep & Reinforcement Learning on Recommender ... · Factorization Machine •Incorporate all possible information for recommender systems •One-hot encoding for each