Tutorial bpocf

Collaborative Filtering with Binary, Positive-only Data

Tutorial @ ECML PKDD, September 2015, Porto

Koen Verstrepen+, Kanishka Bhaduri*, Bart Goethals+ *

+

Agenda •  Introduction •  Algorithms •  Netflix

Binary, Positive-Only Data

1:30 K. Verstrepen et al.

— Convince the reader ranking is more important than RMSE or MSE.— data splits (leave-one-out, 5 fold, ...)— Pradel et al. :ranking with non-random missing ratings: influence of popularity and

positivity on evaluation metrics— Marlin et al. :Collaaborative prediction and ranking with non-random missing data— Marlin et al. :collaborative filtering and the missing at random assumption— Steck: Training and testing of recommender systems on data missing not at random— We should emphasise how choosing hyperparameters is often done in a way that

causes leakage.

7.2. online— Who: Kanishka?— Convince the reader this is much better than offline, how to do it etc.

8. EXPERIMENTAL EVALUATION— Who: ?— THE offline comparison of OCCF algorithms. Many datasets, many algorithms, many

evaluation measures, multiple data split methods, sufficiently randomized.— also empirically evaluate the explanations extracted.

9. SYMBOLS FOR PRESENTATIONU

IR

REFERENCESF. Aiolli. 2013. Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets. In RecSys.

273–280.Fabio Aiolli. 2014. Convex AUC optimization for top-N recommendation with implicit feedback. In RecSys.

293–296.S.S. Anand and B. Mobasher. 2006. Contextual Recommendation. In WebMine. 142–160.C.M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer, New York, NY.Evangelia Christakopoulou and George Karypis. 2014. Hoslim: Higher-order sparse linear method for top-n

recommender systems. In Advances in Knowledge Discovery and Data Mining. Springer, 38–49.Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on

top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems.39–46.

M. Deshpande and G. Karypis. 2004. Item-Based Top-N Recommendation Algorithms. TOIS 22, 1 (2004),143–177.

C. Desrosiers and G. Karypis. 2011. A Comprehensive Survey of Neighborhood-based RecommendationMethods. In Recommender Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P.B. Kantor (Eds.).Springer, Boston, MA.

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linearmodels via coordinate descent. Journal of statistical software 33, 1 (2010), 1.

E. Gaussier and C. Goutte. 2005. Relation between PLSA and NMF and implications. In SIGIR. 601–602.T. Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR. 50–57.Thomas Hofmann. 2004. Latent Semantic Models for Collaborative Filtering. ACM Trans. Inf. Syst. 22, 1

(2004), 89–115.F. Hoppner. 2005. Association Rules. In The Data Mining and Knowledge Discovery Handbook, O. Mainmon

and L. Rokach (Eds.). Springer, New York, NY.Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM.

263–272.D. Jannach, M. Zanker, A. Felfernig, and G. Frierich. 2011. Recommender Systems: An Introduction. Cam-

bridge University Press, New York, NY.

ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: January 2015.




causes leakage.





IR


















causes leakage.





IR















Collaborative Filtering




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





IR















Movies




causes leakage.





IR


















causes leakage.





IR















Music




causes leakage.





IR


















causes leakage.





IR















Social Networks




causes leakage.





IR


















causes leakage.





IR















Tagging / Annotation




causes leakage.





IR


















causes leakage.





IR















Paris

New York

Porto

Statue of Liberty

Eiffel Tower

Also Explicit Feedback




causes leakage.





IR


















causes leakage.





IR















Matrix Representation




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





IR















1 1 1

1

1 1

1 1




causes leakage.





IR


















causes leakage.





IR















R

Unknown = 0 no negative information




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





IR















1 0 1 0 1

0 1 0 0 0

1 0 0 1 0

0 1 0 0 1




causes leakage.





IR


















causes leakage.





IR















R

Different Data

Ratings Graded relevance,

Positive-Only Binary,

Positive-Only

1 5 4

3 3

4 2 2

5 5 1

5 4

4

5 5

X X

X

X X

•  •  Movies •  Music •  …

•  Minutes watched •  Times clicked •  Times listened •  Money spent •  Visits/week •  …

•  Seen •  Bought •  Watched •  Clicked •  …

Sparse 10 in 10 000

Agenda •  Introduction •  Algorithms – Elegant example – Models – Deviation functions – Difference with rating-based algorithms – Parameter inference

•  Netflix

pLSA An elegant example

[Hofmann 2004]

pLSA probabilistic Latent Semantic Analysis




causes leakage.





IR


















causes leakage.





IR


















causes leakage.




9. SYMBOLS FOR PRESENTATIONx

UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.


pLSA latent interests




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D











and L. Rokach (Eds.). Springer, New York, NY.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.


pLSA generative model




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.


pLSA probabilistic weights




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0PDd=1 p(d | u) = 1Pi2I p(i | d) = 1











causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0PDd=1 p(d | u) = 1Pi2I p(i | d) = 1











causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I p(i | d) = 1










causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1







pLSA compute like-probability




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)



293–296.S.S. Anand and B. Mobasher. 2006. Contextual Recommendation. In WebMine. 142–160.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





pLSA computing the weights




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.


Dd = 1

d = 1

d = D




causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.


(tempered) Expecta5on-‐Maximiza5on (EM)

Binary, Positive-Only Collaborative Filtering: A Theoretical and Experimental Comparison of the State Of The Art1:31

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)













bridge University Press, New York, NY.S. Kabbur, X. Ning, and G. Karypis. 2013. FISM: Factored Item Similarity Models for top-N Recommender

Systems. In KDD. 659–667.N. Koeningstein, N. Nice, U. Paquet, and N. Schleyen. 2012. The Xbox Recommender System. In RecSys.

281–284.Y. Koren and R. Bell. 2011. Advances in Collaborative Filtering. In Recommender Systems Handbook,

F. Ricci, L. Rokach, B. Shapira, and P.B. Kantor (Eds.). Springer, Boston, MA.Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender

systems. Computer 8 (2009), 30–37.W. Lin, S.a. Alvarez, and C. Ruiz. Efficient adaptive-support association rule mining for recommender sys-

tems. Data Min. Knowl. Discov. 6, 1 (????), 83–105.


pLSA à General

pLSA recap




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





pLSA recap




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)






rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d

D(S,R) = DKL(Q(S)||p(S|R))

. . .

max for every (u, i)

max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X

i2I↵Rui logSui + log(1 � Sui) + �

⇣||S(1)||2F + ||S(2)||2F

⌘

2

|u| models/user

2

|u|

S = S(1) ⇤ S(2)+ S(3)

+ S(4)S(5)S(6)

p(d1|u)

p(d2|u)

p(dD|u)



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘

2

|u| models/user

2

|u|

S = S(1) ⇤ S(2)+ S(3)

+ S(4)S(5)S(6)

p(d1|u)

p(d2|u)

p(dD|u)



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘

2

|u| models/user

2

|u|

S = S(1) ⇤ S(2)+ S(3)

+ S(4)S(5)S(6)

p(d1|u)

p(d2|u)

p(dD|u)



p(i|d1)

p(i|d2)

p(i|dD)

REFERENCESFabio Aiolli. 2013. Efficient top-n recommendation for very large scale binary rated datasets. In Proceedings

of the 7th ACM conference on Recommender systems. ACM, 273–280.Fabio Aiolli. 2014. Convex AUC optimization for top-N recommendation with implicit feedback. In Proceed-

ings of the 8th ACM Conference on Recommender systems. ACM, 293–296.Sarabjot Singh Anand and Bamshad Mobasher. 2007. Contextual recommendation. Springer.C.M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer, New York, NY.Evangelia Christakopoulou and George Karypis. 2014. Hoslim: Higher-order sparse linear method for top-n

recommender systems. In Advances in Knowledge Discovery and Data Mining. Springer, 38–49.Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-

n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems. ACM,39–46.

Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Trans-actions on Information Systems (TOIS) 22, 1 (2004), 143–177.

Christian Desrosiers and George Karypis. 2011. A comprehensive survey of neighborhood-based recommen-dation methods. In Recommender systems handbook. Springer, 107–144.


Eric Gaussier and Cyril Goutte. 2005. Relation between PLSA and NMF and implications. In Proceedingsof the 28th annual international ACM SIGIR conference on Research and development in informationretrieval. ACM, 601–602.

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual interna-tional ACM SIGIR conference on Research and development in information retrieval. ACM, 50–57.

Thomas Hofmann. 2004. Latent semantic models for collaborative filtering. ACM Transactions on Informa-tion Systems (TOIS) 22, 1 (2004), 89–115.

Frank Hoppner. 2005. Association Rules. In The Data Mining and Knowledge Discovery Handbook, OdedMainmon and Lior Rokach (Eds.). Springer, New York, NY.

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. InData Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 263–272.

Dietmar Jannach, Markus Zanker, Alexander Felfernig, and Gerhard Friedrich. 2010. Recommender sys-tems: an introduction. Cambridge University Press.

Santosh Kabbur and George Karypis. 2014. NLMF: NonLinear Matrix Factorization Methods for Top-NRecommender Systems. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on.IEEE, 167–174.

Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item similarity models for top-n rec-ommender systems. In Proceedings of the 19th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 659–667.

Noam Koenigstein, Nir Nice, Ulrich Paquet, and Nir Schleyen. 2012. The Xbox recommender system. InProceedings of the sixth ACM conference on Recommender systems. ACM, 281–284.

Yehuda Koren and Robert Bell. 2011. Advances in collaborative filtering. In Recommender systems hand-book. Springer, 145–186.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommendersystems. Computer 8 (2009), 30–37.



p(i|d1)

p(i|d2)

p(i|dD)






















p(i|d1)

p(i|d2)

p(i|dD)






















6.7.2. New Item Problem.

7. EVALUATION METHODS7.1. offline

— Who: ?— evaluation measures— Convince the reader ranking is more important than RMSE or MSE.— data splits (leave-one-out, 5 fold, ...)— Pradel et al. :ranking with non-random missing ratings: influence of popularity and


causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)







causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)







causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .



















pLSA matrix factorization notation




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)

|U| ⇥ |I||U| ⇥ DD ⇥ |I||U||I|D





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)

|U| ⇥ |I||U| ⇥ DD ⇥ |I||U||I|D





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)

|U| ⇥ |I||U| ⇥ DD ⇥ |I||U||I|D


pLSA matrix factorization notation




causes leakage.





IR


















causes leakage.





IR


















causes leakage.





UIRD












263–272.





causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D
















causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...










(2004), 89–115.





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(i | d)












causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





D ⇥|U||I|

|U||I|

|I|D


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)


















tems. Data Min. Knowl. Discov. 6, 1 (????), 83–105.H. Ma. 2013. An Experimental Study on Implicit Social Recommendation. In SIGIR. 73–82.G.V. Menezes, J.M. Almeida, F.B. Belem, M.A. Goncalves, A. Lacerda, E. Silva de Moura, G.L. Pappa, A.

Veloso, and N. Ziviani. 2010. Demand Drive Tag Recommendation. In ECML/PKDD. 402–417.B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. 2001. Effective Personalization Based on Association Rule

Discovery from Web Usage Data. In WIDM. 9–15.X. Ning and G. Karypis. 2011. Slim: Sparse Linear Methods for Top-N recommender systems. In ICDM.

497–506.R. Pan and M. Scholz. 2009. Mind the Gaps: Weighting the Unknown in Large-scale One-class Collaborative

Filtering. In KDD. 667–676.R. Pan, Y. Zhou, B. Cao, N.N. Liu, R. Lukose, M. Scholz, and Q. Yang. 2008. One-Class Collaborative Filter-

ing. In ICDM. 502–511.



Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

























Scores = Matrix Factorization




causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





D ⇥|U||I|

|U||I|

|I|D


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)


























Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

























Deviation Function


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)





















Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

min D (S,R) = �X

Rui=1

logSui

min D (S,R)














Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

min D (S,R) = �X

Rui=1

logSui

min D (S,R)














Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

min D (S,R) = �X

Rui=1

logSui

min D (S,R)













Summary: 2 Basic Building Blocks

Factorization Model

Deviation Function

Agenda •  Introduction •  Algorithms – Elegant example – Models – Deviation functions – Parameter inference

•  Netflix

Tour of The Models

pLSA soft clustering interpretation

user-item scores

user-cluster affinity

item-cluster affinity

mixed clusters

[Hofmann 2004] [Hu et al. 2008]

[Pan et al. 2008] [Sindhwani et al. 2010]

[Yao et al. 2014] [Pan and Scholz 2009]

[Rendle et al. 2009] [Shi et al. 2012]

[Takàcs and Tikk 2012]

pLSA soft clustering interpretation




causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





D ⇥|U||I|

|U||I|

|I|D

0.05

0.1

0.5

0.3

0.4

0.1

0.4

0.1




causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)

|U| ⇥ |I||U| ⇥ DD ⇥ |I||U||I|D = 4

d = 1





causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)

|U| ⇥ |I||U| ⇥ DD ⇥ |I||U||I|D = 4

d = 1



d = 2

d = 3

d = 4

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)









E. Gaussier and C. Goutte. 2005. Relation between PLSA and NMF and implications. In SIGIR. 601–602.T. Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR. 50–57.



d = 2

d = 3

d = 4

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)












d = 2

d = 3

d = 4

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)











0.04

0.01

0.20

0.03

0.28

user-item scores

user-cluster affinity

item-cluster affinity

Hard Clustering

user-item scores

user-uCluster membership

item-iCluster membership

item probabilities

uCluster-iCluster similarity

[Hofmann 2004] [Hofmann 1999]

[Ungar and Foster 1998]

Item Similarity dense

user-item scores original rating matrix item-item similarity

[Rendle et al. 2009] [Aiolli 2013]

Item Similarity sparse

user-item scores item-item similarity original rating matrix

[Deshpande and Karypis 2004] [Sigurbjörnsson and Van Zwol 2008]

[Ning and Karypis 2011]

User Similarity sparse

user-item scores column normalized original rating matrix

(row normalized) user-user similarity

[Sarwar et al. 2000]

User Similarity dense

user-item scores column normalized original rating matrix

(row normalized) user-user similarity

[Aiolli 2014] [Aiolli 2013]

User+Item Similarity

[Verstrepen and Goethals 2014]

Factored Item Similarity symmetrical

user-item scores original rating matrix Identical item profiles




causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)








causes leakage.





UIRDd = 1

d = D...uip(u | i)p(d | u)

p(d | u) � 0

p(i | d) � 0

DPd=1

p(d | u) = 1

Pi2I

p(i | d) = 1

p(i|u) =

DX

d=1

p(i|d) · p(d|u)

max

PRui=1

log p(i | u)





item clusters Item-cluster affinity Similarity by dotproduct

[Weston et al. 2013b]

Factored Item Similarity asymmetrical + bias

user-item scores

original rating matrix row normalized

Item profile if known preference

Item profile if candidate item biases user biases

[Kabbur et al. 2013]

Higher Order Item Similarity inner product

user-item scores extended rating matrix Itemset-item similarity

selected higher order itemsets [Christakopoulou and Karypis 2014]

[Deshpande and Karypis 2004] [Menezes et al. 2010] [van Leeuwen and Puspitaningrum 2012] [Lin et al. 2002]

Higher Order Item Similarity max product

0.05

0.1

0.5

0.3

0.4

0.1

0.4

0.1

0.04

0.01

0.20

0.03

0.20

max

MP

[Sarwar et al. 2001] [Mobasher et al. 2001]

Higher Order User Similarity inner product

user-item scores user-userset similarity extended rating matrix

selected higher order usersets

[Lin et al. 2002]

Best of few user models non linearity by max

[Weston et al. 2013a]


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .




















rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .



















~ 3 models/user

Best of all user models efficient max out of



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .




















rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .




















rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘

2

|u| models/user











rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘

2

|u| models/user

2

|u|







Combination item vectors can be shared

[Kabbur and Karypis 2014]


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .




















rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .



















Sigmoid link function for probabilistic frameworks

[Johnson 2014]


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( ) · d



















Weiyang Lin, Sergio A Alvarez, and Carolina Ruiz. 2002. Efficient adaptive-support association rule miningfor recommender systems. Data mining and knowledge discovery 6, 1 (2002), 83–105.

Hao Ma. 2013. An experimental study on implicit social recommendation. In Proceedings of the 36th inter-national ACM SIGIR conference on Research and development in information retrieval. ACM, 73–82.



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d






















Pdf over parameters i.s.o. point estimation

[Koeningstein et al. 2012] [Paquet and Koeningstein 2013]


Factorization Model

Deviation Function


Factorization Model

Deviation Function

a.k.a. What do we minimize in order to find the parameters in the factor matrices?


•  Netflix

Tour of Deviation Functions

Local Minima depending on initialisation


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)














X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|

for all i, j 2 Ifor all u, v 2 U

X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2

every row S(1)u. and every column S(2)

.i the same unit vector

O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))











Max Likelihood high scores for known preferences


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

min D (S,R) = �X

Rui=1

logSui

min D (S,R)














d = 2

d = 3

d = 4

D = |I|SS(1)

S(2)

S(1)ud � 0

S(1)di � 0

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)

DX

d=1

S(1)ud = 1

X

i2IS(1)

di = 1



293–296.



d = 2

d = 3

d = 4

D = |I|SS(1)

S(2)

S(1)ud � 0

S(1)di � 0

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)

DX

d=1

S(1)ud = 1

X

i2IS(1)

di = 1



293–296.



d = 2

d = 3

d = 4

D = |I|SS(1)

S(2)

S(1)ud � 0

S(1)di � 0

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)

DX

d=1

S(1)ud = 1

X

i2IS(1)

di = 1



293–296.


1

1

[Hofmann 2004] [Hofmann 1999]

Reconstruction


X

u2U

X

i2IRuiWui (1 � Sui)

2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0

((Rui � Ruj) � (Sui � Suj))2

+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1













Reconstruction


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1













`Ridge’ regularization [Kabbur et al. 2013] [Kabbur and Karypis 2014]

Reconstruction

Elastic net regularization


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1














X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1













`Ridge’ regularization

[Ning and Karypis 2011] [Christakopoulou and Karypis 2014]

[Kabbur et al. 2013] [Kabbur and Karypis 2014]

Reconstruction between AMAU and AMAN


dure [Hofmann 1999]. Ungar and Foster [Ungar and Foster 1998] proposed a similarhard clustering method, but remain vague about the details of their method.

4.1.2. Reconstruction Based Deviation Functions. Next, there is a group of algorithmsinspired by SVD-based matrix factorization algorithms for rating prediction prob-lems [Koren and Bell 2011]. They start from the 2-factor factorization that describesthe aspect model (Eq. 3) but strip the parameters of all their statistical meaning. In-stead, S is postulated to be an approximate, factorized reconstruction of R. A straight-forward approach is to find S(1) and S(2) such that they minimize the the squaredreconstruction error between S and R. A deviation function that reflects this line ofthough is

D (S,R) =

X

u2U

X

i2IRui (Rui � Sui)

2.

This approach clearly makes the AMAU assumption. Making the AMAN assumption,on the other hand, all missing values are interpreted as an absence of preference andthe deviation function becomes

D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.

On the one hand, the AMAU assumption is too careful because the vast majority of theunknowns are negatives. On the other hand, the AMAN assumption is too incautiousbecause we are actually searching for the preferences among the unknowns. Thereforeboth Hu et al. [Hu et al. 2008] and Pan et al. [Pan et al. 2008] simultaneously proposeda middle way between AMAU and AMAN:

D (S,R) =

X

u2U

X

i2IWui (Rui � Sui)

2, (11)

in which W 2 Rn⇥m assigns a weight to every value in R. The higher Wui, the higherthe confidence about Rui. There is a high confidence about the ones being preferencesand a lower confidence about the zeros being dislikes. To formalize this intuition, Huet al. [Hu et al. 2008] give two potential definitions of Wui:

Wui = 1 + �Rui, (12)Wui = 1 + ↵ log (1 + Rui/✏) , (13)

with ↵, �, ✏ hyperparameters. From the above definitions, it is clear that this methodis not limited to binary data, but works on positive-only data in general. We, however,are only interested in its usefulness for binary, positive-only data. Alternatively, Panet al. [Pan et al. 2008] propose Wui = 1 if Rui = 1 and give three possibilities for thecase when Rui = 0:

Wui = �, (14)

Wui = ↵X

j2IRuj , (15)

Wui = ↵ (n � c(i)) , (16)with � 2 [0, 1] a uniform hyperparameter and ↵ the hyperparameter such that Wui 1

for all pairs (u, i) for which Rui = 0. In the first case, all missing preferences get thesame weight. In the second case, a missing preference is more negative if the useralready has many preferences. In the third case, a missing preference is less negativeif the item is popular1.

1One could argue that this is counterintuitive


AMAN





D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)








D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)





AMAU

AMAN





D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)








D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)








D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)





AMAU

AMAN

Middle Way

Reconstruction choosing W




D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)





Middle Way


d = 2

d = 3

d = 4

D = |I|SS(1)

S(2)

S(1)ud � 0

S(1)di � 0

Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)

DX

d=1

S(1)ud = 1

X

i2IS(1)

di = 1

⇢Wui = 1 if Rui = 0

Wui = ↵ if Rui = 1


273–280.


Reconstruction regularization Binary, Positive-Only Collaborative Filtering: A Theoretical and Experimental Comparison of the State Of The Art1:7

By stripping the matrix factorization of its statistical meaning, also the constraintsin Equations 6 to 9 disappear. Simply minimizing Equation 11 however results intofactor matrices that are overfitted on the training data. Therefore both Hu et al. andPan et al. propose to minimize a regularized version

D (S,R) =

X

u2U

X


2+ �

⇣||S(1)||F + ||S(2)||F

⌘, (17)

with � 2 R+ the regularization hyperparameter and ||.||F the Frobenius norm. Thelarge domain of � can make it hard to find a good value. Additionally, Pan et al. alsopropose the alternate regularization:

D (S,R) =

X

u2U

X

i2IWui

⇣(Rui � Sui)

2+ �

⇣||S(1)

u⇤ ||F + ||S(2)⇤j ||F

⌘⌘. (18)

Since the deviation function is defined over all user-item pairs, a direct optimiza-tion method such as stochastic gradient descent (SGD), which is frequently used forfinding matrix factorizations in rating prediction problems, seems unfeasible in thiscase [Hu et al. 2008]. Therefore both Hu et al. and Pan et al. propose an alternatingleast squares (ALS) method for minimizing the deviation function. Additionally, Panet al. propose an alternative bagging method for solving the regularized minimizationproblem that is more scalable.

switch sindhwani and yao because yao is a simpler form and can be solved with ALS.If I do this, I must be carefull with chronology because sindhwani was before yao.

Sindhwani et al. propose a more complex weighting scheme [Sindhwani et al. 2010].Whereas the previous methods computed the weights of the user-item-pairs, W, beforethe optimization procedure, Sindhwani et al. consider these weights as model parame-ters and compute them simultaneously with all other parameters during the optimiza-tion procedure. Furthermore they introduce a new set of parameters P that indicatesfor every missing value the probability that it is one. Their deviation function is definedas

D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (19)

with Pui the probability that the missing value corresponding to the user-item-pair(u, i) is actually a 1, Wui the confidence of the value Rui and ↵, �, � user-defined hy-perparameters. Furthermore, Sindhwani et al. define the constraint

1

|U||I| � |R|X

u2U

X

i2IPui = �,

i.e that the average probability that a missing value is actually one must be equalto the user-defined hyperparameter �. Additionally, they simplify W as the one-dimensional matrix factorization

Wui = VuVi.

The first term of the deviation function in Equation 19 gives the squared reconstruc-tion error on the ones and the second term gives the squared reconstruction error on


Squared reconstruction error term Regularization term

Regularization hyperparameter

[Hu et al. 2008] [Pan et al. 2008]

[Pan and Scholz 2009]

Reconstruction more complex


By stripping the matrix factorization of its statistical meaning, also the constraintsin Equations 6 to 9 disappear. Simply minimizing Equation 11 however results intofactor matrices that are overfitted on the training data. Therefore both Hu et al. andPan et al. propose to minimize a regularized version

D (S,R) =

X

u2U

X


2+ �

⇣||S(1)||F + ||S(2)||F

⌘, (17)

with � 2 R+ the regularization hyperparameter and ||.||F the Frobenius norm. Thelarge domain of � can make it hard to find a good value. Additionally, Pan et al. alsopropose the alternate regularization:

D (S,R) =

X

u2U

X

i2IWui

⇣(Rui � Sui)

2+ �

⇣||S(1)

u⇤ ||F + ||S(2)⇤j ||F

⌘⌘. (18)

Since the deviation function is defined over all user-item pairs, a direct optimiza-tion method such as stochastic gradient descent (SGD), which is frequently used forfinding matrix factorizations in rating prediction problems, seems unfeasible in thiscase [Hu et al. 2008]. Therefore both Hu et al. and Pan et al. propose an alternatingleast squares (ALS) method for minimizing the deviation function. Additionally, Panet al. propose an alternative bagging method for solving the regularized minimizationproblem that is more scalable.

switch sindhwani and yao because yao is a simpler form and can be solved with ALS.If I do this, I must be carefull with chronology because sindhwani was before yao.

Sindhwani et al. propose a more complex weighting scheme [Sindhwani et al. 2010].Whereas the previous methods computed the weights of the user-item-pairs, W, beforethe optimization procedure, Sindhwani et al. consider these weights as model parame-ters and compute them simultaneously with all other parameters during the optimiza-tion procedure. Furthermore they introduce a new set of parameters P that indicatesfor every missing value the probability that it is one. Their deviation function is definedas

D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (19)

with Pui the probability that the missing value corresponding to the user-item-pair(u, i) is actually a 1, Wui the confidence of the value Rui and ↵, �, � user-defined hy-perparameters. Furthermore, Sindhwani et al. define the constraint

1

|U||I| � |R|X

u2U

X

i2IPui = �,

i.e that the average probability that a missing value is actually one must be equalto the user-defined hyperparameter �. Additionally, they simplify W as the one-dimensional matrix factorization

Wui = VuVi.

The first term of the deviation function in Equation 19 gives the squared reconstruc-tion error on the ones and the second term gives the squared reconstruction error on


Reconstruction rewritten


D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (54)

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui (0 � Sui)

2

+ �||S(1)||F + �||S(2)||F
















F. Ricci, L. Rokach, B. Shapira, and P.B. Kantor (Eds.). Springer, Boston, MA.



D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (54)

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F
















F. Ricci, L. Rokach, B. Shapira, and P.B. Kantor (Eds.). Springer, Boston, MA.


Reconstruction guess unknown = 0


D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (54)

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣p (1 � Sui)

2+ (1 � p) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (55)


Reconstruction unknown can also be 1

[Yao et al. 2014]

Reconstruction less assumptions, more parameters


D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (54)

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F










Reconstruction more regularization


D (S,R) =

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) , (54)

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X


2

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F

X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (55)



293–296.S.S. Anand and B. Mobasher. 2006. Contextual Recommendation. In WebMine. 142–160.C.M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer, New York, NY.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (56)




















Discovery from Web Usage Data. In WIDM. 9–15.


Reconstruction more (flexible) parameters

[Sindhwani et al. 2010]

Reconstruction conceptual flaw




D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)






X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (56)

(1 � 0)

2= 1 = (1 � 2)

2 (57)



















Veloso, and N. Ziviani. 2010. Demand Drive Tag Recommendation. In ECML/PKDD. 402–417.


Log likelihood similar idea

[C. Johnson 2014]


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X

i2I↵Rui logSui + log

⇣1 � Sui) + �(||S(1)||2F + ||S(2)||2F

⌘











Log likelihood similar idea

[C. Johnson 2014]


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X

i2I↵Rui logSui + log

⇣1 � Sui) + �(||S(1)||2F + ||S(2)||2F

⌘












rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘











Zero-‐mean, spherical Gaussian priors

Maximum Margin not all preferences equally preferred


the zeros. The third term gives the regularization error and the fourth and final term,�↵

Pu2U

Pi2I(1 � Rui)H (Pui), gives the entropy of the probabilities Pui and is in-

troduced for smoothing the deviation function. If ↵ is big, the entropy term dominatesand all Pui are found to minimize entropy, i.e. Pui = � for every (u, i). As ↵ is reduced,the entropy of the optimal P increases and P becomes less uniform. A conceptual in-consistency of Equation 19 is that although the recommendation score used is givenby Sui(= S(1)

u⇤ S(2)⇤i ), also Pui could be used. Hence, there exist two parameters for the

same concept, which is ambiguous at least.To compute a local minimum of of the deviation function in Equation 19, Sindhwani

et al. propose a custom non-convex optimization procedure. A serious limit on the scal-ability is that P is a dense matrix. Therefore they enforce sparseness on P by randomlychoosing a small number of user-item pairs (u, i) for which Pui can be bigger than zero.This random aspect weakens the conceptual argument behind the definition of Equa-tion 19.

Yao et al. propose a reformulation of Equation 19 without the entropy smoothingterm, i.e. with ↵ = 0, and with � = � [Yao et al. 2014]. Furthermore, two constraintson the parameters are different. Firstly, they choose a uniform weight for all missingfeedback (Equation 14) instead of including W as a parameter in the optimizationproblem. Similarly, they also uniformly choose Pui = p with the global imputationvalue p a hyperparameter of the method. Not surprisingly, they also propose a differentalgorithm for minimizing Equation 19.

4.1.3. Maximum Margin Based Deviation Functions. Notice that R is a binary matrix andthat for the above algorithms S is a real valued matrix. Therefore, the interpretationof S as the pure reconstruction of R is fundamentally flawed. This fundamental flawhas important practical consequences: If Rui = 1, the square loss is 1 for both Sui = 0

and Sui = 2. However, Sui = 2 is a much better prediction than Sui = 0. Put differently,the reconstruction based deviation functions (implicitly) assume that all preferencesare equally strong, which is an important simplification of reality.

A deviation function that does not suffer from this flaw was proposed by Pan andScholz [Pan and Scholz 2009], who applied the idea of Maximum Margin Matrix Fac-torization (MMMF) by Srebro et al. [Srebro et al. 2004] to binary, positive-only collab-orative filtering. They construct the matrix ˜R as

⇢˜Rui = 1 if Rui = 1

˜Rui = �1 if Rui = 0,

and define the deviation funtion as

D⇣S, ˜R

⌘=

X

u2U

X

i2IWuih

⇣˜Rui · Sui

⌘+ �||S||⌃, (20)

with ||.||⌃ the trace norm, � a regularization hyperparameter, h⇣

˜Rui · Sui

⌘a smooth

hinge loss given by Figure 3 [Rennie and Srebro 2005] and W given by one of theEquations 14-16.

The deviation function incorporates the confidence about the training data by meansof W and the missing knowledge about the degree of preference by means of the hingeloss h

⇣˜Rui · Sui

⌘. Since the degree of preference is considered unknown, a value ˜Rui �

1 is not penalized.Minimizing Equation 20 can be done by means of the conjugate gradients method

by Rennie and Srebro [Rennie and Srebro 2005]. Alternatively, Pan and Scholz [Pan



Maximum Margin not all preferences equally preferred



Pu2U











˜Rui = �1 if Rui = 0,


D⇣S, ˜R

⌘=

X

u2U

X

i2IWuih

⇣˜Rui · Sui

⌘+ �||S||⌃, (20)


˜Rui · Sui

⌘a smooth



⇣˜Rui · Sui









D (S,R) =

X

u2U

X


2.


D (S,R) =

X

u2U

X

i2I(Rui � Sui)

2.


D (S,R) =

X

u2U

X


2, (11)




Wui = �, (14)

Wui = ↵X

j2IRuj , (15)







Pu2U











˜Rui = �1 if Rui = 0,


D⇣S, ˜R

⌘=

X

u2U

X

i2IWuih

⇣˜Rui · Sui

⌘+ �||S||⌃, (20)


˜Rui · Sui

⌘a smooth



⇣˜Rui · Sui







Pu2U











˜Rui = �1 if Rui = 0,


D⇣S, ˜R

⌘=

X

u2U

X

i2IWuih

⇣˜Rui · Sui

⌘+ �||S||⌃, (20)


˜Rui · Sui

⌘a smooth



⇣˜Rui · Sui






p(i|d1)

p(i|d2)

p(i|dD)

Sui = 0

Sui = 2



















p(i|d1)

p(i|d2)

p(i|dD)

Sui = 0

Sui = 2


















1

1

~ 0.5

0

Binary, Positive-Only Collaborative Filtering: A Theoretical and Experimental Comparison of the State Of The Art1:9Fast Maximum Margin Matrix Factorization

The thresholds �r can be learned from the data. Further-more, a different set of thresholds can be learned for eachuser, allowing users to “use ratings differently” and allevi-ates the need to normalize the data. The problem can thenbe written as:

minimize �X�⌃ + CX

ij2S

R�1X

r=1

h(T rij(�ir � Xij)) (4)

where the variables optimized over are the matrix X andthe thresholds �. In other work, we find that such a for-mulation is highly effective for rating prediction (Rennie &Srebro, 2005).

Although the problem was formulated here as a single op-timization problem with a combined objective, �X�⌃ +

C · error, it should really be viewed as a dual-objectiveproblem of balancing between low trace-norm and low er-ror. Considering the entire set of attainable (�X�⌃ , error)pairs, the true object of interest is the exterior “front” ofthis set, i.e. the set of matricesX for which it is not possi-ble to reduce one of the two objectives without increasingthe other. This “front” can be found by varying the valueof C from zero (hard-margin) to infinity (no norm regular-ization).

All optimization problems discussed in this section can bewritten as semi-definite programs (Srebro et al., 2005).

3. Optimization MethodsWe describe here a local search heursitic for the problem(4). Instead of searching over X , we search over pairs ofmatrices (U, V ), as well as sets of thresholds �, and attemptto minimize the objective:

J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r

ij(�ir � UiV�j )

⌘. (5)

For any U, V we have �UV �⌃ 12 (�U�2

Fro + �V �2Fro) and

so J(U, V, �) upper bounds the minimization objective of(4), where X = UV �. Furthermore, for anyX , and in par-ticular theX minimizing (4), some factorizationX = UV �

achieves �X�⌃ =

12 (�U�2

Fro + �V �2Fro). The minimization

problem (4) is therefore equivalent to:

minimize J(U, V, �). (6)

The advantage of considering (6) instead of (4) is that�X�⌃ is a complicated non-differentiable function forwhich it is not easy to find the subdifrential. Finding gooddescent directions for (4) is not easy. On the other hand, the

0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge

Figure 1. Shown are the loss function values (left) and gradients(right) for the Hinge and Smooth Hinge. Note that the gradientsare identical outside the region z � (0, 1).

objective J(U, V, �) is fairly simple. Ignoring for the mo-ment the non-differentiability of h(z) = (1 � z)+ at one,the gradient of J(U, V, �) is easy to compute. The partialderivative with respect to each element of U is:

@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)

The partial derivative with respect to Vja is analogous. Thepartial derivative with respect to �ik is

@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)

With the gradient in-hand, we can turn to gradient descentmethods for localy optimizing J(U, V, �). The disadvan-tage of considering (6) instead of (4) is that although theminimization objective in (4) is a convex function of X, �,the objective J(U, V, �) is not a convex function of U, V .This is potentially bothersome, and might inhibit conver-gence to the global minimum.

3.1. Smooth Hinge

In the previous discussion, we ignored the non-differentiability of the Hinge loss function h(z) at z = 1.In order to give us a smooth optimization surface, we usean alternative to the Hinge loss, which we refer to as theSmooth Hinge. Figure 1 shows the Hinge and SmoothHinge loss functions. The Smooth Hinge shares manyproperties with the Hinge, but is much easier to optimizedirectly via gradient descent methods. Like the Hinge, theSmooth Hinge is not sensitive to outliers, and does notcontinuously “reward” the model for increasing the outputvalue for an example. This contrasts with other smooth lossfunctions, such as the truncated quadratic (which is sensi-tive to outliers) and the Logistic (which “rewards” largeoutput values). We use the Smooth Hinge and the corre-sponding objective for our experiments in Section 4.

Fig. 3. Shown are the loss function values h(z) (left) and the gradients dh(z)/dz (right) for the Hinge andSmooth Hinge. Note that the gradients are identical outside the region z 2 (0, 1)) [Rennie and Srebro 2005].

and Scholz 2009] propose a bagging method for better scalability. Remark that bothmethods find different solutions to the minimization problem.

Besides the hinge loss, also the exponential and the binomial negative log-likelihoodloss functions exhibit a similar behavior [Bishop 2006]:

lexp( ˜Rui,Sui) = exp(� ˜RuiSui),

lll( ˜Rui,Sui) = log(1 + exp(�2

˜RuiSui)).

However, to the best of our knowledge they have not yet been used for one-class collab-orative filtering.

4.1.4. Ranking Based Deviation Functions. The scores computed by recommender sytemsare often used to personally rank all items for every user. Therefore, Rendle et al. [Ren-dle et al. 2009] argued that it is natural to directly optimize the ranking. More specifi-cally they aim to maximize the area under the ROC curve (AUC), which is given by:

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),

with �(true) = 1 and �(false) = 0. If the AUC is higher, the pairwise rankings inducedby the model S are more in line with the observed data R. However, because �(Sui >Suj) is non-differentiable, their deviation function is a differentiable approximationof the negative AUC from which constant factors have been removed and to which aregularization term has been added:

D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,

with �(·) the sigmoid function and �1, �2 regularization constants, which are hyper-parameters of the method. Notice that this deviation function coniders all missingfeedback equally negative, i.e. it corresponds to the AMAN assumption.

However, very often, only the N highest ranked items are shown to users. Therefore,Shi et al. [Shi et al. 2012] propose to minimize the mean reciprocal rank (MRR) insteadof the AUC. The MRR is defined as

MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


AUC directly optimize the ranking

[Rendle et al. 2009]

AUC directly optimize the ranking



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0

↵ui↵uj(Sui � Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)

r>(Suj | {Suk | Ruk = 0})

(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.


AUC non-differentiable



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.


AUC smooth approximation




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,




X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.


Pairwise Ranking 2 similar to AUC


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

D⇣S, ˜R

⌘=

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F















281–284.


[Kabbur et al. 2013]

Pairwise Ranking 3 no regularization, also 1 to 1


in which r>(a | B) gives the rank of a among all numbers in B when ordered in de-scending order. Unfortunately, the non-smoothness of r>() and max makes the directoptimization of MRR unfeasible. Hence, Shi et al. derive a smoothed version of MRR.Although this smoothed version differentiable, it could still be practically intractableto optimize it. Therefore, they propose to optimize a lower bound instead. After alsoadding regularization terms, their final deviation function is given by

D⇣S, ˜R

⌘= �

X

u2U

X

i2IRui

⇣log �(Sui)

+

X

j2Ilog (1 � Ruj�(Suj � Sui))

⌘

+ �⇣||S(1)||2F + ||S(2)||2F

⌘, (21)

with � a regularization constant and �() the sigmoid function. Notice that this devi-ation function de facto ignores all missing feedback, i.e. it corresponds to the AMAUassumption.

Yet another ranking based deviation function was proposed by Takacs andTikk [Takacs and Tikk 2012]

D⇣S, ˜R

⌘=

X

u2U

X

i2IRui

X

j2Iw(j) ((Sui � Suj) � (Rui � Ruj))

2,

with w() a user-defined item weighting function. The simplest choice is w(j) = 1 forall j. An alternative proposed by Takacs and Tikk is w(j) =

Pu2U Ruj . This deviation

function has some resemblance with the one in Equation 4.1.4. However, a squaredloss is used instead of the log-loss of the sigmoid. Furthermore, this deviation functionalso minimizes the score-difference between all known preferences, which is not doneby Equation 4.1.4. Finally, it is remarkable that Takacs and Tikk explicitly do not adda regularization term, whereas most other authors find that the regularization term isimportant for their models performance.

4.1.5. Posterior Probability Deviation Functions. At this point, we almost finished dis-cussing the first group of algorithms, those for which all factor matrices are a priori un-known, and we hope it is becoming clear that the vast majority of algorithms nicely fitsin our framework that models recommendation scores as matrix factorizations foundby minimizing a deviation function. However, since we have chosen to tightly fit ourframework around the majority of the algorithms, there are a few algorithms thatdo not fit completely within it. Fortunately, these outlier algorithms are rare and theframework allows us to show exactly how they differ from the majority of algorithmsfor BPOCF.

A first outlier algorithm, by Koeningstein et al. [Koeningstein et al. 2012], computesthe eventual recommendation scores S as the expected value of the stochastic recom-mendation scores ˆS:

S = EhˆS|R

i, (22)

which can also be written as

S ⇡Z

p⇣

ˆS|R⌘

· ˆS · dˆS,

with p⇣

ˆS|R⌘

the posterior probability density function of the stochastic recommen-dation scores given the data. In the spirit of our framework, we define the deviation



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

















systems. Computer 8 (2009), 30–37.






ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR focus on top of the ranking

[Shi et al. 2012]




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR non-differentiable

[Shi et al. 2012]



D⇣S, ˜R

⌘= �

X

u2U

X

i2IRui

⇣log �(Sui)

+

X


⌘

+ �⇣||S(1)||2F + ||S(2)||2F

⌘, (21)



D⇣S, ˜R

⌘=

X

u2U

X

i2IRui

X


2,






S = EhˆS|R

i, (22)


S ⇡Z

p⇣

ˆS|R⌘

· ˆS · dˆS,

with p⇣

ˆS|R⌘






ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR differentiable approximation, computationally feasible

[Shi et al. 2012]




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR known preferences score high

promote



D⇣S, ˜R

⌘= �

X

u2U

X

i2IRui

⇣log �(Sui)

+

X


⌘

+ �⇣||S(1)||2F + ||S(2)||2F

⌘, (21)



D⇣S, ˜R

⌘=

X

u2U

X

i2IRui

X


2,






S = EhˆS|R

i, (22)


S ⇡Z

p⇣

ˆS|R⌘

· ˆS · dˆS,

with p⇣

ˆS|R⌘



[Shi et al. 2012]




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR push down other known preferences 1:10 K. Verstrepen et al.


D⇣S, ˜R

⌘= �

X

u2U

X

i2IRui

⇣log �(Sui)

+

X


⌘

+ �⇣||S(1)||2F + ||S(2)||2F

⌘, (21)



D⇣S, ˜R

⌘=

X

u2U

X

i2IRui

X


2,






S = EhˆS|R

i, (22)


S ⇡Z

p⇣

ˆS|R⌘

· ˆS · dˆS,

with p⇣

ˆS|R⌘



promote

scatter

[Shi et al. 2012]




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


MRR corresponds to AMAU assumption 1:10 K. Verstrepen et al.


D⇣S, ˜R

⌘= �

X

u2U

X

i2IRui

⇣log �(Sui)

+

X


⌘

+ �⇣||S(1)||2F + ||S(2)||2F

⌘, (21)



D⇣S, ˜R

⌘=

X

u2U

X

i2IRui

X


2,






S = EhˆS|R

i, (22)


S ⇡Z

p⇣

ˆS|R⌘

· ˆS · dˆS,

with p⇣

ˆS|R⌘



promote

scatter AMAU

[Shi et al. 2012]


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.


kth-Order Statistic basis = AUC

[Weston et al. 2013]

kth-Order Statistic strip normalization


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



kth-Order Statistic focus on highly ranked negatives


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



kth-Order Statistic weight known preferences by rank


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(59)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



293–296.



X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


(59)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



kth-Order Statistic non-differentiable


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


(59)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



kth-Order Statistic hinge loss & sampling approximations


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0


X

u2U

X

Rui=1

X

Ruj=0

�(Sui > Suj)

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


(59)

X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),



ferentiable approximation

D⇣S, ˜R

⌘=

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

max(0, 1 + Suj � Sui)

N�1|{j 2 I | Ruj = 0}| ,

(31)in which they replaced the indicator function by the hinge-loss and approximated therank with N�1|{j 2 I | Ruj = 0}|, in which N the number of items k that wererandomly sampled until Suk + 1 > Sui

2. Furthermore, Weston et al. use the simpleweighting function

8<

:w

⇣r>(Sui|{Sui|Rui=1})

|u|

⌘= 1 if r>(Sui | S ✓ {Sui | Rui = 1}, |S| = K) = k and

w⇣

r>(Sui|{Sui|Rui=1})|u|

⌘= 0 otherwise ,

i.e. from the set S of K randomly sampled known preferences, ordered by their pre-dicted score, only the item at rank k is selected to contribute to the training error.When k is set low, the top of the ranking will be optimized at the cost of a worse meanrank. When k is set higher, the mean rank will be optimized at the cost of e.g. recallat 1 or MRR. The regularization is not done by adding a regularization term but byforcing the norm of the factor matrices to be below a maximum. Alternatively, Westonet al. also propose a simplified version

D⇣S, ˜R

⌘=

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

max(0, 1 + Suj � Sui). (32)

To the best of our knowledge, nobody proposed a reconstruction based deviation func-tion for this model yet.

4.4. Group 4: Three Factor Matrices, Two Factor Matrices A Priori Unknown, Bias TermsKabbur et al. [Kabbur et al. 2013] propose FISM (factored item similarity matrix fac-torization), a fine-tuned version of the 3-factor matrix factorization:

Sui = Uu + Ii +

�|u|�↵R

�S(2)S(3), (33)

with U the user-bias vector, I the item-bias vector and ↵ a hyperparameter between0 and 1. Besides the 3-factor matrix factorization, also the introduction of the user-and item-biasses for bpo collaborative filtering sets this model apart. Notice that forcomputing top-N recommendations for a user u, the user-bias Uu is not important.

However, when trained, the above model results in trivial solutions for for S(2) andS(3) that correspond to an item being similar itself and dissimilar to all other itmes.This can be more easily understood by rewriting Equation 33 as

Sui = Uu + Ii + |u|�↵X

Ruj=1

S(2)j⇤ S(3)

⇤i .

Now, to avoid these trivial solutions for S(2) and S(3), Kabbur et al. further enhancethe model to:

Sui = Uu + Ii + (|u| � Rui)�↵

X

Ruj=1

(1 � �(j = i)) · S(2)j⇤ S(3)

⇤i ,

2Weston et al. [Weston et al. 2011] provide a justification for this approximation



ferentiable approximation

D⇣S, ˜R

⌘=

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

max(0, 1 + Suj � Sui)

N�1|{j 2 I | Ruj = 0}| ,

(31)in which they replaced the indicator function by the hinge-loss and approximated therank with N�1|{j 2 I | Ruj = 0}|, in which N the number of items k that wererandomly sampled until Suk + 1 > Sui

2. Furthermore, Weston et al. use the simpleweighting function

8<

:w

⇣r>(Sui|{Sui|Rui=1})

|u|

⌘= 1 if r>(Sui | S ✓ {Sui | Rui = 1}, |S| = K) = k and

w⇣

r>(Sui|{Sui|Rui=1})|u|

⌘= 0 otherwise ,

i.e. from the set S of K randomly sampled known preferences, ordered by their pre-dicted score, only the item at rank k is selected to contribute to the training error.When k is set low, the top of the ranking will be optimized at the cost of a worse meanrank. When k is set higher, the mean rank will be optimized at the cost of e.g. recallat 1 or MRR. The regularization is not done by adding a regularization term but byforcing the norm of the factor matrices to be below a maximum. Alternatively, Westonet al. also propose a simplified version

D⇣S, ˜R

⌘=

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

max(0, 1 + Suj � Sui). (32)

To the best of our knowledge, nobody proposed a reconstruction based deviation func-tion for this model yet.

4.4. Group 4: Three Factor Matrices, Two Factor Matrices A Priori Unknown, Bias TermsKabbur et al. [Kabbur et al. 2013] propose FISM (factored item similarity matrix fac-torization), a fine-tuned version of the 3-factor matrix factorization:

Sui = Uu + Ii +

�|u|�↵R

�S(2)S(3), (33)

with U the user-bias vector, I the item-bias vector and ↵ a hyperparameter between0 and 1. Besides the 3-factor matrix factorization, also the introduction of the user-and item-biasses for bpo collaborative filtering sets this model apart. Notice that forcomputing top-N recommendations for a user u, the user-bias Uu is not important.

However, when trained, the above model results in trivial solutions for for S(2) andS(3) that correspond to an item being similar itself and dissimilar to all other itmes.This can be more easily understood by rewriting Equation 33 as

Sui = Uu + Ii + |u|�↵X

Ruj=1

S(2)j⇤ S(3)

⇤i .

Now, to avoid these trivial solutions for S(2) and S(3), Kabbur et al. further enhancethe model to:

Sui = Uu + Ii + (|u| � Rui)�↵

X

Ruj=1

(1 � �(j = i)) · S(2)j⇤ S(3)

⇤i ,

2Weston et al. [Weston et al. 2011] provide a justification for this approximation



pairs that are difficult to rank correctly. Therefore, Aiolli proposes to replace the uni-form weighting with a weighting scheme that minimizes the total margin. Specifically,he propose to solve for every user u, the joint optimization problem

S(1)u⇤ = arg max

S(1)u

min

↵u⇤

X

Rui=1

X

Ruj=0

↵ui↵uj(Sui � Suj),

where for every user u, it holds thatP

Rui=1 ↵ui = 1 andP

Rui=0 ↵ui = 1. To avoidoverfitting of ↵, he adds two regularization terms:

S(1)u⇤ = arg max

S(1)u

min

↵u⇤

0

@X

Rui=1

X

Ruj=0

↵ui↵uj(Sui � Suj) + �p

X

Rui=1

↵2ui + �n

X

Rui=0

↵2ui

1

A ,

with �p, �n regularization hyperparameters. S(1) is regularized by means of the row-normalization constraint. Solving the above maximization for every user, is equivalentto minimizing the deviation function

D⇣S, ˜R

⌘=

X

u2U

0

@max

↵u⇤

0

@X

Rui=1

X

Ruj=0

↵ui↵uj(Suj � Sui) � �p

X

Rui=1

↵2ui � �n

X

Rui=0

↵2ui

1

A

1

A .

(29)Notice that this approach corresponds to te AMAN assumption.

4.3. Group 3: Three Factor Matrices, One Factor Matrix A Priori UnknownThere are also algorithms which model S with 3 factor matrices:

S = S(1)S(2)S(3).

To the best of our knowledge, they all follow the special case

S = RS(2)S(3).

In this case, the users are represented by |I|-dimensional binary vectors, the items arerepresented by f -dimensional real vectors and the similarity between two items i andj is computed by the inner product S(2)

i⇤ S(3)⇤j , which means that S(2)S(3) represents the

item-similarity matrix.Weston et al. [Weston et al. 2013] adopt a version of this model with a symmetric

item-similarity matrix, which is imposed by setting S(3)= S(2)T .

On the one hand, the deviation functions in Equation 21 and 4.1.4 try to minimizethe mean rank of the known preferences. On the other hand, the deviation functionin Equation 22 tries to push one known preference as high as possible to the top ofthe item-ranking (Eq. 22). Weston et al. [Weston et al. 2013] propose to minimize atrade-off between the above two extremes:

X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


, (30)

with w() a function that weights the importance of the known preference as a functionof its predicted rank among all known preferences. This weighting function is user-defined and determines the trade-off between the two extremes, i.e. minimizing themean rank of the known preferences and minimizing the maximal rank of the knownpreferences. Because this function is non-differentiable, Weston et al. propose the dif-




S(1)u⇤ = arg max

S(1)u

min

↵u⇤

X

Rui=1

X

Ruj=0





S(1)u⇤ = arg max

S(1)u

min

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui + �n

X

Rui=0

↵2ui

1

A ,


D⇣S, ˜R

⌘=

X

u2U

0

@max

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui � �n

X

Rui=0

↵2ui

1

A

1

A .



S = S(1)S(2)S(3).


S = RS(2)S(3).






X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


, (30)





S(1)u⇤ = arg max

S(1)u

min

↵u⇤

X

Rui=1

X

Ruj=0





S(1)u⇤ = arg max

S(1)u

min

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui + �n

X

Rui=0

↵2ui

1

A ,


D⇣S, ˜R

⌘=

X

u2U

0

@max

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui � �n

X

Rui=0

↵2ui

1

A

1

A .



S = S(1)S(2)S(3).


S = RS(2)S(3).






X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


, (30)



1 2 … k … K

0 0 0 1 0 0

1 2 … N

false false false true


KL-divergence approximation of posterior pdf


rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d






















Approximation of

[Koeningstein et al. 2012] [Paquet and Koeningstein 2013]

Local Minima converge to local minimum


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)














X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))











Convex unique minimum


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)













Convex Optimization Algorithm


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))











Max-Min-Margin AUC as average margin




ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


[Aiolli 2014]


4.2.1. Reconstruction Based Deviation Functions

— Bell and Koren ord-rec? are is it only for ratings? check.Ning and Karypis [Ning and Karypis 2011] choose S(1)

= R and propose a standardreconstruction based deviation function:

D⇣S, ˜R

⌘=

X

u2U

X

i2I(Rui � Sui)

2+ �F ||S(2)||2F + �1||S(2)||1, (28)

with ||.||1 the entry-wise l1-norm, ||.||F the Frobenius norm and �1, �F the correspond-ing regularization constants which are hyperparameters of the method. The role of thel1-norm is to introduce sparsity. The role of the Frobenius norm is to prevent overfit-ting. Their combined use is called elastic net regularization, which is known to implic-itly group correlated items. Furthermore, Ning and Karypis impose the constraints

⇢S(2) � 0

diag(S(2)) = 0.

The first constraint expresses non-negativity of the item-similarities. The second con-straint is to avoid trivial solutions to the minimization of the deviation function inwhich every item would recommend itself. Notice that S(2)

⇤i is not required to be sym-metric. The sparsity induced by the l1-norm regularization lowers the the memory re-quired for S(2) and speeds-up the dotproduct computation Sui = Ru⇤ · S(2)

⇤i . Futher per-formance can be achieved by applying feature selection techniques. Ning and Karypisselected features by imposing S(2)

ij = 0 if i is not in the top 100 most similar items to j

as measured by cos(i, j). Their experiments show that this way of working significantlyreduced runtimes, while only slightly reducing complexity.

4.2.2. Ranking Based Deviation Functions. Also when S(1)= R, it is possible to use

ranking-based deviation functions. Rendle et al. propose to use exactly the same devi-ation function as in Equation 21 to optimize the AUC [Rendle et al. 2009]. The onlydifference is that for computing S, RS(2) is used instead of S(1)S(2), i.e. only the secondfactor matrix is unknown. Because S(2) can be interpreted as a item-similarity matrix,they call this method BPR-kNN.

Aiolli [Aiolli 2014] on the other hand, chooses the user-based alternative with S(2)=

¯R, with ¯R the column normalized version of R and S(1) row normalized. Consequently,it holds that �1 Sui � 1 since Sui = S(1)

u⇤ ¯R⇤i with ||S(1)u⇤ || 1 and || ¯R⇤i|| 1. For every

individual user u 2 U , he starts from AUCu, the AUC for u:

AUCu =

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj).

Next, he proposes a lower bound on AUCu:

AUCu � 1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

,

and interprets it as a weighted sum of margins Sui�Suj

2 between any known prefer-ences and any absent feedback, in which every margin gets the same weight 1

|u|·(|I|�|u|) .Hence maximizing this lower bound on the AUC corresponds to maximizing the sumof margins between any known preference and any absent feedback in which everymargin has the same weight. A problem with maximizing this sum is that very highmargins on pairs that are easily ranked correctly, can hide poor (negative) margins on






ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,


[Aiolli 2014]





ij2S

R�1X

r=1







J(U, V, �).=

1

2

(�U�2Fro + �V �2

Fro)

+ C

R�1X

r=1

X

ij2S

h⇣T r


⌘. (5)





12 (�U�2





0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Loss

z

HingeSmooth Hinge

-1.5

-1

-0.5

0

0.5

-0.5 0 0.5 1 1.5

Der

ivat

ive

of L

oss

z

HingeSmooth Hinge



@J

@Uia= Uia � C

R�1X

r=1

X

j|ij2S

Tij(k)h�⇣T r


⌘Vja

(7)


@J

@�ir= C

X

j|ij2S

T rijh

�⇣T r


⌘. (8)


3.1. Smooth Hinge







˜RuiSui)).



AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui>0

X

Ruj=0

�(Sui > Suj),


D⇣S, ˜R

⌘=

X

u2U

X

Rui>0

X

Ruj=0

log �(Suj � Sui) � �1||S(1)||2F � �2||S(2)||2F ,



MRR =

1

|U|X

u2Ur>

✓max

Rui=1Sui | Su⇤

◆�1

,



4.2.1. Reconstruction Based Deviation Functions

— Bell and Koren ord-rec? are is it only for ratings? check.Ning and Karypis [Ning and Karypis 2011] choose S(1)

= R and propose a standardreconstruction based deviation function:

D⇣S, ˜R

⌘=

X

u2U

X

i2I(Rui � Sui)

2+ �F ||S(2)||2F + �1||S(2)||1, (28)

with ||.||1 the entry-wise l1-norm, ||.||F the Frobenius norm and �1, �F the correspond-ing regularization constants which are hyperparameters of the method. The role of thel1-norm is to introduce sparsity. The role of the Frobenius norm is to prevent overfit-ting. Their combined use is called elastic net regularization, which is known to implic-itly group correlated items. Furthermore, Ning and Karypis impose the constraints

⇢S(2) � 0

diag(S(2)) = 0.

The first constraint expresses non-negativity of the item-similarities. The second con-straint is to avoid trivial solutions to the minimization of the deviation function inwhich every item would recommend itself. Notice that S(2)

⇤i is not required to be sym-metric. The sparsity induced by the l1-norm regularization lowers the the memory re-quired for S(2) and speeds-up the dotproduct computation Sui = Ru⇤ · S(2)

⇤i . Futher per-formance can be achieved by applying feature selection techniques. Ning and Karypisselected features by imposing S(2)

ij = 0 if i is not in the top 100 most similar items to j

as measured by cos(i, j). Their experiments show that this way of working significantlyreduced runtimes, while only slightly reducing complexity.

4.2.2. Ranking Based Deviation Functions. Also when S(1)= R, it is possible to use

ranking-based deviation functions. Rendle et al. propose to use exactly the same devi-ation function as in Equation 21 to optimize the AUC [Rendle et al. 2009]. The onlydifference is that for computing S, RS(2) is used instead of S(1)S(2), i.e. only the secondfactor matrix is unknown. Because S(2) can be interpreted as a item-similarity matrix,they call this method BPR-kNN.

Aiolli [Aiolli 2014] on the other hand, chooses the user-based alternative with S(2)=

¯R, with ¯R the column normalized version of R and S(1) row normalized. Consequently,it holds that �1 Sui � 1 since Sui = S(1)

u⇤ ¯R⇤i with ||S(1)u⇤ || 1 and || ¯R⇤i|| 1. For every

individual user u 2 U , he starts from AUCu, the AUC for u:

AUCu =

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj).

Next, he proposes a lower bound on AUCu:

AUCu � 1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

,

and interprets it as a weighted sum of margins Sui�Suj

2 between any known prefer-ences and any absent feedback, in which every margin gets the same weight 1

|u|·(|I|�|u|) .Hence maximizing this lower bound on the AUC corresponds to maximizing the sumof margins between any known preference and any absent feedback in which everymargin has the same weight. A problem with maximizing this sum is that very highmargins on pairs that are easily ranked correctly, can hide poor (negative) margins on


[Aiolli 2014]

Max-Min-Margin average à min total


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0










[Aiolli 2014]

Max-Min-Margin average à min total


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0











X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0










[Aiolli 2014]

Max-Min-Margin add regularization


X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0











X

u2U

X


2

+

X

u2U

X

i2I(1 � Rui)Wui

⇣Pui (1 � Sui)

2+ (1 � Pui) (0 � Sui)

2⌘

+ �||S(1)||F + �||S(2)||F� ↵

X

u2U

X

i2I(1 � Rui)H (Pui) (57)

(1 � 0)

2= 1 = (1 � 2)

2 (58)

w(j) =

X

u2URuj

X

u2U

X

Rui=1

X

Ruj=0


+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F

X

u2U

X

i2I(Rui � Sui)

2+

TX

t=1

FX

f=1

�tf ||S(t,f)||2F + ||S(t,f)||1

max

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

Sui � Suj

2

max min

↵u⇤

X

Rui=1

X

Ruj=0












S(1)u⇤ = arg max

S(1)u

min

↵u⇤

X

Rui=1

X

Ruj=0





S(1)u⇤ = arg max

S(1)u

min

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui + �n

X

Rui=0

↵2ui

1

A ,


D⇣S, ˜R

⌘=

X

u2U

0

@max

↵u⇤

0

@X

Rui=1

X

Ruj=0


X

Rui=1

↵2ui � �n

X

Rui=0

↵2ui

1

A

1

A .



S = S(1)S(2)S(3).


S = RS(2)S(3).






X

u2U

X

Rui=1

w

✓r>(Sui | {Sui | Rui = 1})

|u|

◆ X

Ruj=0

�(Suj + 1 � Sui)


, (30)



[Aiolli 2014]

Convex unique minimum


Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)













Analytically computable


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))











Nearest Neighbors user- or item-similarity

[Aiolli 2013] [Deshpande and Karypis 2004] [Sigurbjörnsson and Van Zwol 2008]

[Sarwar et al. 2001] [Mobasher et al. 2001] [Lin et al. 2002]

[Sarwar et al. 2000] [Menezes et al. 2010] [van Leeuwen and Puspitaningrum 2012]

Nearest Neighbors similarity measures


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(v, u) · |KNN (v) \ {u}| � S(2)

vu

⌘2






















Filtering. In KDD. 667–676.






X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),

X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(1)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3)

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(|R| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)


Nearest Neighbors similarity measures


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(v, u) · |KNN (v) \ {u}| � S(2)

vu

⌘2






















Filtering. In KDD. 667–676.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(v, u) · |KNN (v) \ {u}| � S(2)

vu

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|





















497–506.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|



















Veloso, and N. Ziviani. 2010. Demand Drive Tag Recommendation. In ECML/PKDD. 402–417.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|






















X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

























X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),

X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(1)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3)

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(|R| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)


Nearest Neighbors unified


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(v, u) · |KNN (v) \ {u}| � S(2)

vu

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|





















497–506.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|






















X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|






















X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2














Systems. In KDD. 659–667.




X

u2U

X

Rui=1

X

Ruj=0

�(Suj + 1 � Sui)


(60)

AUC =

1

|U|X

u2U

1

|u| · (|I| � |u|)X

Rui=1

X

Ruj=0

�(Sui > Suj),

X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(1)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3)

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(|R| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)



•  Netflix

Netflix Prize rating data

n-star rating scale n=5

No negative feedback

?

Pearsson Correlation not applicable


4.7. DiscussionA comment on model complexity, the tricks it requires to solve (regulariza-tion,smoothing), the quality of the solution (an arbitrary local minimum) and the close-ness to the intuitive optimization objective.

5. USABILITY OF RATING BASED ALGORITHMSInterest in collaborative filtering on binary, positive-only data only recently increased.The majority of the existing collaborative filtering research assumes rating data. Inthis case, the feedback of user u about item i, i.e. Rui, is an integer between Bl and Bh,with Bl and Bh the most negative and postive feedback respectively. The most typicalexample of rating data was provided in the context of the Netflix Price with Bl = 1 andBh = 5.

Technically, our case of binary, positive-only data is just a special case of rating datawith Bl = Bh = 1. However, collaborative filtering algorithms for rating data are ingeneral build on the implicit assumption that Bl < Bh, i.e. that both positive and nega-tive feedback is available. Since this negative feedback is not available in our problemsetting, it is not surprising that, in general, algorithms for rating data generate pooror even nonsensical results [Hu et al. 2008; Pan et al. 2008].

k-NN algorithms for rating data, for example, often use the Pearson correlation co-efficient as a similarity measure. The Pearson correlation coefficient between users uand v is given by

pcc(u, v) =

PRuj ,Rvj>0

(Ruj � Ru)(Rvj � Rv)

r PRuj ,Rvj>0

(Ruj � Ru)

2r P

Ruj ,Rvj>0(Rvj � Rv)

2,

with Ru and Rv the average rating of u and v respectively. In our setting, with binary,positive-only data however, Ruj and Rvj are by definition always one. Consequently,also Ru and Rv are always one. Therefore, the Pearson correlation is always zero orundefined (zero divided by zero), making it a useless similarity measure for binary,positive-only data. Even if we would hack it by omitting the terms for mean centering,�Ru and �Rv, it is still useless since it would always be equal to either one or zero.

Furthermore, when computing the score of user u for item i, user(item)-based k-NNalgorithms for rating data typically find the k users (items) that are most similar to u(i) and that have rated i (have been rated by u) [Desrosiers and Karypis 2011; Jannachet al. 2011]. On bpo data, this approach results in the nonsensical result that Sui = 1

for every (u, i)-pair.Also the matrix factorization methods for rating data are in general not applicable

to bpo data. Take for example a basic loss function for matrix factorization on ratingdata:

min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,

which for bpo data simplifies to

min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.

The squared error term of this loss function is minimized when the rows and columnsof S(1) and S(2) respectively are all the same unit vector. This is obviously a nonsensicalsolution.








pcc(u, v) =

PRuj ,Rvj>0


r PRuj ,Rvj>0

(Ruj � Ru)

2r P


2,





min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,


min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.



1 1

1 1







pcc(u, v) =

PRuj ,Rvj>0


r PRuj ,Rvj>0

(Ruj � Ru)

2r P


2,





min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,


min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.



1 1

1 1

1 1

1 1

Different Neighborhood trivial solutions




causes leakage.





IR















?






pcc(u, v) =

PRuj ,Rvj>0


r PRuj ,Rvj>0

(Ruj � Ru)

2r P


2,





min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,


min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.



Matrix Factorization # trivial solutions = inf






pcc(u, v) =

PRuj ,Rvj>0


r PRuj ,Rvj>0

(Ruj � Ru)

2r P


2,





min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,


min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.




1






pcc(u, v) =

PRuj ,Rvj>0


r PRuj ,Rvj>0

(Ruj � Ru)

2r P


2,





min

S(1),S(2)

X

Rui>0

⇣Rui � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘,


min

S(1),S(2)

X

Rui>0

⇣1 � S(1)

u· S(2)·i

⌘2+ �

⇣||S(1)

u· ||2F + ||S(2)·i ||2F

⌘.




1


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2


















•  Netflix

SGD mostly prohibitive


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)REFERENCESF. Aiolli. 2013. Efficient Top-N Recommendation for Very Large Scale Binary Rated Datasets. In RecSys.











263–272.



Sui = S(1)u⇤ · S(2)

⇤i

S = S(1)S(2)

S =

⇣S(1,1) · · ·S(1,F1)

⌘+ · · · +

⇣S(T,1) · · ·S(T,FT )

⌘

max

X

Rui=1

log p(i|u)

max

X

Rui=1

logSui

min �X

Rui=1

logSui

D (S,R) = �X

Rui=1

logSui

min D (S,R)














X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))












X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)











X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=









start finish

along the way


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)











X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=










X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)

rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)




recommender systems. In Advances in Knowledge Discovery and Data Mining. Springer, 38–49.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2














263–272.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)

rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)






X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)

rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)






X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)

rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)


273–280.



X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(|R|)

O(|R| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|

(S(1,1), . . . ,S(T,F ))

�⌘ · rD(S,R)

=

rD(S,R) = rX

u2U

X

i2IRui=1

Dui(S,R) =

X

u2U

X

i2IRui=1

rDui(S,R)

rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)


x1000

SGD mostly prohibitive


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2














263–272.


[Shi et al. 2012]

ALS if possible


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(2)

uv

⌘2

S(2)ji = sim(j, i) · |KNN (j) \ {i}|

S(2)uv = sim(u, v) · |KNN (u) \ {v}|

S(3)uv = sim(u, v) · |KNN (u) \ {v}|


X

i2I

X

j2I

⇣sim(j, i) · |KNN (j) \ {i}| � S(2)

ji

⌘2+

X

u2U

X

v2U

⇣sim(u, v) · |KNN (u) \ {v}| � S(3

uv

⌘2



O(|U| ⇥ |I|)

O(d3(|U| + |I|) + d2|R|










(2004), 89–115.



than the global minimum. A possible mitigation is to rerun to minimization procedurewith different initializations and choose the result that gives the best local minimum.

From a numerical optimization point of view, most reconstruction based algorithmsfor bpo data pose a bigger computational challenge than reconstruction based algo-rithms for rating data. The reason for this is that most deviation functions for ratingdata only sum over the known ratings and discard the unknown ratings whereas mostdeviation functions for bpo data sum over all possible user-item pairs, which are easily100 times more numerous.

For rating data, stochastic gradient descent (SGD) is generally the numerical opti-mization algorithm of choice. A (local) minimum of D (S,R) is found when rD (S,R) =

0, which is the same asP

(u,i)Rui>0

rDui (S,R) = 0. SGD randomly samples training rat-

ings Rui > 0 and for each of them updates the parameters S(1)u⇤ and S(2)

⇤i in the directionof the parameters for which rDui (S,R) = 0. For example, a parameter S(1)

xy is updateaccording to the rule

S(1)xy S(1)

xy � ⌘@Dui (S,R)

@S(1)xy

,

in which ⌘ is the learning rate. A lower learning rate is more stable, but also slower.Ratings are sampled with replacement and every rating is typically used multipletimes on average, until a convergence criterium of choice is reached. However, whenthe summation over the known ratings,

P(u,i)

Rui>0

, is replaced by a summation over all

user item pairs,P

(u,i), and every ratings needs to be considered multiple times (on av-erage), SGD needs to perform approximately 100 times as much updates per iteration,which makes the algorithm less attractive for bpo data.

Therefore, algorithms for bpo data typically use a variant of the alternating leastsquares (ALS) method if the deviation function allows it [Koren et al. 2009; Hu et al.2008]. In this respect, the deviation functions 17 and 18 are appealing because theycan be minimized with a variant of the alternating least squares (ALS) method. Takefor example the deviation function from equation 17

D (S,R) =

X

u2U

X


2+ �

⇣||S(1)||F + ||S(2)||F

⌘,

=

X

u2U

X

i2IWui

⇣Rui � S(1)

u⇤ S(2)⇤i

⌘2+ �

⇣||S(1)||F + ||S(2)||F

⌘.

As most deviation functions, this deviation function is non-convex in the parameterscontained in S(1) and S(2) and has therefore multiple local optima. However, if onetemporarily fixes the parameters in S(1), it becomes convex in S(2) and we can analyti-cally find updated values for S(2) that minimize this convex function and are thereforeguaranteed to reduce D (S,R). Subsequently, one can temporarily fix the parameters inS(2) and in the same way compute updated values for S(1) that are also guaranteed toreduce D (S,R). One can keep alternating between fixing S(1) and S(2) until a conver-gence criterium of choice is met. Hu et al. [Hu et al. 2008], Pan et al. [Pan et al. 2008]and Pan and Scholz [Pan and Scholz 2009] give a detailed descriptions of possible ALSprocedure. The description by Hu et al. contains optimizations for the case in whichmissing preferences are uniformly weighted. Pan and Scholz [Pan and Scholz 2009]describe optimizations that apply to a wider range of optimization schemes. These op-timizations outperform their earlier work-around by means of a bagging method [Pan


fix – solve solve – fix fix – solve solve – fix fix – solve solve – fix

…

[Hu et al. 2008] [Pan et al. 2008]

[Pan and Scholz 2009] [Pilászy et al. 2010]

[Zhou et al. 2008] [Yao et al. 2014]


SGD with Sampling if necessary

•  uniform pdf •  uniform pdf+ bagging •  pdf ~ popularity •  pdf ~ gradient size •  discard samples until large gradient is

encountered



[Rendle and Freudenthaler 2014]

[Rendle and Freudenthaler 2014]


Others

•  expectation maximization •  cyclic coordinate descent •  quadratic programming •  direct computation

•  Variational Inference

[Hofmann 2004, Hofmann 1999]

[Ning and Karypis 2012] [Christakopoulou and Karypis 2014]

[Aiolli 2014]

[Aiolli 2013] [Deshpande and Karypis 2004]

[Sigurbjörnsson and Van Zwol 2008] [Sarwar et al. 2001]

[Mobasher et al. 2001] [Lin et al. 2002]

[Sarwar et al. 2000] [Menezes et al. 2010]

[van Leeuwen and Puspitaningrum 2012] [Verstrepen and Goethals 2014] [Verstrepen and Goethals 2015]

[Koeningstein et al. 2012]

[Paquet and Koeningstein 2013]

Agenda •  Introduction •  Algorithms •  Netflix

References














Guilherme Vale Menezes, Jussara M Almeida, Fabiano Belem, Marcos Andre Goncalves, Anısio Lacerda,Edleno Silva De Moura, Gisele L Pappa, Adriano Veloso, and Nivio Ziviani. 2010. Demand-driven tagrecommendation. In Machine Learning and Knowledge Discovery in Databases. Springer, 402–417.

Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. 2001. Effective personalization based onassociation rule discovery from web usage data. In Proceedings of the 3rd international workshop onWeb information and data management. ACM, 9–15.

Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for top-n recommender systems. In DataMining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 497–506.

Rong Pan and Martin Scholz. 2009. Mind the gaps: weighting the unknown in large-scale one-class col-laborative filtering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 667–676.

Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.IEEE, 502–511.

Ulrich Paquet and Noam Koenigstein. 2013. One-class collaborative filtering with random graphs. In Pro-ceedings of the 22nd international conference on World Wide Web. International World Wide Web Con-ferences Steering Committee, 999–1008.

Istvan Pilaszy, David Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicitand implicit feedback datasets. In Proceedings of the fourth ACM conference on Recommender systems.ACM, 71–78.

Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendationfrom implicit feedback. In Proceedings of the 7th ACM international conference on Web search and datamining. ACM, 273–282.

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesianpersonalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncer-tainty in Artificial Intelligence. AUAI Press, 452–461.

Jasson DM Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborativeprediction. In Proceedings of the 22nd international conference on Machine learning. ACM, 713–719.

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Analysis of recommendation al-gorithms for e-commerce. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM,158–167.



rD(S,R) = rX

u2U

X

i2IDui(S,R) =

X

u2U

X

i2IrDui(S,R)

rD(S,R) = rX

u2U

X

i2IRui=1

X

j2IDuij(S,R) =

X

u2U

X

i2IRui=1

X

j2IrDuij(S,R)

=

Z�( ) · p( | ) · d


. . .


max log p(S|R)

max log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

� log

Y

u2U

Y

i2IS↵Rui

ui (1 � Sui)

�X

u2U

X


⇣||S(1)||2F + ||S(2)||2F

⌘











References














Guilherme Vale Menezes, Jussara M Almeida, Fabiano Belem, Marcos Andre Goncalves, Anısio Lacerda,Edleno Silva De Moura, Gisele L Pappa, Adriano Veloso, and Nivio Ziviani. 2010. Demand-driven tagrecommendation. In Machine Learning and Knowledge Discovery in Databases. Springer, 402–417.

Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. 2001. Effective personalization based onassociation rule discovery from web usage data. In Proceedings of the 3rd international workshop onWeb information and data management. ACM, 9–15.

Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for top-n recommender systems. In DataMining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 497–506.

Rong Pan and Martin Scholz. 2009. Mind the gaps: weighting the unknown in large-scale one-class col-laborative filtering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 667–676.

Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.IEEE, 502–511.

Ulrich Paquet and Noam Koenigstein. 2013. One-class collaborative filtering with random graphs. In Pro-ceedings of the 22nd international conference on World Wide Web. International World Wide Web Con-ferences Steering Committee, 999–1008.

Istvan Pilaszy, David Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicitand implicit feedback datasets. In Proceedings of the fourth ACM conference on Recommender systems.ACM, 71–78.

Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendationfrom implicit feedback. In Proceedings of the 7th ACM international conference on Web search and datamining. ACM, 273–282.

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesianpersonalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncer-tainty in Artificial Intelligence. AUAI Press, 452–461.

Jasson DM Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborativeprediction. In Proceedings of the 22nd international conference on Machine learning. ACM, 713–719.

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Analysis of recommendation al-gorithms for e-commerce. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM,158–167.


References Binary, Positive-Only Collaborative Filtering: A Theoretical and Experimental Comparison of the State Of The Art1:37

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filteringrecommendation algorithms. In Proceedings of the 10th international conference on World Wide Web.ACM, 285–295.

Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, Nuria Oliver, and Alan Hanjalic. 2012.CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. In Proceedings ofthe sixth ACM conference on Recommender systems. ACM, 139–146.

Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative filtering beyond the user-item matrix: Asurvey of the state of the art and future challenges. ACM Computing Surveys (CSUR) 47, 1 (2014), 3.

Borkur Sigurbjornsson and Roelof Van Zwol. 2008. Flickr tag recommendation based on collective knowl-edge. In Proceedings of the 17th international conference on World Wide Web. ACM, 327–336.

Vikas Sindhwani, Serhat S Bucak, Jianying Hu, and Aleksandra Mojsilovic. 2010. One-class matrix comple-tion with low-density factorizations. In Data Mining (ICDM), 2010 IEEE 10th International Conferenceon. IEEE, 1055–1060.

Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. 2004. Maximum-margin matrix factorization. InAdvances in neural information processing systems. 1329–1336.

Gabor Takacs and Domonkos Tikk. 2012. Alternating least squares for personalized ranking. In Proceedingsof the sixth ACM conference on Recommender systems. ACM, 83–90.

Lyle H Ungar and Dean P Foster. 1998. Clustering methods for collaborative filtering. In AAAI workshop onrecommendation systems, Vol. 1. 114–129.

Matthijs van Leeuwen and Diyah Puspitaningrum. 2012. Improving tag recommendation using few associ-ations. In Advances in Intelligent Data Analysis XI. Springer, 184–194.

Koen Verstrepen and Bart Goethals. 2014. Unifying nearest neighbors collaborative filtering. In Proceedingsof the 8th ACM Conference on Recommender systems. ACM, 177–184.

Koen Verstrepen and Bart Goethals. 2015. Top-N recommendation for Shared Accounts. In Proceedings ofthe 9th ACM Conference on Recommender systems. ACM.

Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary imageannotation. In IJCAI, Vol. 11. 2764–2770.

Jason Weston, Ron J Weiss, and Hector Yee. 2013a. Nonlinear latent factorization by embedding multipleuser interests. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 65–68.

Jason Weston, Hector Yee, and Ron J Weiss. 2013b. Learning to rank recommendations with the k-orderstatistic loss. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 245–248.

Yuan Yao, Hanghang Tong, Guo Yan, Feng Xu, Xiang Zhang, Boleslaw K Szymanski, and Jian Lu. 2014.Dual-regularized one-class collaborative filtering. In Proceedings of the 23rd ACM International Confer-ence on Conference on Information and Knowledge Management. ACM, 759–768.

Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. 2008. Large-scale parallel collaborativefiltering for the netflix prize. In Algorithmic Aspects in Information and Management. Springer, 337–348.

Received February 2014; revised March 2015; accepted June 2015


Tutorial bpocf

Data & Analytics

Transcript of Tutorial bpocf