Data Mining Techniques · Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10...

Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

Lecture 10Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS246)

Project

• 3 Feb: Form teams of 2-4 people • 17 Feb: Submit abstract (1 paragraph) • 3 Mar: Submit proposals (2 pages) • 13 Mar: Milestone 1 (exploratory analysis) • 31 Mar: Milestone 2 (statistical analysis) • 16 Apr (Sun): Submit reports (10 pages) • 21 Apr (Fri): Submit peer reviews

Project Deadlines

• ~10 pages (rough guideline) • Guidelines for contents

• Introduction / Motivation • Exploratory analysis (if applicable) • Data mining analysis • Discussion of results

Project Reports

• 2 per person (randomly assigned) • Reviews should discuss 4 aspects

of the report • Clarity

(is the writing clear?) • Technical merit

(are methods valid?) • Reproducibility

(is it clear how results were obtained?) • Discussion

(are results interpretable?)

Project Review

Final Exam

Emphasis on post-midterm topics(but some pre-midterm topics included)

Topic Listhttp://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html

Recommender Systems

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

The Long Tail

Problem Setting

• Task: Predict user preferences for unseen items

Content-based Filtering

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Latent factor models

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Dumb and Dumber

Ocean’s 11

Idea: Predict rating using item features on a per-user basis

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Dumb and Dumber

Ocean’s 11

Idea: Predict rating using user features on a per-item basis

Collaborative FilteringPart I:

Basic neighborhood methods

Idea: Predict rating based on similarity to other users

Problem Setting

• Task: Predict user preferences for unseen items • Content-based filtering: Model user/item features • Collaborative filtering: Implicit similarity of users items

Recommender Systems

• Movie recommendation (Netflix)• Related product recommendation (Amazon)• Web page ranking (Google)• Social recommendation (Facebook)• Priority inbox & spam filtering (Google)• Online dating (OK Cupid)• Computational Advertising (Everyone)

Challenges• Scalability

• Millions of objects• 100s of millions of users

• Cold start• Changing user base• Changing inventory

• Imbalanced dataset• User activity / item reviews

power law distributed• Ratings are not missing at random

Running Example: Netflix Datascoredatemovieuser

15/7/02211

58/2/042131

43/6/013452

45/1/051232

37/15/027682

51/22/01763

48/3/00454

19/10/055685

23/5/033425

212/28/002345

58/11/02766

46/15/03566

scoredatemovieuser?1/6/05621

?9/13/04961

?8/18/0572

?11/22/0532

?6/13/02473

?8/12/01153

?9/1/00414

?8/27/05284

?4/4/05935

?7/16/03745

?2/14/04696

?10/3/03836

Training data Test data

Movie rating data

• Released as part of $1M competition by Netflix in 2006 • Prize awarded to BellKor in 2009

Running Yardstick: RMSE

rmse(S) =s|S|�1

(i,u)2S

(r̂ui � rui)2

Running Yardstick: RMSE

rmse(S) =s|S|�1

(i,u)2S

(r̂ui � rui)2

(doesn’t tell you how to actually do recommendation)

Item-based Features

wu = argminw|ru � X w |2

Per-user Regression

Learn a set of regression coefficients for each user

User Bias and Item Popularity

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated

Problem: Some movies are universally loved / hated some users are more picky than others

Solution: Introduce a per-movie and per-user bias

Problem: Some movies are universally loved / hated some users are more picky than others

Collaborative Filtering

Neighborhood Based MethodsPart I:Basic neighborhood methods

Users and items form a bipartite graph (edges are ratings)

Neighborhood Based Methods(user, user) similarity

• predict rating based on average from k-nearest users

• good if item base is small• good if item base changes rapidly

(item,item) similarity• predict rating based on average

from k-nearest items• good if the user base is small• good if user base changes rapidly

Parzen-Window Style CF

• Define a similarity sij between items• Find set εk(i,u) of k-nearest neighbors

to i that were rated by user u• Predict rating using weighted average over set• How should we define sij?

Pearson Correlation Coefficient

sij =Cov[rui, ruj ]

Std[rui]Std[ruj ]

Estimating item-item similarities

• Common practice – rely on Pearson correlation coeff• Challenge – non-uniform user support of item ratings,

each item rated by a distinct set of users

1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ?

? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4

User ratings for item i:

User ratings for item j:

• Compute correlation over shared support

(item,item) similarity

⇢̂ij =

Pu2U(i,j)(rui � bui)(ruj � buj)qP

u2U(i,j)(rui � bui)2P

u2U(i,j)(ruj � buj)2

Empirical estimate of Pearson correlation coefficient

sij =|U(i, j)|� 1

|U(i, j)|� 1 + �⇢̂ij

Regularize towards 0 for small support

Regularize towards baseline for small neighborhood

Similarity for binary labels

mi users acting on i

mij users acting on both i and j

m total number of users

sij =mij

↵+mi +mj �mijsij =

observed

expected

⇡ mij

↵+mimj/m

Jaccard similarity Observed / Expected ratio

Pearson correlation not meaningful for binary labels(e.g. Views, Purchases, Clicks)

Matrix Factorization Methods

Matrix Factorization

Idea: pose as (biased) matrix factorization problem

Matrix FactorizationBasic matrix factorization model

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

A rank-3 SVD approximation

PredictionEstimate unknown ratings as inner-products of factors:

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

PredictionEstimate unknown ratings as inner-products of factors:

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

SVD with missing valuesMatrix factorization model

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD isn’t defined when entries are unknown Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

Pose as regression problem

Regularize using Frobenius norm

Alternating Least SquaresMatrix factorization model

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

(regress wu given X)

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Regularization

L2: closed form solution

w = (XT

X+ �I)�1X

L1: No closed form solution. Use quadraticprogramming:

minimize k Xw � y k2 s.t. k w k1 s

Yijun Zhao Linear Regression

Remember ridge regression?

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

(regress xi given W)

Stochastic Gradient DescentMatrix factorization model

312445

53432142

522434

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

• No need for locking• Multicore updates asynchronously

(Recht, Re, Wright, 2012 - Hogwild)

Sampling Bias

Ratings are not given at randomRatings are not given at random!

B. Marlin et al., “Collaborative Filtering and the Missing at Random Assumption” UAI 2007

Yahoo! survey answersYahoo! music ratingsNetflix ratings

Distribution of ratings

Ratings are not given at random

• A powerful source of information:Characterize users by which movies they rated, rather than how they rated

• Î A dense binary representation of the data:

312445

53432142

522434

movies

010100100101

111001001100

011101011011

010010010110

110000111100

010010010101

movies

Which movies users rate?

^ ` ,ui u iR r ^ ` ,ui u i

B b rui cui

matrix factorization

regression

Temporal Effects

Changes in user behaviorSomething Happened in Early 2004…

Netflix ratings by date

Netflix changedrating labels

Are movies getting better with time?Movies get better with time?

Temporal Effects

Solution: Model temporal effects in bias not weights

Are movies getting better with time?

Netflix Prize

Netflix PrizeTraining data • 100 million ratings, 480,000 users, 17,770 movies • 6 years of data: 2000-2005 Test data • Last few ratings of each user (2.8 million) • Evaluation criterion: Root Mean Square Error (RMSE) Competition • 2,700+ teams• Netflix’s system RMSE: 0.9514• $1 million prize for 10% improvement on Netflix

Improvements40

12818050

100200

100 200 500

100200 500

50100 200 500 1000 1500

10 100 1000 10000 100000

Millions of Parameters

Factor models: Error vs. #parameters

BiasSVD

SVD v.2

SVD v.3

SVD v.4

Add biases

Do SGD, but also learn biases μ, bu and bi

Improvements

Account for fact that ratings are not missing at random.

12818050

100200

100 200 500

100200 500

50100 200 500 1000 1500

10 100 1000 10000 100000

BiasSVD

SVD v.2

SVD v.3

SVD v.4

“who rated what”

12818050

100200

100 200 500

100200 500

50100 200 500 1000 1500

10 100 1000 10000 100000

BiasSVD

SVD v.2

SVD v.3

SVD v.4temporal effects

Improvements

12818050

100200

100 200 500

100200 500

50100 200 500 1000 1500

10 100 1000 10000 100000

BiasSVD

SVD v.2

SVD v.3

SVD v.4temporal effects

Improvements

Still pretty far from 0.8563 grand prize

Winning Solution from BellKor

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52

Last 30 days

June 26th submission triggers 30-day “last call”

BellKor fends off competitors by a hair

Ratings aren’t everythingNetflix then Netflix now

• Only simpler submodels (SVD, RBMs) implemented • Ratings eventually proved to be only weakly informative

Netflix now

Data Mining Techniques · Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10...

Documents

Transcript of Data Mining Techniques · Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10...

Kernel Methods - Alexander J. Smola

Data Mining Techniques · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 10 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.)

Regularized Principal Manifoldsjmlr.csail.mit.edu/papers/volume1/smola01a/smola01a.pdf · Smola, Mika, Sch olkopf, and Williamson A possible problem is: \Which properties of the data

John Deere SCRLC Member Spotlight Elizabeth Carroll Bob Smola Austin, TX 18 May 2010

Black-Box Policy Search with Probabilistic Programsproceedings.mlr.press/v51/vandemeent16.pdf · Black-Box Policy Search with Probabilistic Programs Jan-Willem van de Meent Brooks

John Deere SCRLC Member Spotlight Elizabeth Carroll Bob Smola Austin, TX 18 May 2010.

CS 4100: Artificial Intelligence...CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan

Perceptron and Kernels - Alex Smola

Graphical Models for the Internet Amr Ahmed and Alexander Smola Yahoo Research, Santa Clara, CA.

Mu)Li,)Dave)Andersen,)Junwoo)Park,)Alex)Smola,)Amr)Ahmed…alex.smola.org/data/poster.pdf · 2014-09-03 · Mu)Li,)Dave)Andersen,)Junwoo)Park,)Alex)Smola,)Amr)Ahmed,)VanjaJosifovski,)James)Long,)Eugene)Shekita,)BorSYiing)Su

SCIENTIFIC ABSTRACT SMOKTIY, V.V. - SMOLA, I. - cia.gov · title: scientific abstract smoktiy, v.v. - smola, i. subject: scientific abstract smoktiy, v.v. - smola, i. keywords

Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus ...11-14...Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus, Dean Foster, Tim Januschowski. Time Series Prediction

Czech Cuisine... · Turkish Cuisine . Romanian Cuisine . Greek Cuisine . Portuguese Cuisine . ODi ©Diana Condrea . Christmas Bread Kserotigana . Author: Petr Smola

My Visual Resume - Filip Smola

Xi Chen – Curriculum Vitaepeople.stern.nyu.edu/xchen3/Xi_Chen_CV.pdf[53] Chong Wang, Xi Chen, Alex Smola, and Eric P. Xing.Variance Reduction for Stochastic GradientOptimization

Geraldine Smola - Fulkerson - SGeraldine Smola Place and Date of Birth Rural Rosetown, Saskatchewan, Canada ~ May 20, 1914 Place and Date of Death Stanley, North Dakota ~ November

Geraldine Smola - fulkersonfuneralhomes.comfulkersonfuneralhomes.com/usrfiles/obits/3655/Folder.pdf · Geraldine was born in rural Rosetown, Sask, Canada on May 20, 1914, to David

CoFi Rank : Maximum Margin Matrix Factorization for Collaborative Ranking Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le and Alex Smola NIPS’07.

Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Introduction to Machine Learning - Alex Smola › ... › slides › 8_stochconvergence.pdfIntroduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos Motivation