Data Mining Techniques · Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10...

Post on 24-Sep-2020

1 views 0 download

Transcript of Data Mining Techniques · Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10...

Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

Lecture 10Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS246)

Project

• 3 Feb: Form teams of 2-4 people • 17 Feb: Submit abstract (1 paragraph) • 3 Mar: Submit proposals (2 pages) • 13 Mar: Milestone 1 (exploratory analysis) • 31 Mar: Milestone 2 (statistical analysis) • 16 Apr (Sun): Submit reports (10 pages) • 21 Apr (Fri): Submit peer reviews

Project Deadlines

• ~10 pages (rough guideline) • Guidelines for contents

• Introduction / Motivation • Exploratory analysis (if applicable) • Data mining analysis • Discussion of results

Project Reports

• 2 per person (randomly assigned) • Reviews should discuss 4 aspects

of the report • Clarity

(is the writing clear?) • Technical merit

(are methods valid?) • Reproducibility

(is it clear how results were obtained?) • Discussion

(are results interpretable?)

Project Review

Final Exam

Emphasis on post-midterm topics(but some pre-midterm topics included)

Topic Listhttp://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html

Recommender Systems

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

Problem Setting

Problem Setting

Problem Setting

Problem Setting

• Task: Predict user preferences for unseen items

Content-based Filtering

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s  11

Sense and Sensibility

Gus

Dave

Latent factor models

Content-based Filtering

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s  11

Sense and Sensibility

Gus

Dave

Latent factor models

Idea: Predict rating using item features on a per-user basis

Content-based Filtering

Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s  11

Sense and Sensibility

Gus

Dave

Latent factor models

Idea: Predict rating using user features on a per-item basis

Collaborative FilteringPart I:

Basic neighborhood methods

Joe

#2

#3

#1

#4

Idea: Predict rating based on similarity to other users

Problem Setting

• Task: Predict user preferences for unseen items • Content-based filtering: Model user/item features • Collaborative filtering: Implicit similarity of users items

Recommender Systems

• Movie recommendation (Netflix)• Related product recommendation (Amazon)• Web page ranking (Google)• Social recommendation (Facebook)• Priority inbox & spam filtering (Google)• Online dating (OK Cupid)• Computational Advertising (Everyone)

Challenges• Scalability

• Millions of objects• 100s of millions of users

• Cold start• Changing user base• Changing inventory

• Imbalanced dataset• User activity / item reviews

power law distributed• Ratings are not missing at random

Running Example: Netflix Datascoredatemovieuser

15/7/02211

58/2/042131

43/6/013452

45/1/051232

37/15/027682

51/22/01763

48/3/00454

19/10/055685

23/5/033425

212/28/002345

58/11/02766

46/15/03566

scoredatemovieuser?1/6/05621

?9/13/04961

?8/18/0572

?11/22/0532

?6/13/02473

?8/12/01153

?9/1/00414

?8/27/05284

?4/4/05935

?7/16/03745

?2/14/04696

?10/3/03836

Training data Test data

Movie rating data

• Released as part of $1M competition by Netflix in 2006 • Prize awarded to BellKor in 2009

Running Yardstick: RMSE

rmse(S) =s|S|�1

X

(i,u)2S

(r̂ui � rui)2

Running Yardstick: RMSE

rmse(S) =s|S|�1

X

(i,u)2S

(r̂ui � rui)2

(doesn’t tell you how to actually do recommendation)

Content-based Filtering

Item-based Features

Item-based Features

Item-based Features

wu = argminw|ru � X w |2

Per-user Regression

Learn a set of regression coefficients for each user

User Bias and Item Popularity

Bias

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated some users are more picky than others

3

3

3

Bias

Solution: Introduce a per-movie and per-user bias

Problem: Some movies are universally loved / hated some users are more picky than others

Moonrise Kingdom 4 5 4 4 0.3 0.2

Collaborative Filtering

Neighborhood Based MethodsPart I:Basic neighborhood methods

Joe

#2

#3

#1

#4

Users and items form a bipartite graph (edges are ratings)

Neighborhood Based Methods(user, user) similarity

• predict rating based on average from k-nearest users

• good if item base is small• good if item base changes rapidly

(item,item) similarity• predict rating based on average

from k-nearest items• good if the user base is small• good if user base changes rapidly

Parzen-Window Style CF

• Define a similarity sij between items• Find set εk(i,u) of k-nearest neighbors

to i that were rated by user u• Predict rating using weighted average over set• How should we define sij?

Pearson Correlation Coefficient

sij =Cov[rui, ruj ]

Std[rui]Std[ruj ]

Estimating item-item similarities

• Common practice – rely on Pearson correlation coeff• Challenge – non-uniform user support of item ratings,

each item rated by a distinct set of users

1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ?

? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4

User ratings for item i:

User ratings for item j:

• Compute correlation over shared support

(item,item) similarity

⇢̂ij =

Pu2U(i,j)(rui � bui)(ruj � buj)qP

u2U(i,j)(rui � bui)2P

u2U(i,j)(ruj � buj)2

Empirical estimate of Pearson correlation coefficient

sij =|U(i, j)|� 1

|U(i, j)|� 1 + �⇢̂ij

Regularize towards 0 for small support

Regularize towards baseline for small neighborhood

Similarity for binary labels

mi users acting on i

mij users acting on both i and j

m total number of users

sij =mij

↵+mi +mj �mijsij =

observed

expected

⇡ mij

↵+mimj/m

Jaccard similarity Observed / Expected ratio

Pearson correlation not meaningful for binary labels(e.g. Views, Purchases, Clicks)

Matrix Factorization Methods

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

Idea: pose as (biased) matrix factorization problem

Matrix FactorizationBasic matrix factorization model

45531

312445

53432142

24542

522434

42331

items

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

items

users

users

A rank-3 SVD approximation

PredictionEstimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

items

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

items

users

A rank-3 SVD approximation

users

?

PredictionEstimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

items

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

items

users

2.4

A rank-3 SVD approximation

users

SVD with missing valuesMatrix factorization model

45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD  isn’t  defined  when  entries  are  unknown  Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

Pose as regression problem

Regularize using Frobenius norm

Alternating Least SquaresMatrix factorization model

45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD  isn’t  defined  when  entries  are  unknown  Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

(regress wu given X)

Alternating Least SquaresMatrix factorization model

45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD  isn’t  defined  when  entries  are  unknown  Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

(regress wu given X)

Regularization

L2: closed form solution

w = (XT

X+ �I)�1X

T

y

L1: No closed form solution. Use quadraticprogramming:

minimize k Xw � y k2 s.t. k w k1 s

Yijun Zhao Linear Regression

Remember ridge regression?

Alternating Least SquaresMatrix factorization model

45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD  isn’t  defined  when  entries  are  unknown  Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

(regress xi given W)

(regress wu given X)

Stochastic Gradient DescentMatrix factorization model

45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1~

Properties:• SVD  isn’t  defined  when  entries  are  unknown  Î use

specialized methods• Very powerful model Î can easily overfit, sensitive to

regularization• Probably most popular model among Netflix contestants

– 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

• No need for locking• Multicore updates asynchronously

(Recht, Re, Wright, 2012 - Hogwild)

Sampling Bias

Ratings are not given at randomRatings are not given at random!

B.  Marlin  et  al.,  “Collaborative  Filtering  and  the  Missing  at  Random  Assumption”  UAI 2007

Yahoo! survey answersYahoo! music ratingsNetflix ratings

Distribution of ratings

Ratings are not given at random

• A powerful source of information:Characterize users by which movies they rated, rather than how they rated

• Î A dense binary representation of the data:

45531

312445

53432142

24542

522434

42331

users

movies

010100100101

111001001100

011101011011

010010010110

110000111100

010010010101

users

movies

Which movies users rate?

^ ` ,ui u iR r ^ ` ,ui u i

B b rui cui

matrix factorization

regression

data

Temporal Effects

Changes in user behaviorSomething Happened in Early 2004…

2004

Netflix ratings by date

Netflix changedrating labels

Are movies getting better with time?Movies get better with time?

Temporal Effects

Solution: Model temporal effects in bias not weights

Are movies getting better with time?

Netflix Prize

Netflix PrizeTraining data • 100 million ratings, 480,000 users, 17,770 movies • 6 years of data: 2000-2005 Test data • Last few ratings of each user (2.8 million) • Evaluation criterion: Root Mean Square Error (RMSE) Competition • 2,700+ teams• Netflix’s system RMSE: 0.9514• $1 million prize for 10% improvement on Netflix

Improvements40

6090

12818050

100200

50

100200

50

100 200 500

100200 500

50100 200 500 1000 1500

0.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF

BiasSVD

SVD++

SVD v.2

SVD v.3

SVD v.4

Add biases

Do SGD, but also learn biases μ, bu and bi

Improvements

Account for fact that ratings are not missing at random.

40

6090

12818050

100200

50

100200

50

100 200 500

100200 500

50100 200 500 1000 1500

0.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF

BiasSVD

SVD++

SVD v.2

SVD v.3

SVD v.4

“who  rated  what”

40

6090

12818050

100200

50

100200

50

100 200 500

100200 500

50100 200 500 1000 1500

0.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF

BiasSVD

SVD++

SVD v.2

SVD v.3

SVD v.4temporal effects

Improvements

40

6090

12818050

100200

50

100200

50

100 200 500

100200 500

50100 200 500 1000 1500

0.875

0.88

0.885

0.89

0.895

0.9

0.905

0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF

BiasSVD

SVD++

SVD v.2

SVD v.3

SVD v.4temporal effects

Improvements

Still pretty far from 0.8563 grand prize

Winning Solution from BellKor

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52

Last 30 days

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

June 26th submission triggers 30-day “last call”

BellKor fends off competitors by a hair

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56

BellKor fends off competitors by a hair

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57

Ratings aren’t everythingNetflix then Netflix now

• Only simpler submodels (SVD, RBMs) implemented • Ratings eventually proved to be only weakly informative

Netflix now