Data Mining Techniques › home › jwvdm › teaching › cs6220 › ...• 3 Feb: Form teams of...

72
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 10 Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS246)

Transcript of Data Mining Techniques › home › jwvdm › teaching › cs6220 › ...• 3 Feb: Form teams of...

  • Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

    Lecture 10Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, 
Yehuda Koren, Stanford CS246)

  • Project

  • • 3 Feb: Form teams of 2-4 people • 17 Feb: Submit abstract (1 paragraph) • 3 Mar: Submit proposals (2 pages) • 13 Mar: Milestone 1 (exploratory analysis) • 31 Mar: Milestone 2 (statistical analysis) • 16 Apr (Sun): Submit reports (10 pages) • 21 Apr (Fri): Submit peer reviews

    Project Deadlines

  • • ~10 pages (rough guideline) • Guidelines for contents

    • Introduction / Motivation • Exploratory analysis (if applicable) • Data mining analysis • Discussion of results

    Project Reports

  • • 2 per person (randomly assigned) • Reviews should discuss 4 aspects 


    of the report • Clarity 


    (is the writing clear?) • Technical merit 


    (are methods valid?) • Reproducibility 


    (is it clear how results were obtained?) • Discussion 


    (are results interpretable?)

    Project Review

  • Final Exam

  • Emphasis on post-midterm topics
(but some pre-midterm topics included)

    Topic Listhttp://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html

    http://www.ccs.neu.edu/home/jwvdm/teaching/cs6220/spring2017/final-topics.html

  • Recommender Systems

  • The Long Tail

    (from: https://www.wired.com/2004/10/tail/)

    https://www.wired.com/2004/10/tail/

  • The Long Tail

    (from: https://www.wired.com/2004/10/tail/)

    https://www.wired.com/2004/10/tail/

  • The Long Tail

    (from: https://www.wired.com/2004/10/tail/)

    https://www.wired.com/2004/10/tail/

  • Problem Setting

  • Problem Setting

  • Problem Setting

  • Problem Setting

    • Task: Predict user preferences for unseen items

  • Content-based Filtering

    Geared towards females

    Geared towards males

    serious

    escapist

    The PrincessDiaries

    The Lion King

    Braveheart

    Lethal Weapon

    Independence Day

    AmadeusThe Color Purple

    Dumb and Dumber

    Ocean’s  11

    Sense and Sensibility

    Gus

    Dave

    Latent factor models

  • Content-based Filtering

    Geared towards females

    Geared towards males

    serious

    escapist

    The PrincessDiaries

    The Lion King

    Braveheart

    Lethal Weapon

    Independence Day

    AmadeusThe Color Purple

    Dumb and Dumber

    Ocean’s  11

    Sense and Sensibility

    Gus

    Dave

    Latent factor models

    Idea: Predict rating using item features on a per-user basis

  • Content-based Filtering

    Geared towards females

    Geared towards males

    serious

    escapist

    The PrincessDiaries

    The Lion King

    Braveheart

    Lethal Weapon

    Independence Day

    AmadeusThe Color Purple

    Dumb and Dumber

    Ocean’s  11

    Sense and Sensibility

    Gus

    Dave

    Latent factor models

    Idea: Predict rating using user features on a per-item basis

  • Collaborative FilteringPart I:

    Basic neighborhood methods

    Joe

    #2

    #3

    #1

    #4

    Idea: Predict rating based on similarity to other users

  • Problem Setting

    • Task: Predict user preferences for unseen items • Content-based filtering: Model user/item features • Collaborative filtering: Implicit similarity of users items

  • Recommender Systems

    • Movie recommendation (Netflix)• Related product recommendation (Amazon)• Web page ranking (Google)• Social recommendation (Facebook)• Priority inbox & spam filtering (Google)• Online dating (OK Cupid)• Computational Advertising (Everyone)

  • Challenges• Scalability

    • Millions of objects• 100s of millions of users

    • Cold start• Changing user base• Changing inventory

    • Imbalanced dataset• User activity / item reviews 


    power law distributed• Ratings are not missing at random

  • Running Example: Netflix Datascoredatemovieuser

    15/7/02211

    58/2/042131

    43/6/013452

    45/1/051232

    37/15/027682

    51/22/01763

    48/3/00454

    19/10/055685

    23/5/033425

    212/28/002345

    58/11/02766

    46/15/03566

    scoredatemovieuser?1/6/05621

    ?9/13/04961

    ?8/18/0572

    ?11/22/0532

    ?6/13/02473

    ?8/12/01153

    ?9/1/00414

    ?8/27/05284

    ?4/4/05935

    ?7/16/03745

    ?2/14/04696

    ?10/3/03836

    Training data Test data

    Movie rating data

    • Released as part of $1M competition by Netflix in 2006 • Prize awarded to BellKor in 2009

  • Running Yardstick: RMSE

    rmse(S) =s|S|�1

    X

    (i,u)2S

    (r̂ui � rui)2

  • Running Yardstick: RMSE

    rmse(S) =s|S|�1

    X

    (i,u)2S

    (r̂ui � rui)2

    (doesn’t tell you how to actually do recommendation)

  • Content-based Filtering

  • Item-based Features

  • Item-based Features

  • Item-based Features

  • wu = argminw|ru � X w |2

    Per-user Regression

    Learn a set of regression coefficients for each user

  • User Bias and 
Item Popularity

  • Bias

  • Bias

    Moonrise Kingdom 4 5 4 4 0.3 0.2

  • Bias

    Moonrise Kingdom 4 5 4 4 0.3 0.2

    Problem: Some movies are universally loved / hated

  • Bias

    Moonrise Kingdom 4 5 4 4 0.3 0.2

    Problem: Some movies are universally loved / hated 
some users are more picky than others

    3

    3

    3

  • Bias

    Solution: Introduce a per-movie and per-user bias

    Problem: Some movies are universally loved / hated 
some users are more picky than others

    Moonrise Kingdom 4 5 4 4 0.3 0.2

  • Collaborative 
Filtering

  • Neighborhood Based MethodsPart I:Basic neighborhood methods

    Joe

    #2

    #3

    #1

    #4

    Users and items form a bipartite graph (edges are ratings)

  • Neighborhood Based Methods(user, user) similarity

    • predict rating based on average 
from k-nearest users

    • good if item base is small• good if item base changes rapidly

    (item,item) similarity• predict rating based on average 


    from k-nearest items• good if the user base is small• good if user base changes rapidly

  • Parzen-Window Style CF

    • Define a similarity sij between items• Find set εk(i,u) of k-nearest neighbors 


    to i that were rated by user u• Predict rating using weighted average over set• How should we define sij?

  • Pearson Correlation Coefficient

    sij =Cov[rui, ruj ]

    Std[rui]Std[ruj ]

    Estimating item-item similarities

    • Common practice – rely on Pearson correlation coeff• Challenge – non-uniform user support of item ratings,

    each item rated by a distinct set of users

    1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ?

    ? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4

    User ratings for item i:

    User ratings for item j:

    • Compute correlation over shared support

  • (item,item) similarity

    ⇢̂ij =

    Pu2U(i,j)(rui � bui)(ruj � buj)qP

    u2U(i,j)(rui � bui)2P

    u2U(i,j)(ruj � buj)2

    Empirical estimate of Pearson correlation coefficient

    sij =|U(i, j)|� 1

    |U(i, j)|� 1 + �⇢̂ij

    Regularize towards 0 for small support

    Regularize towards baseline for small neighborhood

  • Similarity for binary labels

    mi users acting on i

    mij users acting on both i and j

    m total number of users

    sij =mij

    ↵+mi +mj �mijsij =

    observed

    expected

    ⇡ mij↵+mimj/m

    Jaccard similarity Observed / Expected ratio

    Pearson correlation not meaningful for binary labels
(e.g. Views, Purchases, Clicks)

  • Matrix Factorization Methods

  • Matrix Factorization

    Moonrise Kingdom 4 5 4 4 0.3 0.2

  • Matrix Factorization

    Moonrise Kingdom 4 5 4 4 0.3 0.2

    Idea: pose as (biased) matrix factorization problem

  • Matrix FactorizationBasic matrix factorization model45531

    312445

    53432142

    24542

    522434

    42331

    items

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1

    ~

    ~

    items

    users

    users

    A rank-3 SVD approximation

  • PredictionEstimate unknown ratings as inner-products of factors:45531

    312445

    53432142

    24542

    522434

    42331

    items

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1

    ~

    ~

    items

    users

    A rank-3 SVD approximation

    users

    ?

  • PredictionEstimate unknown ratings as inner-products of factors:45531

    312445

    53432142

    24542

    522434

    42331

    items

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1

    ~

    ~

    items

    users

    2.4

    A rank-3 SVD approximation

    users

  • SVD with missing valuesMatrix factorization model

    45531

    312445

    53432142

    24542

    522434

    42331

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1~

    Properties:• SVD  isn’t  defined  when  entries  are  unknown   use

    specialized methods• Very powerful model can easily overfit, sensitive to

    regularization• Probably most popular model among Netflix contestants

    – 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

    Pose as regression problem

    Regularize using Frobenius norm

  • Alternating Least SquaresMatrix factorization model

    45531

    312445

    53432142

    24542

    522434

    42331

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1~

    Properties:• SVD  isn’t  defined  when  entries  are  unknown   use

    specialized methods• Very powerful model can easily overfit, sensitive to

    regularization• Probably most popular model among Netflix contestants

    – 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

    (regress wu given X)

  • Alternating Least SquaresMatrix factorization model

    45531

    312445

    53432142

    24542

    522434

    42331

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1~

    Properties:• SVD  isn’t  defined  when  entries  are  unknown   use

    specialized methods• Very powerful model can easily overfit, sensitive to

    regularization• Probably most popular model among Netflix contestants

    – 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

    (regress wu given X)

    Regularization

    L2: closed form solution

    w = (XTX+ �I)�1XTy

    L1: No closed form solution. Use quadraticprogramming:

    minimize k Xw � y k2 s.t. k w k1 s

    Yijun Zhao Linear Regression

    Remember ridge regression?

  • Alternating Least SquaresMatrix factorization model

    45531

    312445

    53432142

    24542

    522434

    42331

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1~

    Properties:• SVD  isn’t  defined  when  entries  are  unknown   use

    specialized methods• Very powerful model can easily overfit, sensitive to

    regularization• Probably most popular model among Netflix contestants

    – 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

    (regress xi given W)

    (regress wu given X)

  • Stochastic Gradient DescentMatrix factorization model

    45531

    312445

    53432142

    24542

    522434

    42331

    .2-.4.1

    .5.6-.5

    .5.3-.2

    .32.11.1

    -22.1-.7

    .3.7-1

    -.92.41.4.3-.4.8-.5-2.5.3-.21.1

    1.3-.11.2-.72.91.4-1.31.4.5.7-.8

    .1-.6.7.8.4-.3.92.41.7.6-.42.1~

    Properties:• SVD  isn’t  defined  when  entries  are  unknown   use

    specialized methods• Very powerful model can easily overfit, sensitive to

    regularization• Probably most popular model among Netflix contestants

    – 12/11/2006: Simon Funk describes an SVD based method– 12/29/2006: Free implementation at timelydevelopment.com

    • No need for locking• Multicore updates asynchronously


    (Recht, Re, Wright, 2012 - Hogwild)

  • Sampling Bias

  • Ratings are not given at randomRatings are not given at random!

    B.  Marlin  et  al.,  “Collaborative  Filtering  and  the  Missing  at  Random  Assumption”  UAI 2007

    Yahoo! survey answersYahoo! music ratingsNetflix ratings

    Distribution of ratings

  • Ratings are not given at random

    • A powerful source of information:Characterize users by which movies they rated, rather than how they rated

    • A dense binary representation of the data:

    45531

    312445

    53432142

    24542

    522434

    42331

    users

    movies

    010100100101

    111001001100

    011101011011

    010010010110

    110000111100

    010010010101

    users

    movies

    Which movies users rate?

    ,ui u iR r

    ,ui u iB brui cui

    matrix factorization

    regression

    data

  • Temporal Effects

  • Changes in user behaviorSomething Happened in Early 2004…

    2004

    Netflix ratings by date

    Netflix changedrating labels

  • Are movies getting better with time?Movies get better with time?

  • Temporal Effects

    Solution: Model temporal effects in bias not weights

    Are movies getting better with time?

  • Netflix Prize

  • Netflix PrizeTraining data • 100 million ratings, 480,000 users, 17,770 movies • 6 years of data: 2000-2005 Test data • Last few ratings of each user (2.8 million) • Evaluation criterion: Root Mean Square Error (RMSE) Competition • 2,700+ teams• Netflix’s system RMSE: 0.9514• $1 million prize for 10% improvement on Netflix

  • Improvements40

    6090

    12818050

    100200

    50

    100200

    50

    100 200 500

    100200 500

    50100 200 500 1000 1500

    0.875

    0.88

    0.885

    0.89

    0.895

    0.9

    0.905

    0.91

    10 100 1000 10000 100000

    RMSE

    Millions of Parameters

    Factor models: Error vs. #parameters

    NMF

    BiasSVD

    SVD++

    SVD v.2

    SVD v.3

    SVD v.4

    Add biases

    Do SGD, but also learn biases μ, bu and bi

  • Improvements

    Account for fact that ratings are not missing at random.

    40

    6090

    12818050

    100200

    50

    100200

    50

    100 200 500

    100200 500

    50100 200 500 1000 1500

    0.875

    0.88

    0.885

    0.89

    0.895

    0.9

    0.905

    0.91

    10 100 1000 10000 100000

    RMSE

    Millions of Parameters

    Factor models: Error vs. #parameters

    NMF

    BiasSVD

    SVD++

    SVD v.2

    SVD v.3

    SVD v.4

    “who  rated  what”

  • 40

    6090

    12818050

    100200

    50

    100200

    50

    100 200 500

    100200 500

    50100 200 500 1000 1500

    0.875

    0.88

    0.885

    0.89

    0.895

    0.9

    0.905

    0.91

    10 100 1000 10000 100000

    RMSE

    Millions of Parameters

    Factor models: Error vs. #parameters

    NMF

    BiasSVD

    SVD++

    SVD v.2

    SVD v.3

    SVD v.4temporal effects

    Improvements

  • 40

    6090

    12818050

    100200

    50

    100200

    50

    100 200 500

    100200 500

    50100 200 500 1000 1500

    0.875

    0.88

    0.885

    0.89

    0.895

    0.9

    0.905

    0.91

    10 100 1000 10000 100000

    RMSE

    Millions of Parameters

    Factor models: Error vs. #parameters

    NMF

    BiasSVD

    SVD++

    SVD v.2

    SVD v.3

    SVD v.4temporal effects

    Improvements

    Still pretty far from 0.8563 grand prize

  • Winning Solution from BellKor

    J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52

  • Last 30 days

    J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

    June 26th submission triggers 30-day “last call”

  • BellKor fends off competitors by a hair

    J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56

  • BellKor fends off competitors by a hair

    J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57

  • Ratings aren’t everythingNetflix then Netflix now

    • Only simpler submodels (SVD, RBMs) implemented • Ratings eventually proved to be only weakly informative

    Netflix now