Offline evaluation of recommender systems: all pain and no gain?
-
Upload
mark-levy -
Category
Technology
-
view
9.802 -
download
5
description
Transcript of Offline evaluation of recommender systems: all pain and no gain?
Offline Evaluation of Recommender Systems
All pain and no gain?
Mark LevyMendeley
About me
About me
Some things I built
Something I'm building
What is a good recommendation?
What is a good recommendation?
One that increases the usefulnessof your product in the long run1
1. WARNING: hard to measure directly
What is a good recommendation?
● One that increased your bottom line:
– User bought item after it was recommended
– User clicked ad after it was shown
– User didn't skip track when it was played
– User added document to library...
– User connected with contact...
Why was it good?
Why was it good?
● Maybe it was
– Relevant
– Novel
– Familiar
– Serendipitous
– Well explained
● Note: some of these are mutually incompatible
What is a bad recommendation?
What is a bad recommendation?
(you know one when you seen one)
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
What is a bad recommendation?
● Maybe it was
– Not relevant
– Too obscure
– Too familiar
– I already have it
– I already know that I don't like it
– Badly explained
What's the cost of getting it wrong?
● Depends on your product and your users
– Lost revenue
– Less engaged user
– Angry user
– Amused user
– Confused user
– User defects to a rival product
Hypotheses
Good offline metricsexpress product goals
Most (really) bad recommendationscan be caught by business logic
Issues
● Real business goals concern long-term user behaviour e.g. Netflix
“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to
watch a title and enjoys it enough to come back to the service”
● Usually have to settle for short-term surrogate
● Only some user behaviour is visible
● Same constraints when collecting training data
Least bad solution?
● “Back to the future” aka historical log analysis
● Decide which logged event(s) indicate success
● Be honest about “success”
● Usually care most about precision @ small k
● Recall will discriminate once this plateaus
● Expect to have to do online testing too
Making metrics meaningful
● Building a test framework + data is hard
● Be sure to get best value from your work
● Don't use straw man baselines
● Be realistic – leave the ivory tower
● Make test setups and baselines reproducible
Making metrics meaningful
● Old skool k-NN systems are better than you think
– Input numbers from mining logs
– Temporal “modelling” (e.g. fake users)
– Data pruning (scalability, popularity bias, quality)
– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric
– Hand crafted aggregation formula
– Postprocessing (popularity matching)
– Diversification
– Attention profile
Making metrics meaningful
● Measure preference honestly
● Predicted items may not be “correct” just because they were consumed once
● Try to capture value
– Earlier recommendation may be better
– Don't need a recommender to suggest items by same artist/author
● Don't neglect side data
– At least use it for evaluation / sanity checking
Making metrics meaningful
● Public data isn't enough for reproducibility or fair comparison
● Need to document preprocessing
● Better:
Release your preparation/evaluation code too
What's the cost of poor evaluation?
What's the cost of poor evaluation?
Poor offline evaluation can lead toyears of misdirected research
Ex 1: Reduce playlist skips
● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”
● Use audio similarity measure to compute transition distance, then travelling salesman
● Metric: sum of transition distances (lower is better)
● 6 months work to develop solution
Ex 1: Reduce playlist skips
● Result: users skipped more often
● Why?
Ex 1: Reduce playlist skips
● Result: users skipped more often
● When a user skipped a track they didn't like they were played something else just like it
● Better metric: average position of skipped tracks (based on logs, lower down is better)
Ex 2: Recommend movies
● Use a corpus of star ratings to improve movie recommendations
● Learn to predict ratings for un-rated movies
● Metric: average RMSE of predictions for a hidden test set (lower is better)
● 2+ years work to develop new algorithms
Ex 2: Recommend movies
● Result: “best” solutions were never deployed
● Why?
Ex 2: Recommend movies
● Result: “best” solutions were never deployed
● User behaviour correlates with rank not RMSE
● Side datasets an order of magnitude more valuable than algorithm improvements
● Explicit ratings are the exception not the rule
● RMSE still haunts research labs
Can contests help?
● Good:
– Great for consistent evaluation
● Not so good:
– Privacy concerns mean obfuscated data
– No guarantee that metrics are meaningful
– No guarantee that train/test framework is valid
– Small datasets can become overexposed
Ex 3: Yahoo! Music KDD Cup
● Largest music rating dataset ever released
● Realistic “loved songs” classification task
● Data fully obfuscated due to recent lawsuits
Ex 3: Yahoo! Music KDD Cup
● Result: researchers hated it
● Why?
Ex 3: Yahoo! Music KDD Cup
● Result: researchers hated it
● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings
Ex 4: Million Song Challenge
● Large music dataset with rich metadata
● Anonymized listening histories
● Simple item recommendation task
● Reasonable MAP@500 metric
● Aimed to solve shortcomings of KDD Cup
● Only obfuscation was removal of timestamps
Ex 4: Million Song Challenge
● Result: winning entry didn't use side data
● Why?
Ex 4: Million Song Challenge
● Result: winning entry didn't use side data
● No timestamps so test tracks chosen at random
● So “people who listen to A also listen to B”
● Traditional item similarity solves this well
● More honesty about “success” might have shown that contest data was flawed
Ex 5: Yelp RecSys Challenge
● Small business review dataset with side data
● Realistic mix of input data types
● Rating prediction task
● Informal procedure to create train/test sets
Ex 5: Yelp RecSys Challenge
● Result: baseline algorithms high up leaderboard
● Why?
Ex 5: Yelp RecSys Challenge
● Result: baseline algorithms high up leaderboard
● Train/test split was corrupt
● Competition organisers moved fast to fix this
● But left only one week before deadline
Ex 6: MIREX Audio Chord Estimation
● Small dataset of audio tracks
● Task to label with predicted chord symbols
● Human labelled data hard to come by
● Contest hosted by premier forum in field
● Evaluate frame-level prediction accuracy
● Historical glass ceiling around 80%
Ex 6: MIREX Audio Chord Estimation
● Result: 2011 winner ftw
● Why?
Ex 6: MIREX Audio Chord Estimation
● Result: 2011 winner ftw
● Spoof entry relying on known test set
● Protest against inadequate test data
● Other research showed weak generalisation of winning algorithms from same contest
● Next year results dropped significantly
So why evaluate offline at all?
● Building test framework ensures clear goals
● Avoid wishful thinking if your data is too thin
● Be efficient with precious online testing
– Cut down huge parameter space
– Don't alienate users
● Need to publish
● Pursuing science as well as profit
Online evaluation is tricky too
● No off the shelf solution for services
● Many statistical gotchas
● Same mismatch between short-term and long-term success criteria
● Results open to interpretation by management
● Can make incremental improvements look good when radical innovation is needed
Ex 7: Article Recommendations
● Recommender for related research articles
● Massive download logs available
● Framework developed based on co-downloads
● Aim to improve on existing search solution
● Management “keen for it work”
● Several weeks of live A/B testing available
● No offline evaluation
Ex 7: Article Recommendations
● Result: worse than similar title search
● Why?
Ex 7: Article Recommendations
● Result: worse than similar title search
● Inadequate business rules e.g. often suggesting other articles from same publication
● Users identified only by organisational IP range so value of “big data” very limited
● Establishing an offline evaluation protocol would have shown these in advance
Isn't there software for that?
Rules of the game:
– Model fit metrics (e.g. validation loss) don't count
– Need a transparent “audit trail” of data to support genuine reproducibility
– Just using public datasets doesn't ensure this
Isn't there software for that?
Wish list for reproducible evaluation:
– Integrate with recommender implementations
– Handle data formats and preprocessing
– Handle splitting, cross-validation, side datasets
– Save everything to file
– Work from file inputs so not tied to one framework
– Generate meaningful metrics
– Well documented and easy to use
Isn't there software for that?
Current offerings:
● GraphChi/GraphLab
● Mahout
● LensKit
● MyMediaLite
Isn't there software for that?
Current offerings:
● GraphChi/GraphLab
– Model validation loss, doesn't count
● Mahout
– Only rating prediction accuracy, doesn't count
● LensKit
– Too hard to understand, won't use
Isn't there software for that?
Current offerings:
● MyMediaLite
– Reports meaningful metrics
– Handles cross-validation
– Data splitting not transparent
– No support for pre-processing
– No built in support for standalone evaluation
– API is capable but current utils don't meet wishlist
Eating your own dog food
● Built a small framework around new algorithm
● https://github.com/mendeley/mrec
– Reports meaningful metrics
– Handles cross-validation
– Supports simple pre-processing
– Writes everything to file for reproducibility
– Provides API and utility scripts
– Runs standalone evalutions
– Readable Python code
Eating your own dog food
● Some lessons learned
– Usable frameworks are hard to write
– Tradeoff between clarity and scalability
– Should generate explicit validation sets
● Please contribute!
● Or use as inspiration to improve existing tools
Where next?
● Shift evaluation online:
– Contests based around online evaluation
– Realistic but not reproducible
– Could some run continuously?
● Recommender Systems as a commodity:
– Software and services reaching maturity now
– Business users can tune/evaluate themselves
– Is there a way to report results?
Where next?
● Support alternative query paradigms:
– More like this, less like that
– Metrics for dynamic/online recommenders
● Support recommendation with side data:
– LibFM, GenSGD, WARP research @google, …– Open datasets?
Thanks for listening
@gamboviol
https://github.com/gamboviol
https://github.com/mendeley/mrec