Query Log Mining Yandex Challenge 2011
Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula
Search logs are used to improve search
• Learn a ranking functions
– Users click on meaningful results
• Personalize search based on users history
– Previous user searches unveil users interests
• Identify spammers
– Bots click on suspicious websites more often
• Tune contextual advertizing models
• Recommend and disambiguate queries
– See also “java programming” Vs. “java coffee”
Yandex QLM Challenge 2011 goals • Learn a ranking function
– For a given query provide a list of ordered URLs using the information from the log
• Plan for today – Task description
– General framework: learning to rank (L2R)
– Features for L2R
– Preferences extraction for L2R
– Ranking algorithms
– Collaborative Filtering and graph-based approaches
– Experiments
– Future Plans to improve
Task description: Input to the challenge
• Query log
– Query action
SessionID TimePassed QUERY QueryID RegionID ListOfURLs
– Click action
SessionID TimePassed CLICK URLID
• Training relevance labels from {0,1} set QueryID RegionID URLID RelevanceLabel
• Testing query/region pairs
– The goal is to provide relevant URLs for these new query/region pairs
Some real input data
• Snapshot of the real Yandex query log
SessionID Time Action QueryId RegionId URL URL URL
• Training relevance labels from {0,1} set QueryId RegionId URL Relevance
Some statistics about the query log
• Unique queries: 30,717,251
• Unique URLs: 117,093,258
• Sessions: 43,977,859
• Total records in the log: 340,796,067
• Assessed query-region-url triples for the total query set (training + test): 71,930
• Log size: 17 Gb (doesn’t’t fit into memory)
General Framework: Learning to Rank (L2R)
• Training formalization:
– Given an ordered set of ranks Y = {0,1} (0 < 1)
– Given a set of queries Q = {q1, . . . , qn}
– A list of documents is associated with each query Dq = {dq1, . . . , dq,n(q)}
– Factor ranking model:
• Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair
• Goal of L2R:
– Learn a Ranker: X Y
Subtasks of L2R from query logs
• Extract preferences (absolute, pairwise) form a query log using click-through statistics
• Generate features (factors) to make a problem structured
• Learn a ranking algorithm
SVM for L2R = RankSVM
• Extract preferences from a query log based on some heuristics
Boosting for L2R = RankBoost
• Uses each feature as a decision stump
• Builds a linear weighted ensemble model
Ensemble Approach
• Generate multiple models by varying…
– Feature subsets
– Algorithms parameters
– Ranking models
– Model Subsets
– Averaging strategies (weighted, quality-absed, etc.)
• Finally average [similar to CombMNZ]
Best result so far
0.642436
Future work
• Add more models
– SVMpref (reduction on L2R to classification)
– Direct optimization of AUC
– Experiment with more sophisticated ensemble models (MonoRank, etc.)
Top Related