Download - Click Log Mining CS598

Query Log Mining Yandex Challenge 2011

Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula

Search logs are used to improve search

• Learn a ranking functions

– Users click on meaningful results

• Personalize search based on users history

– Previous user searches unveil users interests

• Identify spammers

– Bots click on suspicious websites more often

• Tune contextual advertizing models

• Recommend and disambiguate queries

– See also “java programming” Vs. “java coffee”

Yandex QLM Challenge 2011 goals • Learn a ranking function

– For a given query provide a list of ordered URLs using the information from the log

• Plan for today – Task description

– General framework: learning to rank (L2R)

– Features for L2R

– Preferences extraction for L2R

– Ranking algorithms

– Collaborative Filtering and graph-based approaches

– Experiments

– Future Plans to improve

Task description: Input to the challenge

• Query log

– Query action

SessionID TimePassed QUERY QueryID RegionID ListOfURLs

– Click action

SessionID TimePassed CLICK URLID

• Training relevance labels from {0,1} set QueryID RegionID URLID RelevanceLabel

• Testing query/region pairs

– The goal is to provide relevant URLs for these new query/region pairs

Some real input data

• Snapshot of the real Yandex query log

SessionID Time Action QueryId RegionId URL URL URL

• Training relevance labels from {0,1} set QueryId RegionId URL Relevance

Some statistics about the query log

• Unique queries: 30,717,251

• Unique URLs: 117,093,258

• Sessions: 43,977,859

• Total records in the log: 340,796,067

• Assessed query-region-url triples for the total query set (training + test): 71,930

• Log size: 17 Gb (doesn’t’t fit into memory)

General Framework: Learning to Rank (L2R)

• Training formalization:

– Given an ordered set of ranks Y = {0,1} (0 < 1)

– Given a set of queries Q = {q1, . . . , qn}

– A list of documents is associated with each query Dq = {dq1, . . . , dq,n(q)}

– Factor ranking model:

• Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair

• Goal of L2R:

– Learn a Ranker: X Y

Subtasks of L2R from query logs

• Extract preferences (absolute, pairwise) form a query log using click-through statistics

• Generate features (factors) to make a problem structured

• Learn a ranking algorithm

SVM for L2R = RankSVM

• Extract preferences from a query log based on some heuristics

Boosting for L2R = RankBoost

• Uses each feature as a decision stump

• Builds a linear weighted ensemble model

Ensemble Approach

• Generate multiple models by varying…

– Feature subsets

– Algorithms parameters

– Ranking models

– Model Subsets

– Averaging strategies (weighted, quality-absed, etc.)

• Finally average [similar to CombMNZ]

Best result so far

0.642436

Future work

• Add more models

– SVMpref (reduction on L2R to classification)

– Direct optimization of AUC

– Experiment with more sophisticated ensemble models (MonoRank, etc.)