Behavior-driven clustering of queries into topics
description
Transcript of Behavior-driven clustering of queries into topics
CIKM 2011, Glasgow
Behavior-driven clustering of
queries into topics
Luca Maria AielloDebora DonatoUmut OzertemFilippo Menczer
CIKM 2011 2
USER PROFILING IN SEARCH ENGINES
Granularity levels
Aggregation
27/10/2011
Concise representation
Meaningful semantics
Query
Session
Goal
Mission
Topic
CIKM 2011 3
MISSIONS AND TOPICS
A topic is a mental object or cognitive content, i.e., the sum of what can be perceived, discovered or learned about any real or abstract entity.
A search mission can be identified as a set of queries that express a complex search need, possibly articulated in smaller goals
27/10/2011
CIKM 2011 4
QUERY STREAM DECOMPOSITION27/10/2011
Queries in the same mission
Same topic
Queries in consecutive missions
Different topic
Donato et. al:Do you want to take notes? Identifying research missions in Y! search pad. WWW’10Taxonomies User behavior and intent
CIKM 2011 5
MERGING MISSIONS27/10/2011
CIKM 2011 6
TOPIC DETECTOR STATS
• Gradient Boosted Decision Tree (GBDT)• Aggregation (min, max, avg, std) of 62 query pair
features
AUC 0.9510X cross validation on 500K pairs
27/10/2011
Lexical Features Behavioral features
Trigrams/terms cosine Probability fwd
Common prefix/suffix Session total click avg
Length difference Session total time avg
… …
CIKM 2011 7
• Topic detector applied to pairs of query sets• O(log|M|·|M|2) (heavily parellelizable)
1. Missions of the same user supermissions
2. Query sets of different users higher-level topics
GREEDY AGGLOMERATIVE TOPIC EXTRACTION (GATE)27/10/2011
EVALUATION
40K users
3 months Y! log
CIKM 2011 9
EVALUATION: BASELINE
• OSLOM community detection algorithm– Weighted undirected graph– Maximizing local fitness function of clusters– Automatic hierarchy detection
Lancichinetti et. al:Finding statistically significant communities in networks. PLoS ONE, 2011.
27/10/2011
2URL cover graph
CIKM 2011 10
EVALUATION: QUERY SET COVERAGE
Fraction of queries considered in the clustering phase
URL cover graph connected components size distribution
GATE: 1 OSLOM 0.2
27/10/2011
CIKM 2011 11
EVALUATION: SINGLETON RATIO
Fraction of queries that remains isolated in singleton
GATE: 0.55-0.27 OSLOM 0.88
27/10/2011
CIKM 2011 12
EVALUATION: AGGREGATION ABILITY
Topics aggregated in two consecutive steps or levels
GATE: 500k OSLOM:100K
27/10/2011
CIKM 2011 13
EVALUATION: PURITY vs. COVERAGE
• Coverage– Number of unique clicked URLs for the query
• Purity– Average pointwise mutual information of pairs
of query-related relevant terms• Relevant terms are extracted from top clicked
results using a predefined dictionary
27/10/2011
CIKM 2011 14
EVALUATION: PURITY vs. COVERAGE27/10/2011
CIKM 2011 15
EVALUATION: PURITY vs. COVERAGE27/10/2011
USER PROFILING
CIKM 2011 17
USER PROFILING FROM TOPICS27/10/2011
TopicDetector
Missions
Topics
0.0 0.0 0.00.72.9 3.2 1.90.35 0.41 0.24 User topicalprofile
CIKM 2011 18
PROFILES FOR “PREDICTION”
• Sequence of missions of the profiled user vs. sequence of a random one
• Sequence-profile match using topic detector• Success: 0.65 (0.72 less frequent, 0.55 most frequent)
27/10/2011
CIKM 2011 19
CONCLUSIONS
• New behavior-driven notion of topics• Bottom-up topic extraction algorithm• Favorable comparison with graph-based clustering• Effective user profiling
• Other baselines• More accurate predictions
27/10/2011
ACKNOWLEDGMENTS
Fil MenczerProf. Informatics @ IUDirector CNetS @IU
Umut OzertemYahoo! Search SciencesYahoo! Labs @ Sunnyvale
Emre VelisapaogluYahoo! Search Sciences
Yahoo! Labs @ Sunnyvale
Debora DonatoYahoo! Search Sciences
Yahoo! Labs @ Sunnyvale
CIKM 2011 2227/10/2011
Taxonomies User behavior and intent