Behavior-driven clustering of queries into topics

22
CIKM 2011, Glasgow Behavior-driven clustering of queries into topics Luca Maria Aiello Debora Donato Umut Ozertem Filippo Menczer

description

Behavior-driven clustering of queries into topics. Luca Maria Aiello Debora Donato Umut Ozertem Filippo Menczer. CIKM 2011, Glasgow. Granularity levels. Query Session Goal Mission Topic. Concise representation. Aggregation. Meaningful semantics. USER PROFILING IN SEARCH ENGINES. - PowerPoint PPT Presentation

Transcript of Behavior-driven clustering of queries into topics

Page 1: Behavior-driven clustering of queries into topics

CIKM 2011, Glasgow

Behavior-driven clustering of

queries into topics

Luca Maria AielloDebora DonatoUmut OzertemFilippo Menczer

Page 2: Behavior-driven clustering of queries into topics

CIKM 2011 2

USER PROFILING IN SEARCH ENGINES

Granularity levels

Aggregation

27/10/2011

Concise representation

Meaningful semantics

Query

Session

Goal

Mission

Topic

Page 3: Behavior-driven clustering of queries into topics

CIKM 2011 3

MISSIONS AND TOPICS

A topic is a mental object or cognitive content, i.e., the sum of what can be perceived, discovered or learned about any real or abstract entity.

A search mission can be identified as a set of queries that express a complex search need, possibly articulated in smaller goals

27/10/2011

Page 4: Behavior-driven clustering of queries into topics

CIKM 2011 4

QUERY STREAM DECOMPOSITION27/10/2011

Queries in the same mission

Same topic

Queries in consecutive missions

Different topic

Donato et. al:Do you want to take notes? Identifying research missions in Y! search pad. WWW’10Taxonomies User behavior and intent

Page 5: Behavior-driven clustering of queries into topics

CIKM 2011 5

MERGING MISSIONS27/10/2011

Page 6: Behavior-driven clustering of queries into topics

CIKM 2011 6

TOPIC DETECTOR STATS

• Gradient Boosted Decision Tree (GBDT)• Aggregation (min, max, avg, std) of 62 query pair

features

AUC 0.9510X cross validation on 500K pairs

27/10/2011

Lexical Features Behavioral features

Trigrams/terms cosine Probability fwd

Common prefix/suffix Session total click avg

Length difference Session total time avg

… …

Page 7: Behavior-driven clustering of queries into topics

CIKM 2011 7

• Topic detector applied to pairs of query sets• O(log|M|·|M|2) (heavily parellelizable)

1. Missions of the same user supermissions

2. Query sets of different users higher-level topics

GREEDY AGGLOMERATIVE TOPIC EXTRACTION (GATE)27/10/2011

Page 8: Behavior-driven clustering of queries into topics

EVALUATION

40K users

3 months Y! log

Page 9: Behavior-driven clustering of queries into topics

CIKM 2011 9

EVALUATION: BASELINE

• OSLOM community detection algorithm– Weighted undirected graph– Maximizing local fitness function of clusters– Automatic hierarchy detection

Lancichinetti et. al:Finding statistically significant communities in networks. PLoS ONE, 2011.

27/10/2011

2URL cover graph

Page 10: Behavior-driven clustering of queries into topics

CIKM 2011 10

EVALUATION: QUERY SET COVERAGE

Fraction of queries considered in the clustering phase

URL cover graph connected components size distribution

GATE: 1 OSLOM 0.2

27/10/2011

Page 11: Behavior-driven clustering of queries into topics

CIKM 2011 11

EVALUATION: SINGLETON RATIO

Fraction of queries that remains isolated in singleton

GATE: 0.55-0.27 OSLOM 0.88

27/10/2011

Page 12: Behavior-driven clustering of queries into topics

CIKM 2011 12

EVALUATION: AGGREGATION ABILITY

Topics aggregated in two consecutive steps or levels

GATE: 500k OSLOM:100K

27/10/2011

Page 13: Behavior-driven clustering of queries into topics

CIKM 2011 13

EVALUATION: PURITY vs. COVERAGE

• Coverage– Number of unique clicked URLs for the query

• Purity– Average pointwise mutual information of pairs

of query-related relevant terms• Relevant terms are extracted from top clicked

results using a predefined dictionary

27/10/2011

Page 14: Behavior-driven clustering of queries into topics

CIKM 2011 14

EVALUATION: PURITY vs. COVERAGE27/10/2011

Page 15: Behavior-driven clustering of queries into topics

CIKM 2011 15

EVALUATION: PURITY vs. COVERAGE27/10/2011

Page 16: Behavior-driven clustering of queries into topics

USER PROFILING

Page 17: Behavior-driven clustering of queries into topics

CIKM 2011 17

USER PROFILING FROM TOPICS27/10/2011

TopicDetector

Missions

Topics

0.0 0.0 0.00.72.9 3.2 1.90.35 0.41 0.24 User topicalprofile

Page 18: Behavior-driven clustering of queries into topics

CIKM 2011 18

PROFILES FOR “PREDICTION”

• Sequence of missions of the profiled user vs. sequence of a random one

• Sequence-profile match using topic detector• Success: 0.65 (0.72 less frequent, 0.55 most frequent)

27/10/2011

Page 19: Behavior-driven clustering of queries into topics

CIKM 2011 19

CONCLUSIONS

• New behavior-driven notion of topics• Bottom-up topic extraction algorithm• Favorable comparison with graph-based clustering• Effective user profiling

• Other baselines• More accurate predictions

27/10/2011

Page 20: Behavior-driven clustering of queries into topics

ACKNOWLEDGMENTS

Fil MenczerProf. Informatics @ IUDirector CNetS @IU

Umut OzertemYahoo! Search SciencesYahoo! Labs @ Sunnyvale

Emre VelisapaogluYahoo! Search Sciences

Yahoo! Labs @ Sunnyvale

Debora DonatoYahoo! Search Sciences

Yahoo! Labs @ Sunnyvale

Page 21: Behavior-driven clustering of queries into topics
Page 22: Behavior-driven clustering of queries into topics

CIKM 2011 2227/10/2011

Taxonomies User behavior and intent