Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip...

Personalizing Web Search using Long Term Browsing History

Nicolaas Matthijs, Cambridge

Filip Radlinski, Microsoft

In Proceedings of WSDM 2011 1

Relevant result

“pia workshop”Query:

2

Outline

Approaches to personalizationThe proposed personalization strategyEvaluation metricsResultsConclusions and Future work

3

Approaches to Personalization

Observed user interactionsShort-term interests

Sriram et al. [24] and [6], session data is too sparse to personalize

Longer-term interests[23, 16]: model users by classifying previously visited Web pagesJoachims [11]: user click-through data to learn a search functionPClink [7] and Teevan et al. [28]Other related approaches: [20, 25, 26]

Representing the userTeevan et al. [28], rich keyword-based representations, no use of

web page characteristics

Commercial personalization systemsGoogleYahoo!

rich user profile

4

promote URLs

Personalization Strategy

Title Unigrams

Metadata description Unigrams

Full text Unigrams

Metadata keywords

Extracted Terms

Noun phrases

BrowsingHistory

User Profile Terms

User Profile Terms

WordNet DictionaryFiltering

Google N-GramFiltering

No Filtering

TF Weighting

TFxIDF Weighting

BM25 Weighting

User Profile Termsand Weights

Visited URLs +number of visits

Previous searches &click-through data

Data Extraction Filtering

Weighting

User Profile Generation Workflow5

Personalized Search

query

6

dog 1cat 10india 2mit 4search 93amherst 12vegas 1

BrowsingHistory

Firefox add-on: AlterEgo

Personalized Search

query

dog cat monkey banana

food

baby infant

child boy girl

forest hiking

walking gorp

baby infant

child boy girl

csail mit artificial research

robotweb

search retrieval ir

hunt

7


Data extraction

User Profile Terms

1.6 0.26.0

0.2 2.7

1.3

Personalized Search

query

web search retrieval ir hunt

1.38


Term weighting

Term Weighting

TF: term frequency

TF-IDF:

wTF(ti)

cow search cow

ir huntdog

=0.02

9

TF 2 100

wTF(ti)= * wTF(ti) 1

log(DFti)

dog cat monkey banana

food

baby infant

child boy cow

forest cow

walking gorp

baby infant

child boy girl

csail mit artificial research

robotcow

searchcowir

huntdog

* = 0.08

TF-IDF 2 100

1 log(103/107)

0.3 0.7 0.1 0.23 0.6 0.6

0.002 0.7 0.1 0.01 0.6

0.2 0.8 0.1 0.001

0.3 0.4

0.1 0.7 0.001

0.23 0.6

0.1 0.7 0.001 0.23 0.6

0.1 0.05

0.5 0.35 0.3

N

ni

Term Weighting

Personalized BM25

World

ri R

(rti+0.5)(N-nti

+0.5)

(nti+0.5)(R-rti

+0.5)wpBM25(ti)=log

10

Re-rankingUse the user profile to re-rank top results returned by a

search engineCandidate document vs. snippets

Snippets are more effective. Teevan et al. [28]Allow straightforward personalization implementation

MatchingFor each term occurs both in snippet and user profile, its weight will be

added to the snippet’s score

Unique matchingCounts each unique term once

Language modelLanguage model for user profile, weights for terms are used as

frequency counts

PClink Dou et al. [7]11

Scoring methods

Evaluation Metrics

Relevance judgementsNDCG@10 = Σ

Side-by-sideTwo alternative rankings side-by-side, ask users to vote for

best

Clickthrough-basedLook at the query and click logs from large search engine

InterleavedNew metric for personalized searchCombine results of two search rankings (alternating

between results, omitting duplicates) 12

Z

1 i=1

10 2reli - 1

log2(1+i)

Offline Evaluation

6 participants, 2 months of browsing historyJudge relevance of top 50 pages returned by

Google for 12 queries25 general queries (16 from TREC 2009 Web

search track), each participant will judge 6Most recent 40 search queries, judge 5Each participant took about 2.5 hours to complete

13

Offline Evaluation

14

Strategy Profile Parameters Ranking Parameters

Full text

Title Meta keywords

MetaDescr.

Extracted terms

Noun Phrases

Term weights

SnippetScoring

Google rank

URLsvisited

MaxNDCG - Rel Rel - - Rel TF-IDF LM 1/log v=10

MaxQuer - - - - Rel Rel TF LM 1/log v=10

MaxNoRank - - Rel - - - TF LM - v=10

MaxBestPar - Rel Rel - Rel - pBM25 LM 1/log v=10

Personalization strategies. Rel: relative weighting

MaxNDCG: yields highest average NDCG MaxQuer: improves the most queries MaxNoRank: the method with highest NDCG that does not take the original

Google ranking into account MaxBestPar: obtained by greedily selecting each parameter sequentially

Offline Evaluation

15

Method Average NDCG +/=/- Queries

Google 0.502 ± 0.067 -

Teevan et al. [28] 0.518 ± 0.062 44/0/28

PClink 0.533 ± 0.057 13/58/1

MaxNDCG 0.573 ± 0.042 48/1/23

MaxQuer 0.567 ± 0.045 52/2/18

MaxNoRank 0.520 ± 0.060 13/52/7

MaxBestPar 0.566 ± 0.044 45/5/22

Offline evaluation performance

MaxNDCG and MaxQuer are both significantly better Interestingly, MaxNoRank is significantly better than Google and Teevan (may

be due to overfitting on small offline data) PClink improves fewest queries, but better than Teevan on average NDCG

Offline Evaluation

16

Distribution of relevance at rank for Google and MaxNDCG rankings

3600 relevance judgements collected, 9% Very Relevant, 32% Relevant, 58% Non-Relevant

Google:places many Very Relevant results in Top 5 MaxNDCG: adds more Very Relevant results into Top 5, and succeeds in adding

Very Relevant results between Top 5 and Top 10

Online Evaluation

17

Large-scale interleaved evaluation, users performing day-to-day real searches

The first 50 results requested from Google, personalization strategies were picked randomly

Exploit Team-Draft interleaving algorithm [18] to produce a combined ranking

41 users, 7997 queries, 6033 query impressions, 6534 queries and 5335 query impressions received a click

Online Evaluation

18

Method Queries Google Vote Re-ranked Vote

MaxNDCG 2090 624(39.5%) 955(60.5%)

MaxQuer 2273 812(47.3%) 905(52.7%)

MaxBestPar 2171 734(44.8%) 906(55.2%)

Method Unchanged Improved Deteriorated

MaxNDCG 1419(67.9%) 500(23.9%) 171(8.2%)

MaxQuer 1639(72.1%) 423(18.6%) 211(9.3%)

MaxBestPar 1485(68.4%) 467(21.5%) 219(10.1%)

Results of online interleaving test

Queries impacted by personalization

Online Evaluation

19

Rank differences for deteriorated(light) and improved(dark) queries for MaxNDCG Degree of personalization per rank

For a large majority of deteriorated queries, the clicked results only loss 1 rank The majority of clicked results that improved a query gain 1 rank The gains from personalization are on average more than double the losses MaxNDCG is the most effective personalization method

Conclusions

First large-scale personalized search and online evaluation work

Proposed personalization techniques: significantly outperform default Google and best previous ones

Key to model users: use characteristics and structures of Web pages

Long-term, rich user profile is beneficial

20

Future Exploration

Parameter extensionLearning parameter weightsUsing other fields (e.g., headings in HTML) and learning

their weights

Incorporating temporal informationHow much browsing history?Whether decaying weights of older terms?How page visit duration can be used?

Making use of more personal dataUsing extracted profiles for other purposes

21

Thank you!

22

Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip...

Documents

Transcript of Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip...