Search

April 2006 Moscow 1

Search

Stephen RobertsonMicrosoft Research Cambridge

April 2006 Moscow 2

MSR Cambridge

• Andrew Herbert, Director

• Cambridge Laboratory…

• External Research Office– Stephen Emmott

April 2006 Moscow 3

MSR Cambridge

• Systems & Networking – Peter Key– Operating Systems– Networking– Distributed Computing

• Machine Learning & Perception – Christopher Bishop– Machine Learning – Computer Vision– Information Retrieval

April 2006 Moscow 4

MSR Cambridge

• Programming Principles & Tools – Luca Cardelli– Programming Principles & Tools– Security

• Computer-Mediated Living – Ken Wood– Human Computer Interaction– Ubiquitous Computing– Sensors and Devices– Integrated Systems

April 2006 Moscow 5

Search: a bit of history

People sometimes assume that G**gle invented search… but of course this is false

• Library catalogues• Scientific abstracts• Printed indexes• The 1960s to 80s: Boolean search• Free text queries and ranking – a long gestation• The web

April 2006 Moscow 6

Web search

• The technology– Crawling– Indexing– Ranking– Efficiency and effectiveness

• The business– Integrity of search– UI, speed– Ads– Ad ranking– Payment for clickthrough

April 2006 Moscow 7

Other search environments

• Within-site

• Specialist databases

• Enterprise/intranet

• Desktop

April 2006 Moscow 8

How search engines work

• Crawl a lot of documents• Create a vast index

– Every word in every document– Point to where it occurred– Allow documents to inherit additional text

• From the url• From anchors in other documents…

– Index this as well

• Also gather static information

April 2006 Moscow 9

How search engines work

Given a query:– Look up each query word in the index– Throw all this information at the ranker

Ranker:A computing engine which calculates a score

for each document, and identifies the top n scoring documents

Score depends on a whole variety of features, and may include static information

April 2006 Moscow 10

A core challenge: ranking

• What features might be useful?– Features of the query-document pair– Features of the document– Maybe features of the query– Simple / transformed / compound

• Combining features– Formulae– Weights and other free parameters– Tuning / training / learning


Ranking algorithms

• Based on probabilistic models– we are trying to predict relevance

• … plus a little linguistic analysis– but this is secondary to the statistics

• … plus a great deal of know-how, experience, experiment

• Need:– Evidence from all possible sources– … combined appropriately


Evaluation

• User queries• Relevance judgements

– by humans– yes-no or multilevel

• Evaluation measures– How to evaluate a ranking?– Only the top end matters– Various different measures in use

• Public bake-offs– TREC etc.


Using evaluation data for training

• Task: to optimise a set of parameters– E.g. weights of features

• Optimisation is potentially very powerful– Can make a huge difference to effectiveness

• But there are challenges…


Challenge 1: Optimisation methods

• Training is something of a black art – Not easy to write recipes for

• Much work currently on optimisation methods– Some of it coming from the machine learning

community


Challenge 2: a tradeoff

• Many features require many parameters– From a machine learning point of view, the

more the better

• Many parameters means much training– Human relevance judgements are expensive


Challenge 3: How specific?

• How much does the environment matter?– Different features

• E.g. characteristics of documents, file types, linkage, statistical properties…

– Different kinds of queries• Or different mixes of the same kinds

– Different factors affecting relevance– Access constraints– …


Challenge 3: How specific?

• And if it does matter…

How to train for the specific environment?– Web search: huge training effort– Enterprise: some might be feasible– Desktop: unlikely– Within-site / specialist databases: some might

be feasible


Looking for alternatives

If training is difficult… Some other possibilities:– Robustness – parameters with stable optima

(probably means fewer features)

– Training tool-kits(but remember the black art)

– Auto-training – a system that trains itself on the basis of clickthrough

(a long-term prospect)


A little about Microsoft

• Web search: MSN takes on Google and Yahoo– New search engine is closing the gap– Some MSRC input

• Enterprise search: MS Search and SharePoint– New version is on its way– Much MSRC input

• Desktop: also MS Search


Final thoughts

• Search has come a long way since the library card catalogue

• … but it is by no means a done deal• This is a very active field

– both academically and commercially

I confidently expect that it will change as much in the next 16 years as it has since 1990

Search

Documents

Transcript of Search