Search

20
April 2006 Moscow 1 Search Stephen Robertson Microsoft Research Cambridge

description

Search. Stephen Robertson Microsoft Research Cambridge. MSR Cambridge. Andrew Herbert, Director Cambridge Laboratory … External Research Office Stephen Emmott. MSR Cambridge. Systems & Networking – Peter Key Operating Systems Networking Distributed Computing - PowerPoint PPT Presentation

Transcript of Search

Page 1: Search

April 2006 Moscow 1

Search

Stephen RobertsonMicrosoft Research Cambridge

Page 2: Search

April 2006 Moscow 2

MSR Cambridge

• Andrew Herbert, Director

• Cambridge Laboratory…

• External Research Office– Stephen Emmott

Page 3: Search

April 2006 Moscow 3

MSR Cambridge

• Systems & Networking – Peter Key– Operating Systems– Networking– Distributed Computing

• Machine Learning & Perception – Christopher Bishop– Machine Learning – Computer Vision– Information Retrieval

Page 4: Search

April 2006 Moscow 4

MSR Cambridge

• Programming Principles & Tools – Luca Cardelli– Programming Principles & Tools– Security

• Computer-Mediated Living – Ken Wood– Human Computer Interaction– Ubiquitous Computing– Sensors and Devices– Integrated Systems

Page 5: Search

April 2006 Moscow 5

Search: a bit of history

People sometimes assume that G**gle invented search… but of course this is false

• Library catalogues• Scientific abstracts• Printed indexes• The 1960s to 80s: Boolean search• Free text queries and ranking – a long gestation• The web

Page 6: Search

April 2006 Moscow 6

Web search

• The technology– Crawling– Indexing– Ranking– Efficiency and effectiveness

• The business– Integrity of search– UI, speed– Ads– Ad ranking– Payment for clickthrough

Page 7: Search

April 2006 Moscow 7

Other search environments

• Within-site

• Specialist databases

• Enterprise/intranet

• Desktop

Page 8: Search

April 2006 Moscow 8

How search engines work

• Crawl a lot of documents• Create a vast index

– Every word in every document– Point to where it occurred– Allow documents to inherit additional text

• From the url• From anchors in other documents…

– Index this as well

• Also gather static information

Page 9: Search

April 2006 Moscow 9

How search engines work

Given a query:– Look up each query word in the index– Throw all this information at the ranker

Ranker:A computing engine which calculates a score

for each document, and identifies the top n scoring documents

Score depends on a whole variety of features, and may include static information

Page 10: Search

April 2006 Moscow 10

A core challenge: ranking

• What features might be useful?– Features of the query-document pair– Features of the document– Maybe features of the query– Simple / transformed / compound

• Combining features– Formulae– Weights and other free parameters– Tuning / training / learning

Page 11: Search

April 2006 Moscow 11

Ranking algorithms

• Based on probabilistic models– we are trying to predict relevance

• … plus a little linguistic analysis– but this is secondary to the statistics

• … plus a great deal of know-how, experience, experiment

• Need:– Evidence from all possible sources– … combined appropriately

Page 12: Search

April 2006 Moscow 12

Evaluation

• User queries• Relevance judgements

– by humans– yes-no or multilevel

• Evaluation measures– How to evaluate a ranking?– Only the top end matters– Various different measures in use

• Public bake-offs– TREC etc.

Page 13: Search

April 2006 Moscow 13

Using evaluation data for training

• Task: to optimise a set of parameters– E.g. weights of features

• Optimisation is potentially very powerful– Can make a huge difference to effectiveness

• But there are challenges…

Page 14: Search

April 2006 Moscow 14

Challenge 1: Optimisation methods

• Training is something of a black art – Not easy to write recipes for

• Much work currently on optimisation methods– Some of it coming from the machine learning

community

Page 15: Search

April 2006 Moscow 15

Challenge 2: a tradeoff

• Many features require many parameters– From a machine learning point of view, the

more the better

• Many parameters means much training– Human relevance judgements are expensive

Page 16: Search

April 2006 Moscow 16

Challenge 3: How specific?

• How much does the environment matter?– Different features

• E.g. characteristics of documents, file types, linkage, statistical properties…

– Different kinds of queries• Or different mixes of the same kinds

– Different factors affecting relevance– Access constraints– …

Page 17: Search

April 2006 Moscow 17

Challenge 3: How specific?

• And if it does matter…

How to train for the specific environment?– Web search: huge training effort– Enterprise: some might be feasible– Desktop: unlikely– Within-site / specialist databases: some might

be feasible

Page 18: Search

April 2006 Moscow 18

Looking for alternatives

If training is difficult… Some other possibilities:– Robustness – parameters with stable optima

(probably means fewer features)

– Training tool-kits(but remember the black art)

– Auto-training – a system that trains itself on the basis of clickthrough

(a long-term prospect)

Page 19: Search

April 2006 Moscow 19

A little about Microsoft

• Web search: MSN takes on Google and Yahoo– New search engine is closing the gap– Some MSRC input

• Enterprise search: MS Search and SharePoint– New version is on its way– Much MSRC input

• Desktop: also MS Search

Page 20: Search

April 2006 Moscow 20

Final thoughts

• Search has come a long way since the library card catalogue

• … but it is by no means a done deal• This is a very active field

– both academically and commercially

I confidently expect that it will change as much in the next 16 years as it has since 1990