Search
-
Upload
carla-sloan -
Category
Documents
-
view
19 -
download
0
description
Transcript of Search
April 2006 Moscow 1
Search
Stephen RobertsonMicrosoft Research Cambridge
April 2006 Moscow 2
MSR Cambridge
• Andrew Herbert, Director
• Cambridge Laboratory…
• External Research Office– Stephen Emmott
April 2006 Moscow 3
MSR Cambridge
• Systems & Networking – Peter Key– Operating Systems– Networking– Distributed Computing
• Machine Learning & Perception – Christopher Bishop– Machine Learning – Computer Vision– Information Retrieval
April 2006 Moscow 4
MSR Cambridge
• Programming Principles & Tools – Luca Cardelli– Programming Principles & Tools– Security
• Computer-Mediated Living – Ken Wood– Human Computer Interaction– Ubiquitous Computing– Sensors and Devices– Integrated Systems
April 2006 Moscow 5
Search: a bit of history
People sometimes assume that G**gle invented search… but of course this is false
• Library catalogues• Scientific abstracts• Printed indexes• The 1960s to 80s: Boolean search• Free text queries and ranking – a long gestation• The web
April 2006 Moscow 6
Web search
• The technology– Crawling– Indexing– Ranking– Efficiency and effectiveness
• The business– Integrity of search– UI, speed– Ads– Ad ranking– Payment for clickthrough
April 2006 Moscow 7
Other search environments
• Within-site
• Specialist databases
• Enterprise/intranet
• Desktop
April 2006 Moscow 8
How search engines work
• Crawl a lot of documents• Create a vast index
– Every word in every document– Point to where it occurred– Allow documents to inherit additional text
• From the url• From anchors in other documents…
– Index this as well
• Also gather static information
April 2006 Moscow 9
How search engines work
Given a query:– Look up each query word in the index– Throw all this information at the ranker
Ranker:A computing engine which calculates a score
for each document, and identifies the top n scoring documents
Score depends on a whole variety of features, and may include static information
April 2006 Moscow 10
A core challenge: ranking
• What features might be useful?– Features of the query-document pair– Features of the document– Maybe features of the query– Simple / transformed / compound
• Combining features– Formulae– Weights and other free parameters– Tuning / training / learning
April 2006 Moscow 11
Ranking algorithms
• Based on probabilistic models– we are trying to predict relevance
• … plus a little linguistic analysis– but this is secondary to the statistics
• … plus a great deal of know-how, experience, experiment
• Need:– Evidence from all possible sources– … combined appropriately
April 2006 Moscow 12
Evaluation
• User queries• Relevance judgements
– by humans– yes-no or multilevel
• Evaluation measures– How to evaluate a ranking?– Only the top end matters– Various different measures in use
• Public bake-offs– TREC etc.
April 2006 Moscow 13
Using evaluation data for training
• Task: to optimise a set of parameters– E.g. weights of features
• Optimisation is potentially very powerful– Can make a huge difference to effectiveness
• But there are challenges…
April 2006 Moscow 14
Challenge 1: Optimisation methods
• Training is something of a black art – Not easy to write recipes for
• Much work currently on optimisation methods– Some of it coming from the machine learning
community
April 2006 Moscow 15
Challenge 2: a tradeoff
• Many features require many parameters– From a machine learning point of view, the
more the better
• Many parameters means much training– Human relevance judgements are expensive
April 2006 Moscow 16
Challenge 3: How specific?
• How much does the environment matter?– Different features
• E.g. characteristics of documents, file types, linkage, statistical properties…
– Different kinds of queries• Or different mixes of the same kinds
– Different factors affecting relevance– Access constraints– …
April 2006 Moscow 17
Challenge 3: How specific?
• And if it does matter…
How to train for the specific environment?– Web search: huge training effort– Enterprise: some might be feasible– Desktop: unlikely– Within-site / specialist databases: some might
be feasible
April 2006 Moscow 18
Looking for alternatives
If training is difficult… Some other possibilities:– Robustness – parameters with stable optima
(probably means fewer features)
– Training tool-kits(but remember the black art)
– Auto-training – a system that trains itself on the basis of clickthrough
(a long-term prospect)
April 2006 Moscow 19
A little about Microsoft
• Web search: MSN takes on Google and Yahoo– New search engine is closing the gap– Some MSRC input
• Enterprise search: MS Search and SharePoint– New version is on its way– Much MSRC input
• Desktop: also MS Search
April 2006 Moscow 20
Final thoughts
• Search has come a long way since the library card catalogue
• … but it is by no means a done deal• This is a very active field
– both academically and commercially
I confidently expect that it will change as much in the next 16 years as it has since 1990