Query Routing in Peer-to-Peer
Web Search Engine
Speaker: Pavel SerdyukovSupervisors: Gerhard Weikum
Christian ZimmerMatthias Bender
International Max Planck
Research School for Computer
Science
2
Talk Outline
Motivation Proposed Search Engine architecture Query routing and database selection
Similarity-based measures Example: GlOSS
Document-frequency-based measures Example: CORI
Evaluation of methods Proposals Conclusion
3
Problems of present Web Search Engines
Size of indexable Web: Web is huge, it’s difficult to cover all Timely re-crawls are required Technical limits
Deep Web Monopoly of Google:
Controls 80% of web search requests Paid sites get updated more frequently and get higher
rating Sites may be censored by engine
4
Make use of Peer-to-Peer technology
Peer 3Peer 1Peer 2
Peer 4Peer 2Peer 1
Peer 1Peer 4Peer 3
Peer 2Peer 3Peer 4
cancerelephantcomputer
Ranking of peer usefulness (richness) for keyword
Global directory must be shared among peers!
Exploit previously unused CPU/memory/disk power Provide up-to-date results for small portions of Web Conquer Deep Web by personalized and
specialized web crawlers
Chord Ring
0
4
26
5
1
3
7Global
Directory
5
Query routing
Goal: find peers with relevant documents Known before as Database Selection Problem
Not all techniques are applicable to P2P
query
6
Database Selection Problem
1st inference: Is this document relevant? It’s a subjective user judgment, we model it We use only representations of user needs and
documents (keywords, inverted indices)
2nd inference: Database is potential to satisfy query, if it has many documents (size-based naive approach) has many documents, containing all query words
high number of them with given similarity high summarized similarity of them
7
Measuring usefulness
Number of documents with all query words is unknown
no full document representations available, only database summaries (representatives)
3rd Inference (usefulness) is built on top of previous two
Steps of database selectioni. Rely on sensible 1st and 2nd inferences
ii. Choose database representatives for 3rd inference
iii. Calculate usefulness measures
iv. Choose most useful databases
8
Similarity-based measures
Definition: Usefulness is a sum of document similarities, exceeding threshold l
Simplest: summarized weight of query terms across collection no assumptions about word cooccurrence l = 0
ldocqsimDBdoc
docqsimDBqlUsefulness),(
),(),,(
9
GlOSS
High correlation assumption: Sort all n query terms Ti in descendant order of their DF’s
DFn → Tn , Tn-1 , … , T1 , DFn-1 – DFn → Tn-1 , Tn-2 , … , T1 , … , DF1 – DF2 → T1
Use averaged term weights to calculate document similarity
l > 0 l is query dependent l is collection dependent
Usually because of local IDF’s difference Proposal: use global term importance
Usually l is set to 0 in experiments
10
Problems of similarity-based measures
Is this inference good?
A few high-scored documents and a lot of low scored documents are regarded as equal Proposal: summarize first K similarities
Highly scored documents could be bad indicator of usefulness Most of relevant documents have moderate scores Highly scored documents could be non-relevant
11
Document frequency based measures
Don’t use term frequencies (actual similarities) Exploit document frequencies only Exploit global measure of term importance
Average IDF ICF (inversed collection frequency) =
Main assumption: many documents with rare terms have more meaning for user most likely contain other query terms
icf
Clog
12
CORI: Using TFIDF normalization
qq ICFDF
01
506040
.log
.log..~
MAXDF
DFDF
01
50
.log
.log
~
C
CF
C
ICF
DF : document frequency of query termDFMAX : maximum document frequency among all terms in collectionCF : number of collections, containing query term|C| : number of collections in the system
13
CORI Issues
Pure document frequencies make CORI better The less statistics, the simpler Smaller variance Better estimates ranking, not actual database
summaries
No use of document richness
To be normalized or not to be? Small databases are not necessary better Collection may specialize well in several topics
14
Using usefulness measures
601520Peer3
400660Peer2
601220Peer1
DFmaxavg_tfDF
60105Peer3
400410Peer2
6085Peer1
DFmaxavg_tfDF
Peer2
Peer1
Peer3
Inform.
Peer2
Peer1
Peer3
RetrievalCORI
Peer3 0.5681
Peer1 0.5681
Peer2 0.5634
GlOSS
Peer2 845
Peer3 784
Peer1 627
Information: CF = 120 Retrieval: CF = 40|C| = 1000
Peer1
Peer3
Peer2
Inform.
Peer2
Peer1
Peer3
Inform.
15
Analysis of experiments
CORI is the best, but Only when choosing more than 50 from 236 databases Only 10% better when choosing more than 90 databases
Test collections are strange Chronologically or even randomly separated documents No topic specificity No actual Web data used No overlapping among collections
Experiments are unrealistic, it’s unclear Which method is better Is there any satisfactory method
16
Possible solutions
Most of measures could be unified in framework
We can play with it and try Various normalization schemes Different notions of term importance (ICF, local IDF) Use statistics of top documents Change the power of factors
DF·ICF 4 is not worse than CORI Change the form of expression
qqq ImportanceDFTFaverage_
GlOSS
CORI
17
Conclusion
What done: Measures are analytically evaluated Sensible subset of measures is chosen Measures are implemented
What could be done next: Carry out new sensible experiments Choose appropriate usefulness measure Experiment with database representatives Build own measure Try to exploit collections metadata
Bookmarks, authoritative documents, collection descriptions
18
Thank you for attention!
Top Related