IR Techniques For P2P Networks
1
Information Retrieval Techniques For Peer-To-Peer Networks
Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos
Presented By Ranjan Dash
IR Techniques For P2P Networks
2
Layout
Introduction P2P Network IR Techniques PeerWare Infrastructure and
experiments
IR Techniques For P2P Networks
3
Introduction Major challenge
efficiently search the content of other peers Definition
Large number of peers collaborate dynamically in an ad hoc manner and share information in large-scale distributed environments without centralized co-ordination
P2P environment characteristic Each peer has a database or collection of docs Query contains set of key words Reply message contains pointers to matching documents
Different from static data environments No central repository Nodes join and leave in ad hoc and dynamically
IR Techniques For P2P Networks
4
P2P Network IR Techniques
P2P Network IR Techniques Breadth-First Search (BFS) Random Breadth-First-Search (RBFS) Intelligent Search Mechanism (ISM) Directed BFS and >RES Random Walker Searches Randomized Gossiping Local Routing Indices Centralized Approaches Searching Object Identifiers Distributed IR
IR Techniques For P2P Networks
5
P2P Network IR Techniques
Breadth-First Search (BFS) Widely used in file-sharing systems Propagates to all neighbors except sender QueryHit Msg (#of docs, bandwidth info) follows the same
path Simple, guarantees high hit rate Poor in performance and network utilization Low bandwidth node - a bottleneck Can be improved using TTL
IR Techniques For P2P Networks
6
P2P Network IR Techniques
Random Breadth-First Search (RBFS) Dramatic improvements over
BFS Forwards only to a fraction of its
peers, selected at random Does not need global knowledge,
takes local decisions - faster Probabilistic – might not reach
some large network segments
IR Techniques For P2P Networks
7
P2P Network IR Techniques
Intelligent Search Mechanism (ISM)
Quick, efficient and least communication costs
Propagates only to peers more likely to reply
Consists of 2 components that run in each peer
Profile mechanism Relevance rank
Works good for query locality Forwards to same neighbor always -Starvation for new peers Solution – add small random subset of peers to most relevant set
IR Techniques For P2P Networks
8
P2P Network IR Techniques
Profile mechanism Builds a profile for each of its neighboring peers Maintains T most recent Queries and QueryHits with no
of results Least recently used replacement policy for most recent
query
IR Techniques For P2P Networks
9
P2P Network IR Techniques
Relevance rank Ranking of neighbors to decide which ones to
forward a query Ranking of a peer ‘Pi’ for a query ‘q’ Qsim is cosine similarity between 2 queries
= 0, most results in the past that matters like >RES
IR Techniques For P2P Networks
10
P2P Network IR Techniques
Directed BFS and >RES forwards a query to a subset of
its peers based on some aggregated statistics
Send out to ‘k’ peers which had returned the most results for the last ‘m’ queries
BFS turned into a DFS for ‘k’ = 1, ‘m’=10 Similar to ISM, but simpler Does not explore nodes that contain content related to query Performs well because it routes larger networks segments
IR Techniques For P2P Networks
11
P2P Network IR Techniques
Random-Walker SearchesEach node randomly forwards a query message, called a walker to one of its peersCan be extended from 1-walker to k-walkerResembles RBFS but message numbers increase linearlyLike RBFS does not use most relevant content to guide query
Adaptive Probability search (APS) – similarUses feed back from previous searches to probabilistically guide future walkers
IR Techniques For P2P Networks
12
Randomized Gossiping – PlanetP Global inverted index, partially constructed by each
node, called local index bloom filter Propagates it to the rest through gossiping Adv. Of bloom filter –
Smaller messages Saving in network I/O
Problem of scalability for PlanetP
P2P Network IR Techniques
IR Techniques For P2P Networks
13
Local Routing Indices by Arturo Crespo and Hector Garcia-Molina Hybrid technique uses local indices containing the “direction”
toward the documents 3 techniques –
compound routing indices (CRI) hop-count routing index (HRI) exponentially aggregated index (ERI)
Good for topologies where only few nodes have very large numbers of neighbors - (tree, tree with cycles)
The routing indices are similar to the routing tables deployed in the Bellman–Ford
CRI - a node q maintains statistics for each neighbor that indicate how many documents are reachable through each neighbor.
HRI - CRI for k hops – prohibitive storage cost for large k. ERI - addresses the issue of HRI by aggregating HRI using a cost
formula.
P2P Network IR Techniques
IR Techniques For P2P Networks
14
Centralized Approaches maintain an inverted index over all the documents in the
participating hosts’ collections - Google, Yahoo, Napster Each joining peer A uploads an index of all its shared
documents to the central repository R. A querying node B searches A’s documents through R. B can communicate with A directly (using an out-of-band
protocol such as HTTP). Kazaa - Little different. Uses a set of more-powerful peers
that acts as a central repositories different kind of animal than the rest. Simple, Robust, shorter search time, guaranteed to find all results
P2P Network IR Techniques
IR Techniques For P2P Networks
15
Searching Object Identifiers Distributed file indexing systems - Chord, OceanStore, and
Content –Addressable Network (CAN), Freenet efficient searches using object identifiers (a hashcode on
the name of a file) rather than keywords. Perform object lookup operations to get the address (an IP
address) of the node that is storing the object. Optimizes object retrieval by minimizing the numbers of
messages and hops required. Disadvantage - only search for object identifiers and thus
can’t capture the relevance of the doc.
P2P Network IR Techniques
IR Techniques For P2P Networks
16
Distributed IR Having distributed databases, the main IR problem
is deciding which databases are most likely to contain the most relevant documents.
It’s possible to achieve good results for conceptually separated collections.
However, the assumption is that the querying party has some statistical knowledge about each database’s contents (word frequencies in documents) and therefore must have a global view of the system.
P2P Network IR Techniques
IR Techniques For P2P Networks
17
PeerWare Infrastructure and experiments
Evaluation metrics – recall rate – the fraction of documents each of the
search mechanisms retrieves Efficiency - the number of messages needed to find
the results Implemented only algorithms that require local
knowledge when searching for documents. BFS (the baseline) Implemented RBFS, >RES (k = 0.5 * d and m = 100,
where d is the degree of a node) , and ISM these 3 techniques forward query messages to half the
neighbors that BFS contacts. >RES and ISM use previous knowledge to decide on
which peers to forward the query
IR Techniques For P2P Networks
18
BFS requires almost 2.5 times as many messages as its competitors.
PeerWare Infrastructure and experiments
IR Techniques For P2P Networks
19
PeerWare Infrastructure and experiments
ISM found the most documents. ISM achieved almost a 90-percent recall rate while using only 38 percent of the messages BFS required. ISM improves its knowledge over time. Both >RES and ISM started out with a low recall rate (around 40 to 50 percent) because initially they randomly choose their neighbors.
Top Related