D. Zeinalipour et al. (UCR) “A Local Search Mechanism for Peer-to-Peer Networks” Vana...

26
D. Zeinalipour et al. (UCR) “A Local Search Mechanism for Peer-to- Peer Networks” Vana Kalogeraki, Dimitrios Gunopulos & Demetris Zeinalipour (University of California – Riverside) < [email protected] , [email protected] , [email protected] > http://www.cs.ucr.edu/~csyiazti/publications.html CIKM 2002 – Eleventh International Conference on Information and Knowledge Management November 4-9, Mclean VA
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    0

Transcript of D. Zeinalipour et al. (UCR) “A Local Search Mechanism for Peer-to-Peer Networks” Vana...

D. Zeinalipour et al. (UCR)

“A Local Search Mechanism for Peer-to-Peer Networks”

Vana Kalogeraki, Dimitrios Gunopulos

& Demetris Zeinalipour (University of California – Riverside)

< [email protected], [email protected], [email protected] >

http://www.cs.ucr.edu/~csyiazti/publications.html

CIKM 2002 – Eleventh International Conference on Information and Knowledge

Management

November 4-9, Mclean VA

D. Zeinalipour et al. (UCR)

Presentation Outline

1. Introduction: Information Retrieval (I.R) in Peer-to-Peer networks.

2. Techniques for Distributed I.R.i. Breadth-First Search.ii. Random Breadth-First Search.iii. Intelligent Search with profiling.

3. Experimental Evaluation.4. Related Work.5. Conclusions & Future Work.

D. Zeinalipour et al. (UCR)

Introduction to Peer-to-Peer• Peer-to-Peer Computing definition:

“Sharing of computer resources and information through direct exchange”

The physical topologyThe virtual P2P topology

• Clients (downloaders) are also servers

• Clients may join or leave the network at any time => highly fault-tolerant but with a cost!

• Searches are done within the virtual network while actual downloads are done offline (with HTTP).

D. Zeinalipour et al. (UCR)

Introduction to Peer-to-Peer• Peer-to-Peer (P2P) systems are increasingly

becoming popular.

• P2P file-sharing systems, such as Gnutella, Napster and Freenet realized a distributed infrastructure for sharing files.

• Traditionally, files were shared using the Client-Server model (e.g. http). Not scalable since they are centralized services.

• P2P uncover new advantages in simplicity of use, robustness, self organization and scalability.

D. Zeinalipour et al. (UCR)

Information Retrieval in P2PProblem:“How to efficiently retrieve Information in P2P systems where each node shares a collection of documents?”

keywords

• Documents consists of keywords.• Resembles Information Retrieval but resources are

distributed now.• Primary Data Structures such as Global Inverted

Indexes can’t be maintained efficiently.

D. Zeinalipour et al. (UCR)

Solutions for P2P Information Retrieval

1) Centralized Approaches• Centralized Indexes • e.g. Napster, SETI@HOME

2) Purely Distributed Approaches• Each node has only local

knowledge.• I.R is done using Brute force

mechanisms• e.g. Gnutella, Fasttrack (Kazaa)

3) Hybrid Approaches• One or more peers have partial

indexes of the contents of others.• e.g. Limewire's Ultrapeers

Centralized Index1) Upload Index2) Query/QueryHit3) Download (offline)

1 2

3

1) Connect2) Query/QueryHit3) Download (offline)

1) Connect2) Intelligent Query/QueryHit3) Download (offline)

1,2

3

1,2

3

D. Zeinalipour et al. (UCR)

Motivation• On 1st June we crawled the Gnutella P2P Network for 5

hours with 17 workstations.• We analyzed 15,153,524 query messages.

• Observation: High locality of specific queries.

• We try to exploit this property for more efficient searches?

D. Zeinalipour et al. (UCR)

Presentation Outline

1. Introduction: Information Retrieval (I.R) in Peer-to-Peer networks.

2. Techniques for Distributed I.R.i. Breadth-First Search.ii. Random Breadth-First Search.iii. Intelligent Search with profiling.

3. Experimental Evaluation.4. Related Work.5. Conclusions & Future Work.

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 1. Breadth-First Search (Gnutella)• Each Query Message is propagated along all outgoing

links of a peer using TTL (time-to-live).• TTL is decremented on each forward until it becomes 0• Technique for I.R in P2P systems such as Gnutella.• Results?

– The physical network comes to its knees– Long Delays for search results.

P2P Network N

Peer q

QUERY1

QUERYHIT2

Peer d

A

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 2. Modified Random BFS• Each Query Message is forwarded to only a fraction of

outgoing links (e.g. ½ of them). • TTL is again decremented on each forward until it

becomes 0.• Results?

– Fewer Messages but possibly less results– This algorithm is probabilistic. – Some segments may become

unreachable

P2P Network N

QUERY1

QUERYHIT2

Peer dPeer q

unreachable

A

B

C

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM)• Idea: Each Query Message is forwarded intelligently

based on what queries a peer answered in the past.• Components of ISM (for each node u)

a) Profile Mechanism, for each neighbor N(u).b) Peer Ranking Mechanism, for ranking peers locally and send a

search query only to the ones that most likely will answer.c) Similarity Function, for finding similar search queries.d) Search Mechanism, for propagating queries based on local

indexes

QUERY1

QUERYHIT2

Peer dPeer q ?

profiles

A

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM)a) Profile mechanism.

– Maintains a list of past queries routed through that host.– Every time a QueryHit is received the table is updated

– The profile manager uses a Least Recently Used policy to keep most recent queries in repository.

– Profiles are kept for neighbors only so the cost for maintaining this cost is O(Td), T is a limiting factor per profile, d is the degree of a node

}Size:T*dQuery GUID Connection timestamp

Elections Bush Clinton G439ID Socket1 100002222

Super Bowl San Diego F549QL --- 100065652

*** *** *** ***

Italy earthquake disaster PN329D Socket5 100022453

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM)b) Peer Ranking Mechanism.

– Before forwarding a Query Message a peer performs an on-the-fly ranking of its peers to determine the best paths.

– We use the Aggregate Similarity of peer Pi to a query q, computed by a peer Pk as:

ExampleAssume host Pk needs to forward a query q=“italy disaster” to two of

its peers {P1, P2, P3}. Pk maintains queries {q1 ,q2,. ,q5} in its profile.

P2 {P1

P3 {

Sim(q, q1) = 0.8Sim(q, q2) = 0.6Sim(q, q3) = 0.5 Sim(q, q4) = 0.4 Sim(q, q5) = 0.4

} => PsimP(P2, q) = 0.61 + 0.51 = 1.1

} => PsimP(P3, q) = 0.41 + 0.31 = 0.7

=> PsimP(P1, q) = 0.81 = 0.8

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM)c) Similarity Function – The cosine similarity.

• Assume that L is a set of all words (in Profile Manager)\e.g. L={elections, bush, clinton, super, bowl, san, diego, … ,italy, earthquake, disaster}

• We define an |L|-dimensional space where each query is a vector. If q=“italy disaster” => q (vector of q) = [0,0,0,…,1,0,1]

• Recall that we have a vector for each qi stored in the Profile Manager ( i.e. qi )

22 ||||*||||

.),cos(),(

iqq

iqqiqqqiqsim

D. Zeinalipour et al. (UCR)

Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM)d) Search Mechanism

• Utilizes the Peer Ranking Mechanism to forward Queries to nodes that will potentially contain the info we are looking for

QUERY1

Peer d

Peer q ?

profiles

?

D. Zeinalipour et al. (UCR)

Presentation Outline

1. Introduction: Information Retrieval (I.R) in Peer-to-Peer networks.

2. Techniques for Distributed I.R.i. Breadth-First Search.ii. Random Breadth-First Search.iii. Intelligent Search with profiling.

3. Experimental Evaluation.4. Related Work.5. Conclusions & Future Work.

D. Zeinalipour et al. (UCR)

Experimental Evaluation• We use a decentralized Newspaper application

built on top of the REUTERS dataset (22,531 documents grouped by 84 countries).

• Random Network of 100 peers – Each peer has documents from 3 countries– The average degree of a node is 7 ~= log2100

(connected graph) Data-Peer (e.g. usa)

PDOM-XMLManager

P2P NetworkModule

XML Data Files

XQL

u.k

germany

france

Routing Structures (Profiles)

mexico

argentina

china

india

greece

italy

usa.graph

D. Zeinalipour et al. (UCR)

Experimental Evaluation• We perform 400 sequential queries with a delay of 4 sec.• We compare Doc. Ratio (recall rate) vs. Num. of messages

– BFS (Gnutella Message Flooding) (forward to degree nodes).– Modified BFS (randomly forward to degree/2 nodes).– Intelligent Search Mechanism

(forward to M=3 highest rank nodes + 1 random).

Search-Peer

P2P NetworkModule

u.k

germany

france

mexico

argentina

china

india

greece

italy

queries.txtresults.txt

D. Zeinalipour et al. (UCR)

Experimental Evaluation• We measure Doc. Ratio (recall rate) vs. Num. of

messages with Time-to-Live (TTL)=4 – BFS (Gnutella) uses ~763 messages w/ recall rate 100%– Random BFS(degree/2) uses ~120 (16%) msgs w/ recall rate 42%– Intelligent Search uses ~131 (17%) msgs w/ recall rate ~55%

• Recall Rate improves over time with Intelligent Search since Peer Profiles get more knowledge.

D. Zeinalipour et al. (UCR)

Experimental Evaluation• We again measure Doc. Ratio (recall rate) vs. Num. of

messages by increasing Time-to-Live (TTL) = 5– BFS (TTL=4) uses ~763 messages w/ recall rate 100%– Random BFS(degree/2) uses ~28% msgs w/ recall rate ~72%– Intelligent Search uses ~35% (of BFS msgs) w/ recall rate ~90% !

A large number of peers receive unnecessary messages. We get almost identical recall (90%) with only 35% of msgs

D. Zeinalipour et al. (UCR)

Presentation Outline

1. Introduction: Information Retrieval (I.R) in Peer-to-Peer networks.

2. Techniques for Distributed I.R.i. Breadth-First Search.ii. Random Breadth-First Search.iii. Intelligent Search with profiling.

3. Experimental Evaluation.4. Related Work.5. Conclusions & Future Work.

D. Zeinalipour et al. (UCR)

Related Work

• Improving Search in P2P B.Yang et al. (Stanford)– Iterative Deepening, until Z results are returned – Directed BFS based on aggregate statistics (e.g. num of

results a peer returned, shortest queue, forwarded the most data)

– Local Indexes, each node maintains an index over the data of peers r hops away.

• Routing Indices for P2P Crespo et al. (Stanford)– Compound Indices, each node sends a clustered

summary of its topic to its neighbors. (e.g. 100 databases, 4 theory, 10 OS)

– Might be too costly for Highly dynamic P2P systems.

D. Zeinalipour et al. (UCR)

Related Work

• Freenet (Clark et al.) Search by Identifiers.

uses SHA1 hashes of resources and information is retrieved based on the key closeness in a DFS manner.

• Others such as Chord.

Systems that focus on scalable object location, which becomes feasible by hashing and distributing objects in the P2P system. (Searches are by Identifier).

D. Zeinalipour et al. (UCR)

Conclusions• P2P systems offer several advantages such as

scalability, robustness and simplicity of use.

• Efficient P2P Information Retrieval is not feasible with the current Search Algorithms.

• We propose an Intelligent Search Mechanism that uses local knowledge to improve Information Retrieval in P2P.

• Our mechanism achieves 90% recall rate while using only 35% of the initial messaging.

D. Zeinalipour et al. (UCR)

Future Work• We plan to deploy our middleware infrastructure

on a larger P2P network with more Queries.

• We want to probe different Network Topologies such as ASMap with PowerLaws.

• We want to probe different Peer-Profile maintenance policies at peers.

• Compare the performance of our method with different proposed algorithms (iterative deepening, local indexes, etc).

D. Zeinalipour et al. (UCR)

“A Local Search Mechanism for Peer-to-Peer Networks”

Vana Kalogeraki, Dimitrios Gunopulos

& Demetris Zeinalipour (University of California – Riverside)

< [email protected], [email protected], [email protected] >

http://www.cs.ucr.edu/~csyiazti/publications.html

CIKM 2002 – Eleventh International Conference on Information and Knowledge

Management

November 4-9, Mclean VA