1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by:...

1

Distributed, Automatic File Description Tuning in Peer-to-Peer

File-Sharing Systems

Presented by: Dongmei JiaIllinois Institute of Technology

April 11, 2008

D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007

2

Outline

• Objective

• Problem

• Proposed Approach

• Experimental Results

• Conclusions

3

Objective

• To improve the accuracy of search in P2P file-sharing systems.– Finding poorly described data

4

Problem Statement

• Characteristics:– Binary files (e.g., music file).– Each replica described with a descriptor.

Sparse. Vary across peers.

– Queries are conjunctive.

• Problem: poor/sparse description makes files hard to match with queries!

5

Approach

• Peers independently search the network for other descriptors of local files– Incorporate them into the local replica’s

descriptor– Search implemented by “probe” queries

6

Example

Q = {Mozart, piano}

Q = {Mozart, piano}

D1 = {Mozart}

D2 = {piano}

tell me your

description of F

D2

= {p

iano

}

D1’ = {Mozart, piano}

D1’ = {Mozart, piano}

tell

me

your

desc

riptio

n of

F

Two descriptors of File F: D1, D2. Q = {Mozart, piano}.

D3 = {Mozart, piano}

Peer1

Peer2

Peer3

No result returned for Q!

7

How P2P File-Sharing Systems Work

• Peers share a set of files.

• Each replica of a file is identified by a descriptor.– Every descriptor contains a unique hash key

(MD5) identifying the file.

• Query is routed to all reachable peers.

• Query result contains its descriptor and the identity of the source server.

8

Probe Query Design

• Contains one term: the key of a file.– Matches all the replicas of the file reached by

the probe query.

9

Design Challenges

• When to probe

• What file to probe

• What to do with probe results

• How to control cost

• Do this in a fully distributed way

10

When to Probe?

• When a peer is not busy and under-utilized– Measured by number of responses returned Nr

• When a peer has a high desire to participate– Measured by number of files published Nf

• When the system is active– Measured by number of queries received Nq

11

When to Probe? (Cont’d)

• Triggering mechanism:T > Nr/NfNq + NpT, Nf, Nq > 0

Where T: user-defined threshold

Np: number of probe queries performed

Nr/NfNq: number of results returned per shared file

per incoming query

• All the metrics are locally maintained by each peer, easy to implement

12

What File to Probe?

• Goal is to increase participation level

• Criteria to choose from: – File that is least probed (RR)– File that is in the least or most query responses

(LPF or MPF)– File with a smallest descriptor

13

What to do with Probe Results?

• Select terms from the result set to add to the local descriptor– Random (rand)– Weighted random (wrand)– Most frequent (mfreq)– Least frequent (lfreq)

• Stop when local descriptor size limit is reached

14

Experimental Setup

Query Length Distribution:

Parameters Used in the Simulation:

15

Metrics

• MRR (mean reciprocal rank) =

• Precision =

• Recall =

A: set of replicas of the desired file.

R: result set of the query.

16

Data

• TREC wt2g Web track.– Arbitrary set of 1000 Web docs from 37 Web

domains.– Preprocessing

• Stemming and removing html markup and stop words.

– Final data set• 800,000 terms, 37,000 are unique.

17

Experimental Results – Applying Probe Results to Local Descriptors

00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

rand wrand mfreq lfreq

Term Copying Technique

MR

R

MRR with Various Term Copying Techniques

18

Experimental Results - Probe Triggering

• No probing (base case).

• Random– Assign each peer a probability of probing.– 5K probes are issued over the 10K queries.

• T5K– Tune T to perform 5K probes over the 10K

queries.

19

Experimental Results - Probe Triggering (Cont’d)

0

0.1

0.2

0.3

0.4

0.5

noprobe random T5K

Probe Triggering Technique

MRR

MRR: random + 20%; T5K +30%

20


00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.5

1 2 3 4 5 6 7 8

Query Length

MR

R

noprobe

random

T5K

Probing dramatically increases MRR of longer queries. Solve query over-specification problem.

21


00.05

0.1

0.150.2

0.250.3

0.35

0.40.45

0.5

noprobe T2.5K T5K T7.5K T10K

Number of Probes

MRR

Effect of Various Probing Rates on MRR.

22

Experimental Results - Probe File Selection

• Rand – randomly select a file to probe (base case).

• LPF – least popular first. – Min query hits; on a tie, min descriptor size.

• MPF – Most popular first.– Max query hits; on a tie, min descriptor size.

• RR-LPF – round-robin-LPF.– Min probes; on a tie, LPF.

• RR-MPF – round-robin-MPF.– Min probes; on a tie, MPF.

23

Experimental Results - Probe File Selection (Cont’d)

Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%).

　MRR Cost Recall Prec. Pct. Contained

Rand = = = = =

LPF < < < < >

MPF < > < < <

RR-LPF < < > < >

RR-MPF > < > > >

24

Putting Them Together…

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

noprobe probe

MRR

Probe: T5K, RR-MPF, wrand

Probing improves MRR by ~30%

25

Explanation

• Triggering: tuning T– Probes are issued by underactive peers.

• File selection: – RR avoids same file being probed repeatedly– MPF improves peer’s ability of sharing popula

r files

• Term copying: – wrand selects from bag of words in proportion

of frequency– Allows new queries to be matched with a bias

toward more strongly associated terms

26

How to Control Cost?

• Cost components:– Probe query results– File query results

• Cost: avg number of responses per file query

• Randomly sample each type of result on server side with a probability P

• Impact on performance?

27

0

50

100

150

200

250

1 0.75 0.5 0.25

Probe Query Sample Rate

Num

. R

esp.

per

File

Que

ry

1

0.75

0.5

0.25

no probe

file querysample rate

Performance/Cost Analysis

Total Per-file-query Cost for Different File and Probe Query Sampling Rates

28

Performance/Cost Analysis (Cont’d)

MRR is increased in all sampling settings

00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

1 0.75 0.5 0.25

Probe Query Sample Rate

MR

R

1

0.75

0.5

0.25

no probe

file querysample rate

29

0

50

100

150

200

250

0.25

Probe　Query　Sampling　Rate

Num　Resp.　Per　File　Query

1

0.75

0.5

0.25

Performance/Cost Analysis (Cont’d)

00.050.10.150.20.250.30.350.40.450.5

0.25Probe　Query　Sampling　Rate

MRR

1

0.75

0.5

0.25

Example: It can both reduce the cost (-15%) and improve the performance (18%)

Cost Performance

30

Conclusions and Future Work

• Probing enriches data description – MRR is improved by 30%

• Sampling is effective in controlling cost– Reduce cost by 15% and improve

performance by 18% at the same time

• We consider better ways of controlling cost

31

• Thank You!

• Any Questions?

1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by:...

Documents

Transcript of 1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by:...