1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by:...
-
date post
20-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of 1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by:...
1
Distributed, Automatic File Description Tuning in Peer-to-Peer
File-Sharing Systems
Presented by: Dongmei JiaIllinois Institute of Technology
April 11, 2008
D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007
2
Outline
• Objective
• Problem
• Proposed Approach
• Experimental Results
• Conclusions
3
Objective
• To improve the accuracy of search in P2P file-sharing systems.– Finding poorly described data
4
Problem Statement
• Characteristics:– Binary files (e.g., music file).– Each replica described with a descriptor.
Sparse. Vary across peers.
– Queries are conjunctive.
• Problem: poor/sparse description makes files hard to match with queries!
5
Approach
• Peers independently search the network for other descriptors of local files– Incorporate them into the local replica’s
descriptor– Search implemented by “probe” queries
6
Example
Q = {Mozart, piano}
Q = {Mozart, piano}
D1 = {Mozart}
D2 = {piano}
tell me your
description of F
D2
= {p
iano
}
D1’ = {Mozart, piano}
D1’ = {Mozart, piano}
tell
me
your
desc
riptio
n of
F
Two descriptors of File F: D1, D2. Q = {Mozart, piano}.
D3 = {Mozart, piano}
Peer1
Peer2
Peer3
No result returned for Q!
7
How P2P File-Sharing Systems Work
• Peers share a set of files.
• Each replica of a file is identified by a descriptor.– Every descriptor contains a unique hash key
(MD5) identifying the file.
• Query is routed to all reachable peers.
• Query result contains its descriptor and the identity of the source server.
8
Probe Query Design
• Contains one term: the key of a file.– Matches all the replicas of the file reached by
the probe query.
9
Design Challenges
• When to probe
• What file to probe
• What to do with probe results
• How to control cost
• Do this in a fully distributed way
10
When to Probe?
• When a peer is not busy and under-utilized– Measured by number of responses returned Nr
• When a peer has a high desire to participate– Measured by number of files published Nf
• When the system is active– Measured by number of queries received Nq
11
When to Probe? (Cont’d)
• Triggering mechanism:T > Nr/NfNq + NpT, Nf, Nq > 0
Where T: user-defined threshold
Np: number of probe queries performed
Nr/NfNq: number of results returned per shared file
per incoming query
• All the metrics are locally maintained by each peer, easy to implement
12
What File to Probe?
• Goal is to increase participation level
• Criteria to choose from: – File that is least probed (RR)– File that is in the least or most query responses
(LPF or MPF)– File with a smallest descriptor
13
What to do with Probe Results?
• Select terms from the result set to add to the local descriptor– Random (rand)– Weighted random (wrand)– Most frequent (mfreq)– Least frequent (lfreq)
• Stop when local descriptor size limit is reached
14
Experimental Setup
Query Length Distribution:
Parameters Used in the Simulation:
15
Metrics
• MRR (mean reciprocal rank) =
• Precision =
• Recall =
A: set of replicas of the desired file.
R: result set of the query.
16
Data
• TREC wt2g Web track.– Arbitrary set of 1000 Web docs from 37 Web
domains.– Preprocessing
• Stemming and removing html markup and stop words.
– Final data set• 800,000 terms, 37,000 are unique.
17
Experimental Results – Applying Probe Results to Local Descriptors
00.050.1
0.150.2
0.250.3
0.350.4
0.450.5
rand wrand mfreq lfreq
Term Copying Technique
MR
R
MRR with Various Term Copying Techniques
18
Experimental Results - Probe Triggering
• No probing (base case).
• Random– Assign each peer a probability of probing.– 5K probes are issued over the 10K queries.
• T5K– Tune T to perform 5K probes over the 10K
queries.
19
Experimental Results - Probe Triggering (Cont’d)
0
0.1
0.2
0.3
0.4
0.5
noprobe random T5K
Probe Triggering Technique
MRR
MRR: random + 20%; T5K +30%
20
Experimental Results - Probe Triggering (Cont’d)
00.05
0.10.15
0.20.25
0.30.35
0.40.45
0.5
1 2 3 4 5 6 7 8
Query Length
MR
R
noprobe
random
T5K
Probing dramatically increases MRR of longer queries. Solve query over-specification problem.
21
Experimental Results - Probe Triggering (Cont’d)
00.05
0.1
0.150.2
0.250.3
0.35
0.40.45
0.5
noprobe T2.5K T5K T7.5K T10K
Number of Probes
MRR
Effect of Various Probing Rates on MRR.
22
Experimental Results - Probe File Selection
• Rand – randomly select a file to probe (base case).
• LPF – least popular first. – Min query hits; on a tie, min descriptor size.
• MPF – Most popular first.– Max query hits; on a tie, min descriptor size.
• RR-LPF – round-robin-LPF.– Min probes; on a tie, LPF.
• RR-MPF – round-robin-MPF.– Min probes; on a tie, MPF.
23
Experimental Results - Probe File Selection (Cont’d)
Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%).
MRR Cost Recall Prec. Pct. Contained
Rand = = = = =
LPF < < < < >
MPF < > < < <
RR-LPF < < > < >
RR-MPF > < > > >
24
Putting Them Together…
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
noprobe probe
MRR
Probe: T5K, RR-MPF, wrand
Probing improves MRR by ~30%
25
Explanation
• Triggering: tuning T– Probes are issued by underactive peers.
• File selection: – RR avoids same file being probed repeatedly– MPF improves peer’s ability of sharing popula
r files
• Term copying: – wrand selects from bag of words in proportion
of frequency– Allows new queries to be matched with a bias
toward more strongly associated terms
26
How to Control Cost?
• Cost components:– Probe query results– File query results
• Cost: avg number of responses per file query
• Randomly sample each type of result on server side with a probability P
• Impact on performance?
27
0
50
100
150
200
250
1 0.75 0.5 0.25
Probe Query Sample Rate
Num
. R
esp.
per
File
Que
ry
1
0.75
0.5
0.25
no probe
file querysample rate
Performance/Cost Analysis
Total Per-file-query Cost for Different File and Probe Query Sampling Rates
28
Performance/Cost Analysis (Cont’d)
MRR is increased in all sampling settings
00.050.1
0.150.2
0.250.3
0.350.4
0.450.5
1 0.75 0.5 0.25
Probe Query Sample Rate
MR
R
1
0.75
0.5
0.25
no probe
file querysample rate
29
0
50
100
150
200
250
0.25
Probe Query Sampling Rate
Num Resp. Per File Query
1
0.75
0.5
0.25
Performance/Cost Analysis (Cont’d)
00.050.10.150.20.250.30.350.40.450.5
0.25Probe Query Sampling Rate
MRR
1
0.75
0.5
0.25
Example: It can both reduce the cost (-15%) and improve the performance (18%)
Cost Performance
30
Conclusions and Future Work
• Probing enriches data description – MRR is improved by 30%
• Sampling is effective in controlling cost– Reduce cost by 15% and improve
performance by 18% at the same time
• We consider better ways of controlling cost
31
• Thank You!
• Any Questions?