LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information...
-
Upload
shauna-mosley -
Category
Documents
-
view
214 -
download
0
Transcript of LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information...
![Page 1: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/1.jpg)
LSDS-IR’08 www.ir.iit.edu 1
Cost-Effective Spam Detection in P2P File-Sharing Systems
Dongmei JiaInformation Retrieval Lab
Illinois Institute of [email protected]
![Page 2: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/2.jpg)
LSDS-IR’08 www.ir.iit.edu 2
Goal
• Create cost-effective ways of automatically detecting P2P spam results w/o actual file downloading
![Page 3: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/3.jpg)
LSDS-IR’08 www.ir.iit.edu 3
Introduction
• Spam: – Any file that is misrepresented deliberately or
in a way of manipulating established retrieval and ranking techniques
• Spam is harmful– Degrade user search experience– Assist the propagation of viruses in network– Have significant impact on P2P traffic load
![Page 4: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/4.jpg)
LSDS-IR’08 www.ir.iit.edu 4
Problem Statement
• Naïve spam detection method– Download and manually check– Cons:
• Time and labor consuming• Wastes bandwidth and storage resources• Risks of opening malware
• Hence, automatic spam detection is needed!
![Page 5: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/5.jpg)
LSDS-IR’08 www.ir.iit.edu 5
Emule Example
Query (number of results)
Descriptors Group Size
File Key
Hard to detect spam automatically in query result set!
![Page 6: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/6.jpg)
LSDS-IR’08 www.ir.iit.edu 6
Types of Spam
• Type 1: Files whose replicas have semantically different descriptors– E.g., different song titles for a same key
26NZUBS655CC66COLKMWHUVJGUXRPVUF:
“12 days after christmas.mp3”
“i want you thalia.mp3”
“comon be my girl.mp3”
…
![Page 7: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/7.jpg)
LSDS-IR’08 www.ir.iit.edu 7
Types of Spam (Cont’d)
• Type 2: Files with long descriptors that contain semantically nonsensical term combinations– Single-descriptor problem– E.g., a single replica descriptor for key
1200473A4BB17724194C5B9C271F3DC4: “Aerosmith,Van Halen,Quiet Riot,Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”
![Page 8: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/8.jpg)
LSDS-IR’08 www.ir.iit.edu 8
Types of Spam (Cont’d)
• Type 3: Files with descriptors that contain no query terms– Ads or warning on the illegal distribution of
copyrighted materials– E.g., “Can you afford 0.09
www.BuyLegalMP3.com.mp3”
![Page 9: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/9.jpg)
LSDS-IR’08 www.ir.iit.edu 9
Types of Spam (Cont’d)
• Type 4: Files that are highly replicated on a single peer– Normal users do not create multiple replicas of a same
file on a single server – Manipulate “group size” ranking– E.g., 177 replicas of the file
DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer
![Page 10: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/10.jpg)
LSDS-IR’08 www.ir.iit.edu
Feature-Based Spam Detection
• Basic idea– To detect spam results by P2P features that
are strongly correlated with spam• Vocabulary size of a file’s group descriptor• Variance of terms in replica descriptors D of a file
group G– Jaccard distance: 1 - |D ∩ G| / |D G |
– Cosine distance: 1 - (VG·VD) / (|VG| |VD|)
• Per-host replication degree of a file– numRep / numHost
• …
10
![Page 11: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/11.jpg)
LSDS-IR’08 www.ir.iit.edu 11
Probe Query
• Problem: – Results have insufficient and biased description info
• Conjunctive query matching
• Solution: – Gather more info for a result from network
• Other replica descriptors of the file• Statistics of peers who share the file
– Num of files, num of unique files, peer ID
– Implementation• Contains only a file key, not a “term” query
– Intuition• Probing helps to create a more complete view of a file• Ranking is more effective with more adequate file info
![Page 12: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/12.jpg)
LSDS-IR’08 www.ir.iit.edu
Evaluation
• Dataset– P2P audio files crawled from Gnutella network:
• numRep = 25,137,217; numFile = 9,575,113; numPeer = 226,786
– 50 most popular queries in the crawled dataset• Representative of most users, more likely target for spam
• Metric– Num spam in top-N ranked results, esp. for a small N
• Effectiveness– Improves performance by 9% for top-200 results, by
92.5% for top-20 results• Base case: noprobe+numRep
12
![Page 13: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/13.jpg)
LSDS-IR’08 www.ir.iit.edu
Cost Control
• Tradeoff– Performance vs. cost
• Cost– Num of responses for regular query and probe query
• Problem– Network cost is dramatically increased by probing
• How to reduce the cost?
13
![Page 14: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/14.jpg)
LSDS-IR’08 www.ir.iit.edu
Cost Control Approaches
• Random sampling of probe query results
• Piggy-backing of descriptor data in probe queries
• Limiting the scope of probing
14
![Page 15: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/15.jpg)
LSDS-IR’08 www.ir.iit.edu
Random Sampling
• Server-side random sampling of probe query results– A predefined probability P, 0 ≤ P ≤ 1
– Reduces cost by a factor P predictably– Impact on effectiveness of spam detection?
15
![Page 16: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/16.jpg)
LSDS-IR’08 www.ir.iit.edu 16
Experimental Results
0
1
2
3
4
5
6
7
8
1 20 39 58 77 96 115 134 153 172 191
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
Cost is reduced significantly by sampling fewer probe results
In all sampling cases, overall performance is still 1.7%-9% better than noprobe
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
But the cost is still high With 25% sampling, cost is ~7 times higher than noprobe
Performance for top-20 results is 71%-92% better than noprobe
`
![Page 17: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/17.jpg)
LSDS-IR’08 www.ir.iit.edu
Piggy-backing of Descriptor Data
• Piggy-backing of descriptor data in probe queries– New type of probe query
• file key + descriptor of result file being probed
– Server’s descriptor will not respond if it contains no new term compared with the descriptor in probe query
• To limit num of probe results returned to client
17
![Page 18: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/18.jpg)
LSDS-IR’08 www.ir.iit.edu 18
Experimental Results
0
1
2
3
4
5
6
7
8
1 19 37 55 73 91 109 127 145 163 181 199
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
Compared with the original type of probe, total cost is decreased by 35%-39% for all sampling rates
Compared with the original type of probe, overall performance is dropped by ~15%
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
E.g., the cost with sampling rate 0.25 is ~4 times higher than noprobe
`
However, performance for top-20 results is improved by 71%-88% in all sampling cases
![Page 19: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/19.jpg)
LSDS-IR’08 www.ir.iit.edu
Limiting Probing Scope
• Limiting the scope of probing– Only probe a few top-ranked (i.e., top-20) regular
query results– Intuition
• User tends to only consider downloading a file from a few top-ranked results
19
![Page 20: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/20.jpg)
LSDS-IR’08 www.ir.iit.edu 20
Experimental Results
Performance of probing only top-20 results is always 22%-56% better over noprobe
Probing only the top-20 results significantly reduces cost
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 3 5 7 9 11 13 15 17 19
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
E.g., cost with sampling rate 0.25 is only twice as much as that of noprobe
`
![Page 21: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/21.jpg)
LSDS-IR’08 www.ir.iit.edu 21
Conclusion
• Feature-based spam detection techniques successfully decrease the amount of spam – 9% in top-200 results; 92% in top-20 results
• Cost control methods are effective in reducing network cost– Factor increase of cost is dropped from 7 to 2 over
noprobe– At the same time, performance is at least 22%
better over noprobe for top-20 results
![Page 22: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/22.jpg)
LSDS-IR’08 www.ir.iit.edu 22
References• Limewire junk filter. http://wiki.limewire.org/index.php?title=Junk_Filter• J. Liang, R. Kumar, Y. Xi and K. Ross. Pollution in P2P File Sharing Systems. In
INFOCOM’05, May 2005.• K. Svore, Q. Wu, C.J.C. Burges and A. Raman. Improving Web spam classification using
Rank-time features. In Proc. AIRWeb workshop in WWW, 2007• Shlomo Hershkop, Salvatore j Stolfo. Combining Email Models for False Positive
Reduction. In proc. KDD’05. Chicago, Aug. 2005. • P. A. Chirita, J. Diederich, and W. Nejdl. MailRank: Using ranking for spam detection. In
proc. CIKM’05, Bremen, Germany, 2005.• Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly. Detecting spam web
pages through content analysis. In Proc. of WWW'06.• Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust
Algorithm for Reputation Management in P2P Networks. In Proc. of WWW, 2003. • Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on
mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450.
• Limewire. www.limewire.org• Runfang Zhou and Kai Hwang. Gossip-based Reputation Aggregation for Unstructured
Peer-to-Peer Networks. 21th IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Los Angeles, March 26-30, 2007
• Kevin Walsh, Emin Gun Sirer. Experience with an Object Reputation System for Peer-to-Peer Filesharing. In 3rd Symposium on Networked Systems Design & Implementation (NSDI), 2006
• Uichin Lee, Min Choi, Junghoo Cho, Medy. Y. Sanadidi, Mario Gerla. Understanding Pollution Dynamics in P2P File Sharing. In Proc. IPTPS'06.
![Page 23: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/23.jpg)
LSDS-IR’08 www.ir.iit.edu 23
• Questions?
• Contact info:– WWW: www.ir.iit.edu– Email: [email protected]
Thanks fromIIT’s IR Lab!
![Page 24: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/24.jpg)
LSDS-IR’08 www.ir.iit.edu 24
Related Work
• Email spam detection– Hershkop et al., KDD’05
• Analyze email content and syntax
– Chirita et al., CIKM’05• Construct social networks for email address
• Web spam detection– Ntoulas et al., WWW’06
• Analyze content of Web pages
– Gyongyi et al., VLDB’06• Analyze link structure of Web pages
![Page 25: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/25.jpg)
LSDS-IR’08 www.ir.iit.edu 25
Related Work (Cont’d)
• P2P spam detection– Spam filter in Limewire
• User-controlled spam learning
– Liang et al., INFOCOM’05• Detect spam using extra info, i.e., official CD
length of a media file
– Kamvar et al., WWW’03• Build reputation systems to rank peers
![Page 26: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/26.jpg)
LSDS-IR’08 www.ir.iit.edu 26
Simulating P2P search
• Built a system to simulate P2P search on client side
• Simulating query routing– A query is randomly sent to 50 peers– Repeat until either stop condition is satisfied
• Condition 1: num of unique results reaches 200 results• Condition 2: num of peers that have received query reaches
50K peers
– Threshold values chosen based on specifications of real-world P2P systems (i.e. Limewire’s Gnutella)
![Page 27: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/27.jpg)
LSDS-IR’08 www.ir.iit.edu 27
Experimental Results
0
1
2
3
4
5
6
7
8
1 22 43 64 85 106 127 148 169 190
Top N Results
Avg
Nu
m S
pa
m
noprobe+numRep
noprobe+CosineQD
probe+numRep
probe+Cosine
probe+Jaccard
probe+numUniqueTerms
Compared with noprobe+numRep, probe+Cosine improves performance by 9% for top-200 results, by 92.5% for top-20 results
Compared with noprobe+CosineQD, 21.6% and 97.8%
noprobe+numRep
probe+Cosine
noprobe+CosineQD
probe+numUniqueTerms
probe+Jaccard
![Page 28: LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d205503460f949f4ef1/html5/thumbnails/28.jpg)
LSDS-IR’08 www.ir.iit.edu 28
Experimental Results (Cont’d)
Compare Cosine/Jaccard distance with numUniqueTerms in a fair way by only considering multi-replica files
0
2
4
6
8
10
12
1 15 29 43 57 71 85 99 113
Top N Results
Avg
Nu
m S
pa
m
probe+Cosineprobe+Jaccardprobe+numUniqueTerms