Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.
-
Upload
bernard-oliver -
Category
Documents
-
view
219 -
download
2
Transcript of Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.
![Page 1: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/1.jpg)
Andrew G. WestWikimania `11 – August 5, 2011
Autonomous Detection of Collaborative Link Spam
![Page 2: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/2.jpg)
Big Idea
2
Design a framework to detect link spam additions to wikis/Wikipedia, including those employing:
(1) Subtlety; aims at link persistence (status quo)(2) Vulnerabilities of recent literature [1]
And use this tool’s functionality to:
(1) Autonomously undo obvious spam (i.e., a bot)(2) Feed non-obvious, but questionable instances to human patrollers in a streamlined fashion
![Page 3: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/3.jpg)
Outline
3
• Motivation• Corpus construction• Features• Performance• Live implementation
• Demonstration
![Page 4: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/4.jpg)
External Link Spam
4
• Any external link (i.e., not wikilink) which violates the subjective link policy [2] is external link spam:• Issues with the destination
URL (the landing site). For example: Commercial intent, shock sites, non-notable sources.
• Issues with presentation. For example: Putting a link atop article.
[[http://www.google.com Visit Google!]]
Visit Google!
Article:
![Page 5: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/5.jpg)
5
Research Motivations
![Page 6: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/6.jpg)
Motive: Social Spam
6
• Not entirely different from link spam in other UGC/ collaborative applications• Much literature [3]
• But wikis/WP are unique:• Not append-only; less
formatting constraints• Massive traffic in a single
installation.• Large topic space: Spam
relevant to landing site.• Mitigation entirely
community-driven
![Page 7: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/7.jpg)
Motive: Incentive
7
• Much research on wiki vandalism (link spam is a subset)• But most vandalism is “offensive” or “nonsense” [4]• Not well incentivized; whereas link spam likely serves
poster self-interest (e.g., monetary, lobbying, etc.)• Thus, more aggressive/optimal attack tactics expected
• In [1], examined the status quo nature of WP link spam:• “Nuisance”: order of magnitude less frequent than
vandalism. See also [[WP:WPSPAM]].• Less sophistication than seen in other spam domains• Subtlety: Even spam links follow conventions. Perhaps an
attempt to deceive patrollers/watch-list: Doesn’t work
![Page 8: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/8.jpg)
Motive: Vulnerabilities
8
• Novel attack model [1], exploits human latency and appears economically viable:
Peak views/second
Highest avg. views/day
Calculated Jan.-Aug.
2010
• High-traffic page targets• Prominent placement• Script-driven operation
via autonomously attained privileged accounts
• Distributed
• Other recent concerns:• Declining labor force [5]• XRumer (blackhat SEO) [6]
![Page 9: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/9.jpg)
9
Corpus Construction
![Page 10: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/10.jpg)
Corpus: Spam/Ham
10
• SPAM edits are those that:1. Added exactly one link to an HTML document2. The only changes made in the edit are the link addition
and its immediate supporting “context”3. Were rolled-back by a privileged editor. Where rollback is
a tool used only for “blatantly unproductive edits”• HAM edits are those:
1. Added by a privileged user2. Meeting criteria (1) and (2) above
By relying on implicit actions, we save time and allow privileged users to define spam on case-by-case basis
![Page 11: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/11.jpg)
Corpus: Context (1)
11
Because the link was the ONLY change made. The privileged user’s decision to roll-back that
edit speaks DIRECTLY to that link’s inappropriateness.
![Page 12: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/12.jpg)
Corpus: Context (2)
12
![Page 13: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/13.jpg)
Corpus: Collection
13
• ≈2 months of data collection in early 2011 (en.wiki)• Done in real-time viva the STiki framework [7]• Also retrieved the landing site for each link
• Be careful of biasing features!
All links collected
238,371
Links to HTML doc.
Was rolled back
Added by priv. user
only link
“context”SPAM
HAM
2,865 1,095
50,108 4,867
188,210
only link
“context”
LBL
SPAM
HAM
NUM# PER%
4,867
1,095
81.6%
18.4%
human
human
![Page 14: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/14.jpg)
14
Features
![Page 15: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/15.jpg)
15
Features
55+ features implemented and described in [8]
For brevity, focus on features that: (1) Are strong indicators, and (2) have intuitive presentation
Three major feature groups:(1) Wikipedia-driven
(2) Landing-site driven(3) Third-party based
![Page 16: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/16.jpg)
Features: Wiki (1)
16
• Examples of Wikipedia features:
• URL: Length, sub-domain quantity, TLD (.com, .org, etc.)
• Article: Length, popularity, age
• Presentation: Where in article, citation, description length
• URL History: Addition quantity, diversity of adding users
• Metadata: Revision comment length, time-of-day
![Page 17: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/17.jpg)
Features: Wiki (2)
17
• Long URLs are good URLs:• www.example.com vs. www.example.com/doc.html• Former more likely to be promotional
• Spam is rarely used in citations• Advanced syntax implies advanced editors
![Page 18: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/18.jpg)
Features: Wiki (3)
18
≈ 85% of spam leaves no “revision comment”
vs. < 20% of ham
TLDs with greater admin. control tend to be good.Also correlates well with
registration cost
![Page 19: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/19.jpg)
Features: Wiki (4)
19
An article’s non-citation links “stabilize” with time(Non-cites tend to have their own section at article bottom)
![Page 20: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/20.jpg)
Features: Site
20
• We fetch and process the HTML source of the landing site
• Spam destinations marginally more profane/commercial (SEO?)
• Re-implement features from a study of email spam URLs [9]• Opposite results from that work
• TAKEAWAY: Subtlety and link diversity impair this analysis
![Page 21: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/21.jpg)
Features: 3rd Party (1)
21
Two third-party sources queried:
• Google Safe Browsing Project: Lists suspected phishing /malware hosts
Google lists produce NO corpus matches:• So worthless during learning/training
• But a match is unambiguously bad…• …. So bring into force via statically authored rules
• Alexa WIS [10]: Data from web-crawling and toolbar. Including traffic data, WHOIS, load times…
![Page 22: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/22.jpg)
Features: 3rd Party (2)
22
•#1 weighted feature•At median, ham has 850 BLs, spam has 20 BLs (40× difference).•Intuitive: basis for search-engine rank
• Continent of WHOIS registration.
• Asia especially poor • Other good, non-
intuitive Alexa feats.
![Page 23: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/23.jpg)
23
Performance
![Page 24: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/24.jpg)
Perform: ADTrees
24
• Features turned into ML model via ADTree algorithm• Human-readable• Enumerated features
• In practice…• Tree-depth of 15• About 35 features
• Evaluation performed using standard 90/10 cross validation.
val =0.0
BACKLINKS > 200
IS_CITE == TRUE COMM_LEN> 0
Y: +0.8 N: -0.4
Y: +0.6 Y: +0.2N: -0.1 N: -0.6
… … … …
if (final_value > 0): HAMif (final_value < 0): SPAM
Example ADTree:
![Page 25: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/25.jpg)
Perform: Piecewise
25
• Obviously, much better than random predictions (status quo)• Wikipedia-driven features weighed most helpful
• But also must consider robustness issues
![Page 26: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/26.jpg)
Perform: All
26
![Page 27: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/27.jpg)
27
Live Implementation
![Page 28: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/28.jpg)
Live: Architecture
28
If spam, revert and warn
Likely vandalism
Likely vandalism
-------------------
EditQueue
Likely vandalism
Likely innocent
STiki Client
Wiki-DB
WikipediaArticle
STiki Client
Fetch Edit
Display
Classify
Wikipedia
IRC#enwiki#
Spam-scoring engine
QueueMainte--nance
STiki Services
Scoring/ADTree
3rd-partyLanding
site
Alexa
Safe-Browse
<HTML>…</HTML>
STiki Client
BotHandler
if(score) < thresh:
then:
elseREVERT
• Bringing live is trivial via IRC and implemented ADTree• But how to apply the scores:
• Autonomously (i.e., a bot); threshold; approval-pending• Prioritized human-review via STiki [7]
• Priority-queue used in crowd-sourced fashion
![Page 29: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/29.jpg)
Live: Demonstration
29
Software Demonstrationopen-source [7], [[WP:STiki]]
![Page 30: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/30.jpg)
Live: Discussion
30
• Practical implementation notes• Multiple links: Score individually, assign worst • Dead links (i.e., HTTP 404): Reported to special page• Non-HTML destinations: Omit corresponding features
• Static rules needed to capture novel attack vectors• Features are in place: Page placement, article popularity,
link histories, diversity quotients, Safe Browsing lists.• Pattern matches result in arbitrarily high scores
• Offline vs. online performance• Bot performance immediate; should be equal• Difficult to quantify decline with prioritized humans (STiki)
![Page 31: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/31.jpg)
Live: Gamesmanship
31
How would an attacker circumvent our system?
• Content-optimization (need for robust features)• TOCTTOU attacks (i.e., redirect after inspection)
• Rechecking on an interval is very expensive• But a practical concern; LTA case [[WP:UNID]]
• Crawler redirection• Determine our system’s IP; serve better content• A more distributed operational base
• Denial-of-service (overloading system with links) + STiki
Solutions to these kinds of problems remain future work
![Page 32: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/32.jpg)
References
32
[1] A. G. West, J. Chang, K. Venkatasubramanian, O. Sokolsky, and I. Lee. Link spamming Wikipedia for profit. In CEAS ‘11, September 2011.
[2] Wikipedia: External links. http://en.wikipedia.org/wiki/Wikipedia:External_links.[3] P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites:
A survey of approaches and future challenges. IEEE Internet Comp., 11(6), 2007.[4] R. Priedhorsky, J. Chen, S. K. Lam, K. Panciera, L. Terveen, and J. Riedl.
Creating, destroying, and restoring value in Wikipedia. In GROUP ’07, 2007.[5] E. Goldman. Wikipedia’s labor squeeze and its consequences. Journal of
Telecommunications and High Technology Law, 8, 2009.[6] Y. Shin, M. Gupta, and S. Myers. The nuts and bolts of a forum spam automator.
In LEET’11: 4th Wkshp. on Large-Scale Exploits and Emergent Threats, 2011.[7] A.G. West. STiki: A vandalism detection tool for Wikipedia.
http://en.wikipedia.org/wiki/Wikipedia:STiki. Software, 2010.[8] A.G. West, A. Agarwal, P. Baker, B. Exline, and I. Lee. Autonomous Link Spam
Detection in Purely Collaborative Environments. In WikiSym '11.[9] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages
through content analysis. In WWW’06: World Wide Web Conference, 2006.[10] Alexa web information service. http://aws.amazon.com/awis/.
![Page 33: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/33.jpg)
Backup slides (1)
33
![Page 34: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/34.jpg)
Backup slides (2)
34
Blacklists Patrollers Watchlisters Readers
Prevention Detection
Immediate Seconds Mins./Days ∞Latency:
LEFT: Pipeline of typical Wikipedia link spam detection, including both actors and latency
RIGHT: log-log plot showing average daily article views versus article popularity rank.
Observe the power-law distribution. A spammer could reach large percentages of viewership via few articles.
![Page 35: Andrew G. West Wikimania `11 – August 5, 2011 Autonomous Detection of Collaborative Link Spam.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649f395503460f94c55f9e/html5/thumbnails/35.jpg)
Backup slides (3)
35
ABOVE: Evaluating time-of-day (TOD) and day-of-week (DOW)
AROUND: Evaluating feature strength. All features here are “Wikipedia-driven”.