Post on 16-Jan-2016
DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS
Fabrício Benevenuto∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and Marcos
Gonçalves
Computer Science Department, Federal University of Minas Gerais Belo Horizonte, Brazil
(SIR’09)
Speaker : Yi-Ling Tai
Date : 2009/09/28
OUTLINE
Introduction User test collection Analyzing user behavior attributes Detecting spammer and promoters
Evaluation metrics Experimental setup Classification Reducing attribute set
Conclusions
INTRODUCTION
YouTube is the most popular Online video social networks.
It allows users to post a video as a response to a discussion topic.
These features open opportunities for users to introduce polluted content into the system.
Pollution – spread advertise to generate sales disseminate pornography compromise system reputation
INTRODUCTION
Users cannot easily identify the pollution before watching, which also consumes system resources, especially bandwidth.
This paper address the issue of detecting video spammers and promoters.
Spammers post an unrelated video as response to a popular
video topic to increase the likelihood of the response being viewed.
Promoters post a large number of responses to boost the
rank of the video topic.
INTRODUCTION
Toward this end - crawl a large user data set from YouTube. “manually” classified user as legitimate,
spammers and promoters. study attributes to distinguish different types of
polluters . use a supervised classification algorithm to
detect spammers and promoters.
USER TEST COLLECTION
A YouTube video is a responded video or a video topic if it has at least one video response.
A YouTube user is a responsive user if she has posted at least one video response.
A responded user is someone who posted at least one responded video.
Polluter is used to refer to either a spammer or a promoter.
CRAWLING YOUTUBE
User interactions can be represented by a video response user graph.G = (X, Y)
X is the union of all users who posted or received video responses.
(x1, x2) is a directed arc in Y, if user x1 has responded to a video contributed by user x2
To build the graph, this paper build a crawler that implements Algorithm 1.
CRAWLING YOUTUBE
The sampling starts from a set of 88 seeds, consisting of the owners of the top-100 most responded videos of all time.
The crawler follows links gathering information on a number of different attributes.
264,460 users 381,616 responded videos 701,950 video responses
CRAWLING YOUTUBE
BUILDING A TEST COLLECTION
The main goal is to study the patterns and characteristics of each class of users.
The collection should include the properties having a significant number of users of all three
categories including, but not restricting to large amounts of
pollution including a large number of legitimate users with
different behavior randomly sampling may not achieve these
properties.
BUILDING A TEST COLLECTION
three strategies for user selection1. different levels of interaction
Four groups of users based on their in and out-degrees
100 users were randomly selected from each group
2. Aiming at the test collection with polluters Browsed responses of top 100 most responded
videos, selecting suspect users.
3. randomly selected 300 users Who posted video responses to the top 100 most
responded videos To minimize a possible bias by strategy2
BUILDING A TEST COLLECTION
Each selected user was then manually classified.
Three volunteers analyzed all video responses of each user to classify her into one of categories.
Volunteers were instructed to favor legitimate users.
ANALYZING USER BEHAVIOR ATTRIBUTES
We considered three attribute sets Video attributes
Duration, number of views, commentaries receivedRating, number of times to be selected favoriteNumber of honor and external links
Three video groups of the user All video uploaded by the user Video responses Responded videos which this user response to
summing up 42 video attributes for each user
ANALYZING USER BEHAVIOR ATTRIBUTES User attributes
number of friends, number of videos uploaded, number of videos watched,number of videos added as favorite, numbers of video responses posted and received, numbers of subscriptions and subscribers, average time between video uploads, maximum number of videos uploaded in 24 hours.
ANALYZING USER BEHAVIOR ATTRIBUTES
Social network attributes Clustering coefficient
cc(i), is the ratio of the number of existing edges between i’s neighbors to the maximum possible number
Betweenness Reciprocity
Assortativity The ratio between the node (in/out) degree and the
average (in/out)degree of its neighbors. UserRank
ANALYZING USER BEHAVIOR ATTRIBUTES two well known feature selection methods.
Information gain (Chi Squared)
ANALYZING USER BEHAVIOR ATTRIBUTES
EVALUATION METRICS
use the standard information retrieval metrics Recall Precision Micro-F1
first computing global precision and recall values for all classes.
then calculating F1 Macro-F1
first calculating F1 values for each class in isolation then averaging over all classes
EVALUATION METRICS
confusion matrix
EXPERIMENTAL SETUP
libSVM - an open source SVM package allows searching for the best classifier
parameters using the training data provides a series of optimizations, including
normalization of all numerical attributes.
5-fold cross-validation. repeated 5 times with different seeds used to
shuffle the original data set. producing 25 different results for each test.
TWO CLASSIFICATION STRATEGIES
flat classification promoters (P), spammers (S), and legitimate
users (L)
hierarchical strategy first separate promoters (P) from non-promoters
(NP) heavy (HP) and light promoters (LP) legitimate users (L) and spammers (S)
FLAT CLASSIFICATION
confusion matrix obtained
The numbers presented are percentages relative to the total number of users in each class.
The diagonal indicates the recall in each class. no promoter was classified as legitimate user. 3.87% -
videos actually acquired popularity. harder to distinguish them from spammers.
FLAT CLASSIFICATION
41.91% - Legitimate users post their video responses to
popular responded videos(a typical behavior of spammers).
Micro-F1 = 87.5, with per-class F1 values are 90.8, 63.7, and 92.3
Macro-F1 = 82.2
HIERARCHICAL CLASSIfiCATION
Binary classification J parameter - one can give priority to one
class (e.g., spammers) over the other (e.g., legitimate users) .
promoters VS non-promoters
Macro-F1 = 93.44 Micro-F1 = 99.17
NON-PROMOTERS
We trained the classifier with the original training data without promoters.
with J=1
J = 0.1 - 24% VS 1% J = 3.0 - 71% VS 9%
The best solution depends on the system administrator’s objectives.
HEAVY AND LIGHT PROMOTERS
Aggressiveness Maximum number of video responses posted in a
24-hour period.
k-means clustering algorithm was used to separate promoters into two clusters.
Average aggressiveness Light promoters = 15.78 (CV=0.63) Heavy promoters = 107.54 (CV=0.61)
HEAVY AND LIGHT PROMOTERS
Binary classification retrained with the original training data
containing only promoters
REDUCING THE ATTRIBUTE SET
Two scenarios - Decreasing order of position in the χ2 ranking Evaluating classification when subsets of 10
attributes occupying contiguous positions
CONCLUSIONS
An effective solution to detect spammers and promoters in online video social networks.
Flat classification approach provides alternative to simply considering all users as legitimate.
Hierarchical approach explores different classification tradeoffs and provides more flexibility for the application.
Finally, we can produce significant benefits with only a small subset of less expensive attributes.
Spammers and promoters will evolve and adapt to anti-pollution strategies, periodical assessment of the classification process may be necessary.