SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1, Despoina Chatzakou 1, Neil Shah 2,...
-
Upload
audrey-morris -
Category
Documents
-
view
220 -
download
4
Transcript of SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1, Despoina Chatzakou 1, Neil Shah 2,...
SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER
Maria Giatsoglou1, Despoina Chatzakou1, Neil Shah2, Alex Beutel2, Christos Faloutsos2, Athena Vakali1
1 Informatics Department, Aristotle University of Thessaloniki, Greece2 School of Computer Science, Carnegie Mellon University, USA
Informatics DepartmentAristotle University of Thessaloniki
School of Computer ScienceCarnegie Mellon University
Content in Twitter• Great topic diversity• Varying attention levels (#views, , #favorites)
04/18/23 2
#RTs
Retweet Fraud: overview
• User typically retweet a post due to its high quality /
interestingness + author’s influence / popularity
• # retweets serves as a post’s popularity indicator
Retweet fraud: falsely create the impression of popularity
by artificially generating a high volume of retweets
• Twitter estimates 14% (5%) of user accounts being bots
(spam bots); the problem is probably much bigger
• Such content is vacuous, spammy / malicious and detracts
from Twitter content’s credibility and users’ experiences
04/18/23 3
%
Retweet Fraud: dimensions
• Accounts of varying automation level (bots, humans, semi-automated)
• Mixed honest and fake retweets for the same post• Promiscuous vs. subtle fraudsters: based on the ratio of
fraudulent to honest(-like) activity
04/18/23 6
###
###
###
###
###
###
######
occasional retweet buyer
honest humans
paidhuman
bots
%%%%%%
%%%%%%
%%%
%%%
professional content / user promoter
###
Complex problem with multiple dimensions
examp
les
What features tell fake from genuine
reactions?
How do they relate to the targeted
problems ?
RTSCOPE addresses these
issues
Hypotheses and problems addressed
There are distinctive patterns in retweet fraud in terms ofH1. the timing of retweets (use of automation tools)
H2. the accounts that retweet (fraudsters acting in lockstep)
H3. the connectivity of retweeters (bot networks, “camouflage”)
04/18/23 8
Retweet-thread level problemGiven: the ith tweet of user u; its induced retweet activity (user IDs ×tamps)Identify: if the activity is organic or not.
User level problemGiven: a user u; a set of tweets of user u; their induced retweet activityIdentify: if u is a spammer.
promiscuous
fraudsters
cautious
fraudsters
Background• User u: a given Twitter account
• Tweet twu,i: the ith post of user u
• Retweet thread: all re-posts of a tweet
04/18/23 9
can be honest OR fraudster
### %%% *** $$$
t1 t2 t3 t4u time
twu,1 twu,2 twu,3 twu,4
###
###
t1 t2 t3 t4 time
### ### ######
t4
Ru,1
Alex Mary Peter DebbieTimtwu,1
“R” network(of Ru,1)
Alex
Mary
PeterDebbie
Introducing the RTSCOPE approach• RTSCOPE: series of tests for spotting fraudsters with varying behaviors
04/18/23 10
Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, and Athena Vakali. Retweeting Activity on Twitter: Signs of Deception. In PAKDD 2015.
Connectivity: TRIANGLES pattern
04/18/23 11
honest“R” network
fraudulent“R” network
degree2 degree2
degree2degree2
Connectivity: DEGREES pattern
04/18/23 12
spike at30
honest“R” network
fraudulent“R” network
power-law
Activity Summarization: Features
• Temporal & popularity features per retweet thread
Log-log pairwise feature scatterplots of retweet threads reveal dense microclusters for fraudsters
04/18/23 13
ratio of activated followers author’s followers who retweeted
response timetime between the tweet’s posting and first retweet
lifespantime between first and last retweet (constrained to 1 month)
Arr-IQRinter-quartile range of inter-arrival times for retweets
Activity Summarization: Patterns
ENTHUSIASM: High infection probability for followers of fraudsters
MACHINE-GUN: Fraudsters retweet all at once/with similar time delay
REPETITION: Fake retweet threads form microclusters due to similar response time, Arr-IQR, activated followers ratio
04/18/23 14
Popular++ , Popular+, Popular , Fraudulent users
ENTHUSIASM
MACHINE-GUN
Retweeters activation: Disparity
• Given the posts of user ui, what is the distribution of retweets across retweeters?
• Disparity reveals if retweeting activity spreads homogeneously over retweeters or it is skewed towards few dedicated users.
Disparity for ui and a retweet thread size of k
,
,
04/18/23 15
### ### ### ######
Alex Mary Peter DebbieTim
%%%
$$$
ui
***
###
100 posts
k = 5
.
.
.
ri,1 = 100
ri,2 = 2
ri,3 = 2
ri,4 = 1 ri,5 = 1***
%%%
$$$
*** %%%
Disparity: Intuition
04/18/23 16
### ### ### ######
bot1 bot2 bot4 bot5bot3
%%%
$$$
ui
***
###
100 posts
$$$
%%%
***
k = 5
.
.
.
ri,1 = 100
$$$
%%%
***
$$$
%%%
***
$$$
%%%
***
$$$
%%%
***
.
.
.
ri,2 = 100
.
.
.
ri,3 = 100
.
.
.
ri,4 = 100
.
.
.
ri,5 = 100
85.0108
1124100),5(
2
222
iY
%%%
$$$
5
1
500
100100100100100),5(
2
22222
iY
FAVORITISM & HOMOGENEITY patterns
Disparity of a Zipf distribution
(proof in paper)
04/18/23 17
homogeneity
favoritism
FAVORITISM. Participation of honest users to retweets follows a Zipf law.
HOMOGENEITY. Participation of fraudulent users to retweets is homogeneous.
super-skewed
favoritism
DETAIL
Findings• Patterns: we discovered several patterns for spotting retweet fraud
• All tests are content independent• can catch more sophisticated fraudsters• are language independent
• But: • golden number of tests for flagging fraudsters?
04/18/23 18
Synchronization Fraud
• Group of unnaturally synchronized events/entities• Collective / group anomaly
• e.g. retweets, Facebook likes, subgraphs, image subregions
04/18/23 20
###
Alex
###
Mary###
Peter
###
Debbie
###
Tim
### got 3K retweets in 10 minutes
… 3000 times
SUSPICIOUS?
not necessarily10’
###
John
$$$
Alex
$$$
Mary$$$
Peter
$$$
Debbie
$$$
Tim… 3000 times
10’
$$$
John
&&&
Alex
&&&
Mary&&&
Peter
&&&
Debbie
&&&
Tim… 3000 times
10’
&&&
John
###
Alex
###
Mary###
Peter
###
Debbie
###
Tim… 3000 times
10’
SUSPICIOUS ? Probably!
Our goals• Given: N groups of entities; a representation for each
entity in a p-dimensional space;• Identify groups of entities abnormally synchronized in
some feature subspaces.
G1. Design a general, effective approach for collective anomalies detection
G2. Customize it for Retweet Fraud detection
G3. Find features that will assist distinguishing fraudsters from honest users
04/18/23 21
Background: Measuring group strangeness
04/18/23 22
average closeness
Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang. CatchSync: Catching Synchronized Behavior in Large Directed Graphs. KDD 2014.
Background: Robust outlier detection
• ROBPCA-AO: robust dimensionality reduction approach; finds outlying points
• Suitable for multivariate, high-dimensional data; Independent of features’ distribution; Non-deterministic
1.Finds the “best” k-D space to project data based on subset of points
2.Detects outliers based on two distance scores
04/18/23 23
M. Hubert, P.J. Rousseeuw, T. Verdonck. Robust PCA for skewed data and its outlier map. Comput. Stat. Dat. An., 53 (2009), 2264-2274.
orthogonal distance
robust scoredistance
Detail
Problem definitionSYNCFRAUD Problem.
•Given: a set of groups of entities G with a variable number of entities em,i for each group gm; p features for the entities’ representation,
•Extract: a set of features at the group-level, and
•Identify: suspicious groups S with highly synchronized characteristics.
RTFRAUD Problem.
groups of entities users entities retweet threads suspicious groups RTFraudsters
04/18/23 24
ND-SYNC pipeline
Given N groups of p -D entities and I iterations
Do
1. Feature subspace sweeping;
2. Group scoring;
3. Multivariate outlier detection;
Extract suspicious groups
04/18/23 25
ND-SYNC: Multivariate outlier detection
Aim: given the suspiciousness score vectors identify the suspicious groups
1.Apply ROBPCA-AO for I iterations and find outliers
2.Flag a group as suspicious based on majority vote over all iterations.
•To eliminate parameters• automatic selection of dimensionality k via 95%
cumulative variance explained criterion heuristic• use of all entities for estimating the robust feature
subspaces
04/18/23 28
Features for retweet threads
Retweets: # retweets
Response time: tweet’s posting first retweet
Lifespan: first last (observed) retweet constrained to 3 weeks
RT-Q3 response time: tweet’s posting first ¾ of retweets
RT-Q2 response time: tweet’s posting first ½ of retweets
Arr-MAD: mean absolute deviation of RTs inter-arrival times
Arr-IQR: inter-quartile range of RTs inter-arrival times
04/18/23 29
Microclusters of fraudulent retweet threads
04/18/23 30
high synchronicity
for RTFraudsters
2D feature subspaces
Dataset generation• Selection of target users (both honest and fraudulent)
• users with the most retweeted tweets and heavy use of spammy keywords (casino, buy, followback, etc) in a 2-day Twitter sample
• active (frequent tweets) and popular (> 100 retweets) users (http://twittercounter.com/)
• topic experts (European affairs and Automobile), based on Twitter lists
• Target users tracked for 2-6 months (all tweets & their retweets)• Pruned “unpopular” users (all retweet threads < 50 retweet or
fewer than 20)
04/18/23 31
Dataset overview
Type #Retweet threads #Retweets
honest 83,587 2,939,455
fraudulent 50,435 8,787,803
BOTH 134,022 11,727,258
04/18/23 32
User categorization
•28 fraudulent: tweets with spammy links and terms,
repetitive promotions; fabricated profiles
•278 honest
(Available at http://oswinds.csd.auth.gr/project/NDSYNC)
ND-SYNC effectiveness & robustness
• Highly accurate and robust to the selection of k• Best performance at k = 6 (selected with the 95% cumulative
variance explained criterion)• Only 1% decrease in F1-score using just 2D feature subspaces
04/18/23 33
97% accuracy0.82 F1-score
Detected outliers
04/18/23 34
professional promoterspromiscuous
65 retweet threads in 4 months80% > 1k retweets60% > 10k retweetsnews media
account
news mediaaccount
politician
Conclusions
G1. Design a general, effective approach for collective anomalies detection
ND-SYNC is a general, effective pipeline, which automatically detects group anomalies
G2. Customize it for Retweet Fraud detection
Carefully designed set of features for the retweet fraud case
G3. Find features that will assist spotting fraudsters from honest users
ND-SYNC achieves 97% accuracy in distinguishing fraudulent from honest users on real Twitter data
04/18/23 35