Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Understanding Botnet-driven Blog Spam: Motivations and Methods BrandonBevansbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaBruceDeBruhlbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaFoaadKhosmoodbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmerica

Introduction Spam, or unsolicited commercial communication,

has evolved from telemarketing schemes to a highlysophisticated and profitable black-market business.Although many users are aware that email spam isprominent, theyare lessawareofblogspam(Thom-ason,2007).Blogspam,alsoknownasforumspam,isspamthatispostedtoapublicoroutwardfacingweb-site.Blogspamcanbetoaccomplishmanytasksthatemailspamisusedfor,suchaspostinglinkstoamali-ciousexecutable.

Blog spam can also serve someunique purposes.First,blogspamcaninfluencepurchasingdecisionsbyfeaturing illegitimate advertisements or reviews. Se-cond,blogspamcanincludecontentwithtargetkey-words designed to change the way a search engineidentifies pages (Geerthik, 2013). Lastly, blog spamcancontainlinkspam,whichspamsaURLonavictimpagetoincreasetheinsertedURLssearchenginerank-ing.Overall,blogspamweakenssearchengines’modeloftheInternetpopularitydistribution.Muchacademicand industrial effort has been spent to detect, filter,anddeterspam(Dinh,2013),(SpirinandHan,2012).

Less effort has been placed in understanding theunderlyingdistributionmechanismsofspambotsandbotnets.Onefoundationalstudyincharacterizingblog

spam(Niuetal.,2007)providedaquantitativeanaly-sisofblogspamin2007.Thisstudyshowedthatblogsin2007includedincredibleamountsofspambutdoesnot try to identify linked behavior thatwould implybotnet behavior. A later study on blog spam(Stringhini,2015)exploresusing IPsandusernamestodetectbotnetsbutdoesnotcharacterizethebehav-iorofthesebotnets.In2011,aresearchteam(Stone-Grossetal.,2011)infiltratedabotnet,whichallowedforobservationsof the logisticsaroundbotnet spamcampaigns. Overall, our understanding of blog spamgeneratedbybotnetsisstilllimited.

Related Work Variousprojectshaveattemptedtoidentifytheme-

chanics, characteristics,andbehaviorofbotnets thatcontrol spam. In one important study (Shin et al.,2011), researchers fully evaluated how one of themost popular spam automation programs, XRumer,operates.Anotherstudyexploredthebehaviorofbot-netsacrossmultiplespamcampaigns(ThonnardandDacier,2011).Others(Pitsillidisetal.,2012)examinedtheimpactthatspamdatasetshadoncharacterizationresults.(Lumezanuetal.,2012)exploredthesimilari-ties between email spamand blog spamonTwitter.Theyshowthatover50%ofspamlinks fromemailsalsoappearedonTwitter.

Figure 1: Browser rendering of the ggjx honeypot

Theundergroundecosystembuildaroundthebot-netcommunityhasbeenexplored(Stone-Grossetal.,2011).Inasurprisingresult,over95%ofpharmaceu-ticals advertised in spam were handled by a smallgroupofbanks(Levchenkoetal.,2011).Ourworkissimilarinthatwearetryingtocharacterizethebotnetecosystem,focusingonthedistributionandclassifica-tionofcertainspamproducingbotnets.

Experimental Design

Inordertoclassifylinguisticsimilarityanddiffer-encesinbotnets,weimplement3honeypotstogathersamples of blog spam. We configure our honeypotsidenticallyusingtheDrupalcontentmanagementsys-tems(CMS)asshowninFigure1.Ourhoneypotsareidenticalexceptforthecontentoftheirfirstpostandtheir domain name. Ggjx.org is fashion themed,npcagent.com is sports themed, and gjams.com ispharmaceutical themed. We combine the data col-lected from Drupal with the Apache server logs(Apache, 2016) to allow for content analysis of datacollectedover42days.Toallowbotnets timetodis-cover the honeypots, we activate the honeypots atleast6-weeksbeforedatacollection.

Wegeneratethreetablesofcontentforeachhoney-pot(BevansandKhosmood,2016).Intheusertable,werecordthe informationthespambotenterswhileregisteringanduserloginstatisticsthatwesummarizeinTable1.Thisincludestheuserid,username,pass-word,dateofregistration,registrationIP,andnumberoflogins.Inthecontenttable,werecordthecontentofspampostsandcommentswhichwesummarizeinTa-ble 2. This includes the blog node id, the author’suniqueid,thedateposted,thenumberofhits,typeofpost,titleofthepost,textofthepost,linksinthepost,languageofthepost,andataxonomyofthepostfromIBM’sAlchemyAPI.

Table 1: User table characteristics for three honeypots

Table 2: Characteristics for the content tables

Table 3: Characteristics of entities

Lastly, in the access table, we include data andmeta-datafromtheApachelogs.Thisincludestheuserid,theaccessIP,theURL,theHTTPrequesttype,thenodeID,andanactionkeyworddescribingthetypeofaccess.

Our honeypots received a total of 1.1million re-questsforggjx,481thousandrequestsforgjams,and591thousandrequestsfornpcagent.

Entity Reduction It is widely accepted that spambot networks, or

botnets,areresponsibleformostspam.Therefore,wealgorithmicallyreducespaminstancesintouniqueen-titiesrepresentingbotnets.Foreachentity,wedefine4attributes:entityid,associatedIPs,usernames,andassociated user ids. To construct entities we scanthroughtheusersandassigneachonetoanentityasfollows.

1. Forauser,ifanentityexistswhichcontainsitsusernameorIP,theuserisaddedtotheentity.

2. Ifmore than one entitymatches the abovecriteria,allmatchingentitiesaremerged.

3. Ifnoentitymatchestheabovecriteria,anewentityiscreated.

WesummarizetheentitycharacteristicsinTable3.Themaximumnumberofusersinoneentityisalmost38 thousand for ggjx with over 100 unique IP ad-dresses.Theseresultsconfirmwhatisexpected-thevastmajorityofbots interactingwithourhoneypotsarepartof largebotnets. Thisalsoallowsustoper-formcontentanalysisexploringwhatlinguisticquali-tiesdifferentiatebotnets.

Table 4: NLP feature sets we consider for our content

analysis and their effectiveness at differentiating botnets

Content Analysis Tobetterunderstandbotnets,weusenaturallan-

guageprocessing(CollobertandWeston,2008)foran-alyzingthelinguisticcontentofentities.Forouranal-ysis,we consider various feature sets as proxies forlinguisticcharacteristicsassummarizedinTable4.WeuseaMaximumEntropyclassifier(MegaM,2016)totestwhich featuresdifferentiatebotnets. Inorder totestafeature,wetraintheclassifierwith70%oftheposts, randomly selected, from theN largest entities

andtest itwiththeremaining30%of theposts.Ourfinalresultsaretheaverageofthreeruns.

ThefirstfeaturesetwetestisBagOfWords(BoW)whichmodelsthelexicalcontentofposts.Putsimply,eachwordinadocumentisputintoa‘bag’andthesyn-tacticstructureisdiscarded.Forimplementationde-tails,seeourtechnicalreport(Bevans,2016).InFigure2,weshowouranalysisoftheBoWfeatureset.

Whenconsidering the top5 contributingentities,theclassificationaccuracyislessthan95%whichim-pliesthatthelexicalcontentofbotnetsvariesgreatly.Thesecondfeatureweconsideristhetaxonomypro-videdbyIBMWatson’sAlchemyAPI.Alchemy’soutputisalistoftaxonomylabelsandassociatedconfidences.Forthepurposeofouranalysis,wediscardanylowornon-confidentlabels.InFigure3,weshowouranalysisoftheAlchemyTaxonomyfeaturesetwhichhighlightstheaccuracyofAlchemy’staxonomy.WenotethattheAlchemyTaxonomyfeaturesetisdramaticallysmallerinsizethantheBoWfeaturesetwhilestillprovidinghighperformance.Thisindicatesafulllexicalanalysisis not necessary but a taxonomic approach is suffi-cient. Our third feature is based on the links in theposts.Tocreatethefeature,weparseeachpostforanyHTTPlinksandstripthelinktoitscoredomainname.

Theclassifierwith the link featuresethadvariedresults,asshowninTable5,whereitwasreliableindifferentiating ggjx entities but less reliable for theother twohoneypots.TheseresultscorrelatewithlinkscarcityfromTable2.

Figure 2

Figure 3

Wetestthenormalizedvocabularysizeofapostasa feature.Wederivethis fromthenumberofuniquewords divided by the total number of words in thepost.AsshowninTable5,thevocabularysizedoesnotdifferentiatebotnets.

We also form a feature set based on the part-of-speech(PoS)makeupofapostusingtheStanfordPoSTagger.TheStanfordPoStaggerreturnsapairforeachwordinthetext,theoriginalwordandcorrespondingPoS.WecreateaBoWfromthisresponsethatcreatesanabstractrepresentationof thedocument’ssyntax.As shown in Table 5, the PoS does not differentiatebotnets.

Table 5: Accuracies for various features when identifying 10

and 60 entities using the maximum entropy classifier

Conclusions Inthispaper,weexamineinterestingcharacteris-

tics of spam-generating botnets and release a novelcorpus to the community.We find that hundreds ofthousandsof fakeusersarecreatedbyasmallsetofbotnets and much fewer numbers of them actuallypostspam.Thespamthatispostedishighlycorrelatedbysubjectlanguagetothepointwherebotnetslabeled

bytheirnetworkbehavioraretoalargedegreere-dis-coverableusingcontentclassification(Figure3).

Whilelinkandvocabularyanalysiscanbegooddif-ferentiatorsofthesebotnets,itisthecontentlabeling(providedbyAlchemy)thatisthebestindicator.Ourexperimentonlyspans42days,thusit’spossiblethesubjectspecializationisafeatureofthecampaignra-therthanthebotnetitself.

Bibliography

Apache virtual host. (2016).http://httpd.apache.org/docs/current/vhosts Ac-cessed:2016-08-10.

Bevans,B.,andKhosmood,F.(2016).ForumSpamCorpus.

http://users.csc.calpoly.edu/~foaad/bfbevans Ac-cessed:2017-04-01.

Bevans, B. (2016). “Categorizing Forum Spam.” Master's

ThesesatCalPolyDigitalCommons.http://digitalcom-mons.calpoly.edu/theses/1623Accessed:2017-04-01.

Collobert,R., andWeston, J. (2008). “Aunified architec-

ture fornatural languageprocessing:Deepneuralnet-workswithmultitasklearning.”Proceedingsofthe25thInternational Conference on Machine Learning, ACM:160–67.

Dinh,S.etal.(2015).“Spamcampaigndetection,analysis,

andinvestigation.”DigitalInvestigation,(12)S12–S21.Geerthik, S. (2013). “Survey on internet spam: Classifica-

tion and analysis.” International Journal of ComputerTechnologyandApplications,4(3):384.

Levchenko,K.etal.(2011).“Clicktrajectories:End-to-end

analysisofthespamvaluechain.”SymposiumonSecu-rityandPrivacy,IEEE.431–446.

Lumezanu,C.andFeamster,N. (2012). “Observingcom-

monspamintwitterandemail.”Proceedingsofthe2012ACM conference on Internet measurement, ACM. 461–466.

Mega M. (2016). “Mega model optimization package.”

https://www.umiacs.umd.edu/~hal/megam/, Ac-cessed:2016-08-10.

Niu,Y.etal.(2007).“Aquantitativestudyofforumspam-

mingusingcontext-basedanalysis.”NDSS.Pitsillidis,A.etal.(2012).“Taster’schoice:Acomparative

analysisofspamfeeds.”Proceedingsofthe2012ACMconferenceonInternetmeasure-

ment,ACM.427–440.

Shin,Y.,Gupta,M.,andMyers,S.A.(2011).“Thenutsand

boltsofaforumspamautomator.”LEET.Spirin,N.,andHan,J.(2012).“Surveyonwebspamdetec-

tion:Principlesandalgorithms.”ACMSIGKDDExplorationsNewsletter,13(2):50-64.Stone-Gross,B.,etal.“Theundergroundeconomyofspam:

A botmaster’s perspective of coordinating large-scalespamcampaigns.”LEET,11:4.

Stringhini, G. (2015). “Evilcohort: Detecting communities

ofmaliciousaccountsononlineservices.”24thUSENIXSecuritySymposium(USENIXSecurity15),563–578.

Thomason, A. (2007). “Blog spam: A review.” CEAS,

Citeseer.ThonnardO.andDacier,M.(2011).“Astrategicanalysisof

spambotnetsoperations.”Proceedingsofthe8thAnnualCollaboration, Electronic messaging, Anti-Abuse andSpamConference,ACM,162–171.

Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Documents

Transcript of Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Mirai Botnet - BILLSLATER.COM · perform Distributed Denial Of Service Attack, steal data, send spam, allow the attacker access to the device and its connection. The owner can control

Botnet Judo: Fighting Spam with Itselfckanich/papers/botnet.judo.pdfBotnet Judo: Fighting Spam with Itself Andreas Pitsillidis Kirill Levchenko Christian Kreibichy Chris Kanich Geoffrey

Spam Spam Spam Spam

Spam and Botnet Reputation Randomized Control Trials and Policy John S. Quarterman Quarterman Creations antispam@quarterman.com Leigh L. Linden Department.

Combating Mobile Spam through Botnet Detection using ... · security experts defending systems from attack and criminals and adventurers ... or under a court order ... androids SDK

BOTNET JUDO Fighting Spam with Itself

Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.

The Impact of Botnet Countermeasures on Spam Ratios · The Impact of Botnet Countermeasures on Spam Ratios ... with SPF about seven times more prevalent. When comparing the number

Spam, spam, spam

Botnet Judo: Fighting Spam With Itself...Signature Generator - Template inference algorithm Regex signature by parsing messages Maintain set of signature, updatable in real-time Domain

OPINION SPAM RECOGNITION METHOD FOR ONLINE REVIEWS … · good resource for decision making. In recent years, along with web spam 19, 22, email spam 23, 10 and blog spam 20, 18, review

BOTNET-GENERATED SPAMprojects.csail.mit.edu/spamconf/SC2009/Areej_Al-Bataineh/Bataineh... · 3/27/2009 1 BOTNET-GENERATED SPAM By Areej Al-Bataineh University of Texas at San Antonio

Improve DDoS Botnet Tracking With Honeypots - …...DDoS botnet tracking •It’s aimed to learn botnet assisted DDoS attacks –4w: who is being attacked by what botnet families

Botnet-generated Spam

Botnet detection using correlated anomalies · deals with machine learning techniques and algorithms used for training botnet ... Botnet detection faces a number of ... Botnet detection

Detecting and preventing DNS abuse in · ›Domain names are often abused by cyber criminals Spam, botnet C&C infrastructure, phishing, malware, … ›To avoid blacklisting, malicious

Botnets and the Global Infection Rate Anticipating ... · Corporate Penis Enlargement Problem • Botnet spam through corporate MX • Spamvertised link that transverses 3 web servers

Proofpoint The Human Factor 2018 proofpoint.com ... Microsoft Outlook Web App (OWA) ... the app supported a botnet used to generate spam on a variety of branded Facebook

CSCI-UA.9480 Introduction to Computer Security · ⚫ Srizbi botnet: responsible for most of the spam in the world at some point. ⚫ Carna botnet: used for estimating the size of

Your Botnet is My Botnet: Analysis of a Botnet Takeoverchris/research/doc/ccs09_botnet.pdf · Your Botnet is My Botnet: Analysis of a Botnet Takeover Brett ... (such as bank account