D4.1 Techniques and tools for OSINT-based threat...

ProjectDeliverable

D4.1Techniquesandtoolsfor

OSINT-basedthreatanalysisProjectNumber 700692ProjectTitle DiSIEM–Diversity-enhancementsforSIEMsProgramme H2020-DS-04-2015Deliverabletype ReportDisseminationlevel PUSubmissiondate August31st,2017Resubmissiondate May31st,2018Responsiblepartner FCiências.IDEditor PedroM.FerreiraRevision 2.0

TheDiSIEMproject has received funding from theEuropeanUnion’sHorizon2020researchandinnovationprogrammeundergrantagreementNo700692.

22

D4.1

EditorPedroFerreira,FCiências.IDContributorsPedroFerreira,FCiências.IDAlyssonBessani,FCiências.IDFernandoAlves,FCiências.IDEuniceBranco,FCiências.IDAnaRespício,FCiências.IDJoãoAlves,FCiências.IDSusanaGonzalez,AtosMarioFaiella,AtosGustavoGonzalez,AtosAbdullahiAdamu,DigitalMRIlirGashi,CityVersionHistoryVersion Author Description1.0 FCiências.ID,

Atos, DigitalMR,City

Firstversionofthedocument,submittedtotheEC

2.0 FCiências.ID Corrections of the minor problems requested bythe reviewers in the first period review. Inparticular, we removed the list of tweetercybersecurity-relatedaccountsfollowedbyprojectpartners to avoid possible privacy conflicts.SubmittedtoEC.

33

D4.1

ExecutiveSummaryThisreportpresentsanin-depthanalysisofsecurity-relatedOSINTdatasourcesand how the information from these sources can be extracted, including adescriptionof toolsandmethodsthatcanbeemployedforthis.Relevantopen-sourceandpaidtools,aswellasservicesavailable,areidentifiedanddescribed.A complete list of OSINT sources selected by the DiSIEM industrial partners,currentlybeingcollected intheproject, isalsoprovided. Itsmainpurpose is tocreate realistic case-studies that enable a sound evaluationof the technologiesbeingdeveloped.Thedeliverablealsodescribes initialworkonthemodelsandtechniquesthatcanbeusedtoprocessOSINTdataforpredictingthreatsagainstagivenorganisation’sITinfrastructure.Additionally, the techniques that can be used to express and share gatheredOSINTinastandardizedwayarealsoreviewed.Finally,thedeliverableendsbyproposing an architecture for infrastructure-awareOSINT integration, i.e., howto integrate relevantOSINTwith events from the infrastructure and provide arelatedthreatscore.Overall,themainresultsofthisdeliverableare:

• AtaxonomyofthetypesofOSINTsourcesconsideredontheproject;• Thevarioussourcesinusebythepartnersoftheproject;• A listof relevant tools and services that can be used for the processing

andanalysisofOSINT• Adescriptionofthestateoftheartusingsecurity-relatedOSINT;• The OSINT processing tools being developed on the project, and their

objectives;• HowtointegrateOSINTeventsintheSIEMs.

44

D4.1

TableofContents1 Introduction....................................................................................................................................71.1 OrganizationoftheDocument.....................................................................................8

2 OSINTDataSources....................................................................................................................92.1 Typesofsources.................................................................................................................92.1.1 Structureddatasources.............................................................................................92.1.2 Unstructureddatasources.....................................................................................102.1.3 Darkweb.........................................................................................................................11

2.2 OSINTextractiontools...................................................................................................112.3 DatasourcesconsideredinDiSIEM.........................................................................12

3 TechniquesandtoolsforOSINTanalysis.......................................................................173.1 Relatedwork......................................................................................................................173.1.1 Collectinginfrastructure-specificOSINT.........................................................173.1.2 OSINTcollectionandextractionmethodologies..........................................183.1.3 CorrelateuserbehaviourwithOSINTforsecurity.....................................203.1.4 FeedprotectionsystemswithOSINT................................................................213.1.5 GatherexploitdatafromOSINT...........................................................................223.1.6 Black-listedIPs.............................................................................................................223.1.7 Others...............................................................................................................................23

3.2 Existingtools......................................................................................................................243.2.1 Generalpurposeopensourcetools....................................................................243.2.2 OSINTpaidtools..........................................................................................................243.2.3 OSINTpaidservices...................................................................................................25

4 PreliminaryresultsonOSINTprocessing......................................................................274.1 BlacklistedIPsOSINTprocessing.............................................................................274.1.1 TrustworthyBlacklistinSIEMsystems............................................................274.1.2 IPsCollector...................................................................................................................284.1.3 TrustAssessment........................................................................................................284.1.4 TrustworthyAssessmentBlacklistsInterface...............................................29

4.2 Infrastructure-relatedOSINTprocessing.............................................................294.2.1 ExploratoryMachineLearningApproaches..................................................294.2.2 DigitalMRListening247platform........................................................................33

5 Context-awareOSINTintegration......................................................................................395.1 ThreatIntelligenceDataInterchangeFormats..................................................395.1.1 STIX....................................................................................................................................405.1.2 TAXII..................................................................................................................................44

5.2 IntegratingOSINTdatawithinfrastructureevents.........................................465.2.1 Architectureproposal...............................................................................................47

5.3 Context-awareThreatScore.......................................................................................485.3.1 Heuristics-basedthreatscore...............................................................................485.3.2 ThreatScoreMethodology.....................................................................................505.3.3 Preliminaryanalysisofheuristicsfeatures....................................................51

6 SummaryandConclusions....................................................................................................55References...............................................................................................................................................56ListofAcronyms...................................................................................................................................62AppendixA–OSINTsources..........................................................................................................63

55

D4.1

ListofFiguresFigure1-DigitalMRdatasources................................................................................................13Figure2-WorkflowoftheIPblacklistprocessingframework.....................................28Figure3-ProposedTwitterclassifierarchitecture............................................................31Figure 4 - Architecture of the deep learning approach to the classification of

tweets...............................................................................................................................................32Figure5-Deepneuralnetworkarchitectureforclassificationoftweets(adapted

from[KIM14])..............................................................................................................................33Figure6-DigitalMROSINTthreatpredictorproposal......................................................35Figure7-Noisefilteringstep.........................................................................................................37Figure8-RelevantdatagetsanalysedbythemachinelearningandNLPtools...37Figure9-AnLSTMNeuralNetworkCell..................................................................................38Figure10-context-awareOSINTDataArchitecture..........................................................48

66

D4.1

ListofTablesTable1-TaxonomyofOSINTsources.........................................................................................9Table2-ThevariousOSINTsourcesusedintheprojectandanexampleofeach.

.............................................................................................................................................................16Table3 -Examplesof generalpurposeopen-source tools that canbeextended

forOSINTprocessingandanalysis....................................................................................25Table4-Examplesofpaidsecurityfeeds/services............................................................26Table5-CVSSv3Ratings.................................................................................................................49Table6-IndividualThreatScore.................................................................................................49Table 7 - The OSINT sources used by the partners of the project, divided by

category..........................................................................................................................................68

77

D4.1

1 Introduction Cybersecurity is a matter of growing concern as cyber-attacks cause loss ofincome, sensitive information leaks, and even vital infrastructures to fail. Toproperly protect an infrastructure, a security analyst must have timelyinformationaboutsecuritythreatstotheITinfrastructureandthelatestnewsinterms of updates, patches, mitigation measures, vulnerabilities, attacks, andexploits.Therearetwomajorwaysofobtainingsecuritynewsfeeds.Oneistopurchaseacurated feed from a specialized company such as SenseCy1or SurfWatch.2Another is to collect Open Source Intelligence (OSINT) from various sourcesavailableontheInternet.Insummary,OSINTisinformationpubliclyavailableonthe news and on the web. Examples of cybersecurity-related OSINT feeds areCiscoSecurityAdvisory3andThreatpost.4Collecting and processing OSINT is becoming a fundamental approach forobtainingcybersecuritythreatawareness.Recently,theresearchcommunityhasdemonstratedthatmanydifferent typesofuseful informationandIndicatorsofCompromise(IoC)canbeobtainedfromOSINT[LIA16,SAB15,ZHU16].Besidestheseresearchorientedefforts,allSecurityOperationCentres(SOC)analyststryto be updated about possible threats against the IT infrastructure of theirorganizations by following cybersecurity OSINT. Nevertheless, skimmingthrough various news feeds is a time-consuming task for any security analyst.Furthermore, an analyst is not guaranteed to find news relevant to the ITinfrastructurehe/sheoversees.Therefore,toolsarerequirednotonlytocollectOSINT,butalsotoprocessittofilteronlytherelevantpartsfortheSOCanalysts,thusdecreasingtheamountofinformationandconsequently thetimerequiredtoanalyseitandactuponit.Whenappropriate,thefilteredinformationmustbefurtherprocessedtoextractIoCs.ThisreportprovidesataxonomyofOSINTsourcesreadilyavailableandgivesacomprehensivelistofOSINTdatasourcesthatarecurrentlybeingconsideredinthe scope of DiSIEM. A review on existing techniques and tools available forOSINTprocessingisgiveninthedocument,aswellaspreliminaryresultsontheinfrastructure-relatedOSINTprocessingapproachesbeingfollowedinDiSIEM.TheabilitytocollectandprocessOSINTisoftennotenough.Threatintelligencemustbeexpressedand then sharedusing specific standards, allowing involvedparties to speed up processing and analysis phases of received information,achievinginteroperabilityamongthem.Additionally,thegatheredOSINTshouldbeintegratedwitheventsoriginatingwithintheorganisation’sITinfrastructureandgivena threatscore indicating itsseverity.Thisdocumentalsodiscussesa

1 https://www.sensecy.com/2 https://www.surfwatchlabs.com/threat-intelligence-products/threat-analyst3 https://tools.cisco.com/security/center/psirtrss20/CiscoSecurityAdvisory.xml4 https://threatpost.com/feed/

88

D4.1

standarddesignedtotransmitOSINTdatathatwillbeusedtocreatedataflowsamongthesoftwarecomponentsdesignedintheproject,andtotheSIEMs.The final contributions presented in this document are a proposed systemarchitectureandthreatscoreforcontext-awareOSINTintegration.

1.1 Organization of the Document Chapter2isdevotedtopresentingthevarioustypesofOSINTthatareavailableontheInternetaswellascommonextractionandstoragetoolsthatcanbeused.Thesectionendsbypresentinga listofOSINTsourcesthatarebeingcollectedfor the development of DiSIEM tools. Chapter 3 presents related work ontechniques and tools for OSINT analysis and existing tools for that purpose.PreliminaryresultsonOSINTprocessingapproachesthatarebeing followed inDiSIEMaregiveninChapter4.Then,integrationofsecurity-relatedOSINTwithsecurityeventsfromtheorganisationITinfrastructureisapproachedinChapter5. Finally, Chapter 6 presents a summary of the work and draws someconclusions.

99

D4.1

2 OSINT Data Sources

2.1 Types of sources In this section we describe the data sources used when gathering security-relatedOSINT.Table1presentsataxonomyclassifyingsourcesasstructuredandunstructured, keeping a separate class for the dark web even though it isconsideredanunstructuredsource.Thetablepresentsexamplesofeachsourcetype,aswellasthetechnologiesrequiredtocollectdatafromthosesources.Thethreemajorclassesare:

Structured data sources: Resources that provide structured data, in a well-defined format. The data obtained from these sources comes in a machineparsableformat.

Unstructured data sources: Feeds that provide unstructured datawhere themain content is in free text format.Although thisdata type requires furtherprocessing, feeds in text format (such as news posts) are typically moreinformationrich.

Darkweb:The “dark sideof the Internet", aplaceknown forhacker sitesandforums, and exploit marketplaces. Both are rich information sources formaliciousactivity,mostlyunstructured.

Table1-TaxonomyofOSINTsources.

2.1.1 Structured data sources Organizeddatasourcespresent informationthat ismachine-parsable,andthuscan be directly fed to a machine. These sources usually provide an API forprogrammaticaccesstotheircontent.Vulnerability/exploit sources. Organized sources describing vulnerabilitiesand/or exploits related only to threats that have been confirmed. In case ofvulnerability databases, each vulnerability is described using several numericfields(suchasseverity),andatextdescription.Organizeddatasetsprovidethemostreliableinformationsincetheircontenthasbeenofficiallyconfirmed.Thisreliabilitycomesataprice;usually,thereisatime

Structureddatasources

Unstructureddatasources

Darkweb

Examples -IPwhitelists/blacklists-CVE

-Newssites-Twitter

-Forums-Marketplaces

Requiredtechnologies

-Feed/webscraper-Parser

-Feed/webscraper-NaturalLanguageProcessing(NLP)tools-Machinelearningtechniques

-Darkwebaccess-Darkwebscraper-NLPtools-Machinelearningtechniques

1010

D4.1

lapse between the detection of a vulnerability and its presence in this type ofdatabase.Twoof themost important structuredvulnerabilitydatabasesare theNationalVulnerability Database (NVD) 5 and Common Vulnerabilities and Exposures(CVE).6Others include the ExploitDatabase7andVulners.8TheNVDbelongs tothe U.S. government and describes checklists, security related software flaws,misconfigurations, product names, and impact metrics. The CVE provides astructureddatabaseforpubliclyknowninformation-securityvulnerabilitiesandexposures.Thevulnerabilitiesstoredtherearedescribedinvariouscomponents,aswellasreferencestothevulnerabilities.IPandrulessources.Inthecaseofblacklistsorwhitelists,orsetsofrules(e.g.,firewallrules)thedataisavailableintextfileswithanIP/ruleperline.Eachlinecanbefeddirectlytothecorrespondingsoftware.TherearemanysourcesofIPlistsandrules,suchastheonespresentedonAppendixA.

2.1.2 Unstructured data sources Unstructuredsourcesprovidetextdatadescribingeventsof allsorts, includingsecurityones.Blogsandnewsmaycontainmoreinformation(e.g.,aquickfixtoavulnerability), but pose a hard challenge for automated processing sinceextracting concepts from free text is still aNaturalLanguageProcessing (NLP)challenge.Therefore,asappealingastheymaybe,usingthemasOSINTsourcesisfarfromtrivial.Nevertheless,someauthorsshowitispossibletocollectdatafromtechnicalblogpostsandscientificliterature,sincetechnicalwritingtendstohaveastablestructureandmuchlessambiguitywhencomparingtoothertypesofwriting[LIA16,ZHU16].One of the unstructured data sources used inDiSIEM is Twitter,9amicro-blogservicewhereuserscanpublishtextandmediacontent.Tweetstendtoprovideconcise information due to the 140-character limit. Therefore, tweets areattractive for publishing quick status updates; news sites, bloggers, and otherfeeds post tweets containing the post’s title to increase the visibility of thecontent they produce. Twitter is a popular feed since a quick review of tweettitles provides an overview of current news and trends. Tweets are alsoattractive for automated processing, as small concisemessages are simpler toprocessthanlargetexts.

5 https://nvd.nist.gov/6 https://cve.mitre.org/7 https://www.exploit-db.com/8 https://vulners.com/#help9 https://twitter.com/

1111

D4.1

2.1.3 Dark web Accessibleonlyusinganonymitytools(e.g.,TORnetwork10),thedarkweboffersanonymitytotheusersaccessingitandtotheserviceshostedonit.Therefore,itis the ideal place for buying, selling, and discussing all types of illegalcommoditiesandservices.This isalsotrue forbotnets,exploits,virusesandallkindsofmaliciousITservices.The dark web is a known place where exploit discussion and developmenthappens. Collecting information about threats during their development phaseoraboutthreatsforsalewhichhavenotbeenusedyetisextremelyvaluable,asitallows defenders to act before the attackers. In fact, this approach has beensuccessfullyundertakenbyNunesetal.[NUN16],whoobtaineddataonzerodayvulnerabilitiesondarkwebmarketplacesandhackerforums.

2.2 OSINT extraction tools In termsofcollecting information freely fromthevarioussocialmediasources,the state-of-the-art uses crawlers in conjunction with parsers to extractinformation from the web pages of blogs, forums, marketplaces, and otherrelevant sites [NUN16, KER15, JEN16]. Some sources of data such as specificsecurity websites or blogs will require gathering data using a custom-builtcrawlerinconjunctionwithaparser.ForotherdatasourcessuchasTwitter,Instagram,ornewsfeeds,companieslikeDigitalMR(amemberoftheDiSIEMconsortium),whohaveexperiencecollectingOSINTfromstructuredandunstructuredsources,canprovidehistoricaldata.Real-time access and historical data from the complete feed of most socialnetworks iscommerciallyavailable fromproviderssuchasGnip11orDataSift.12ThefreealternativeconsistsinusingAPIsprovidedbythesocialmedianetworksto access information, although usually there is no access to the full streamofdata.DigitalMR’s Listening247 platform is a social media monitoring and analyticsplatform that is being used in DiSIEM to collect and filter relevant security-related OSINT. Listening247 provides access to social media networks data,blogs,forumsorweb-sites,bothhistoricandpresent.Thisdatacanbeaccessedwith carefully formed queries with specific keywords which populatesDigitalMR’s Elasticsearch13database with relevant data from all the varioussources. ConsideringDiSIEM, therewill be a need for specialized querieswithspecific keywords for the project, which will yield relevant data to theinfrastructurethatpartnersareinterestedinprotecting.

10https://www.torproject.org/11https://gnip.com/12http://datasift.com/13https://www.elastic.co/products/elasticsearch

1212

D4.1

Forming these queries requires careful attention and understanding of thedomain,somethingthathasbeenperfectedatDigitalMR.Justusingaquerywitha keyword for ‘Windows’ (i.e., the operating system) will yield data about‘windows’ (i.e., for buildings), and other more abstract uses of the word.Scraped data using the custom-built crawlers can be done at intervals thatoverlapwiththeintervalsofthedatafromotherListening247OSINTsourcestoallow for aggregating the data from the two data gathering pipelines.Elasticsearch can be used as a storage for the data from both pipelines.Therearepre-processingandnoisefilteringstepsthatwillbecarriedoutontherawdataaswellontheListening247platform’scustompipelineforthisproject.In the pre-processing step, in compliance with the ethics advice from theadvisory board, an additional step will involve anonymizing the data to stripaway any information that might violate the privacy of the users of theseplatforms. In the noise filtering step, a noise model will be created based onannotatedtrainingdatatofilteroutirrelevantdata.ArelatedworkbyNunesetal. [NUN16] also included a noise filtering step which used a classifier forfilteringoutirrelevantdata.Thereisahugeamountofdataontheinternetandreducingthedatatoonlytherelevant,savesbothtimeandcostneededtostoreandprocessirrelevantdata.Datawillbeaggregatedbytimestamp,sothatdatawithinthesametimeintervalends up in the same timeslice. These slices of data can be built by employingcloudservicessuchasAWSElasticMapReduce,14orlocalprocessingframeworkssuch asApache Spark15whichwill aggregate the various forms of data for thenext stage of processing. This not only gives the data a context of what ishappening in thevarious sourcesofOSINTdata sharing the same timeslice, italso gives a context of how all the content is changing over time. Otherapproaches that aggregate over some relevant property of data could also beexploredbefore implementation, ifnecessary.Forexample, laggingsomeof thedatasourcesthatgeneratedatafaster(e.g.,Twitter)sothatthesameinformationisnotsplitupindifferenttimeslices.

2.3 Data sources considered in DiSIEM Besidessecurity-relatedmachine-parsableinformationsuchasIPblackorwhitelistsorfirewallrules,andtweetsfromspecificsecurity-relatedtwitteraccounts,DigitalMR’sListening247platformwillbeusedinDiSIEMtogatherinformationfrom social media sources and unstructured sources such as blogs, forums orweb-sites.Listening247hasbeenusedsuccessfullyformarketresearchandcanbetailoredforcyber-securitypurposes.Regardingtheinformationcontainedindatabases like NVD or CVE, DiSIEM uses the vepRisk tool [AND17],16whichextracts,parsesandstoresdatafromtheseandotherpublicrepositories.14http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html15https://spark.apache.org/16http://veprisk.city.ac.uk/main/

1313

D4.1

Generically,DigitalMRcancollectdataofvarioustypesincludingtextandimagesfrom a variety of sources which range from blogs, social networks, news,darknet,boards/foraandotheropenlyavailabledataontheInternetformarketresearch (see Figure 1). The data is unstructured, and has different velocitiesdepending on the source. An article from the dailymail17highlighted somestatisticsfromInternetlivestats.com[LIB16],showingthatthereareabout7,620tweets per second, 790 photos uploaded to Instagram per second, and 1,259posts to Tumblr per second. For other sources like boards, news, and blogs,whichareusuallylongerinlength,thissortofvelocityisunlikely.

Figure1-DigitalMRdatasources.

Thesearetypically taggedwith informationpertainingtorelevance,sentiment,andemotion, formarket research. Itmayalsobe taggedwith information thatclassifies thesedata in termsofa taxonomywhichcanbeusedtoorganizethedata in a hierarchical format by their topic. For example, a tweet about twopeopledrinkingPepsiwhilewatchingafootballmatchwillbeclassifiedtobean‘occasion’, and specifically a ‘sport’ occasion. Essentially, this is a form ofhierarchicalclusteringwhichcanalsobedoneforcyber-threats.Interestingly,thereisalreadyataxonomyofthetypesofcyberthreatsdevelopedby ENISA (European Agency for Network and Information Security), includingassetexposureandvulnerabilityexploitation[CEB10,MAR16].Suchinformationcouldalsobetaggedwiththetrainingdatameantforthecyberthreatpredictionmodelwhichwillmakethereportsofcyberthreatsmorespecific.Thetagswillalso allow the cyber threat predictor to learn and predict future events withmore specificity. Furthermore, another advantage of tagging in relation to ataxonomyisthatitcanalsobeusedtoidentifyspecifictrends,e.g.seasonswhensomethreatsaremoreprominentthanothers,andanyotherrelatedtrends.

17http://dailym.ai/28YNsq9

1414

D4.1

Forinformationrelatedtovulnerabilities,exploits,andpatches,thevepRisktoolisused inDiSIEM.The toolhasbackendmodules thatmine, extract,parseandstore data from public repositories of vulnerabilities, exploits and patches.vepRiskservesasaknowledgebaseforpublicsecuritydataandprovidesawebinterface for analysing and visualizing the underlying data. It providesfunctionality for analysing relationships between the different security riskfactorsinpublicsecuritydata.Currentlysixdifferentvulnerabilitydatasourcesare considered: NVD, Security database,18CVE, CVE Details,19Security focus,20and CXSECURITY.21Additionally, vepRisk collects data from various vendorpatch sources (e.g., Microsoft, Debian, SUSE Linux, Cisco) and exploits fromExploitdatabase.22TheindustrialpartnersofDiSIEMthatoperateSIEMplatformsprovidedalistofOSINT sources that their security analysts regularlymonitor to receive eventsrelevanttotheirprotectedinfrastructure.Thesesourcesarebeingcontinuouslycollected to form sufficiently large representative data sets that enableresearching efficient OSINT processing and analysis technologies. During theproject execution, the list of sources may be updated according to therequirementsofthetoolsbeingdeveloped.AcomprehensivelistwithallOSINTsourcesbeingcollectedispresentedinAppendixA.

18https://www.security-database.com/19http://www.cvedetails.com/20http://www.securityfocus.com/21https://cxsecurity.com/22https://www.exploit-db.com/

Source Example

Twitter

https://twitter.com/threatmeter/status/887390382094516229@threatmeter:“Vuln: RETIRED: Linux Kernel 'saa7164-bus.c' Local Privilege Escalation Vulnerability http://ift.tt/2tcvTsM”

Newssites

DarkReading:http://www.darkreading.com/cloud/zero-day-exploit-surfaces-that-may-affect-millions-of-iot-users/d/d-id/1329380?“A zero-day vulnerability dubbed Devil's Ivy is discovered in a widely used third-party toolkit called gSOAP. Millions of IoT devices relying on widely used third-party toolkit gSOAP could face a zero-day attack, security firm Senrio disclosed Tuesday, which dubbed the vulnerability Devil's Ivy. <...>”

Expertblogs

SchneieronSecurity:https://www.schneier.com/blog/archives/2017/07/forged_document_1.html

1515

D4.1

“Forged Documents and Microsoft Fonts A set of documents in Pakistan were detected as forgeries because their fonts were not in circulation at the time the documents were dated.”

Securityvendorblogs

FireEyeThreatResearchBloghttps://www.fireeye.com/blog/threat-research/2017/05/cyber-espionage-apt32.html“Cyber Espionage is Alive and Well: APT32 and the Threat to GlobalCorporations<...>”

IPsforwhitelists

Bambenekconsultinghttp://osint.bambenekconsulting.com/feeds/c2-ipmasterlist.txt“5.101.153.16,IP used by banjori C&C,2017-07-1818:05, http://osint.bambenekconsulting.com/manual/banjori.txt 23.104.241.95,IP used by banjori C&C,2017-07-18 18:05, http://osint.bambenekconsulting.com/manual/banjori.txt <...>”

IPsforblacklists

abuse.chZeuSTracker:https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist“039b1ee.netsolhost.com 03a6b7a.netsolhost.com 03a6f57.netsolhost.com <...>”

Domains/botnets

abuse.chRansomwareTrackerhttp://ransomwaretracker.abuse.ch/downloads/RW_DOMBL.txt“25z5g623wpqpdwis.onion.to 27c73bq66y4xqoh7.dorfact.at <...>”

Snort/Suricatarules

Emergingthreatshttp://rules.emergingthreats.net/blockrules/emerging-botcc.portgrouped.suricata.rules“alert tcp $HOME_NET any -> 50.116.1.225 22 (msg:"ET CNC Shadowserver Reported CnC Server Port 22 Group 1"; flow:to_server,established; flags:S; reference:url,doc.emergingthreats.net/bin/view/Main/BotCC; reference:url,www.shadowserver.org; threshold: type limit, track by_src, seconds 360, count 1; classtype:trojan-activity; flowbits:set,ET.Evil; flowbits:set,ET.BotccIP; sid:2405000; rev:4687;) <...>”

1616

D4.1

Table2-ThevariousOSINTsourcesusedintheprojectandanexampleofeach.

BroNoexampleavailable

Firewallrules

Emergingthreatshttp://rules.emergingthreats.net/fwrules/emerging-IPTABLES-DROP.rules“$IPTABLES -N ETBLOCKLIST $IPTABLES -I FORWARD 1 -j ETBLOCKLIST $IPTABLES -I INPUT 1 -j ETBLOCKLIST <...>”

Malware

Virustotalhttps://www.virustotal.com/en/file/3232fb8c336d280b8552a0f796a3b7e6ef2a67b603d9716a7053b17596e8b24c/analysis/“SHA256: 3232fb8c336d280b8552a0f796a3b7e6ef2a67b603d9716a7053b17596e8b24c File name: microsoft.visualbasic.dll Detection ratio: 0 / 64 Analysis date: 2017-07-28 17:49:37 UTC ( 53 minutes ago ) <...>”

IPreputation

TheCINSScorehttp://cinsscore.com/list/ci-badguys.txt“1.1.198.38 1.9.13.156 1.9.135.197 <...>”

Yararules

Yara-Ruleshttps://github.com/Yara-Rules/rules/blob/master/CVE_Rules/CVE-2010-0887.yar“rule JavaDeploymentToolkit { meta: ref = "CVE-2010-0887"(…)”

1717

D4.1

3 Techniques and tools for OSINT analysis

3.1 Related work In this chapterwepresentanoverviewof researchwork thatusesOSINT inasecuritycontext,divided intosevensectionsaccordinglyto themainobjectivesoftheseworks.

3.1.1 Collecting infrastructure-specific OSINT MostworkdescribedinthissectionusesTwitterastheOSINTdatasource,andfollowsthesamegeneralprinciple.First,theyobtainfromtheuserakeywordsetwhich is then used to select tweets containing one or more keywords. Thisapproachsetsaprimaryfilterforgatheringonlypossiblyrelevantcontentfortheuser.Then, another technique isused to classify the tweetsas relevantornot:Ritter et al.[RIT15] compare a few machine learning techniques, withExpectation-Maximization(EM)[MOO96]obtainingthebestresults;Mittaletal.[MIT16] use Naive Bayes[ZAK14], while Correia et al.[COR16] use SupportVector Machines (SVM) [ZAK14], and Santos et al. [SAN13] use plain-textsearches(ApacheLucene23)foracluster-likeapproach.Eachoftheseapproachespresentuniqueelements.Santosandco-workersfiltertweetsaccording to the following criteria:written inEnglish, correctly formed,andcontainingURLsofwebsitesfocusedonsecuritynews.Then,thetweetsareclusteredusing a specific similaritymeasure that considers the tweet size andthenumberof equalwords.Toavoidpresenting spammessagesas relevant, acluster is considered relevant only if the tweets contained were posted by asignificant number (10) of different users. Their results are evaluated in twoaspects:1)isitpossibletoremovespammessagesfromthelegitimatecontent?2) is itpossible to select the security tweetswithmost relevance?Theappliedtechniquesreducedtheamountofspammessagesby22%onaverage,andofthemessagespresentedassecurityrelevant,61.3%wereselectedcorrectly.Insteadofcollectingtweetsbykeyword,Correiaetal.gatheredtweetsonlyfromsecurity accounts to reduce the amount of non-security related tweets. Thecollected tweetswere filteredby thekeywordset, and thenmanually labelled;about 10 thousand tweets were manually classified. In this work two featureextraction methods were compared: TF-IDF (Term Frequency – InverseDocumentFrequency)[ZAK14]andword2vec[W2V].ThetweetswereclassifiedusinganSVMclassifier.Correiaetal.achievedhightruepositiverates(around90%)withlowfalsepositiveandfalsenegativerates.The base of Mittal et al.’s [MIT16] work is a knowledge base, created usingsecurityconcepts.Further,theyuseexternalontologiesforworddisambiguation(e.g.,applereferstothefruitorthecompany).KeyconceptsfromthetweetsareextractedthroughaspecificNamedEntityRecognizer.Theconceptsarequeried

23 https://lucene.apache.org/core/

1818

D4.1

to the knowledge base, which reasons about the importance of the tweetaccording to thekeywordsetprovidedby theuser.Thisapproach’s evaluationseemsinadequatelysmall,usingonly250tweetsfromthe10.004collected.Outof those250,60%were correctly identifiedby theknowledgebase,34%werecompletelyincorrect,andtheremainderwepartiallycorrect.Ritteretal.[RIT15]describeshowtouseasmallnumberofsampleswithanEMclassifiertoavoidmanualclassification.TheEMmodelbeginswithtentotwentypositivesamplesandnonegativesamples.Ritterdemonstrates thatbytrainingtheEMonlywithpositiveeventsheachievesbetterresults.Also,sinceEMdoesnot require a large training corpus, it is simple to train various EM classifiersusingadifferentseedforeach.Thisapproachwasevaluatedusing200manuallylabelled samples from the training corpus. EM achieves better results than theother tested machine learning approaches, although it shows a difficultcompromisebetweenprecisionandrecall;EMpresentshighprecisionratesbutlow recall (∼90% -∼30%), and as the recall rate increases the precision ratedecreases(∼50%-∼50%).Chang et al. [CHA16] show it is possible to improve Ritter et al.’s work usingneuralnetworks.Their architecture consistsofwordembeddingstomodel thetweets and Long Short TermMemoryNetworksto classify them. Chang et al.’sapproach managed to increase Ritter et al.’s results in about 10% in bothprecisionandrecall.Del Esposte et al. [ESP16] propose a recommender model to select OSINTrelevanttotheuser’spreferences.Nevertheless,thisworkwasevaluatedusingamoviedatabaseandpresentsnosuggestionsforpossibleOSINTsourcestouse.The framework receives a keyword set from the user to find candidateinformationofinteresttotheuser.Theinformationselectedbytheframeworkispresented to be rated by the user. The ratings are used to train arecommendation model, which iteratively learns the user’s preferences andtunesthemodel.Usingthisapproach,DelEsposteetal.achievedaprecisionrateofabout70%,andontwodifferenttests,recallratesof∼20%and∼7%.

3.1.2 OSINT collection and extraction methodologies Another line of work presents methodologies for gathering and processingOSINT, which could then be applied in more specific contexts. Nunes et al.[NUN16] crawl some deep web’s hacker forums and marketplaces. Thisapproachgoesdirectlytoamajormalicioususercommunity,wherestate-of-the-art attacks and the latest discovered vulnerabilities can be found. Informationgatheredtherecouldprovideimportantforewarningsandenoughtimetopatchvulnerable software.Nuneset al.’s approachbeginsby collectingwebpagesofvulnerability marketplaces and hacker forums. These pages are processed toobtain the textual contents discussed. The collected text is fed to an SVM thatclassifies it assecurityrelevantornot, obtainingaprecisionandrecallof85%and87%,respectively.

1919

D4.1

Mulwad et al.[MUL11],Neri et al. [NER09], andMcNeil et al.[MCN13]processfreetextinsearchforsecurityconcepts.Mulwadetal.’sframeworkreceivestextsnippets extracted from theweb (e.g., blogs, news),which are classified by anSVMascontainingsecuritytermsornot.Thesnippetsclassifiedasrelevantareprocessedbyaknowledgebasethatextracts therelevantconcepts,suchas themeans of attack and the target of the attack. The extracted concepts areconverted to the machine-readable OWL language format. This work wasevaluated using NVD text excerpts describing vulnerabilities, testing if theframeworkcouldobtain the correct concepts from those texts.The frameworkidentified71%ascontainingsecurityconcepts,andtheconceptsdetectedwerecorrectforroughly90%ofthecases.Instead of presenting results, Neri et al. focus on describing the variouscorrelation capabilities of their framework, which is widely used bygovernmental entities in Italy. The framework is composed of the followingelements:

• A crawler that selectively collects documents from the Internet anddatabases;

• Alexicalsystemthatdetectsrelevantconceptsandtherelationbetweenthoseconcepts;

• Asearchenginetoquerythecollectedknowledge;• And a classification system that processes the query results, obtaining

relationsbetweentheresults,andassigningthemthemes.

McNeil et al. present a novel approach to detect cyber-security concepts fromfree text, called PACE. As described by McNeil et al., the typical bootstrapalgorithmhasasetofseedsthatareusedtosearchforpatternsinthetexttobeprocessed, i.e., the algorithm has a set of initial sentences that are used as apattern to find similar sentences. These algorithms require a large amount ofseedstoobtainhighrecallpercentages.Besideshavingseedpatterns,PACEalsohasseedsofpairs(entity,context).Thesepairsprovideflexibilitytothepatternmatchingprocess,asamatchcanbefoundbyeitheraseedpatternorbyaseedcontext. This is especially useful for contextswhere the terminology can varygreatly(e.g.,application,software,programallrefertothesameconcept).PACEis evaluated using seeds manually extracted from ten cyber-security newsarticles, and by extracting entities from another seven. PACE obtained a highprecisionscore(90%),butlowrecall(12%).Nevertheless,theauthorscomparedPACEtothepreviousstate-of-the-artmethod,whichobtainedzeroresultsonthesamedataset.Erkal et al. [ERK15] also search for security concepts but use Twitter as datasource. To avoid manually labelling a large dataset for supervised machinelearning,theycollecttweetsfromaccountsfocusedonsecuritynewsforpositivesamples, and tweets from generalist accounts (e.g., health, news) for negativesamples. The tweets are processed using TF-IDF and classified as securityrelevant or not using Naive Bayes. Their approach is evaluated using cross-validationonthecollecteddataset,wherethe“percentageofcorrectdecision"is70%.

2020

D4.1

Jonesetal.[JON15]createda frameworkforextractingconcepts fromfreetext.Theyuseabootstrapalgorithmtoextractentitiesfromtextusingpatterns.Theiralgorithmisbasedontherelation(subjectentity,predicaterelation,objectentity)to extract concepts such as (Microsoft, is vendorof, InternetExplorer) from thesentence“MicrosofthasreleasedafixforacriticalbugthataffecteditsInternetExplorerbrowser."Anovelelementisinvolvingtheuserinthelearningphaseofthebootstrapalgorithm.Whenthealgorithmfindsanewpatterntobeincludedinitssetofpatterns,theuserisqueriedforthecorrectnessofthenewpattern.To evaluate thiswork, the algorithm is trainedwith seeds originated from 62securitynewsposts;then,recall(24%)iscalculatedbyrunningthealgorithmononemanuallylabellednewspost,andprecision(82%)iscalculatedbymanuallyverifyingthecorrectnessoftheentitiesextractedfromacorpusof41documents.Liao et al.[LIA16] developed a framework for extracting IoC from scientificliterature.IoCcanbeofmanyformatsandareusedtodescribevariousaspectsofanattack,suchasthevectorandthedamagecaused.LiaofocusesonextractingIoC from technical literature since it possesses a more predictable structure,enablinghighrecallinthisprocess.ThetextisprocessedbyacomplexpipelinecomposedofNLPprocessingtools.ThetermsareextractedandconvertedtotheOpenIoC (addressed in Section 5) format, which can then be processed byautomatictools.Liaoetal.’stoolpresentsaprecisionof98%andrecallof93%.Alqathanietal.[ALQ16]usetheApacheMAVENsoftwarerepository24andNVDintheirwork.MAVENisanopensourcesoftwarerepository(primarilyforJava)thatsimplifiesdependencyusageandsoftwarecompiling;thelibrariespublishedthere can be added asdependencies of anyproject using a simplemechanism.Alqathani’s objective is to search the NVD for vulnerabilities in MAVEN’slibraries. Then, following theMAVENdependency tree they can identifywhichprojects have those vulnerabilities. This work is evaluated by correctlyidentifying vulnerable projects. This approach’s precision sits at roughly 90%,withanimpressiverecallrateof100%.

3.1.3 Correlate user behaviour with OSINT for security Inthissectionwedescriberesearchworkthatprocessesuserbehaviourtoinfermalicious activity or vulnerabilities. Liu et al.[LIU15] try to predict if aninfrastructure is vulnerable based on its observable behaviour and systemconfiguration. First, they gather a set of system configurations; these includeDNS, SMTP, and certificate management. Then, they gather observablebehaviours of the system; they set upmonitors outside the infrastructure andcollectoutgoingcommunications.Theoutboundtrafficisanalysedinsearchforspammessages,phishingattempts,botnettraffic,andscantraffic.Thesedataarecompared against a ground truth of three databases: the Veris CommunityDatabase,25theHackmageddon,26andtheWebHackingIncidentsDatabase,27all

24 https://maven.apache.org/ 25 http://veriscommunity.net/index.html 26 http://hackmageddon.com/

2121

D4.1

containingdescriptionsofrealattacks.Thedatabases’descriptionof thevictimand the means of attack are compared against the two descriptions collectedfrom the user’s system. Using a Random Forest classifier, they were able topredict if the infrastructure is vulnerable,with 88%of true positives and only4%offalsepositives.Miller et al.[MIL11] also use behaviour descriptors but for examining userbehaviour insocialnetworks.Milleretal. buildagraphconnectingthevariouselementsof a socialnetworkbasedon their interactionsandbehaviouraldata.Through thegraph,Milleret al.wereable todiscover threatnetworks, i.e., if asubset of elements in a social network present danger. Miller et al. test theirapproach using a dataset containing online social interactions, including theinteractionsofaterroristgroupplanninganattack.Theirframeworkwastestedusing different parameters, and is able to show 100% precision rates whilepresentingarecallof~60%.

3.1.4 Feed protection systems with OSINT Inthissection,wedescriberesearchworkthatgathersOSINTandtransformsitinto a machine-readable format. This information can be fed to protectionsystems,suchasIDSsoranti-viruses.Mathews et al.[MAT12] and More et al.[MOR12] have the same objective:providing to an Intrusion Detection System information from traditional andnon-traditional sources. As traditional sources they consider network data,sensors, and logs. Non-traditional data is comprised of information collectedfromonlinesourcessuchasblogposts,newsfeedsortheNVD(i.e.,OSINT).Mathews et al. present a framework composed of an ontology, an IDS and a“TrafficFlowClassifier"(TFC).Theontologyhasthreeclasses:means(theattackvector), consequences (the attack outcomes), and targets (the target system).The ontology is fed with OSINT data, gathered from various structured andunstructuredsources.FromOSINT,theyextractthreeconceptscorrespondingtothe ontology classes, which are used to update the ontology’s reasoningcapabilities. The TFC componentmonitors the packets’ headers to infer if thetraffic is legitimate or malicious based on traffic flow and used ports. TheontologyusesasetofrulesandthecollectedOSINTtoprocessthedatareceivedfromboththeIDSandTFC,todetectattacks.Thisworkwasevaluatedusingasetof virtualmachinesgenerating benign andmalicious traffic; the objective is toobserveiftheontologycorrectlygeneratesalertsformalicioustraffic.Theirbestresult is achieved with TFC using port, TTL and timing data, obtaining a truepositiverateof84%,andafalsepositiverateof3%.Zhuetal.[ZHU16]takeadifferentapproach.Theirobjectiveistoprovethatitispossible to create an anti-malware solution using only information present inscientificliterature.AsdescribedbyLiaoetal.[LIA16],scientificliteraturetends

27 http://projects.webappsec.org/w/page/13246995/Web-Hacking-Incident-Database

2222

D4.1

to have a predictable structure, which simplifies NLP processing and contentextraction.Zhuetal.processthescientificliteraturedescribingAndroidmalwareandextractfeaturesdescribingtheattacks,creatingamachinelearningsolutionfor recognizing malware. The results obtained are compared with manuallyengineered solution for the same purpose (Android malware detection). Thisworkobtainedsimilarresultswithafullyautomatedapproach,usingmuchlessfeatures,andwithasystemthatcanbeeasilyupdated.Thisworkisevaluatedona fewparameters,butwehighlight that it achieves92.5% truepositive rateat1%offalsepositives.

3.1.5 Gather exploit data from OSINT Sabottke et al.[SAB15] use Twitter to gathermentions about existing exploitsthatarenotyetpresentinsecuritydatabasessuchastheNVD.Thisworkshowsthat information about exploits is published on Twitter twodays (in average)before these are included in theNVD. Thiswork shows that analysingTwitternewsstreamscanprovidevaluabledata,asmentionsaboutexploitscanbeseentherebeforetheyareformallyrecognizedandnormalizedinformationispresentinNVD.Edkrantz et al.[EDK15] try to predictwhich vulnerabilitieswill have exploits.Althoughtherearethousandsofvulnerabilitiesdescribed intheNVDdatabase,onlyasmallportionofthosevulnerabilitieshaveexploitscreatedbyhackersformaliciouspurposes.Therefore,Edkrantzetal.reasonthatitispossibletopredictwhetheravulnerabilityhasanexploitbasedon itsNVDdescriptor.Totest thismethodology, they created amodel usingNVD’s descriptions of vulnerabilitiesandtheirexploits.Then,thedescriptionsofnewvulnerabilitiesareclassifiedaslikelytohaveanexploitornot.Thisapproachpresentsbothprecisionandrecallaround80%.

3.1.6 Black-listed IPs As mentioned in [SHA15], feeds are a good way to obtain information aboutexternal threats and with the use of OSINT it is possible to collect pertinentinformation fromseveral feeds.ABlacklist isanexampleof apublic listwhichcontainsinformationaboutthreatsandmaliciousbehaviours.Therearesomearticlesthatinvestigatetheeffectivenessofblacklistsandwhichblacklists,inaperiod,providethemostreliableinformation.Blacklistscontainasignificant number of false positives, as described in [KÜH14, ROS10, SIN08].However, it is known that blacklist information is a widely-used measure formonitoring and detectingmalicious behaviour [KÜH14, SIN08]. Four blacklists(NJABL, SORBS, SpamCop, and SpamHaus), which report suspicious emailaddresses considered as spam, were analysed in [SIN08]. It was used anunsolicitedmail detection program for the confirmation and detection of falseand true positives. After analysing email traffic in an academic environment(more than 7000 computers) within 10 days, the results confirmed thatblacklistscontainahighnumberoffalsepositives.

2323

D4.1

Theworkdonein[KÜH14]aimstounderstandtheblacklists’contentsandhowitsinformationiscollected.Theauthorspresenttwomechanisms:thedetectionofparkeddomainsandthedetectionofsinkholes.Theyproposeamechanismtodistinguishparkeddomainsfrombenigndomains,thusreducingaconsiderablenumberofnon-benigndomainspresentinablacklist.Amethodforthedetectionof sinkholes it is also described, using a technique developed by the authors(graph-based), and their removal in blacklists. Sinkholes are, for example,serversthatcontainmaliciousdomains,buthavebeencontrolledandmitigatedby security organizations, which use them to monitor the network andcommunicationswithmalicious domains. The authors conclude that blacklistsonlycontainabout20%ofmaliciousdomains,resultinginasignificantnumberoffalsepositives.In both previous works, it is difficult to correctly determine whether theeffectivenessofablacklistwillincreaseordecreaseovertime.AlienVault’sOTX [ALI16] is amechanism like theonebeingdevelopedbyEDPand FCiências.ID inDiSIEM (described in Section 4.1).This framework gathersinformation on IP addresses through denunciations by a set of communities.After this collection is obtained, the threat level of each of the suspiciousaddresses isassessedconsideringthenumberofattacks, thenumberof lists inwhichtheaddressappears,andthetypeofmaliciousnesstowhichthesuspectedIPaddressisassociated.TheresultisalistofIPsthatcanbeusedformonitoringor blocking IP addresseswith a threat value calculated by OTX. However, theassessment is onlymade for OTX IPs and not for the blacklists chosen by theorganization’ssecurityteam.

3.1.7 Others In this section,we describe researchwork that are singular in theirobjectivesanddonotfitanyofthecategoriesmentionedabove.Kergletal.[KER15]suggesta new response to anomalies or attacks mentioned on Twitter. Attackdescriptions are collected from tweets, which are then compared to knownattack descriptions collected from a vulnerability database. If the new attack’sdescription matches a description present in the database, solutions to thevulnerabilitymayhavealreadybeendescribed;ifnot,itmaybethesignofnewzero-dayvulnerability.Zhang et al.[ZHA11] process the NVD to learn patterns between thecharacteristicsofapplicationsandtheirknownvulnerabilities.Theirobjectiveistousehistoricaldataaboutpiecesofsoftwareandtheirvulnerabilitiestotrainamodel that predicts the time to next vulnerability, i.e., for a given piece ofsoftware,howlongittakesuntilanewvulnerabilityisdisclosed.AlthoughZhanget al. present an interesting idea, they were unable to generate a modelpresentinggoodcorrelationcapabilities.Asavulnerabilityonasoftwareversiontypicallyalsoaffectsthepreviousones,varioussoftwareversionsarereportedascontainingthesamethevulnerability.Asdifferentsoftwareversionsthatsufferfromthesamevulnerability(whichisdetectedonasingleday),arereleasedin

2424

D4.1

differentdates, theauthorscouldnotobtainamodelachievinggoodpredictionaccuracy.

3.2 Existing tools ManytoolsareavailablethatcanbeusedtoexploreOSINTinformation,differingmainlyonhowtheyaredeliveredandonthefeaturestheyprovide.Thesetoolsmaybecategorised inthreeclasses:genericopensourcetools;paidtools;paidservices.

3.2.1 General purpose open source tools Many general purpose open source tools may be used to collect, store andorganizeOSINTingeneral,butnoneofthosefoundwasdesignedspecificallyforsecurity-related OSINT. For example, searching GitHub 28 using the “osint”keywordprovides324results.29Nevertheless,mosttoolsareeithergeneric(i.e.,collectOSINT fromall sourcessuchas theOSINT-Framework30)or collectonlyfromaspecificsource,suchasthevarioustoolsthatcollectTweets.31Theselackprocessingandanalysisfunctionalitiesthatwouldsuitthemforsecurity-relatedOSINTapplications,butcanbeadaptedtoasecuritycontextusingtwomeasures:

1. FocustheOSINTcaptureonsecurityeventsorsecurity-relatedsources;2. Filterthedatacaptured,keepingonlysecurityresults.

Thefirstmeasurereducesthescopeofdatatocapture.Forexample,whenusingTwitter, one can select only security related accounts (e.g., Kaspersky32). Thesecond measure reduces the amount of data to process by removing dataunrelatedtothecontext,whetherfilteringdatanotmentioningspecificsoftwareelements or capturingonly vulnerabilitydata.Note that bothmeasures can beusedtogether.Tousethesegeneric toolsone isrequiredtoconfigureOSINTsources, to tailorexisting analysis functions to the specific needs, or even implement requiredfunctionalitiesfromscratch.Table3presentstwoprominentexamplesofsuchopen-sourcegeneralpurposetoolsthatprovidemechanismstoimplementbothmeasuresmentionedabove.

3.2.2 OSINT paid tools WesearchedforspecializedtoolsforsalethatcancollectandprocessOSINTandthatcanbedeployedwithinanorganisationinfrastructure.Mostproductsfoundaresoldasservices,withtheexceptionofPaterva’s33Maltegoproduct.Although

28https://github.com29https://github.com/search?utf8=%E2%9C%93&q=osint&type=30https://github.com/lockfale/OSINT-Framework31https://dev.twitter.com/resources/twitter-libraries32https://twitter.com/kaspersky33https://www.paterva.com

2525

D4.1

Maltego isstructured ina client-serverway,where clientapplicationsare soldandPaterva’sserversprovidetheservicestocustomers,asanalternative,theseserverscanalsobesoldanddeployedwithinanorganisationnetwork.

Table3-Examplesofgeneralpurposeopen-sourcetoolsthatcanbeextendedforOSINTprocessingandanalysis.

3.2.3 OSINT paid services AnalternativetousingtoolstocollectOSINTistopurchaseasecurityfeedfromspecializedcompanies.Theseservicesareusuallysoldasasubscriptiontoafeedofsecurity-relevantnews,sometimesspecificallysuitedtotheIT infrastructureofthesubscriber.Themainadvantageofusingonesuchfeedistosimplypayforthedata,insteadofdevelopingandmanaginganotherpieceofsoftwaretogathersuch data. The main disadvantage is the cost of the service. Table 4 presentssomesecuritypaidfeedswefound,andtheirmainfeatures.Service FeaturesLookingGlass Besidesproving tools for threatanalysis,alsoprovidesasetof

specialized feeds: Threat Intelligence Services, a machine-readablefeed,andThreatMitigation

SecureWorks’ ThreatIntelligence

Theprovideddescriptionsarenotveryspecific,buttheirserviceprovides a threat intelligence feed tailored specifically to theirclients.

Kaspersky’s ThreatDataFeed

Providesafeedconsistingofrulesmainlyforbotnetprotection,andwhitelistoflegitimateservices.

34https://www.elastic.co/products/logstash35https://github.com/certtools/intelmq

Toolname DescriptionLogstash34 Logstash is one of the elements of the ElasticStack (together with

ElasticSearchandKibana–refertoDeliverable2.1“In-depthanalysisofSIEMs extensibility” formore details). Logstash can collect data from amultitudeofsources,includingTwitterandotherOSINT.InLogstashitispossible to place code toprocess andmodify anyof the data collected,whichcouldbeusedasaninitialfilterforthedatacollected.Afterbeingprocessed, the data could be sent to ElasticSearch,where it is indexed.Finally,thedatacanbeanalysedusingtheseveraltypesofvisualizationsprovidedbyKibana.

IntelMQ35 IntelMQ is a message queue implemented to receive data from a widevariety of sources, including OSINT. IntelMQ’s main feature is thecollection and processing of security feeds (such as logs, tweets, orblacklists)autonomously.Thistoolenablestheinformationsecurityteamto more efficiently collect information from a set of feeds. However, asetting is required for each source and, if the information we want tocollectisdifferentfromthestandardprograms,itbecomesnecessarytocreatemodules, oruse similarmodules, to correctly collect informationfromtheintendedsource.Yet,as[SHA15]refers,itisnecessarytogaugetowhichextentfeedsaretrustworthyandif indeedit ispossibletorelyon them, based on the information obtained to implement defencemechanisms.

2626

D4.1

Kaspersky’s TailoredThreatReporting

AsecurityfeeddesignedforaspecificITinfrastructure.Detailsthreat vectors, malware, attacks targeting specifically the ITinfrastructure, possible information leaks, and a status oncurrentattacksagainsttheinfrastructure.

RecordedFuture Paid service for automated collection and visualization ofsecuritydatafocusedonthecostumer’sneeds.Includessupportforthedarkweb,andalargesetofvisualizationtypes.

FireEyecyberthreatintelligence

Providesthreeservices:• Asubscriptiontoathreatintelligenceservice,consisting

of possible threats to an infrastructure, enriched withtheattack’scontext

• ThesupportofaFireEye’ssecurityanalyst• Aservicetousethreatintelligenceonthelifecycleofthe

company

Symantec DeepSightIntelligence

Depending on the subscriptionmodel,may include data aboutvulnerabilities, IPreputation, riskassessment; further, includesattacker data, such as active hacker campaigns and detectedincidents.

Kennavulnerability&riskintelligenceplatform

Integratestheresultsofvulnerabilityscandatawiththeresultsfrom 8 different threat feeds. Prioritizes vulnerabilities andprovidesriskreporting.

AirbusDSCyberSecuritycyberdefencecentres

ThreeEuropeancentresprotectandmonitorcustomer’sassetsin real time. Airbus DS CyberSecurity provides additionalservicesfordetectionandinvestigationofsophisticatedattacks,incidentresponse,andriskanalysis.

AnomaliThreatStream

Anomali ThreatStream is a Threat Intelligence Platform,allowingorganizationstoaccessintelligencefeedsandintegrateitwithinternalsecurityandITsystems.

Table4-Examplesofpaidsecurityfeeds/services.

2727

D4.1

4 Preliminary results on OSINT processing InthischapterwepresentthestatusandpreliminaryresultsofongoingworkinDiSIEMregardingtheprocessingandanalysisofOSINT,withtheaimofcreatingtools that integrate relevant information into SIEMs. Detailed results andconclusionswillbepresentedinaforthcomingdeliverable(D4.2),asplannedintheprojectDescriptionofAction.TheworkreportedisrelatedtotheprocessingoftwoformsofOSINT:structuredblack-lists of IPs; and unstructured textual information arising from variouskinds of sources. In the last case two main approaches are being followed:machine learning approaches to process posts of security-related twitteraccounts; andDigitalMR’s listening247platform,which is employed forminingmarket trends, using data fromvarious sources such as social networks, blogsandforums,tomentionafew.

4.1 Blacklisted IPs OSINT processing Blacklistsareliststhatcontaininformationaboutuntrustedelementsandareatypical tool used as a cyber-defence mechanism [KÜH14]. An example ofblacklists is a list of malware signatures, used by antivirus or IntrusionPrevention Systems (IPS). DiSIEM ongoing research focuses on IP blacklists,which are lists of IP addresses deemed as malicious; IPSs use them to blockinboundandoutboundconnectionstothoseIPs,whichisasimplebuteffectivesecuritymeasure.

4.1.1 Trustworthy Blacklist in SIEM systems EDP and FCiências.ID are working on a case study whose focus is on thetrustworthinessofIPblacklists.Oneoftheobjectivesofthisongoingworkisthereduction of false positives when assessing the legitimacy of communicationswithIPaddressessuspectedofmaliciousactivity.To obtain a more reliable list of malicious IPs leading to a reduction of falsepositives it is necessary to classify the reputation of each IP address and eachblacklist.Thisassessment isdoneusingspecificsecuritymetrics.Blacklistsandtheir contentsmust be evaluated continuously (orwhenever the lists change)and must consider the cases of communications from the organization’snetworkstoblacklistedIPaddresses.Figure 2 represents an overview of the framework being developed, whichincludes four modules that can be used independently. The first is the IPCollector, a program with the purpose of gathering information from publicblacklists. The second is the Trustworthiness assessment, which evaluates thereputationofthemaliciousIPaddressesandtheblackliststhatcontainthem.Thethird module, the Trustworthy Assessment of Blacklists Interface (TABI)application, consists of a web management interface on the IP addresses,blacklistsandcasesrelatedwithcommunicationsbetweentheorganizationandIP addresses suspicious of maliciousness. Finally, a reputable list of IPs

2828

D4.1

(BADIP.csv) is introduced in the SIEM and the rules for monitoring andgenerating alarms are defined. These components are described in the nextsections.

Figure2-WorkflowoftheIPblacklistprocessingframework.

4.1.2 IPs Collector Weconsiderasasource(orfeed)anentitythatprovidesoneormoreblacklists.In the undergoing study, we only consider public blacklists that containinformation about IP addresses (IP blacklists). The frameworkuses theOSINTconcept to gather informationof a pre-specified set of public blacklists. At theend of a period of three months of investigations and selection of publicblacklists,28 sourcesand121blacklistswere selected.This listof sourcesareincludedinAppendixA.

4.1.3 Trust Assessment For an effective cyber-defence, and when there is an extensive number of IPaddresses to consider, it becomes necessary to differentiate a suspicious IPaddress from another by its trustworthiness. The typical features used todifferentiate them are the criticality, credibility, impact,maliciousness and thenumber of reports. The trust assessment aims to classify the reputation ofmaliciousness of an IP address and the reputation of credibility of a blacklistconsideringtheseconditions.For the calculation of the reputation of maliciousness of an IP address, fourfeatures are used: Term Frequency (TF), precision, average of the reputationscoreofallblackliststhatreportedtheIP,anditspersistence.TheIPrankistheposition an IP occupies when sorting all IPs by the trust we have in theirmaliciousness,whichisgivenbyareputationscore.

2929

D4.1

TheTFcomponentistherelativefrequencyofoneIPconsideringthenumberofoccurrencesofallgatheredIPs.TheprecisionofanIPistheratioofthenumberofconfirmedcasesofmalwaredetectedbycommunicationswiththis IP, to thetotalnumberofcasesassociatedwithcommunicationswiththatIPinthecurrentmonth.Thepersistenceisdefinedforagiventimeperiod,whichmayberelatedwiththeSIEMeventretentionperiod,andisameasureoftheIPsappearanceinblacklistsorinpositivecasesreportedintheorganization.Asthesolutionshouldbeadaptableto theenvironmentofdifferentorganizations,whenanIPhasnotbeen informed by blacklists, it is only discarded if it is not associated withpositivecases.

4.1.4 Trustworthy Assessment Blacklists Interface TheTrustworthyAssessmentBlacklistInterface(TABI)isawebinterface,whichisbeingdeveloped toallowmanagingandvisualizing information relatedwiththeblacklists,suspicious IPs,organization’scasesandpublicorganization’s IPs.TABIwillallowforacentralizedmanagementoftheentireframework,withouttheneedforcodewritingorfileconfiguration.Thetoolwillconsidertheaddition,removalandeditionofblacklistsandincidentcases, to be used in the trustworthiness assessment of the IP addresses andblacklists.TheTABIapplicationwillhaveanextrafunctionalitythatindicatesifapublicIPoftheorganizationiscontainedinanyblacklist.Forthisfunctionalitytobeoperational,itisnecessarytohaveaccesstoalistofthepublicIPaddressesoftheorganization.

4.2 Infrastructure-related OSINT processing AnotherlineofworkinDiSIEMconcernstheprocessingofunstructuredtextualOSINTthatispostedonthewebbycybersecuritycompaniesandprofessionals,as well as hackers and attack victims. The information is posted on socialnetworkssuchasTwitter,dedicated forums,blogs,andnewsfeeds, tomentionsomeofthepublicationvenues.

4.2.1 Exploratory Machine Learning Approaches We are exploring different machine learning-based approaches to discoverrelevantsecurity-relatedOSINTforagivenITinfrastructure.TheseapproachesareorientedatkeepingSOCanalystsawareofthemostrelevantthreatsagainstthe infrastructuresunder theirresponsibility,without requiring them to spendtimesearchingforthatinformation.Forthispurpose,theDiSIEMindustrialpartnersprovidedadescriptionoftheITinfrastructure they wish to monitor. This allows decreasing the amount ofcollectedOSINT, thereforeenablingthedevelopmentofmodels tailored for thespecific descriptions, and enabling also a more concise assessment of theperformance of the approaches being followed and a more efficientinfrastructure-awareOSINTdiscovery.

3030

D4.1

CollectingOSINTimpliessearchingandcollectingdatafromthemostinterestingsources.Nevertheless,securityanalystshavea limitedtimebudget toseekthisinformation,eventhoughthequalityoftheirworkdependsonthisknowledge.Our proposal is meant to provide analysts with the most recent and relevantinformation regarding the protected infrastructure. We want to maximize theamountofrelevantinformationobtained,whileminimizingthetimerequiredtoviewit.Toachievethisobjective,weproposeaprocessingpipelinecomposedofanOSINTinformationgatherer,anautomaticmethodforselectingtherelevantinformation,andasummarizingfunction.Morespecifically,weuseanautomatedtool to gather tweets from security-relevant accounts, and we are testingdifferentmachine learning approaches to select the relevant ones consideringtheprotectedinfrastructure,andtogrouprelatedinformationgatheredtoavoidpresentingrepeatedorinformation.The proposed methodology aims to simultaneously achieve three mainobjectives:

1. Maximizetheamountofrelevantinformationpresentedtotheanalyst;2. Minimizetheamountofirrelevantinformationpresentedtotheanalyst;3. Aggregaterelatedinformation.

The first objective aims to avoid discarding relevant information, while thesecondaimstoavoidpresentingirrelevantinformationtotheanalyst.Thesetwoobjectives are fundamental to ensure the reliability of the system: analystsshould trust that the presented information is relevant and must be takenseriously.Thefinalobjectiveisimportanttoavoidthepresentationofduplicateinformation.Although there aremany sources ofOSINT, for these approacheswe focus onTwitter for twomainreasons.First,Twitter iswell-recognizedasan importantsourceofshortnoticesaboutwebactivityandabouttheoccurrenceofeventsinnear real-time.36This is also trueabout cyber securityevents, asmost securityfeedsandresearchersmaintainactiveaccountswheretheytweetthenews’titles[CAM13,SAB15].Therefore,Twitter isan interestingaggregatorof informationandactivity fromall kindsof sources.Secondly, sincea tweet is limited to140characters (around 20-30 words), these messages are simpler to processautomatically,enablingveryhighlevelsofaccuracyandlowfalsepositiverates.As agreed in the DiSIEM project proposal two types of machine learningapproaches are being evaluated: well established methodologies such as SVMandArtificialNeuralNetworks(ANN),anddeeplearningapproaches.

36 https://www.americanpressinstitute.org/publications/reports/survey-research/how-people-use-twitter-in-general/

3131

D4.1

SupportVectorMachinesandArtificialNeuralNetworks.SVMsandANNsarebeingtestedtoclassifyeachtweetasrelevantornotforagivenITinfrastructure.Apache Spark,37a scalable platform, and itsmachine leaning library are beingemployedforthispurpose.Figure3illustratestheproposedtwitterclassificationarchitecture.A data collector restricts tweets by collecting them only from relevant twitteraccounts.Collectedtweetsarethenpassedbyagroupoffiltersthatassignsthemto a given part of the IT infrastructure. Then, a specific classifier is used toclassify the tweets as relevant or not for the security of that part of themonitoredinfrastructure.

Figure3-ProposedTwitterclassifierarchitecture.

SinceTwitterisourdatasourceandwewanttoavoidpresentingretweetsandthestreamofsimilartweetsaboutthesameeventsandthreats,attheendofthearchitecturethereisaclusteringstepusedtogrouprelatedtweets.Noticethatatthis level we are not interested in what the analyst will dowith the relevantinformation, nor we aim to further process it to extract machine readableinformation(asisdonebyotherworks[LIA16,ZHU16]).OngoingworkseekstofindgooddesignparametersforSVMsandANNsandtoprovide a comparison on the performance of these well-established machinelearningtechniques.AnotherinterestingquestionbeingaddressedistofindoutifthereisaclearbenefitinusingmultipleclassifiersforspecificITinfrastructurepartsinsteadofusingasingleclassifierforthewholeinfrastructure.Preliminaryresults indicate that theobjectives specified in theprevious subsectionmaybemet, but are still inconclusive regarding this question and also on theapplicabilityofthemethodologiestoverylargedatasets.Deep Learning approach. Besides SVMs and shallow ANNs, a deep learningmethodologyisbeingtestedtoclassifyeachtweetasrelevantornotforagivenITinfrastructure.Forthispurpose,TensorFlow38isbeingemployed.Deep learningmechanisms have recently gainedmuchwarranted attention astheyhavebeenusedinanincreasingnumberofextremelycomplextasksonverydemandingbigdataproblems.Regardingtheproblemathandtheyareexpected

37https://spark.apache.org/38https://www.tensorflow.org/

3232

D4.1

to provide increased classification accuracy, less sensitiveness to theheterogeneity of data sources and data sets size, less requirements inprearranging a set of input features, and a higher level of generalisation andadaptation,thusincreasingautonomy.As such, the main objectives of following this approach consist in processinglargeramountsofOSINTdatawithbetter classificationaccuracyasalternativesimplerapproaches.Weareinterestedindeterminingwhetheratweetcontainsvaluableinformationregarding a cyber-threat to a certain architectural component or not. Thistranslatesintoabinaryclassificationtask:atweetmentionsathreatornot.Figure4illustratestheproposedarchitecturefortheneuralnetwork.Weexpectthatasingledeeplearningmodelwillpresenthigheraccuracywhenclassifyingtweets foracomplete IT infrastructurethanusingSVMsorANNs.The inputofthe model is a sentence (a tweet) and a description of part of the ITinfrastructure. The output should indicate if the sentencementions a threat tothatpartoftheinfrastructure.AlthoughtheinputmentionsonlyapartoftheITinfrastructure(astweetsgenerallymentiononesoftwareelementintheirtext),asinglemodelwillbeusedforthewholeinfrastructure.

Figure4-Architectureofthedeeplearningapproachtotheclassificationoftweets.

Related work [KAL14, KIM14,WAN15] suggests neural network architectureswheretheinputlayerisasentencecomprisedofconcatenatedword2vec[W2V]wordembeddings,followedbyaconvolutionallayerwithmultiplefilters,amax-

3333

D4.1

pooling layer, multiple fully connected layers, and finally a softmax classifier[CON17].Therefore,wedecidedtoemployasimilardesign.Figure5depictsthearchitecturebeingtested for tweetclassification,whichexploits thecorrelationbetween the tweet sentence and a specification of an IT infrastructurecomponentbyusingaconvolutionalneuralnetwork.Preliminaryresultsindicatethat the approach is successful, although the dataset’s size does not yet allowsolidcomparisonsandconclusions.

Figure5-Deepneuralnetworkarchitectureforclassificationoftweets(adaptedfrom[KIM14]).

4.2.2 DigitalMR Listening247 platform TheListening247platformisaservicewhichoffersanalysisofvarioussourcesofdata including blogs, social networks, news, boards/forums and other openlyavailabledataontheInternetformarketresearch.ItusesaSoftwareasaService(SaaS)modelthatenablesuserstomonitorthewebforspecificsubjects/topicswhileextractinginsightsandreports.Theplatformhasbeendesignedfororganisationstomanagetheirreputationnotonlyonsocialmedia,butvariousonlinelocations.Suchsystemsareincreasinglyindemandbyseniormarketingexecutiveswholookforwaystosiftthroughfast-changing data across geographies, languages and time zones. This makes itparticularlyusefulinthiscaseaswellasitprovidesasimpleinterfacetoprocessunstructureddata.Social listening helps organisations: accurately evaluate marketing campaigns;analysehotconversationtopics;discoverwhitespace/marketgaps;respondtonegative & leverage positive posts; benchmark your share of voice withcompetitors. Unlike conventional socialmedia dashboards, this combined datafromcorporateCRM/ERPsystemsandmillionsofblogs,boards,videosandnewsfrom three different socialmedia sites to present aggregated data quickly andclearly.Inthiscase,analysissuchasconversationtopicscanbeusefulforgivinginsightintounstructureddatarelatedtospecificinfrastructuresofcompanies.

3434

D4.1

Intermsofinfrastructure,currently,theplatformmakesuseofAmazon’scloudstorage (S3), database services and elastic compute capability. This allowscomplexandprocessor intensivetaskstobeoffloaded,allowingthemtoutilisesurplus processing and storage capability elsewhere in the cloud andconcurrentlywithother tasks viaAPIs tomultiple data aggregation engines tocoversourcesofonlinetextsuchasTwitter,Facebook,blogs,boards,videosandnews. The data is analysed to extract sentiment, brands, products and topics.Thisinformationretrievalstep,combinedwithmetadatacomingfromtheonlineposts,isessentialforthecreationofinsightfulreportsthatdescribewhatissaidonthewebaboutthesubjectofinterest.Intermsofprocessing,theListening247isadistributedplatform.ItusesHadoopMapReduce 39 for data processing and implements state-of-the-art machinelearningalgorithmsforinformationretrieval.Rawandanalyseddataarestoredin scalable distributed databases that offer a flexible query API used for ourreportingneeds.Theback-endarchitectureisdevelopedusingPython,whereasthefrontendisbuiltusingHTML5,JavaScriptandPHP.ThemainimpactoftheuseoftheListening247platformforDiSIEMisthatitwillbeabetterandmoreeffectiveuseofthesocialwebdatafrombothbusinessandsocial perspectives. In particular, it will provide new insights/tools to betterunderstand clients/citizens needs and activities that confinemalicious contentwhich will traditionally go unnoticed. Of course, personally identifiableinformation will be anonymised by any of the techniques that best fits thisapplicationincompliancewithprivacylaws.Additionally, the platform has been used with success on several majorlanguagesincludingEnglish,Spanish,German,Russian,Chinese,andVietnameseamong others. DigitalMR’s network of 250 experienced and tested curatorsworldwide, in addition to our industrial strength processes for noise removalanddisambiguatingpostsmakesourtrainingmachinelearningmodelsstandoutintermsofperformance.The proposed custom pipeline for the threat prediction based on theListening247platformwill consistof apre-processing step,noise filtering, andananalysisstepwheretheentitiesofthetweetandthelocation(ifavailable)canbe extracted, and a prediction can be made as to the type of threat it isaccompaniedbyapredictionconfidence.Inputs.Thekey inputs for theListening247willbekeywordsto focusthedatabeinggatheredfromOSINTsourcestoonlythoserelevant.Thisisthefirststeptonoisefilteringwhichinvolvesformingqueriesthatdisambiguatekeywordsthatmight be homonyms to other words. For example, “Windows – the operatingsystem” (an infrastructure),will yield information about “windows”which areused inbuildingsamongothers.Formingspecializedqueries for theserelevantinfrastructure as keywords such as for the Windows operating system will

39http://hadoop.apache.org/

3535

D4.1

narrowdownthevastamountsofOSINTdataavailableontheInternet,therebymakingtheamountofdatatobeprocessedmoremanageable.Thesequeriescanbeupdatedwithtimetoyieldmorerelevantdata.

Figure6-DigitalMROSINTthreatpredictorproposal.

Data Aggregation. To process all OSINT data, it needs to be aggregatedwithrespect to time so that data occurring within the same period gets groupedtogether.Thisallowsfortimeseriesanalysisofthedatafromvarioussources,forexample using LatentDirichlet Allocation (LDA) todetermine the topicsof thedata, or predicting the expected number of posts relevant to an infrastructurebasedonpreviouslyseennumbersforeachweek.Informationcouldbeaggregatedbyweeknumbers(i.e.ISOWeekformat)whichconsists of 52, or 53 weeks in a year. This makes it more likely to haveinformation related to each other within the same window (i.e. a week).Specifically, various sources of information have different velocities andaggregatinginformationbyweekmakesitmorelikelyforinformationrelatedtoeach other to be grouped together. By velocity,we are referring to the rate atwhich information isbeingproduced.Forexample,whenthere isaDistributedDenialofService(DDOS)attackonawebsite,socialmediasourceslikeTwitter,InstagramandFacebookareusuallythefirstonestoreportthenews.Followedby news agencies on their sites, and blog articles that follow up on the event.However,someofthevelocitiesfortheseoutletssuchasnewsagenciesarealsochangingdue to the changes in thewaynewsarebeing reported in this ageofconnectivity.Wecanconsidertwodesignsforaggregatingthisdatafromvarioussourcesintoatime-interval.Thefollowingprovidestwodifferentapproaches.

3636

D4.1

• Design#1

Inthisdesign,thetimeintervalisaday.{time-interval : (21-01-2017 , 22-01-2017) #daily {‘twitter’: <twitter_data>, ‘blogs’: <blog_data>, ‘forums’: <forums_data> ….} }

- Pros:§ Dataencapsulatedwithinacommontimeframe.

- Cons:§ Datasourceshavedifferentvelocities;§ Somedatasourceswillbeduplicatesasaresult.

• Design#2Inthisdesign,weaccountforthetimedifferenceofvelocitiesfordifferentsourceswhichallowsforsimilarinformationtohopefullybegroupedwithinthesametime-interval.{ 'period': (21-01-2017 , 28-01-2017) //one week data {'twitter': { 'period': (21-01-2017 , 22-01-2017) { <twitter_data> } . . 'period': (27-01-2017 , 28-01-2017) { <twitter_data> } }//end of twitter data {'blogs': { 'period': (21-01-2017 , 28-01-2017) { <blog_data> } } } //end of one week data

- Pros:§ Alsoencapsulatesdatawithinatimeframe;§ butalso considers thevelocityof the rateof capturing for

eachsource.- Cons:

§ Rate of capturing for each source might need moreoptimizing.

Wewill be experimentingwith variouswaysof aggregating data such that themost amount of relevant information (though at different velocities fromdifferentsources)arecapturedwithintheperiod.Cyber threat Modelling. The first model in the pipeline filters out noise,specifically thosedata found tonotbeuseful for the infrastructuresof interest

3737

D4.1

(seeFigure7).Thismodelwillneed tobe trainedona largenumberofOSINTdatataggedwithrelevancetoinfrastructuresofinterest.FilteredOSINTdatawill then be allowed to pass through to the secondmodel(see Figure 8), which will use NLP to obtain meta-information such as theinfrastructureinvolvedandpossiblythelocationsinvolvedwhichwillbeaddedtothedatainSTIXv2.0format(JSON).

Figure7-Noisefilteringstep.

Figure8-RelevantdatagetsanalysedbythemachinelearningandNLPtools.

Prediction of threat likelihood will require tagging data with threat forsupervised training. So, the STIX payload could include information of threatlikelihood to the infrastructure, and possibly a prediction confidence to avoidfalsealarms.ThiswilllikelyinvolvetheuseofrecurrentneuralnetworkssuchasLong ShortTermMemory (LSTM) neural networksorGatedRecurrentNeuralNetworks (GRU), which remember information over time and use that toinfluencetheirnextprediction.Thisisessentialforeventsthatunfoldovertime.

3838

D4.1

LSTMnetworkscontaina“memoryunit”,whichisselectivelyupdatedwithnewpatterns. This memory is also used to selectively influence the neuron's finaloutput. This selective process, which is handled by the gates, is learned overtime.Figure9showsthecanonicalstructureofanLSTMunit.Theyaregenerallytrainedbyaback-propagationalgorithm.ApotentialrestrictionthatwillaffecttheaccuracyoftheclassifierinpredictingthreatswillbetheavailabilityofsufficientOSINTdatarelatingtocyberthreats.Thereisaneedforgettingannotateddatainmajorlanguagesthatrelatetocyberthreats,whichcanthenbeusedastrainingdataforthethreatpredictor.LSTMs,inparticular, requirea lotofdatabecauseof theadditional free-parameters.Apotential solution could be the use of language processing to identify threatsfromtheuseofkeywordsthatwilltypicallyindicateathreatinmajorlanguages;such as ‘ddos’, ‘security breach’, ‘leak’ and more. This data can be verified byhumans and used as training data for the threat predictor. This implies thatcurators that areknowledgeablenotonly in the language,butalsounderstandcomputer security will be needed. In addition to the type of threat, otherinformationfromtheOSINTsourcessuchaslocationandentitiesinvolvedcouldalsobeextractedtoprovideamorecomprehensivedescriptionofthethreat.ThepredictionconfidenceoftheclassifiercanbeincludedinthedatasenttoSIEMS,whichwillhelpavoidtheissueoffalsealarms.

Figure9-AnLSTMNeuralNetworkCell.

Other information thatwill be produced at this stagewill include topics fromtime series topicmodelling which helps pass on information about the topicsdiscussed,howmanypostsarerelatedtoeachofthetopicsandhowthischangesovertime.Anotherfeatureisawordfrequencycountalsodoneovertimewhichcapturedthemostfrequentwordsforeachperiod,andhowsignificanttheyarerelative to other words. All this data can be passed on to the visualizationcomponentwhichhelps theenduseof theSIEMhave situational awarenessofwhatishappeningwithregardstotheOSINTdatasources.

3939

D4.1

5 Context-aware OSINT integration

5.1 Threat Intelligence Data Interchange Formats The number and impact of cyber-attacks has increased in recent years, asevidenced by reports from governments and organizations. To face theseemergingthreats, it iscrucial tohavetimelyaccesstorelevant,sensitivethreatintelligence information. Anyway, the ability to share this information is oftennot enough. Threat intelligence must be expressed and, then, shared usingspecificstandards,allowinginvolvedpartiestospeedupprocessingandanalysisphasesofreceivedinformation,achievinginteroperabilityamongthem.Some companies developed their own application framework, in order toexchange cyber threat intelligence, relying on different standards and/orprotocols.Anexample, is theonedevelopedby IntelMcAfee, calledOpenDataeXchange Layer (Open DXL) [MCA]. It supports a wide range of languages,allowing all the applications to communicate over a universal orchestrationlayer, and this interaction is totally independent of the underlying proprietaryarchitecture. This abstraction from vendor-specific APIsmakes the integrationpartmucheasier.Itcouldrepresentagoodsolutionfortheintegrationwithtoolsand platformswhich already supportOpen DXL interchangemethods, such asMcAfeeproducts,aspossiblefutureworks.However,inDiSIEM,asstatedin[DIS],oneofthearchitectureprinciplesaffirmsthatnoadditionalorsignificantmanualworkshouldberequiredtooperateandinteract with SIEMs, as well as some relevant modifications due to ourextensions. So, it is preferable to focus upon actual standards, instead offrameworks,torepresentandexchangecyberthreatintelligence,andcheckhowit could be injected into SIEMs, using interchange methods which are alreadysupportedbythem.Starting from these considerations, some standards have been considered; themostimportantarethefollowing:

• IncidentObjectDescriptionandExchangeFormat(IODEF)[DAN07]:XMLbased standard, mainly used for representing and sharing incidentreports, especially when Computer Security Incident Response Teams(CSIRTs)areinvolved

• CyBOX/STIX/TAXII [CYB], [STR], [TRU1]: open standards developed byMitreorganization,respectivelyusedforrepresentingIoCsanddetailedcyberthreatintelligence,butalsoforsharingitwithtrustedpartners

• OpenIOC [OPE]: vendor dependent XML based standard, introduced byMandiant and primarily used in their product, however it can beextendedinordertomeetorganizationneeds.Itfocusesespeciallyupontactical cyber threat intelligence, in fact it is not as complete as Mitrestandardsfromthispointofview,whichareabletocoveralsostrategiccyberthreatintelligenceinamoredetailedway

4040

D4.1

Comparisonamongthemwasaddressedinmanyarticlesandpublications,suchasin[KAM14],[FRA15],[FAR13],[SAU17].Currently,wecanstatethatthemostused,andalsothemostpromising,aretheonesdevelopedbyMitreorganization,specifically Structured Threat Information eXpression (STIX), for describingcyber threat information, and Trusted Automated eXchange of IndicatorInformation(TAXII),forsharingitinanautomatedandsecureway.Forthesereasons,wedecidedon investigatingmore indetailsabout thesetwostandards, to understand if they could represent a good solution for DiSIEMobjectives.

5.1.1 STIX StructuredThreat Information eXpression (STIX) is anopen standard used forrepresenting and exchanging Cyber Threat Intelligence (CTI), developed byMITREorganization,butnowismaintainedbyOASISCyberThreat IntelligenceTechnicalCommittee.Itallowssharingthisinformationamongdifferententities(e.g., organizations, governments, companies, research groups) in a consistentandmachine-readablemanner,withtheaimofincreasing:

• Collaborativethreatanalysis;• Incidentresponsecapabilities;• Automatedthreatsharing;• Interoperability;• Efficiency;• Situationalawareness.

It iswidelyusedbymanygovernmentsandorganizationssuchas theNationalCouncil of Information Sharing and Analysis Center (ISAC council), FederalGovernmentof theUnitedStates,U.S.DepartmentofHomelandSecurity(DHS),Japanese Information-technology Promotion Agency (IPA), besides bothcommercialandgovernmentfeedprovidesit,aswellasmanythreatintelligencetoolswhichareabletoprocessandproduceit.Verybriefly,itistargetedtosupportalargesetofcyberthreatmanagementusecases,forexample:

• Analysingcyberthreats;• Specifyingindicatorpatternsforcyberthreats;• Managingcyberthreatresponseactivities;

o Cyberthreatprevention;o Cyberthreatdetection;o Cyberthreatincidentresponse;

• Sharingcyberthreatinformation.

Thisstandardallowsbindingtogetheradiversesetofcyberthreatinformation,which will be individually described later, representing it with a commonstandardizedformat.Thisisveryimportant,especiallywhendatashouldbefedinto a Security Information and Event Management (SIEM) system. SIEMs are

4141

D4.1

verypowerfultoolsforempoweringtheorganizationsecurity[DIS17],buttheyshouldworkonlywithstructureddata,consideringtheir limitations forad-hocimportingandanalysingunstructuredformats.Thiscouldbeaproblem,infact,often,rawdataextractedfromexternalsources,such OSINT, Social Media Intelligence (SOCMINT), Human Intelligence(HUMINT), or other private or public repositories, are expressed throughdifferent data format (e.g., CSV, PDF, custom XML, custom JSON). ThisinformationisreferredasThreatData,andinjectingitdirectlyintoSIEMscouldled,forexample,toahighnumberoffalsepositives.So, ThreatData should be collected, aggregated and, then, normalized, using acommon structured format, before being analysed and enriched. The obtainedcleaneddataisreferredasThreatIntelligenceanditcouldbefedintoSIEMs,tolet thesesystemsprocessandcorrelate it.ThesearesomereasonsthatexplaintheimportanceofusingstandardsforrepresentingCTI,and,inparticular,STIXisactuallythemostused.Another great advantage of STIX is given by its extensibility. In fact, it iscompletely extensible, allowing each user to define their custom properties,customobjectsorcustomvaluesforpredefinedproperties.Obviously,itshouldbeconsideredthatsomesharingpartiescouldnotbeable toprocessacustomSTIX file. In this case, they could choose to ignore the entire file or just thecustomsections.However,ifalltheinvolvedpartiesareawareofeachadditionaldatathatcouldbepresent,therewillnotbeanyproblemwhenprocessingit.Besidesbeingawidelyused standard, lotsofdocumentations canbe foundontheInternet.Manyopen-sourcelibrariesareavailableforhelpingdeveloperstocreate and process CTI using STIX, written especially in Python, but alsosomething in Java is available. Both libraries and documentations arecontinuouslyupdated;currentlythelateststableversionisthe2.0.Next sectionswill proceedwith amore detailed description about the currentversion of STIX (2.0). Finally, a brief comparison among this version and theolderoneswillbemade,forpointingoutwhyitcouldrepresentagoodchoiceforrepresentingCTIinDiSIEM.STIX2.0.Therearemanydifferencesamongthisversionandtheothers.STIX2.0[OAS17], [OAS171], [OAS172], [OAS173] could be considered graph-based,where nodes and edges are respectively STIXDomainObjects (SDO) and STIXrelationships (that could be STIX Relationship Objects (SRO) or embeddedrelationships).STIXDomainObjectsareextensionsofSTIX1.x[STR]coreconstructs,whileSTIXRelationship Objects indicate explicit relationships among different objects,allowing to represent in a more understandable way the related threatintelligence. Before starting to explain the available SDOs, it should beconsidered thatCybOXstandardhasbeen integrated intoSTIX2.0, todescribe

4242

D4.1

simple IoCs and their associated patterns. The predefined SDOs are thefollowing:

• Observed Data: related to basic Indicator of Compromise, such as IPaddresses,hash-filesandregistrykeyvalues.TheseSTIXObjectsareusedtorepresentwhathasbeenmonitored.MaliciousactivitiesarerecognizedcheckingthemwithpatternsdescribedinSTIXIndicators,wheredetailedinformationaboutthethreatisprovided;

• Indicator: describes patterns used to detect malicious or suspiciouscyber activities. These patterns could be specified using the STIXPatterningLanguage[OAS174];

• Identity:representsindividuals,organizations,groups,butalsoclassesofindividuals,organizationsandgroups.Itcouldreferbothtoattackersandvictims;

• AttackPattern:typeofTTPforcategorisingattacks,generalizingthemtothe pattern that they follow, providing detailed information about howthey are performed. Reference to externally-defined taxonomy, such asCAPEC[CAP],couldbeattachedtothisSDO;

• Malware: refers tomaliciouscodeand/orsoftware.Thisobjecthelpstocharacterise, identify and categorising malware samples through a textdescriptionfield;

• Campaign: describes a set of malicious activities performed by anattackertoaspecificIdentityoveraspecificperiodoftime;

• IntrusionSet:setofCampaigns,performedbyanattacker,targetedtoaspecificresourceofaspecificIdentity;

• CourseofAction:countermeasurestobetakenagainstaspecificthreat,inordertomitigatethepossibleimpactsofincidents;

• ThreatActor:representsmaliciousactoridentity,includinghishistoricalobservedbehaviouragainstaspecificentity;

• Tool: legitimate software that can be used by adversaries to performattacks. Examples could be remote access tools (e.g., RDP) and networkaccesstools(e.g.,NMAP);

• Vulnerability: refers to mistakes in software that can be exploited byattackers.Externalreferences(e.g.,CVEs)couldbeattachedtotheSDO;

• Report: collections of threat intelligence related to one or more topic(e.g.,malware,attacktechnique,threatactor).

Instead,thepredefinedSROsarethefollowing:

• Relationship:usedforlinkingtwoSDOsinordertoexplicitlydefinedhowthey are related to each other. They can be considered as edges in ahypotheticalgraph,whereSDOsarethevertices.

• Sighting: refers to the belief that something in CTI was seen (e.g.,indicators, malware, observed data). Used for tracking threat actors,resourcestargeted,suspiciousbehaviours,etc.

STIX2.0objectsare completely customizable.Thereare twoprimarymeansofcustomization:

4343

D4.1

• CustomProperties:foraddingnotalreadydefinedpropertiestoSDOs• CustomObjects:forcreatingfromscratchnewSDOs

In order to perform these operations, some specific rules should be followed,regardingnaming,length,whatASCIIcharactershouldbeused,etc.Additionally,someSDOscontainaparticularpropertywherethesetofpossiblevalues that can be assigned to them, is associated to an open vocabulary. Itmeansthatthesesetofvaluescouldbeseenasasortof“suggestedvalues”,but,inpractice,anyothervaluescouldbeused.DifferentlyfromSTIX1.x,thisversionexploitsJSONstandardtorepresentSTIXobjects(itisforthisreasonthatSTIXObjectsareconsideredforthisversion,notSTIX files, as in the previous ones), instead of XML. OASIS CTI TechnicalCommittee(TC)statedthatJSONwasmorelightweightthanXML,andsufficienttoexpressthesemanticofcyberthreatintelligence.Besides,itissimplertouseand globally preferred by developers. Some open-source utilities and librariescan be downloaded from the website, for creating and processing STIX 2.0objects.AnotherimportantfeatureofSTIX2.0isthatitiscompletelytransport-agnostic.Itmeans that itdoesnot relyonany specific transportmechanism,and this isachievedembeddingtheSTIXObjectsthatshouldbesentintoaBundle,providedbySTIX,thatcanbeseenasasortofcontainerforSTIXObjects.Formoredetailed informationandsomepractical example, thedocumentationavailableinthewebsitecanbeconsulted.ComparisonbetweenSTIX1.xandSTIX2.0.Inthissection,abriefcomparisonamongtheversionswillbeconsidered,tounderstandwhySTIX2.0wouldbeabetterchoicewithrespecttoSTIX1.x,forDiSIEMproject:

• Itismorerecent.Itseemstrivial,butbeingmorerecent,moreeffortswillbespentinordertoupdateandimproveit,consideringthehighnumberofdifferencesthanpreviousversions;

• JSON vs XML. As explained in the previous section, JSON is morelightweight,simplertouseandpreferredbydevelopers;

• Onestandard.CybOXStandardiscompletelyintegratedinSTIX2.0,whileinSTIX1.xitisaseparatedstandard;

• STIXDomainObjects:inSTIX1.x,Objectsareembeddedintoeachother,while inSTIX2.0theyaredefinedat thetop level,andtherelationshipsamong them are expressed through SROs. Besides some STIX 1.xconstructweresplit,inordertogeneratedifferentandmoredetailedSTIX2.0 Object. For example, STIX 1.x TTP construct was split in STIX 2.0AttackPatterns,Tool,MalwareandVulnerabilityObjects;

• IntroductionofSROsastoplevelobjects;• Datamarkingsdon’tuseanymoreaserializationspecificlanguage,suchas

XPath. In STIX 2.0, markings could be applied to entire objects or tospecificpartsofthem;

4444

D4.1

• Indicator Pattern Language. In STIX 1.x, indicator patterns wereexpressed in XML, making very difficult to express complex patterns.While, in STIX 2.0 specific indicator pattern languages can be used,independentfromtheserializationlanguage,makingthemeasiertoread,processandfordescribingmorecomplexsituations.

Theonly(temporary)advantageofusingSTIX1.x,isrelatedtoavailableon-linedocumentation, open-source utilities and available samples. However, thisadvantagewilldisappearoncenewupdateswillbeavailable forSTIX2.0, and,consideringtheimportanceofthetopicandhowmuchthisstandardisactuallyusedbymanygovernments,companiesandorganizations,itshouldhappenveryfrequently.Anyway,actualavailableSTIX2.0librariesaregoodenoughtocreateand process STIX JSON objects, and considering the above differences amongconsidered versions, it can be stated that, for normalizing unstructured cyberthreatdatagatheredfromexternalsources,tostructureddatatobeinjectedintoSIEMs,STIX2.0willbeabetterchoicethanpreviousversions,inthecontextofDiSIEMproject.

5.1.2 TAXII Trusted Automated eXchange of Indicator Information (TAXII) [TRU1] is anapplicationlayerprotocol,developedbyMITREorganizationforcommunicationofCTI ina simple, automatedandscalablemanner. Itwas ledby theDHSandfacilitated by MITRE organization considering STIX standard, in fact it mustsupport theexchangeof STIX content; this feature ismandatory to implement.However, additional content types are permitted. Now it ismaintained by theOASISCyberThreatIntelligenceTechnicalCommittee,thesameasSTIX.Thanksto itshighlevelof interoperabilitywithSTIX itself, it iswidelyusedbymany organizations and also governments. Some examples are the AdvancedCyber Defence Center (ACDC), the ISAC Council and IBM for its cloud-basedplatformIBMX-ForceExchange.TAXIIgoalscanbesummarizedasfollows:

• EnabletimelyandsecuresharingofCTIincybersharingcommunities;• Support a broad range of use cases and practises common to cyber

informationsharingscenarios;• MinimizeoperationalchangesneededtoadoptTAXII.

Anyway, it is important topointout that thisstandarddoesnotallowdefiningtrustagreementsbetweensharingpartners,asanyaccesscontrollimitationsornon-technical aspects of cyber threat information sharing. Instead, it enablespartiestosharesituationalawareness,basingonalreadyexistingdataandtrustsharingagreements.It enables secure sharing of CTI, considering that it is a transportmechanismbuilt over HTTPS. Besides, it supportsmany common sharingmodels, such ashubandspoke,publisher/subscriberandpeer-to-peer,so,itissuitablebothforcentralizedanddecentralizedenvironments.

4545

D4.1

Next sectionwill proceed describing the newest version, TAXII2.0. Due to thechoiceofusingSTIX2.0,wedecidedonfocusingdirectlyonthisversionofTAXII,andtonotconsiderthepreviousones.ThemainreasonwasthatTAXIIversionswere specifically developed considering the related STIX version, in order toexploit all itspotentialities, although they couldbeused for transportingothercontenttypes.As stated in the previous section, there are huge differences between STIX 1.xand STIX 2.0, starting from the format used for representing cyber threatintelligence,so,wedecidedonfocusinguponlastversionofTAXII,forcheckingitsapplicabilityinDiSIEMcontext.TAXII 2.0.With STIX2.0, CTI started to be represented using JSON instead ofXML.Forthisreason,alsoTAXIIhadtobemodified,inordertodealwithJSONasmain standard used for representing CTI. Previous versions, in fact, weredeveloped for, mainly, dealing with STIX files expressed through XML format.Detailedspecificationsofthisversioncouldbefoundin[OAS175].ThesupportforexchangingSTIX2.0contentismandatorytoimplement,anywayadditional content typesarepermitted. It isdesigned towork specificallywithHTTPS, to enable secure and authenticated communication between sharingparties.ActualspecificationdoesnotdefineanyrequirementsforHTTP.Thisstandarddefinestwoprimaryservicesforsupportingmostcommonsharingmodels,bothforcentralizedanddecentralizedenvironments:

• Collections:producers(TAXIIServers)canhostasetofCTIthatcanbe,in turn, requested by consumers (TAXII Clients). Information isexchangedinarequest-responsemanner.

• Channels:usedbyproducers topushdata todifferent consumers, andbyconsumerstoreceivedata fromproducers.Channelsarewellsuitedfor publish/subscribe sharing models, where consumers performsubscriptionoperationsoverproducers,toreceivespecificCTI.

Channels,CollectionsandrelatedfunctionalitiescanbegroupedtogethertoformanAPIRoot (TAXII Servers couldhostmanyAPIRoots), allowingadivisionofcontentandaccesscontrolrulesbytrustgroupsoranyotherkindofgrouping.Asimple example, to better understand this concept, is given by a TAXII Server,whichcouldhosttwoAPIRoots,oneusedby“TrustGroupA”andtheotherby“TrustGroupB”.TAXII2.0definestwowaysforallowingTAXIIClientstoidentifyTAXIIServers.The first isanetworkleveldiscovery,whichallowsthe latter toadvertise theirlocationwithinanetwork.Thesecond,instead,usesaDiscoveryEndpoint,whichidentifiesanURLandanHTTPmethodwithadefinedrequestandresponse,forenablingauthorizedclientstogatherinformationabouttheserver.

4646

D4.1

Authenticationandauthorizationisimplementedasdefinedin[FIE14],throughthe Authorization and WWW-Authenticate HTTP Header respectively. HTTPBasic Authentication [RES15] is themandatory schema to implement in TAXII2.0.Anyway,otherauthenticationschemescanalsobesupported.Another important feature is related to the possibility of customizing thisstandard, adding new properties, in order to improve information exchange.Specific naming conventions should be followed for every custom property,beingcarefultoletalltheinvolvedpartiesawareofthesemodifications,toavoidprocessingproblemswhenthisdataisreceived.TAXIIinDiSIEMcontext.AfterabriefTAXII2.0description,inthissectionwillbeinferredifitsusagecouldbeavaluableadditionforDiSIEMproject.In theprevioussection, ithasbeenstatedthatTAXII2.0 is theactualstandardused for exchanging cyber threat intelligence represented using STIX 2.0standard.Thisisactually truebut,however, therearestillsomedisadvantages,forhowconcernitsusage,especiallyconsideringDiSIEMcontext.An importantdrawback regards the lackof availableopensource librariesandutilities,forhelpingdeveloperstoimplementTAXIIClientsandServers.Wecanstate that TAXII 2.0 is notmature enough, differently from STIX 2.0, from thispointofview.Besides,SIEMsdonotsupportyetTAXII2.0protocol,andoneoftheDiSIEMarchitectureprinciples[DIS]affirmsthattheyshouldnotbemodifieddue to our extensions and no additional or significantmanualwork should berequiredtooperatewiththem.So, considering the high number ofmandatory specifications to implement forbuilding a TAXII 2.0 service from scratch,without the possibility of relyingonexternal open source libraries and utilities, we decided on thinking aboutalternative solutions for exchanging STIXdata, in order to inject the output ofDiSIEMOSINT-basedcomponentsintoSIEMs.Inconclusion,STIXdataareexpressedthroughJSONformat,so,forDiSIEMusecases, itcouldbebetter toconsider interchangemethodsalreadysupportedbySIEMs(e.g.,Syslog,LogStash),mentionedmoreindetailsintheIntegrationPlanrelatedto[DIS],whichsupportJSONingestion.

5.2 Integrating OSINT data with infrastructure events In previous chapters, different approaches for OSINT data fusion and analysishavebeendescribed.ThefinalobjectiveistointegratetherelevantsecuritydatacomingfromthesepublicsourceswithdatagatheredfromtheinfrastructurebytheSIEMstoanticipateandimprovethethreatdetection.In this context, it arises the need of a component that considers what ishappening inside the monitored infrastructure providing a threat score forincomingOSINTdatathathelpstoidentifyitsrelevanceandpriority.Thisthreatscore will complement the usage of static information about the monitored

4747

D4.1

infrastructurewithdynamicandreal-timethreatintelligencedatareportedfrominsidetheownmonitoredinfrastructureinthewayofIoCs.Enteringmore indetail,weneeda componentable toperform this correlationconsideringstaticandreal-timeinformation,suchasIndicatorsofCompromise,related to the monitored infrastructure and data coming from OSINT sourcesthrough other DISIEM OSINT data fusion and analysis tools, for checking therelevance of the latter flowdependingon the former. This dynamic evaluationwillbebasedonheuristicanalysiswhichallowsdeterminingthepriorityoftheincomingOSINTdata, assigning a threat score to it. The details abouthow thescoreiscalculatedwiththisparticularmethodwillbeexplainedintheremainingofthischapter.The final STIX object integrating the information received from OSINT datasourceswithitscalculatedthreatscorefortheinfrastructure,canbesentdirectlyto theSIEMs for itsvisualization, storageorprocessing,orbe sentback to theDiSIEMOSINT-basedcomponentsasa feedback, inordertorefinethemachinelearning algorithms with relevant information based on real-time analysis,improving threatdetectionandprediction.Thiswill allowachievinga context-awareOSINTdataanalysis.

5.2.1 Architecture proposal Theproposedarchitecture,depicted inFigure10, iscomposedof the followingmodules: (i) the Entry Point, which obtains information coming frommultiplesources(e.g.,OSINTdata,infrastructure,IoCs,etc.),tobeusedinthethreatscoreanalysisperformedbytheheuristicsengine;(ii)theDatabasethatwillstoretheinformationoftheinfrastructureandtheOSINTdatacollector;(iii)theHeuristicsEngine, which will compute a threat score based on the information receivedfromtheinfrastructureandtheOSINTdatacollector;(iv)theThreatScoreAgent,thatwillbuildthefinalIoCobjectwiththeobtainedresultandwillshareitandinteractwiththeSIEMs.Entry Point: this module will be responsible of capturing useful data fromOSINT, IoCs and the infrastructure in order to evaluate the set of pre-definedheuristicsandtocomputeathreatscore.Theentrypointwillseparatetheinputdata into two main groups: infrastructure data and OSINT data. The formerneedstobestoredinthedatabase,whereasthelattercouldbedirectlyusedbytheenginewithoutstoringit.Heuristics Engine: It will be mainly responsible of using the input data (e.g.,context information, features) coming from the infrastructure in the analysisprocess. The latter considers a set of conditions that are evaluated for everysingle feature.Ascore(eitherpositiveornegative) isassignedtoevery feature(i.e., individual score). The sumof all individual scores results into theThreatScoreassociatedtothedatabeinganalysed.

4848

D4.1

Figure10-context-awareOSINTDataArchitecture

Database: Informationreceivedfromtheinfrastructureneedstobestoredinadatabase to be used later by the engine or the agent. The data received fromOSINTsourcesthatdonotneedtobestoredinthedatabasecanbeimmediatelysenttotheHeuristicEnginefortheiranalysis.Threat Score Agent: It will be mainly responsible for the generation of theresulting Indicator of Compromise, including the threat score for securityinformation received fromOSINTdata sources.This IoC thatwillbe sharedbythiscomponentwouldincludethesameinformationreceivedfromOSINT(JSONfollowing STIX format) but adding the threat score as well as the featuresconsidered in the evaluation. This module will provide the interfaces used tointeract with the SIEMs or any other component interested in this score andother useful information related to it. In addition, this module will interactdirectlywiththedatabase,whentoretrieveadditionalinformationrelatedtotheheuristicorthedatabeinganalysed.

5.3 Context-aware Threat Score

5.3.1 Heuristics-based threat score Theheuristics-basedthreatscore iscomposedofasetofindividualscoresthatcouldbeusedincomplementwithotherDiSIEMpredictiontoolstoindicatethepriority and relevance for the infrastructure of incoming security informationreceivedfromOSINTdatasources.DifferentaggregationtechniquescanbeusedforthecomputationoftheThreatScorefromthesimplestwayperformedasthesumof all individual scores (score assigned to each heuristic) using followingEquation1,tomoresophisticatedonesusingorderedweightedaveraging(OWA)operators [Jos12] or Weighted Ordered Weighted Aggregation (WOWA)operators[Ern06].

𝑇ℎ𝑟𝑒𝑎𝑡'()*+ = ∑ 𝑆𝑐𝑜𝑟𝑒12134 (1)

4949

D4.1

Dependingontheinformationthatisavailablefromboth,theinfrastructureandthethreat intelligencereceivedfromOSINTdatasource, itwillbeanalysedthebestaggregationmethodforcalculationofthefinalthreatscore.Considering,forinstance,thatoneofthefeaturestobeevaluatedisthepresenceofaCommonVulnerabilityExposure–CVE[MIT]– identifier in the inputdata,the engine will check if the word ‘CVE’ appears in the input data in order toretrieve the complete CVE number composed of the publication year and theidentificationnumber(i.e.,CVE-AAAA-NNNN).IfaCVEisfound,theenginethenchecksforitsassociatedCommonVulnerabilityScoringSystem(CVSS)[FOR17],more specifically, the engine will search for its associated base score, whichconsiders access vector, access complexity, authentication, and impact relatedinformationbasedonavailability,confidentialityandintegrity.DependingontheCVSSscore,thevulnerabilityislabelledasnone,low,medium,highorcritical,asshowninTable5.

Severity None Low Medium High CriticalLowerbound 0.0 0.1 4.0 7.0 9.0

Upperbound 0.0 3.9 6.9 8.9 10.0(Source:https://www.first.org/cvss/specification-document)

Table5-CVSSv3Ratings.

Each evaluated feature is assigned an individual score based on the definedthreshold(e.g., from0to5) thatwill indicatethe levelof impactof the featurewithrespecttotheevent.Thefollowingexampleillustratesthisassignment.Wedefinethevariable“Score_CVE”thatwillcomputetheindividualscorevalueassigned to the presence of a CVE in the input data based on the conditionsdescribedinTable6.Other features (e.g., source IP, created by, valid until, etc.) may use positiveand/or negative values in the assignment process. Such individual values arethentunedinthetrainingandcalibrationprocessessothatthefinalthreatscorereducesthenumberoffalsepositivesandnegatives.

Evaluation Condition Score_CVEEvaluateifthereisnotaCVEintheinputdata IfCVE==‘’ 0EvaluateifthereisCVEintheinputdatawithCVSS=‘none’or0.0

If CVE != ‘ ’ & CVSS =´none´|CVSS=0.0

1

EvaluateifthereisCVEintheinputdatawithCVSS=‘low’orlessthan4.0

IfCVE!=‘’&CVSS=´low´|CVSS<4.0

2

EvaluateifthereisCVEintheinputdatawithCVSS=‘medium’orlessthan7.0

If CVE != ‘ ’ & CVSS =´medium´|CVSS<7.0

3

EvaluateifthereisCVEintheinputdatawithCVSS=‘highorlessthan9.0

If CVE != ‘ ’ & CVSS =´high´|CVSS<9.0

4

EvaluateifthereisCVEintheinputdatawithCVSS=‘critical’orlessthan10.0

If CVE != ‘ ’ & CVSS =´critical´|CVSS<10.0

5

Table6-IndividualThreatScore.

5050

D4.1

5.3.2 Threat Score Methodology Thethreatscoreevaluationusesaheuristicanalysismethodologycomposedofthefollowingsteps:

1. Source Identification: Input data may come from different sources,

therefore, during this phasewe need to search and identify all possiblesources of information for our tool. Examples of these sources are:securitylogs,databases,reportdata,OSINTdatasources,IoCs,etc.

2. Heuristics Identification: Different features (e.g., heuristics) can beidentified from the input data. Such features must provide relevantinformation about the infrastructure (e.g., vulnerabilities, events, faults,errors, etc.) that couldbeuseful in the threatanalysis andclassificationprocess. Examples of heuristics are: CVE, IP source, IP destination, portsource,portdestination,timestamp,etc.

3. Threshold Definition: For each heuristic, we need to defineminimumand maximum values that could be assigned based on characteristicsassociated to the instance.We can check, for instance, if the input datacontainsornotaCVEforthedetectedthreat.Athresholdof(e.g.,0–5)canbeassignedtocoverallpossibleresultsdescribedinTable6-.

4. Score Computation: For each possible instance of the identifiedheuristic, a score value is assigned based on expert knowledge. Scoresassociatedtoeachheuristiccanbeeitherpositiveornegative,dependingon the impact of the selected heuristic. All individual scores are thensummed up and a final score is computed. The resulting value willindicate the priority and relevance of the security information comingfromOSINTdatasourcesforthemonitoredinfrastructure.

5. TrainingPeriod:Weneedtoperformasetofpreliminarytests(duringatraining process) to evaluate the performance of the engine. The testsshould include real data so that we can analyse the score obtainedindividually (for each heuristic) and globally (for thewhole event). Thehigher the number of tests during this phase, the better for the tool toevaluatefalsepositivesand/ornegatives,andtoavoiddeviations.

6. EngineCalibration:Sincepreliminarytestsarebasedontheassessmentmadebyexpertknowledge,weneedtominimizedeviations(e.g.,reducenumberoffalsepositive,falsenegative)byanalysingtheobtainedresults,addingotherheuristicsand/ormodifyingtheassignedvaluestocurrentattributes. It ispossible thatduringthetrainingprocess,werealizethatinsteadofgivingavalueof3or5toCVEswithmediumandhighimpactbasescoresweshouldgiveavalueof2and3respectively.

7. FinalTests:Oncetheengineiscalibrated,wecanrepeatprevioustestsoraddnewonesinordertoevaluatetheperformanceofourtool.

5151

D4.1

5.3.3 Preliminary analysis of heuristics features Thetwomaininputsofourcontext-awareOSINTdataanalyserarethefollowing:

• Security information coming from OSINT data-sources provided byDigitalMRplatformand/orFCiências.IDOSINT-basedcomponent.

• IoCscomingfromthemonitoredinfrastructure.

Several features coming from each of those sources can be considered in thethreat score evaluation. Similar features can be merged into a more enrichedgroupofheuristicsandasub-scorecouldalsobeassignedtothegroupsothatitsimpact canbeanalysedaccordingly. Someexamplesofpotential features tobeconsideredbyourcontext-awareOSINTdataanalyseraredescribedasfollows:

• Externalreferencestonon-STIXinformation,usedforabetterdescriptionof the threat, such as CVE for vulnerabilities and CAPEC for attackpatterns

• ValiditytimeintervaloftheSTIXObject• Identityofthreatactorsorentitywhoareactuallybeingtargeted• IPs or domain names included in IoCs and reported as source of some

incidentdetectedintheinfrastructure• TypeoftheactivityrelatedtoaparticularIoC(e.g.,malicious,anomalous,

benign)• Phaseofthekillchainwherethethreatwasdetected

Thesefeatures,andothers,willbestoredandusedbytheHeuristicEnginewhenrequested the assignment of a threat score to a specific IoC. New dynamicfeatures could be added in the future to the heuristic analysis to improve theevaluationperformed.As described in Section 5.1, in DiSIEM we are going to consider that all theincomingthreatinformationisexpressedthroughtheSTIX2.0standard,inJSONformat,andthisassumptionisvalidforboththeinputflows.Toperformthethreatscoreassignment,andbasedontheaforementionedinputdata,wedecidedtostarttheanalysisfocusingonthefollowingdynamicfeaturesfromtheSDOsdefinedinSTIX2.0:

• Indicators:usedfordetectingmaliciousactivities;• Vulnerabilities: detailed information about known vulnerabilities

which are related to some relevant assets of the monitoredinfrastructure;

• Attack Patterns: used for describing specific properties related tovariousattacks;

• Tools: information about tools that could be used for performing aspecificattack;

• Threat Actors: information aboutmalicious entitieswho are behindspecificmaliciousactivities;

5252

D4.1

• Identities:detailedpersonalinformationaboutbothmaliciousandnotmaliciousentities.

TheselectionofthisinitialsetofobjectscanbelaterextendedtomorecomplexSTIXDomainObjectssuchasIntrusionSet,CampaignandReportObjectsincasesome of them are relevant for the DiSIEM validation use cases or pilotdeployments.MalwareandCourseofActionsObjectshavenotbeenconsideredbythemomentbecauseintheactualSTIXversion2.0theyarestillstubs.ConcerningObservedDataObject,theycouldbesentbyaSIEMwithdatarelatedtowhatitismonitoringandbeusedformatchingspecificpatternsexpressedinSTIXIndicatorsreceivedfromOSINTdatasources.However,thiswouldtranslateintotheimplementationofthreatdetectioncapabilitiesinthiscomponentwhichisnotitspurpose.Thisoperationisoutofthescopeofthiscomponent,becauseitshoulddealwithintelligencereadytobeused,therefore,thiskindofSTIXObjectisnotconsideredinthethreatscoreevaluation.ComingbacktoSDOsthat,instead,willbeusedforthethreatscoreassignment,we had alsomade a preliminary identification of some features that could beinteresting tobeused for theheuristicsanalysis.DifferentSDOsactuallysharecommonproperties, sowe started studying theseones.Then,weproceeded toconsideralsospecificpropertiesforeachoftheseobjects.Common properties: SDOs are characterised by some common properties,whichareusedfordescribingrelatedfeatures.Theinitialsetofthemthatwillbetakeninconsiderationiscomposedbythefollowingones:

• Type: indicates the type of the SDO (e.g., indicator, vulnerabilities,tools);

• Created:timestampthatindicatedwhentheSDOwascreated;• Modified:timestampthatindicatedthelastupdatemadeontheSDO;• Revoked:timestampthatindicatedwhentheSDOwasrevoked;• External_references: refers tonon-STIX information,used foramore

accuratedescriptionof theobject (e.g., CVE forvulnerabilities,CAPECforattackpatterns).

Indicator:someinterestingspecificpropertiesofthisSDOare:

• Labels: open-vocabulary field that indicates the kind of the detectedactivity. Possible values could be “anomalous-activity”, “malicious-activity”,“anonymization”and“benign;”

• Pattern: detection pattern for this indicator. For example, it couldcontainasetofmaliciousIPaddressesordomainnames;

• Valid_from:timefromwhichthisobjectshouldbeconsideredvaluableintelligence;

• Valid_from_precision:precisionoftheprevioustimestamp;• Valid_until:timeuntilwhichthisobjectshouldbeconsideredavaluable

intelligence;• Valid_until_precision:precisionoftheprevioustimestamp;

5353

D4.1

• Kill_chain_phase: phases of the kill chain40 where the attack wasdetected.

Vulnerability: no specific properties for this SDO have been consideredinterestingforbeingincludedintheinitialset,apartfromthecommononesAttack Pattern: regarding this SDO, just one specific property has beenconsideredinteresting:

• Kill_chain_phase:phasesofthekillchainwheretheattackwasdetected.

Tool: for how concern Tool SDO, the list of the interesting properties iscomposedbythefollowingones:

• Label: open-vocabulary field that indicates the kind of tool considered.Possible values could be “denial-of-service “vulnerability-scanning” and“remote-access”, “privilege-escalation”, “password-cracking”, “password-sniffing”,“memory-analysis”,“reconnaissance;”

• Kill_chain_phase:phasesof thekillchainwheretheattackthat isusingthistoolwasdetected;

• Tool_version:versionofthetool.

Threat Actor: interesting specific properties of Threat Actor SDO are thefollowing:

• Labels:open-vocabularyfieldthatindicatesthetypeoftheThreatActor.Possiblevaluescouldbe“criminal”,“hacker”,“spy”and“terrorist;”

• Aliases:listofothernamesthatthisactorcoulduse;• Roles:open-vocabularyfieldthatindicatesalistofrolesthattheThreat

Actorcouldplay.Someexamplescouldbe“agent”and“director;”• Goals:highlevelgoalsoftheThreatActor;• Sophistication: open-vocabulary field which represents skills, training,

expertiseoftheactor.Someexamplescouldbe“minimal”,“intermediate”,“advanced;”

• Resource_level:open-vocabularyfieldthatrepresentstheorganizationallevel atwhich the actorworks,which, in turn, determines the resourceavailable for the attack. Some example could be “individual”, “team”,“government;”

• Primary_motivation: open-vocabulary field that represents themotivation of the Threat Actor. Some examples could be “accidental”,“dominance”, “personal-satisfaction”, “revenge”, “industrial-espionage”,“sabotage”,“hacktivism”,“data-theft;”

• Secondary_motivation:sameconsiderationsasPrimary_motivation;• Personal_motivation:sameconsiderationsasPrimary_motivation.

Identity: last SDO considered. The set of interesting specific properties iscomposedbythefollowingones:40https://en.wikipedia.org/wiki/Kill_chain

5454

D4.1

• Labels: list of roles that this Identity performs (e.g., CEO, Domain

Administrator,Doctor).Noopen-vocabularydefinedforthisproperty;• Identity_class:open-vocabularyfieldthatindicatesthetypeofentitythat

this Identity describes. Some examples could be “individual”,“organization”,“unknown”and“group;”

• Sectors: open-vocabulary field, which represents the list of industrysectorsthatthisIdentitybelongsto.Someexamplescouldbe“aerospace”,“automotive”,“defense”and“financial-services;”

• Regions:listofregionsorgeographiclocationsthisIdentityislocatedoroperatesin.

These features, when available from the IoCs coming from the monitoredinfrastructure, will be used in the training period of the heuristic analysis. Asnextsteps,thresholddefinitionandscorecomputationwillbeperformedbyeachofthemandnewfeatureswillbeidentified,inordertocalibratetheengineandrefine our tool, with the aim of improving the overall procedure used forassigning the threat score to security information coming from OSINT-datasources.Moreprecisely, otheralreadyexistingSDOsproperties couldbe considered, aswellas,thankstotheextensibilityofSTIXstandard,specificcustomproperties,notyetdefinedinthestandarditself,thatcouldbecreatedad-hoc,followingSTIXnaming guidelines described in the official documentation [OAS17], forrepresentingparticularfeaturesthatcouldimproveourheuristicanalysis,Thesetasksoftrainingandenginecalibrationwillbedoneinclosecollaborationwiththepartners involved intheDISIEMpilotdeploymentsandthe implementationoftheOSINTdatafusionandanalysistools.TheresultswillbeincludedinnextdeliverableD4.2-OSINTdatafusionandanalysisarchitecture.

5555

D4.1

6 Summary and Conclusions Thisdeliverablepresentsanin-depthanalysisofsecurity-relatedOSINTsources,whichcanbeclassifiedasstructuredorunstructured.Additionally,thedarkwebisconsideredaspecialclassduetoitsspecificcharacteristicsandrequirements.These classes differ mostly in the type of content, in the format of theinformation and in the tools required to extract it. Depending on the type ofsource, information is collected by means of parsers and crawlers (possiblycustom-built),byavailableAPIsorbyspecializedcommercialservices.AcompletelistofOSINTsourcesspecifiedbytheDiSIEMindustrialpartnershasbeencompiled. Information from these sources isbeing collected,which formsthebasisforvariouscase-studiesregardingtheprocessingofOSINTtointegraterelevantinformationintotheSIEMs.The literature review revealed that mostwork that uses OSINT in a security-relatedcontextisrelatedtocollectinginfrastructure-specificinformation;tothecollection and extraction methodologies; to the correlation of user behaviourwithOSINT; to feedprotectionsystemswithOSINT; to thegatheringof exploitdata;andtoblack-listedIPS.Regarding the processing and analysis of OSINT, there are some open-sourcegeneralpurposetoolsthatcanbeextendedforthatpurpose.Thealternativesarepaidtoolsandsecurity-relatednewsfeeds.The ongoing work on the models and techniques to process OSINT is firstlydescribedinthisdeliverable.Thiswillbethemainthemeofthenextdeliverableinworkpackage4.Finally, the deliverable provides the first insights on how the relevant OSINTdata can be merged with infrastructure-related IoCs, and communicated andsharedbetweensoftwarecomponentsandtheSIEM.

5656

D4.1

References [ALI16]AlienVault. (2016).AlienVaultOpenThreatExchange (OTX )TMUserGuide,1–44.[ALQ16] S.S. Alqahtani, E.E. Eghan, and J.Rilling. Tracing known securityvulnerabilities in software repositories–a semantic web enabled modelingapproach.ScienceofComputerProgramming,121:153–175,2016.[AND17]Andongabo,A.&Gashi,I.(2017).vepRisk-AWebBasedAnalysisToolforPublicSecurityData.13thEuropeanDependableComputingConference,4-8Sep2017,Geneva,Switzerland.[CAM13] Rodrigo Campiolo, Luiz Arthur F. Santos, DanielMacêdo Batista, andMarcoAurélioGerosa.2013.EvaluatingtheUtilizationofTwitterMessagesAsaSource of Security Alerts. In 28th Annual ACM Symposium on AppliedComputing.[CAP] “CAPEC™ Common Attack Pattern Enumeration and Classification,”[Online].Available:https://capec.mitre.org/.[CEB10] Cebula, J. J., & Young, L. R. (2010). A Taxonomy of Operational CyberSecurity Risks. Carnegie-Mellon Univ Pittsburgh Pa Software Engineering Inst,(December),1–47.[CHA16]C.-Y.Chang,Z.Teng,andY.Zhang.Expectation-regulatedneuralmodelforeventmentionextraction.InProceedingsofNAACL-HLT,pages400–410,2016.[CON17]FengyuCong,AndrewLeung,andQinglaiWei.AdvancesinNeuralNetworks-ISNN.Springer,2017.[COR16] A. Correia, P. Ferreira, and A. Bessani. Descoberta de ameaças desegurançaatravésdotwitter.InINForum,2016.[CYB] “Cyber Observable eXpression (CybOX™) Archive Website,” [Online].Available:https://cyboxproject.github.io/.[DAN07] R. Danyliw, J. Meijer and Y. Demchenko, “The Incident ObjectDescription Exchange Format,” December 2007. [Online]. Available:https://www.ietf.org/rfc/rfc5070.txt.[DIS]“DiSIEMD2.2–Referencearchitectureandintegrationplan”,2017.[DIS17]“DiSIEMD2.1In-depthanalysisofSIEMsextensibility”,2017.[ENI17]Enisa. (2017). IncidentHandlingAutomation.Retrieved June29,2017,from https://www.enisa.europa.eu/topics/csirt-cert-services/community-projects/incident-handling-automation.

5757

D4.1

[Ern06]ErnestoDamiani,SabrinaDeCapitanidiVimercati,PierangelaSamarati,Marco Viviani, S. D. (2006). A WOWA-based Aggregation Technique on TrustValuesConnectedtoMetadata.ElectronicNotesinTheoreticalComputerScience,131-142.[ESP16]A.d.M.DelEsposte,R.Campiolo,F.Kon,andD.Batista.Acollaborationmodel to recommend network security alerts based on the mixed hybridapproach. In Simpósio Brasileiro de Redes de Computadores e SistemasDistribuídos,2016.[EDK15]M.Edkrantz, S.Truvé, andA. Said.Predictingvulnerabilityexploits inthe wild. In Cyber Security and Cloud Computing (CSCloud), 2015 IEEE 2ndInternationalConferenceon,pages513–514.IEEE,2015.[ERK15]Y.Erkal,M.Sezgin,andS.Gunduz.Anewcybersecurityalertsystemfortwitter. In 2015 IEEE14th International Conference onMachine Learning andApplications(ICMLA),pages766–770.IEEE,2015.[FAR13] SANS, "Tools and Standards for Cyber Threat Intelligence Projects,"https://www.sans.org/reading-room/whitepapers/warfare/tools-standards-cyber-threat-intelligence-projects-34375,2013.[FIE14] R. Fielding and J. Reschke, “Hypertext Transfer Protocol (HTTP/1.1):Authentication,” Internet Engineering Task Force (IETF), 2014. [Online].Available:https://www.rfc-editor.org/rfc/pdfrfc/rfc7235.txt.pdf.[FOR17] Forum of Incident Response and Security Teams.: CommonVulnerability Scoring System v3.0 Speci_cation Document, Technical Paper,available at: https://www.first.org/cvss/specification-document, last accessedonJuly2017.[FRA15] F. Fransen, A. Smulders and R. Kerkdijk, “Cyber security informationexchange to gain insight into the effects of cyber threats and incidents,” E & IElektrotechnikundInformationstechnik132(2):106–112,2015.[Jos12] Gil-Lafuente, José M. Merigó and Anna M. (2012). Decision-makingtechniques with similarity measures and OWA operators, Sort (Statistics andOperationsResearchTransactions),36,81-102[JEN16]Jenhani,F.,Gouider,M.S.,&Said,L.Ben.(2016).AHybridApproachforDrugAbuseEventsExtractionfromTwitter.ProcediaComputerScience,96(September),1032–1040.https://doi.org/10.1016/j.procs.2016.08.121[JON15]C.L.Jones,R.A.Bridges,K.M.Huffer,andJ.R.Goodall.Towardsarelationextractionframeworkforcyber-securityconcepts.InProceedingsofthe10thAnnualCyberandInformationSecurityResearchConference,page11.ACM,2015.

5858

D4.1

[KAL14]NalKalchbrenner,EdwardGrefenstette,andPhilBlunsom.Aconvolutionalneuralnetworkformodellingsentences.arXivpreprintarXiv:1404.2188,2014.[KAM14]P.Kampanakis, "SecurityAutomationandThreat Information-SharingOptions,"IEEESecurity&Privacy,vol.12,no.5,pp.42-51,2014.[KER15] D. Kergl. Enhancing network security by software vulnerabilitydetection using social media analysis extended abstract. In Data MiningWorkshop(ICDMW),2015IEEEInternationalConferenceon,pages1532–1533.IEEE,2015.[KIM14]YoonKim.Convolutionalneuralnetworksforsentenceclassification.arXivpreprintarXiv:1408.5882,2014.[KÜH14]Kührer,M.,Rossow,C.,&Holz,T.(2014).Paintitblack:Evaluatingtheeffectivenessofmalwareblacklists.[LES14]J.Leskovec,A.Rajaraman,andJ.D.Ullman.Miningofmassivedatasets.CambridgeUniversityPress,2014.[LIA16]X.Liao,K.Yuan,X.Wang,Z.Li,L.Xing,andR.Beyah.Acingtheiocgame:Toward automatic discovery and analysis of open-source cyber threatintelligence. In Proceedings of the 2016ACMSIGSACConference on ComputerandCommunicationsSecurity,pages755–766.ACM,2016.[LIB16] LIBERATORE, S. (2016). What happens in an internet second: 54,907Google searches, 7,252 tweets, 125,406 YouTube video views and 2,501,018emails sent. DailyMail, [online] p.1. Available at:http://www.dailymail.co.uk/sciencetech/article-3662925/What-happens-internet-second-54-907-Google-searches-7-252-tweets-125-406-YouTube-video-views-2-501-018-emails-sent.html[Accessed4Jul.2017].[LIU15]Y.Liu,A.Sarabi,J.Zhang,P.Naghizadeh,M.Karir,M.Bailey,andM.Liu.Cloudywith a chance of breach: Forecasting cyber security incidents. In 24thUSENIXSecuritySymposium(USENIXSecurity15),pages1009–1024,2015.[MAR16] Marinos, L. (2016). ENISA threat taxonomy: A tool for structuringthreat information. Initial report., (January), 1–24. Retrieved fromhttps://www.enisa.europa.eu/topics/threat-risk-management/threats-and-trends/enisa-threat-landscape/etl2015/enisa-threat-taxonomy-a-tool-for-structuring-threat-information[MAT12] M. L. Mathews, P. Halvorsen, A. Joshi, and T. Finin. A collaborativeapproachtosituationalawarenessforcybersecurity.InCollaborativeComputing:Networking, Applications and Worksharing (CollaborateCom), 2012 8thInternationalConferenceon,pages216–222.IEEE,2012.

5959

D4.1

[MCA] McAfee, “Data Exchange Layer”, [Online]. Availablehttps://www.mcafee.com/us/solutions/data-exchange-layer.aspx.[MCN13]N.McNeil,R.A.Bridges,M.D. Iannacone,B.Czejdo,N.Perez,andJ.R.Goodall. Pace: Pattern accurate computationally efficient bootstrapping fortimely discovery of cyber-security concepts. In Machine Learning andApplications (ICMLA),201312th InternationalConferenceon, volume2,pages60–65.IEEE,2013.[MIL11]B.A.Miller,M. S.Beard, andN.T.Bliss.Eigenspaceanalysis for threatdetectioninsocialnetworks.InInformationFusion(FUSION),2011Proceedingsofthe14thInternationalConferenceon,pages1–7.IEEE,2011.[MIT] MITRE, “Common Vulnerability and Exposures. The Standard forInformation Security Vulnerability Names”, MITRE Website,https://www.mitre.org/[MIT16]S.Mittal,P.K.Das,V.Mulwad,A.Joshi,andT.Finin.Cybertwitter:Usingtwitter to generate alerts for cybersecurity threats and vulnerabilities. InInternational Symposium on Foundations of Open Source Intelligence andSecurityInformatics.IEEEComputerSociety,2016.[MOO96] T. K. Moon. The expectation-maximization algorithm. IEEE Signalprocessingmagazine,13(6):47–60,1996.[MOR12] S. More, M. Matthews, A. Joshi, and T. Finin. A knowledge-basedapproach to intrusion detectionmodeling. In Security and PrivacyWorkshops(SPW),2012IEEESymposiumon,pages75–81.IEEE,2012.[MUL11] V. Mulwad, W. Li, A. Joshi, T. Finin, and K. Viswanathan. Extractinginformationaboutsecurityvulnerabilitiesfromwebtext.InWebIntelligenceandIntelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM InternationalConferenceon,volume3,pages257–260.IEEE,2011.[NER09]F.NeriandP.Geraci.Miningtextualdatatoboostinformationaccessinosint. In 2009 13th International Conference Information Visualisation, pages427–432.IEEE,2009.[NUN16]E.Nunes,A.Diab,A.Gunn,E.Marin,V.Mishra,V.Paliath,J.Robertson,J.Shakarian,A.Thart,andP.Shakarian.Darknetanddeepnetminingforproactivecybersecurity threat intelligence. In Intelligence and Security Informatics (ISI),2016IEEEConferenceon,pages7–12.IEEE,2016.[OAS17]OASIS,“STIX™Version2.0.Part1:STIXCoreConcepts,”2017.[Online].Available: https://docs.oasis-open.org/cti/stix/v2.0/stix-v2.0-part1-stix-core.pdf.

6060

D4.1

[OAS171] OASIS, “STIX™ Version 2.0. Part 2: STIX Objects,” 2017. [Online].Available: https://docs.oasis-open.org/cti/stix/v2.0/stix-v2.0-part2-stix-objects.pdf.[OAS172]OASIS, “STIX™Version2.0.Part3:CyberObservableCoreConcepts,”2017. [Online]. Available: https://docs.oasis-open.org/cti/stix/v2.0/stix-v2.0-part3-cyber-observable-core.pdf.[OAS173] OASIS, “STIX™ Version 2.0. Part 4: Cyber Observable Objects,” 2017.[Online]. Available: https://docs.oasis-open.org/cti/stix/v2.0/stix-v2.0-part4-cyber-observable-objects.pdf.[OAS174] OASIS, “STIX™ Version 2.0. Part 5: STIX Patterning,” 2017. [Online].Available: https://docs.oasis-open.org/cti/stix/v2.0/stix-v2.0-part5-stix-patterning.pdf.[OAS175] OASIS, “TAXII™ Version 2.0,” 2017. [Online]. Available:https://docs.oasis-open.org/cti/taxii/v2.0/taxii-v2.0.pdf.[OPE]“OpenIOC-AnOpenFrameworkforSharingThreatIntelligence,”[Online].Available:http://www.openioc.org/.[RES15] R. J., “The 'Basic' HTTP Authentication Scheme,” 2015. [Online].Available:https://www.rfc-editor.org/rfc/pdfrfc/rfc7617.txt.pdf.[RIT15] A. Ritter, E. Wright, W. Casey, and T. Mitchell. Weakly supervisedextractionofcomputersecurityevents fromtwitter. InProceedingsof the24thInternationalConferenceonWorldWideWeb,pages896–905.ACM,2015.[ROS10] Rossow, C., Czerwinski, T., Dietrich, C. J., & Pohlmann, N. (2010).DetectingGrayinBlackandWhite.MITSpamConference.[SAB15]C.Sabottke,O.Suciu,andT.DumitraÈ™.Vulnerabilitydisclosureintheageofsocialmedia:exploitingtwitterforpredictingreal-worldexploits.In24thUSENIXSecuritySymposium(USENIXSecurity15),pages1041–1056,2015.[SAN13]L.A.F.Santos,R.Campiolo,M.A.Gerosa,D.M.Batista,andC.Mourao-PR-Brasil.Detecçãodealertasdesegurançaemredesdecomputadoresusandoredes sociais. In Simpósio Brasileiro de Redes de Computadores e SistemasDistribuídos,2013.[SAU17]C.Sauerwein,C.Sillaber,A.MussmannandR.Breu,"ThreatIntelligenceSharing Platforms: An Exploratory Study of Software Vendors and ResearchPerspectives," 13th International Conference on Wirtschaftsinformatik(WI2017),S.837-851,2017.[SHA15]Shackleford,D.(2015).Who’sUsingCyberthreatIntelligenceandHow?SANS.

6161

D4.1

[SIN08] Sinha, S., Bailey, M., & Jahanian, F. (2008). Shades of Grey: On theeffectiveness of reputation based black-lists. Proceedings of the InternationalConferenceonMaliciousandUnwantedSoftware(Malware),57–64.[STR]“StructuredThreatInformationeXpression(STIX™)1.xArchiveWebsite,”[Online].Available:http://stixproject.github.io/.[TRU1] “Trusted Automated eXchange of Indicator Information (TAXII™) 1.xArchiveWebsite,”[Online].Available:https://taxiiproject.github.io/.[W2V]TomasMikolovandteam.word2vec:VectorRepresentationsofWords(Tensorflow).https://www.tensorflow.org/tutorials/word2vec.Accessed:2017-07-17. [WAN15]PengWang,JiamingXu,BoXu,Cheng-LinLiu,HengZhang,FangyuanWang,andHongweiHao.Semanticclusteringandconvolutionalneuralnetworkforshorttextcategorization.InACL(2),pages352–357,2015.[ZAK14] M. J. Zaki, W. Meira Jr, and W. Meira. Data mining and analysis:fundamentalconceptsandalgorithms.CambridgeUniversityPress,2014.[ZHA11]S.Zhang,D.Caragea,andX.Ou.Anempiricalstudyonusingthenationalvulnerability database to predict software vulnerabilities. In InternationalConference on Database and Expert Systems Applications, pages 217–231.Springer,2011.[ZHU16] Z. Zhu and T. Dumitras. Featuresmith: Automatically engineeringfeaturesformalwaredetectionbyminingthesecurityliterature.InProceedingsofthe2016ACMSIGSACConferenceonComputerandCommunicationsSecurity,pages767–778.ACM,2016.

6262

D4.1

List of Acronyms Acronym DescriptionACDC AdvancedCyberDefenceCenterANN ArtificialNeuralNetworksCVE CommonVulnerabilitiesandExposuresCSIRTs ComputerSecurityIncidentResponseTeamsCVSS CommonVulnerabilityScoringSystemCTI CyberThreatIntelligenceDDOS DistributedDenialofServiceENISA EuropeanAgencyforNetworkandInformationSecurityEM Expectation-MaximizationGRU GatedRecurrentNeuralNetworksHUMINT HumanIntelligenceIODEF IncidentObjectDescriptionandExchangeFormatIoC IndicatorsofCompromiseIPS IntrusionPreventionSystemsIPA JapaneseInformation-technologyPromotionAgencyLDA LatentDirichletAllocationLSTM LongShortTermMemoryNVD NationalVulnerabilityDatabaseISACcouncil NationalCouncilofInformationSharingandAnalysisCenterNLP NaturalLanguageProcessingOpenDXL OpenDataeXchangeLayerOSINT OpenSourceIntelligenceSOC SecurityOperationCenterSOCMINT SocialMediaIntelligenceSaaS SoftwareasaServiceSDO STIXDomainObjectsSRO STIXRelationshipObjectsSTIX StructuredThreatInformationeXpressionSVM SupportVectorMachinesTTPs Tactics,TechniquesandProceduresTC TechnicalCommitteeTF TermFrequencyTABI TrustAssessmentofBlacklistsInterfaceTAXII TrustedAutomatedeXchangeofIndicatorInformationDHS U.S.DepartmentofHomelandSecurity

6363

D4.1

Appendix A – OSINT sources In this Appendix is presented a list of the OSINT sources used by the variouspartnersoftheproject,dividebycategories.Table7presentsthesourcesdividedbycategory,whileSection0presentsTwitteraccounts.NewssitesName SourceDarkReading http://www.darkreading.com/ComputerWorld http://www.computerworld.com/EuropeanUnionAgencyforNetworkandInformationSecurity

https://www.enisa.europa.eu/

SecurityFocus http://www.securityfocus.com/headlines

BlogsName SourceSchneieronSecurity https://www.schneier.com/DanchoDanchev'sBlog http://ddanchev.blogspot.pt/

NetworkSecurityBlog http://www.mckeay.net/Risky.biz https://risky.biz/KaiRoer’sSecurityCultureRamblings

https://roer.com/

IPsforwhitelistsName Sourceawesome-threat-intelligence

https://github.com/hslatman/awesome-threat-intelligence

CiscoUmbrella-UmbrellaPopularityList

http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

IPsforblacklistsName SourceAutoShun.org https://www.autoshun.org/ThreatMiner–DataMiningforThreatIntelligence

https://www.threatminer.org/

Spamhaus https://www.spamhaus.org/SuspiciousDomains https://isc.sans.edu/suspicious_domains.htmlI-Blocklist https://www.iblocklist.com/listsbadips_cyrusauth https://www.badips.com/get/list/cyrusauth/age=1dbadips_squid https://www.badips.com/get/list/squid/?age=1d

security_researchhttp://security-research.dyndns.org/pub/botnet/ponmocup/ponmocup-finder/ponmocup-infected-domains-latest.txt

alienvault https://reputation.alienvault.com/reputation.datalists_blocklist_ssh https://lists.blocklist.de/lists/ssh.txt

badips_apache-overflows https://www.badips.com/get/list/apache-overflows/?age=1d

ci-badguys http://cinsscore.com/list/ci-badguys.txt

6464

D4.1

emergingthreats_compromised-ips

http://rules.emergingthreats.net/blockrules/compromised-ips.txt

lists_blocklist_ircbot https://lists.blocklist.de/lists/ircbot.txt

badips_apache-dokuwiki https://www.badips.com/get/list/apache-dokuwiki/?age=1d

badips_apache-defensible https://www.badips.com/get/list/apache-defensible/?age=1d

badips_Php-url-fopen https://www.badips.com/get/list/Php-url-fopen/?age=1d

nothink_http http://www.nothink.org/blacklist/blacklist_malware_http.txt

badips_qmail-smtp https://www.badips.com/get/list/qmail-smtp/?age=1d

badips_apache-scriddies https://www.badips.com/get/list/apache-scriddies/?age=1d

badips_apache-noscript https://www.badips.com/get/list/apache-noscript/?age=1d

badips_pop3 https://www.badips.com/get/list/pop3/?age=1dbadips_bruteforce https://www.badips.com/get/list/bruteforce/?age=1d

nothink_irc http://www.nothink.org/blacklist/blacklist_malware_irc.txt

badips_pureftpd https://www.badips.com/get/list/pureftpd/?age=1dvirustotal https://www.virustotal.com/vtapi/v2/ip-address/reportdshield http://www.dshield.org/ipsascii.html?limit=10000badips_local-exim https://www.badips.com/get/list/local-exim/?age=1dlists_blocklist_bots https://lists.blocklist.de/lists/bots.txtbadips_proxy https://www.badips.com/get/list/proxy/?age=1dbadips_php-cgi https://www.badips.com/get/list/php-cgi/?age=1dlists_blocklist_imap https://lists.blocklist.de/lists/imap.txtbadips_drupal https://www.badips.com/get/list/drupal/?age=1dbadips_nginx https://www.badips.com/get/list/nginx/?age=1dbadips_dovecot-pop3 https://www.badips.com/get/list/dovecot-pop3/?age=1dbadips_sql https://www.badips.com/get/list/sql/?age=1dbadips_unknown https://www.badips.com/get/list/unknown/?age=1dbadips_proftpd https://www.badips.com/get/list/proftpd/?age=1dbadips_sip https://www.badips.com/get/list/sip/?age=1dbadips_imap https://www.badips.com/get/list/imap/?age=1dbadips_http https://www.badips.com/get/list/http?age=1dmalc0de http://malc0de.com/bl/IP_Blacklist.txtbadips_ftp https://www.badips.com/get/list/ftp/?age=1dbadips_assp https://www.badips.com/get/list/assp/?age=1dbadips_vsftpd https://www.badips.com/get/list/vsftpd/?age=1dlists_blocklist_bruteforcelogin https://lists.blocklist.de/lists/bruteforcelogin.txt

badips_apacheddos https://www.badips.com/get/list/apacheddos/?age=1dbadips_xmlrpc https://www.badips.com/get/list/xmlrpc/?age=1dlists_blocklist_strongIP https://lists.blocklist.de/lists/strongips.txtbadips_postfix https://www.badips.com/get/list/postfix/?age=1dbadips_phpids https://www.badips.com/get/list/phpids/?age=1dbadips_wp https://www.badips.com/get/list/wp/?age=1dlists_blocklist_ftp https://lists.blocklist.de/lists/ftp.txtbadips_sql-attack https://www.badips.com/get/list/sql-attack/?age=1dnothink_ssh http://www.nothink.org/blacklist/blacklist_ssh_day.txtbadips_pureftp https://www.badips.com/get/list/pureftp/?age=1d

6565

D4.1

badips_courierauth https://www.badips.com/get/list/courierauth/?age=1dbadips_plesk-postfix https://www.badips.com/get/list/plesk-postfix/?age=1dbadips_vnc https://www.badips.com/get/list/vnc/?age=1dbadips_dns https://www.badips.com/get/list/dns/?age=1dbadips_exim https://www.badips.com/get/list/exim/?age=1dbadips_ssh https://www.badips.com/get/list/ssh/?age=1dbadips_wordpress https://www.badips.com/get/list/wordpress/?age=1d

zeustracker https://zeustracker.abuse.ch/blocklist.php?download=badips

badips_sasl https://www.badips.com/get/list/sasl/?age=1d

badips_apache-spamtrap https://www.badips.com/get/list/apache-spamtrap/?age=1d

badips_ssh-ddos https://www.badips.com/get/list/ssh-ddos/?age=1dbadips_rdp https://www.badips.com/get/list/rdp/?age=1ddragonForce_VNCPROBE https://dragonresearchgroup.org/insight/vncprobe.txturlvir http://www.urlvir.com/export-ip-addresses/badips_default https://www.badips.com/get/list/default/?age=1ddragonForce_SSH https://dragonresearchgroup.org/insight/sshpwauth.txtbadips_ssh-blocklist https://www.badips.com/get/list/ssh-blocklist/?age=1d

badips_apache-wordpress https://www.badips.com/get/list/apache-wordpress/?age=1d

badips_nginxpost https://www.badips.com/get/list/nginxpost/?age=1dbadips_apache https://www.badips.com/get/list/apache/?age=1d

badips_apache-w00tw00t https://www.badips.com/get/list/apache-w00tw00t/?age=1d

badips_nginxproxy https://www.badips.com/get/list/nginxproxy/?age=1dbadips_sql-injection https://www.badips.com/get/list/sql-injection/?age=1dbadips_cms https://www.badips.com/get/list/cms/?age=1d

feodotracker https://feodotracker.abuse.ch/blocklist/?download=ipblocklist

lists_blocklist_apache https://lists.blocklist.de/lists/apache.txtbadips_w00t https://www.badips.com/get/list/w00t/?age=1dbadips_sshd https://www.badips.com/get/list/sshd/?age=1dbadips_ssh-auth https://www.badips.com/get/list/ssh-auth/?age=1dbadips_courierpop3 https://www.badips.com/get/list/courierpop3/?age=1d

cryptophp_master https://raw.githubusercontent.com/fox-it/cryptophp/master/ips.txt

badips_smtp https://www.badips.com/get/list/smtp/?age=1dbadips_badbots https://www.badips.com/get/list/badbots/?age=1d

badips_apache-nohome https://www.badips.com/get/list/apache-nohome/?age=1d

danger_rulez http://danger.rulez.sk/projects/bruteforceblocker/blist.php

lists_blocklist_mail https://lists.blocklist.de/lists/mail.txt

emergingthreats_botcc http://rules.emergingthreats.net/blockrules/emerging-botcc.rules

turris_greylist https://www.turris.cz/greylist-data/greylist-latest.csvbadips_owncloud https://www.badips.com/get/list/owncloud/?age=1dopenbl https://www.openbl.org/lists/base_30days.txt

badips_username-notfound https://www.badips.com/get/list/username-notfound/?age=1dIPList_IPset https://raw.githubusercontent.com/firehol/blocklist-

6666

D4.1

ipsets/master/firehol_level1.netset

badips_screensharingd https://www.badips.com/get/list/screensharingd/?age=1d

malwaredomainlist http://www.malwaredomainlist.com/updatescsv.phpdragonForce_HTTP https://dragonresearchgroup.org/insight/http-report.txtransomwaretracker https://ransomwaretracker.abuse.ch/feeds/csvbadips_spam https://www.badips.com/get/list/spam/?age=1dlabs_snort http://labs.snort.org/feeds/ip-filter.blfbadips_sshddos https://www.badips.com/get/list/sshddos/?age=1dbadips_ddos https://www.badips.com/get/list/ddos/?age=1dcert http://www.cert.org/downloads/mxlist.ips.txtcruzit http://www.cruzit.com/xwbl2csv.phpbadips_apache-phpmyadmin

https://www.badips.com/get/list/apache-phpmyadmin/?age=1d

badips_postfix-sasl https://www.badips.com/get/list/postfix-sasl/?age=1dlists_blocklist_sip https://lists.blocklist.de/lists/sip.txtbadips_telnet https://www.badips.com/get/list/telnet/?age=1d

badips_dovecot-pop3imap https://www.badips.com/get/list/dovecot-pop3imap/?age=1d

badips_apache-php-url-fopen

https://www.badips.com/get/list/apache-php-url-fopen/?age=1d

badips_apache-404 https://www.badips.com/get/list/apache-404/?age=1dbadips_dovecot https://www.badips.com/get/list/dovecot/?age=1dbadips_asterisk https://www.badips.com/get/list/asterisk/?age=1d

badips_apache-modsec https://www.badips.com/get/list/apache-modsec/?age=1d

badips_named https://www.badips.com/get/list/named/?age=1dbadips_asterisk-sec https://www.badips.com/get/list/asterisk-sec/?age=1d

osint http://osint.bambenekconsulting.com/feeds/c2-ipmasterlist-high.txt

autoshun https://www.autoshun.org/download/?api_key=badips_rfi-attack https://www.badips.com/get/list/rfi-attack/?age=1dbadips_spamdyke https://www.badips.com/get/list/spamdyke/?age=1dsslbl https://sslbl.abuse.ch/blacklist/sslipblacklist.csv

charleshttp://charles.the-haleys.org/ssh_dico_attack_hdeny_format.php/hostsdeny.txt

BinaryDefense https://www.binarydefense.com/banlist.txtTalos http://www.talosintelligence.com/feeds/ip-filter.blfDomains/BotnetsName SourceMalwareINT https://intel.malwaretech.com/BambenekConsulting http://osint.bambenekconsulting.com/feeds/c2-

ipmasterlist.txtRansomwareTracker http://ransomwaretracker.abuse.chZeusTracker https://zeustracker.abuse.ch/

DNS-BH–MalwareDomainBlocklistbyRiskAnalytics

http://www.malwaredomains.com/

Snort/SuricataName Source

6767

D4.1

Proofpoint emergingthreatsintelligence

http://rules.emergingthreats.net/blockrules/

HailaTAXII http://hailataxii.com/

BroName SourceCriticalStackintelfeed https://intel.criticalstack.com/FirewallrulesName SourceProofpoint emergingthreatsintelligence

http://rules.emergingthreats.net/fwrules/

OWASPCoreRuleSet https://github.com/SpiderLabs/owasp-modsecurity-crsMalwareName SourceOPSWATMetadefender https://www.metadefender.com/threat-intelligence-feeds

VirusShare https://virusshare.com/

MISP-OpenSourceThreatIntelligence Platform &OpenStandardsForThreatInformationSharing

http://www.misp-project.org/features.html

IPreputationName SourceCINSscore http://cinsscore.com/list/ci-badguys.txtYararulesName SourceYara-rules https://github.com/Yara-Rules/rulesDNSsinkholesName SourceBambenekConsulting http://osint.bambenekconsulting.com/feeds/c2-

dommasterlist-high.txtIPaddresssinkholesName SourceBambenekConsulting http://osint.bambenekconsulting.com/feeds/c2-

ipmasterlist-high.txtBadDomainsName SourceZeustracker https://zeustracker.abuse.ch/blocklist.php?download=ba

ddomainsRansomwareName SourceRansomwareTracker https://ransomwaretracker.abuse.ch/downloads/RW_IPB

L.txtRansomwareTracker https://ransomwaretracker.abuse.ch/downloads/LY_C2_

DOMBL.txtRansomwareTracker https://ransomwaretracker.abuse.ch/downloads/CW_C2_

URLBL.txtPhisingsitesName Source

6868

D4.1

OpenPhish https://openphish.com/feed.txtTORnodesIPsName Sourcedan.me.uk https://www.dan.me.uk/torlist/TORproject https://check.torproject.org/exit-addressesVariousName SourceRiskIQ https://www.riskiq.com/

https://www.riskiq.com/blog/https://www.riskiq.com/products/security-intelligence-services/

Shodan https://www.shodan.io/about/productsBlocklist.de https://lists.blocklist.de/lists/all.txtComputerIncidentResponseCenter

https://www.circl.lu/doc/misp/feed-osint/

botvrij.eu http://www.botvrij.eu/data/feed-osint/inThreat https://feeds.inthreat.com/osint/misp/Pastebin https://pastebin.com/

Table7-TheOSINTsourcesusedbythepartnersoftheproject,dividedbycategory.

D4.1 Techniques and tools for OSINT-based threat...

Documents

Transcript of D4.1 Techniques and tools for OSINT-based threat...