D2.1 DATA LAKE INTEGRATION LAN Data... · Hive Query engine (SQL-like language) on HDFS (and HBase)...

54
D2.1_v1.0 Data lake integration plan Page 1 D2.1 DATA L AKE I NTEGRATION PLAN Deliverable No.: D2.1 Deliverable Title: Data lake integration plan Project Acronym: Fandango Project Full Title: FAke News discovery and propagation from big Data and Artificial iNtelliGence Operations Grant Agreement No.: 780355 Work Package No.: 2 Work Package Name: Data Access, Interoperability and user requirements Responsible Author(s): Sindice (Lead), ENG, LIVETECH, VRT, CERTH, CIVIO, UPM, ANSA Date: 31.07.2018 Status: V1.1 Deliverable type: REPORT Distribution: PUBLIC Ref. Ares(2018)4118136 - 05/08/2018

Transcript of D2.1 DATA LAKE INTEGRATION LAN Data... · Hive Query engine (SQL-like language) on HDFS (and HBase)...

D2.1_v1.0 Datalakeintegrationplan

Page1

D2.1DATALAKEINTEGRATIONPLAN

DeliverableNo.: D2.1

DeliverableTitle: Datalakeintegrationplan

ProjectAcronym: Fandango

ProjectFullTitle: FAkeNewsdiscoveryandpropagationfrombigDataandArtificialiNtelliGenceOperations

GrantAgreementNo.: 780355

WorkPackageNo.: 2

WorkPackageName: DataAccess,Interoperabilityanduserrequirements

ResponsibleAuthor(s): Sindice(Lead),ENG,LIVETECH,VRT,CERTH,CIVIO,UPM,ANSA

Date: 31.07.2018

Status: V1.1

Deliverabletype: REPORT

Distribution: PUBLIC

Ref. Ares(2018)4118136 - 05/08/2018

D2.1_v1.0 Datalakeintegrationplan

Page2

REVISIONHISTORYVERSION DATE MODIFIEDBY COMMENTS

V0.1 07.05.2018 Jeferson Zanim(Siren/Sindice)

Firstdraft.

V0.2 15.05.2018 Jeferson Zanim(Siren/Sindice)

Content structure adjusts and firstintegrationaddition.

V0.3 22.05.2018 TheodorosSemertzidis(CERTH)

CERTHcontributions.

V0.4 07.06.2018 Jeferson Zanim(Siren/Sindice)

Merged UPM contributions to thedocument.

V0.5 14.06.2018 Jeferson Zanim(Siren/Sindice)

Merged LvT contributions to thedocument.

V1.0 30.07.2018 MonicaFranceschini,MassimoMagaldi

(ENG)

Internallyreviewedversion

V1.1 03.08.2018 Jeferson Zanim(Siren/Sindice)

FinalVersion

D2.1_v1.0 Datalakeintegrationplan

Page3

TABLEOFCONTENTS1. Introduction.................................................................................................................................72. ArchitectureOverview................................................................................................................73. DataIntegrations.........................................................................................................................93.1. DataIngestion.........................................................................................................................93.1.1. DataIngestionI–Partner’sData(FTP)..............................................................................103.1.2. DataIngestionII–RestAPIs..............................................................................................113.1.3. DataIngestionIII–OpenData...........................................................................................123.1.4. DataIngestionIV–Websites.............................................................................................133.1.5. DataLoadingV–RSSSites................................................................................................143.1.6. DataIngestionVI–SocialNetworks..................................................................................143.2. DataProcessing.....................................................................................................................163.2.1. Siren(Sindice)DataProcessingIntegrations.....................................................................163.2.1.1. SirenIntegrationI–SirenInvestigate...............................................................................163.2.2. UPMIntegrations..............................................................................................................183.2.2.1. UPMIntegrationI–Spark.................................................................................................183.2.2.2. UPMIntegrationII–Hive..................................................................................................193.2.2.3. UPMIntegrationIII–Elasticsearch...................................................................................213.2.2.4. UPMIntegrationIV–Neo4J..............................................................................................213.2.2.5. UPMIntegrationV–SirenInvestigate..............................................................................233.2.3. CERTHIntegrations............................................................................................................253.2.3.1. CERTHIntegrationI–HDFS...............................................................................................253.2.3.2. CERTHIntegrationII–HBase.............................................................................................263.2.3.3. CERTHIntegrationIII,VIandV–ApacheZeppelin............................................................273.2.3.4. CERTHIntegrationVI–Elasticsearch.................................................................................273.2.3.5. CERTHIntegrationVII–Spark...........................................................................................283.2.3.6. CERTHIntegrationVIII–MLib...........................................................................................283.2.3.7. CERTHIntegrationIX–Elasticsearch.................................................................................293.2.4. LvTIntegrations.................................................................................................................313.2.4.1. LvTIntegrationI–Kafka....................................................................................................313.2.4.2. LvTIntegrationII–FandangoWebapp.............................................................................323.2.4.3. LvTIntegrationIII,IV,V–ApacheZeppelin.......................................................................333.2.4.4. LvTIntegrationVI–Elasticsearch......................................................................................343.2.4.5. LvTIntegrationVII–Spark.................................................................................................353.2.4.6. LvTIntegrationVIII–Spark................................................................................................363.2.4.7. LvTIntegrationIX–Spark..................................................................................................373.2.4.8. LvTIntegrationX–HBase..................................................................................................374. Conclusion.................................................................................................................................385. ANNEX–DataSilosforEuropeanContent,ClimateChangeandMigration.............................39PodcastsandVodcasts.......................................................................................................................43

D2.1_v1.0 Datalakeintegrationplan

Page4

LISTOFFIGURESFigure1-ArchitectureOverview..................................................................................................7

Figure2-DataLoading...............................................................................................................10

Figure3-SirenIntegrations........................................................................................................16

Figure4-UPMIntegrations........................................................................................................18

Figure5-CERTHIntegrations.....................................................................................................25

Figure6-LvTIntegrations...........................................................................................................31

LISTOFTABLESTable1–ArchitectureComponentsOverview.............................................................................9

D2.1_v1.0 Datalakeintegrationplan

Page5

ABBREVIATIONSABBREVIATION DESCRIPTION

H2020 Horizon2020

EC EuropeanCommission

WP WorkPackage

EU EuropeanUnion

D2.1_v1.0 Datalakeintegrationplan

Page6

EXECUTIVESUMMARYThisdocumentisadeliverableoftheFANDANGOprojectfundedbytheEuropeanUnion’sHorizon2020(H2020)researchandinnovationprogrammeundergrantagreementNo780355.ItisapublicreportthatdescribesthedatalakeintegrationplanforthesoftwaredevelopmentwithinFANDANGO.

Themain goal of this deliverable is todefine thedata sets thatwill be collected andprocessed in theFANDANGO’s data-lake, ultimately describing howdata is handled and curate in different steps of theprocess.

Datalakesrequiredataintegrationsolutionsthatcanworkwithstructuredandunstructureddata,likelywithschema-lessdatastorage,andwithstreamsofdatathatshouldbeprocessed innearreal-time. Inother words, data lake requires a completely different approach to data integration and newer dataintegrationtechnologyascomparedtotraditionaldatawarehouse.

Therefore, this document describes the different data ingestions and integrations currently designed,basedontheproposedarchitecture.Foreachofthose,ownershipofsourceandtargetrepository,typeofdata,accesscontrol,persistenceperiodandpurposeareasserted.

As the data lake evolves sowill its documentation, becomingmore descriptive and precise during thelifespanoftheproject.IntegrationsandcomplementaryinformationwillbeaddedintospecificsectionsofdeliverablesD2.2-DataInteroperabilityanddatamodeldesign,D3.1-Datamodelandcomponentsand/orProjectProgressPeriodicReportstodefinemoredetaileddatastructuresavailableineachrepositoryanditsconventions.

D2.1_v1.0 Datalakeintegrationplan

Page7

1. INTRODUCTIONFANDANGO’sgoalistoaggregateandverifydifferenttypologiesofnewsdata,mediasources,socialmediaandopendatatodetectfakenewsandprovideamoreefficientandverifiedformofcommunicationforEuropeancitizens.

Toachievesuchgoal,severaldifferentapproachesmustbeusedinconjunctiontocollectalargevolumeofdata.ThecollectionofthesedatasetsisessentialtoensurethattheMachineLearningalgorithmscanprocessthe inputs intomeaningful informationandprovidehighquality interactionswiththeuserthatallowsreal-timeanalysisforinvestigationandvalidationpurposes.SolutionslikeSpark,whichwillbeusedforfastprocessingofmachinelearningandgraphanalysis,needstoworkinconjunctionwithElasticsearch,thatisfocusedonsemanticandstatisticalcomputation.Suchstrategyrequiresdifferenttechnologiestobe used and interconnected into a single, meaningful, solution. The parts responsible for suchinterconnections,betweendataanddifferentpartsofthesoftwaresolution,aretheintegrations,whichwillbedescribedinfurtherdetailonthisdocument

2. ARCHITECTUREOVERVIEWTodefinethedatalakeintegration,itiscrucialtoanalysetheoverallarchitectureofthesolutionandhowdata will be collected and processed across different environments. Therefore, the initial architectureoverviewinFigure1servesasbasetodescribethedifferentpartsofthesolution.

Figure1-ArchitectureOverview

FANDANGO’sfeaturestosupportjournalistinfake-newsdetectionandverification,aswellasscoringthenews with different trustworthiness scores, requires the development of several different big dataprocessing and analyzing techniques. To optimize the solution and better comply to software qualitystandards, such as: Functional Suitability, Reliability, Operability, Performance Efficiency, Security,Compatibility, Maintainability and Transferability, FANDANGO relies on well-established products thatwerebrought together to form theproposedarchitecture. The componentsof the architecture,whichneedstobeintegratedaredescribedonTable1–ArchitectureComponentsOverview.

D2.1_v1.0 Datalakeintegrationplan

Page8

SOFTWARE DESCRIPTION

Nifi Data flow ingestion tool, open source, distributed and scalable, tomodelreal-timepre-processingworkflowfromseveraldifferentsources.

Kafka Publish-subscribedistributedmessagingsystem,thatgrantshighthroughputandbackpressuremanagement.

Spark Fast, in-memory, distributed and general engine for large-scale dataprocessingwithmachine learning (Mllib), graph processing (GraphX), SQL(SparkSQL)andstreaming(SparkStreaming)features.

HDFS TheHadoopdistributedfilesystem,opensource,reliable,scalable,chosenasstorage.

Elasticsearch +SirenFederate

Distributed,multitenant-capable full-textsearchenginewithanHTTPwebinterfaceandschema-freeJSONdocuments.SirenFederatepluginisaddedtoElasticsearchtoallowdatasetsemi-joinsandseamless integrationwithdifferentdatasources.

Hive Query engine (SQL-like language) on HDFS (and HBase) with JDBC/ODBCinterfaces.

Oozie Workflowscheduler.

Ambari itactsasbothaworkflowengineandascheduler.Inthiscase,itsmainroleistomanagetheschedulingofSparkjobsandthecreationofHivetables.

Hue HadoopHUEtoperformdashboards,queriesandbrowsetheservices.

Siren InvestigativeIntelligenceUIwithconnectivitytoElasticsearch,whoseaimistoallowreporting,investigativeanalysisandalertingtousersbasedontheindexedcontents.

Rest APIs, RSS,Web Sites,Open Data,Socialnetwork

DatasourcesoftheFandangoproject.Specificcrawlerswillconnecttothesesourcesofdatatogettheinformationneededtoverifythenews.

FTP TheFileTransferProtocol(FTP)isastandardnetworkprotocolusedforthetransfer of computer files between a client and server on a computernetwork.InourArchitectureitiswhereUserscanplacefilesthatwillbethaningestedinthedatalake.

HBase TheHadoopNoSQLdatabase,toperformrandomreadandwritesbasedonrowkeyidentifiers.

Zeppelin Thenotebookdedicatedtodatascientists,toruninREPLmodescriptsandalgorithmsondatastoredinHadoop.

D2.1_v1.0 Datalakeintegrationplan

Page9

Atlas Apache Atlas provides scalable governance for Enterprise Hadoop that isdrivenbymetadata,addingfeaturesfordatalineage,governancecontrolstoaddresscompliancerequirementsandagiledatamodelling.

Ranger Frameworktoenable,monitorandmanagedatasecurityacrosstheHadoopplatform according to fine-grained policies and a centralized security andauditing.

WebApp AccesspointtoFandangoInfrastructure.ThejournalistwillusetheFandangoWeb application to insert news and verify the trustworthiness of certainpublications.

Table1–ArchitectureComponentsOverview

3. DATAINTEGRATIONSTo allow the implementation of the data processing and analysis techniques needed to support theFANDANGO’s features, interconnections between the different parts of the solutions and enable thefunctional requirements designed for FANDANGO, multiple Data Integration processes are required,identified by red arrows in Figure 1. This section is going to describe in further detail each of theseintegrationprocesses,breakingdownintotwomainsteps:dataingestionanddataprocessing.ThefirstonedescribestheacquisitionofexternaldatabythesolutionandtheseconditsdifferentprocessingstageswithinFANDANGO.

3.1. DATAINGESTIONTheinitialstepsintheprocessisacquiringdatafrommultiplesources.Someofdesireddatainputshavebeenmappedanditwillbeshapedinmoredetailsalongtherequirementevolutionandthefirstsoftwaredeliveryiterations.ThesecanbeseeninFigure2Figure2-Data,andwillbedescribedinfurtherdetailinthefollowingsections.

D2.1_v1.0 Datalakeintegrationplan

Page10

Figure2-DataIngestion

AlldataingestionprocesseswillbeimplementedbyCERTH,byfollowingthedatamodeldesignthatistobedefinedinWP2.

3.1.1. DATAINGESTIONI–PARTNER’SDATA(FTP)FANDANGO’suserpartnersowncollectionsofvaluabledatathatmaybeusedinvarioussituationsfromtrainingmachinelearningmodelstosupportFANDANGO’sfake-newsdetectionsfeature.Thedatasetsthatwillbemadeavailableareofdifferenttypesandonavarietyofformats.

ForeachdatasetacustomdatashippingscriptwillbeprovidedthatwillingestthedataintheFANDANGOclusterandmadeavailabletotheprocessingunits.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource ThedatasetswillcomefromthedatacollectionsofANSA,VRT,CIVIO

Ownership EachdatasetwillbeownedbythepartnerthatissharingthedataandwillbeusedinFANDANGOonlyinternally.Partner’sdatawillnotbeexposedtothepublic.

Type Mostofthedataare intext formatssuchasplaintext,pdf filesandwordfiles.Anotherbatchofdatawillbeimagesandvideosinknownformats.

AccessControl Internalnetworkonly.

Persistence ThedatawillbekeptforthedurationoftheFANDANGOproject.Afterthefinalisationoftheproject,anewagreementwillbeconducted.

D2.1_v1.0 Datalakeintegrationplan

Page11

DATATRANSFORMATION

Theinitialingestionofthedatawillnotfollowanytransformationprocedure.Thiswillpermitthepartnersworkingontheprocessingmodulestoexperimentwithdifferentconfigurationswiththeoriginaldata.Aftertheestablishmentofasolidprocessingworkflow,thedefineddatatransformationsandsuccessiveupdateswillbespecifiedintospecificannexestodeliverablesD2.2-DataInteroperability

anddatamodeldesign,D3.1-Datamodelandcomponentsand/orProjectProgressPeriodicReports.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem HDFS

Ownership Thepartnerthatsharedthedata

Type HDFSfiles

AccessControl Internalnetwork

Persistence Keptuntiltheendoftheproject

3.1.2. DATAINGESTIONII–RESTAPISAlistofsourcesthatgiveaccessthroughRESTAPIswillalsobeintegratedintheFANDANGOdatashippers.TheRESTAPIdatashipperwillbeabletoloaddatafromdifferentsourcesbychangingonlyasmallpartofthescriptwiththespecificitiesofeachservice.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource RSSsourceswillbedefinedandintegratedduringtheproject

Ownership Thirdparty

Type JSON

AccessControl Publicaccess

Persistence ManagedaccordingtothetermsofuseofeachRESTAPIprovider

DATATRANSFORMATION

Thedatawillfollowthedatamodelthatwillbedefined.Untilthennotransformationwillbeapplied.

TARGET

D2.1_v1.0 Datalakeintegrationplan

Page12

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Elasticsearch

Ownership Thirdparty

Type JSON

AccessControl internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner.

3.1.3. DATAINGESTIONIII–OPENDATAAlistofopendatasourcesisunderdevelopmentintheFANDANGOproject.Thesesourcesaremainlytextdata coming from public organizations either in national or European level. Some of these open dataportals, share theirdata throughprogrammable interfaces,however there is a clearmajority thatonlyprovideddownloadablelinkstopdf,csvorxlsfiles.SourcessuchastheEurobarometer,theEurostat,theEuropeanExternalActionServiceandotherorganizationsareinthiscategory.

Alistofalltheopendatasetisprovidedin5ANNEX–DataSilosforEuropeanContent,ClimateChangeandMigration.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource DatapubliclyavailablefromnationalandEuropeanorganisations

Ownership OpenDatapublishers.

Type json,csv,xls,pdfformatsareavailable

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner

DATATRANSFORMATION

ThedatawillbefedtotheFANDANGOdatalakeasis.Aftertheloading,themodulesthatperformthepre-processingwilltransformthemtoplaintext.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

D2.1_v1.0 Datalakeintegrationplan

Page13

TargetSystem HBASE

Ownership OpenDatapublishers.

Type Datatable

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner.

3.1.4. DATAINGESTIONIV–WEBSITESAfocusedwebcrawlerisbeingimplementedinWP3toloadwebsitesthatarerelevanttonewsandfakenewsdebunking.Thecrawlerwillholdalistofpredefinednewssourcessuchasnewspapersites,blogs,factcheckersandotherrelatedsources.Thecrawlingofthesesiteswillfollowadeltaapproachthatwillgatheronlytheupdatesaftertheoriginal/initialcrawlingprocess.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Newssitesandotherrelatedwebpagesthatwillbemanuallyselected

Ownership Publishersand/orauthors

Type JSONfilesthatcontainthetextandtheURIsofmultimediaineachnewspost

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner.

DATATRANSFORMATION

ThedatawillbeholdinJSONformatandnofurthertransformationisneeded.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Elasticsearch

Ownership Publishersand/orauthors

Type JSON

AccessControl Internalnetwork

D2.1_v1.0 Datalakeintegrationplan

Page14

Persistence Managedaccordingtothetermsofuseofthedataowner.

3.1.5. DATAINGESTIONV–RSSSITESThedatashipperswillprovidethemeanstogatherdatafromRSSfeeds.AlistofmonitoredRSSfeedswillbecreatedwiththehelpofourusers’partners.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource RSSfeedsofnewsagenciesandnewssites

Ownership Thirdparty

Type XML

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner.

DATATRANSFORMATION

ThedatawillbetransformedintoJSONformatforauniformapproachonthehandlingoftextdata.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem HBASE

Ownership Thirdparty

Type Datatable

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowners.

3.1.6. DATAINGESTIONVI–SOCIALNETWORKSA special source for fake news is the social networks. Social networks are themain channels of newspropagationandassuchFANDANGOmustkeepaconstanteyeonwhatissharethere.ThefirstandthemostinterestingintermsoffakenewsandnewspropagationisTwitter.FANDANGO’sdatashipperswillcreatedatagatheringwithdifferentparameterssuchasusingkeywords,hashtagsorusers’accountsandgeolocationqueries.

D2.1_v1.0 Datalakeintegrationplan

Page15

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Twitterandothersocialnetworks

Ownership Thirdparty

Type JSON

AccessControl Internalnetwork

Persistence Managedaccordingtothetermsofuseofthedataowner.

DATATRANSFORMATION

Notransformationswillbeappliedintheoriginaldata.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Elasticsearch

Ownership Thirdparty

Type JSON

AccessControl Internalnetwork

Persistence Willbekeptaslongasitispermittedbythetermsofuseofeachservice

D2.1_v1.0 Datalakeintegrationplan

Page16

3.2. DATAPROCESSINGOnceexternaldatahasbeenmadeavailablewithinFANDANGO’splatform,therearemultipleprocessingstagesrequiredtoenhanceit,analyseitandultimatelymakeinformationavailabletotheuser,whichwillthenprovidemoredatatothesystemtocontinueitslearningcycle.

To facilitate thecontrolof thedeliverables,projectplanningandprovidebettervisibilityof therequireimplementations,thedifferentdataprocessingintegrationhavebeenbrokenintosub-groupsthatwillbeimplementedbydifferentpartnersinFANDANGOproject.Eachgroupanditsimplementationsisdescribedinthefollowingsections.

3.2.1. SIREN(SINDICE)DATAPROCESSINGINTEGRATIONSTheintegrationshighlightedinredinFigure3aregoingtoimplementedbythepartnerSiren.

Figure3-SirenIntegrations

3.2.1.1. SIRENINTEGRATIONI–SIRENINVESTIGATEThis integration is responsible for accessing consolidated datasets,made available in Elasticsearch andbringingittoSirenInvestigateplatform,whereuserscandoinvestigativeanalysisthroughDashboardsandKnowledgeGraphs.

Thedataretrievedisdependentonuserrequestandtheassignedcredentials,andit isonlytreatedforpresentationonthetargetsoftware.Whilemulti-datasetfiltersareallowed,datacontentandgranularityiskeptunchangedbetweenthesystems.

SOURCE

Characteristicsofthedataorigin.

D2.1_v1.0 Datalakeintegrationplan

Page17

CHARACTERISTIC DESCRIPTION

MainSource ElasticsearchandSirenFederateplugin

Ownership FANDANGO

Type JSONDocuments

AccessControl Userauthentication

Persistence Data will be preserved indeterminately for analysis purposes. Sanitizingpoliciesmightbecreatedafterproduction

DATATRANSFORMATION

Data iscollectedandtransportedwithoutchangesto itscontent.Thatallowstheoriginaldataretrievaldesign to be preserved and ensures that information being presented to the user isn’t altered byaggregationortransformationprocesses.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem SirenInvestigate

Ownership FANDANGO

Type JSONDocumentsconsolidatedintoDashboardsandKnowledgeGraphs

AccessControl Userauthentication

Persistence Real-timescreenvisualizationonlyorCSVexportbyusers

D2.1_v1.0 Datalakeintegrationplan

Page18

3.2.2. UPMINTEGRATIONSTheintegrationshighlightedinredinFigure4-UPMIntegrationsaregoingtoimplementedbythepartnerUPM.

Figure4-UPMIntegrations

3.2.2.1. UPMINTEGRATIONI–SPARKSparkisafastandgeneralclustercomputingsystemforBigData.Itprovideshigh-levelAPIsindifferentlanguagesincludingScala,Java,Python,andR,andanoptimizedenginethatsupportsgeneralcomputationgraphsfordataanalysis. Italsosupportsarichsetofhigher-leveltools includingSparkSQLforSQLandDataFrames,MLlib formachine learning,GraphX forgraphprocessing,andSparkStreaming for streamprocessing.(ApacheSpark,s.f.)

Moreover,Sparkisveryflexible,anditallowstopreprocessthedatainordertostoreitinaproperformatforfuturedatatransformationsanddataanalysis.

Indeed,Sparkwillbeemployedtopreprocessthedataprovidedbythedifferentdatasourceswiththeaimofreducingthecomplexityoftherawdatasuchasimages,videocontentaswellasthetextfiles.SinceHortonworksDataPlatform(HDP)supportsApacheSpark, the integrationof suchcomponentdoesnotrequireanexternalprocedure.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource ApacheNIFI.

D2.1_v1.0 Datalakeintegrationplan

Page19

Ownership FANDANGO

Type BinaryandHDFSfilesforstoringimages,videocontentandmeta-data.

AccessControl InternalNetwork

Persistence Theoriginaldatawillbemanagedbyathirdpartyinordertoberemoved,processedorstored.

DATATRANSFORMATION

Datapreprocessingcanbedefinedasadataminingtechniquethatinvolvestransformingrawdataintoanunderstandableformat.ThemainprobleminReal-worlddataisthatitisoftenincomplete,inconsistent,and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence, a datapreprocessingstageisaprovenmethodofresolvingsuchissues.Datapreprocessingpreparesrawdataforfurtherprocessing.

In this project, different data sources will be stored in the Data Lake, and therefore, some datatransformationsshouldbeappliedtoImagesandothermedia-contentinordertonormalizeandscaletheoriginaldata.

SeveralDataTransformationsproceduresincludingcenteringthedata,normalizingwillbeappliedinthedifferentdatasourceswiththeaimofstandardizingthedataforthefutureMachineandDeepLearningprocedures.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem ApacheZEPPELIN

Ownership FANDANGO

Type HDFSfiles

AccessControl InternalNetwork

Persistence Theoriginaldatawillbemanagedbyathirdpartyinordertoberemoved,processedorstored.

3.2.2.2. UPMINTEGRATIONII–HIVEHiveisadatawarehouseinfrastructurebuiltontopofHadoop.ItprovidestoolstoenableeasydataETL,amechanismtoputstructuresonthedata,andthecapabilityforqueryingandanalysisoflargedatasetsstoredinHadoopfiles.TheintegrationofsuchcomponentintheHDPissimilartotheSparkonesinceHIVEisanativecomponentofHortonworks.

HivedefinesasimpleSQLquerylanguage,calledHiveQL,thatenablesusersfamiliarwithSQLtoquerythedata.At thesametime, this languagealsoallows toworkwith theMapReduce frameworkbyplugging

D2.1_v1.0 Datalakeintegrationplan

Page20

custommappersandreducerstoperformmoresophisticatedanalysisthatmaynotbesupportedbythebuilt-incapabilitiesofthelanguage.1

TheuseofHIVEintheprojectwillbebasicallytosupportSparkinthepreprocessingmethodsbyprovidingflexibilityandscalability intherequireddataqueries. ItmayalsobeemployedtoperformdataanalysistasksoverlargedatasetswhicharestoredinHDFSfilesusingitsUserInterfaceaswell.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Spark

Ownership FANDANGO

Type HDFSfiles

AccessControl InternalNetwork

Persistence Thedataismanagedbyathirdpartytobevisualizedorprocessed

DATATRANSFORMATION

Inthisscenario,sincethetransformationstepwillbecarriedoutbytheSparkcomponent,ApacheHIVEwillnotrequiretoperformanydatatransformationduetoItwillbeemployedtomakeflexibleandscalablequeries of the data stored in the Spark component aswell as to support any other componentwhichrequirestheusageofaquerysystemforreal-timevisualizationordataanalysis.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem FANDANGOWebApp

Ownership FANDANGO

Type HDFSfiles

AccessControl InternalNetwork

Persistence ThedatawillbequeriedinReal-Timescreenvisualizationandmanagedbyathirdparty.

1https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ch_using-hive.html

D2.1_v1.0 Datalakeintegrationplan

Page21

3.2.2.3. UPMINTEGRATIONIII–ELASTICSEARCHElasticsearchisadistributed,RESTfulsearchandanalyticsenginecapableofsolvingagrowingnumberofusecases.Itcentrallystoresthedataanditwillbeusedtoprocesstherequiredqueriesinordertovisualizetheresultsinreal-timeusingtheDashboardoftheSirenInvestigate.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Spark,HDFS,ApacheNIFI

Ownership FANDANGO

Type JSONDocument

AccessControl InternalNetwork

Persistence In-memoryprocessingonly

DATATRANSFORMATION

Inthiscase,thedata iscollectedandtransportedwithoutsufferingfromanykindofchangessincethiscomponent will be used to transport the information between pairs of modules and the originalinformationshouldremainintactsincethismodulewillhelpusersinthereal-timevisualizationprocedure.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem FANDANGOWebApp

Ownership FANDANGO

Type JSONdocument

AccessControl InternalNetwork

Persistence In-memoryprocessing

3.2.2.4. UPMINTEGRATIONIV–NEO4JNeo4JisaGraphDataBase(GDB)mainlyorientedtographs.Itmeansthatitusesgraphstorepresentthedata(entities)andtherelationshipbetweenthese. Thereexistmultiplemannersofrepresentingthesegraphs:

- UndirectedGraph:nodesandlinkscanbeexchanged,anditsrelationshipcanbeinterpretedregardlessthedirection.(i.e.FriendlinksinFacebook)

D2.1_v1.0 Datalakeintegrationplan

Page22

- DirectedGraph:nodesandrelationshipsarenotbidirectional(bydefault).Anexampleofthistypeisgivenbytwitterfollowingrelationships.Ausercanfollowsomeprofilesinthisnetworkwithouttheseprofilescanfollowhim/herback.

- Weightedgraph:therelationshipsbetweenentitiesarerepresentedbyanumericalvalue(weight).Itallowsperformingsomespecialoperations.

- Labeledgraph:thesegraphshavelabelsincorporatedthatcandefinethemultipleedgesandthetypeof relationship between nodes (i.e. Facebook labeled relationships include friends, job colleague,partnerof,friendof

- PropertyGraph: isaweightedgraph,with labelswherepropertiescanbeassignedbothtoentities(journal,publisher….)aswellasrelationships(Generalcategoriessuchasname,country,birthplace)

InthecontextofFANDANGO,aparticularapplicationofthesedatabasesmatcheswiththeambitionofTaskT4.4Sourcecredibilityscoring,profilingandsocialgraphanalytics.Thistaskaimstodetectnodesassociatedtofakecontentgenerationandrelationshipswiththeseentities.Forthispurpose,itisexpectedtohaveacompletedefinitionofthemultipleactorsinvolvedinthefakenewsdetectionparadigm.Thiswillallowtoexpressthecompleteenvironmentandtoexploitsourcesfrommultipleentities(Newshasbeenpublishedforanauthorwithabadreputation,soitislikelytobebiasedorevenfake).Thisparadigmdefinitionisalsocommonlyknownasontology.Therearesomegeneral-purposenews-relatedontologiesinthefieldofnewsanalysis2.Theseapproacheswillbetakenasstartingpointforthenewsanalysis3.

TheintegrationofNEO4JintotheHortonWorksDataPlatform(HDP)asaservicehasbeendone.Theprocessissummarizedasfollows:

- inHDPafoldermustbecreatedat:‘/var/lib/ambari-agent/cache/stacks/HDP/2.6/services’withthenameoftheservice‘Neo4J’

- gointothefolderandclonethehttps://github.com/cas-bigdatalab/ambari-neo4jrepository.

- [optional]Changetheconfiguration(GeneralparametersIP,PORTS,SECURITY….)intheconfigurationfilea‘/master/configuration/neo4j.xml’

- StarttheHDPandgotoaddaservice…

Whatthisrepositorydoes,istocreateafolderinthe/etc/yum.repos.d/neo4j.repoandinstallthemostrecentversionofthesoftwareandattachedittothewholestackofservicesintotheHDPplatform.

For the interestof theentire research/development community, aDockerHub imagehasbeencreatedintegratingtheservicesrequiredbyUPMforFANDANGO.4

SOURCE

Characteristicsofthedataorigin.

2 The IPTC is the global standards body of the newsmedia that provides the technical foundation for the newsecosystemhttp://dev.iptc.org/rNews

3BBCOntology:https://www.bbc.co.uk/ontologies/storyline

4DockerHubFANDANGOmoduleshttps://hub.docker.com/r/tavitto16/fandango_hdp/

D2.1_v1.0 Datalakeintegrationplan

Page23

CHARACTERISTIC DESCRIPTION

MainSource SparkandFANDANGOWebApp

Ownership FANDANGO

Type JSONandcsvdocuments

AccessControl Authenticatedaccess

Persistence Thedatawillbeusedtomakegraphanalyticsanddatavisualizationsbyathirdparty.

DATATRANSFORMATION

SinceNEO4jwillbeemployed in thegraphanalysisperformance, the transformationsappliedover theorigindatawillconsistofasetofqueriesandgraphoperations.Thissetofoperationswillbeusedtofindrelevantpatternsandtoanalyzethecredibilityofsomesourcesusinggraphalgorithmsbutattheend,theoutputoftheseoperationsmustbeanewgraph(ifthegraphhasbeenmodified)aswellasasetofmetricsorresultswhethertheyarerequired.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem FANDANGOWebAppandSireninvestigate.

Ownership FANDANGO

Type JSONdocument

AccessControl Authenticatedaccess

Persistence Thedatawillbeusedforgraphanalyticsanddatavisualizationsbyathirdparty.

3.2.2.5. UPMINTEGRATIONV–SIRENINVESTIGATESireninvestigatewillbeusedtovisualizethegraphdatabasewithalltheentitiesandrelationshipsinvolvedin FANDANGO’s ontology. In addition, basic modifications and analytics within the graph can also beperformedusingthismodule.

Moreover,SireninvestigatewillbecommunicatewithNeo4jincasethelatterisrequiredtoperformmoreadvancegraphanalytics.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

D2.1_v1.0 Datalakeintegrationplan

Page24

MainSource NEO4J,FANDANGOWebApp

Ownership FANDANGO

Type JSON,ontologyfile(owl,rdf).

AccessControl InternalNetwork

Persistence Real-timevisualizations

DATATRANSFORMATION

Inthisprocess,thedatawillbetransformedwhetherthegraphanalysisemployssomealgorithmsthatwillmodifythecurrentgraph(i.e.addnewentitiesorrelationshipsorremovesomeofthem).Inthiscase,thetransformationwill consist of updating the current graphand store such version in theplatform tobevisualizedlateron.However,theformatofthedatamustbethesamethattheoriginal.Thistransformationwillonlyaffecttotheinformationprovidedbythegraphbutnotinthestructureofit.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem FANDANGOWebApp

Ownership FANDANGO

Type JSONandontologiesconfigurationfiles

AccessControl InternalNetwork

Persistence RealTimeVisualizations

D2.1_v1.0 Datalakeintegrationplan

Page25

3.2.3. CERTHINTEGRATIONSTheintegrationshighlightedinredinFigure5aregoingtobeimplementedbythepartnerCERTH.

Figure5-CERTHIntegrations

3.2.3.1. CERTHINTEGRATIONI–HDFSthatwillbeimmutable.Itisconvenienttoworkwithandwillholdanytypeofdataofanyformat.HDFSwillbethemainstoragemoduleFANDANGOwillbeusingtopushdata

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource DatafromdatashippersthroughtheApacheNIFI

Ownership FANDANGO

Type HDFSholdinganytypeofdata

AccessControl internalnetwork

Persistence dependsoneachsourceasdescribedintheprevioussections.

DATATRANSFORMATION

Asitisdescribedinthedataloadingsection.

TARGET

Characteristicsofthedatatarget.

D2.1_v1.0 Datalakeintegrationplan

Page26

CHARACTERISTIC DESCRIPTION

TargetSystem Allprocessingsystemse.g.MLlib,Spark,etc.

Ownership FANDANGO

Type MostlyJSONfilesbutdependsontheoriginaldata

AccessControl Internalnetwork

Persistence Asitisdefinedinthedataloadingsection

3.2.3.2. CERTHINTEGRATIONII–HBASEHBASEwillbeusedfordatathatarestructuredsuchasinformationcomingfromRSSfeedsoropendataportalsfromEuropeanandnationalorganizations.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource RSSandopendata

Ownership FANDANGOunlessotherwisedefinedbythedataprovider

Type Datatable

AccessControl Internalnetwork

Persistence Asitisdefinedinthedataloadingsection

DATATRANSFORMATION

Notransformationsrequired.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Allprocessingmodulese.gMLlibandspark

Ownership FANDANGO

Type Datatable

AccessControl Internalnetwork

Persistence Asitisdefinedinthedataloadingsection

D2.1_v1.0 Datalakeintegrationplan

Page27

3.2.3.3. CERTHINTEGRATIONIII,VIANDV–APACHEZEPPELINApachezeppelin isanotebookenvironment. Itwillbeused toworkon thedata thatarestored in theFANDANGOclusterandexperimentdirectlyonthedatawithouttheneedtotransferdatatoandfromtheclusterfortheresearchandprototypingpurposes.Asitisusedasasandboxarea,differentdatasetsandformatswillbeloadedintoit.Datainthisareawillonlybepersistedwhilerequiredfortheprototypingpurposesandwillbecontrolledthroughuserauthenticationforaccessingthenotebooks.

3.2.3.4. CERTHINTEGRATIONVI–ELASTICSEARCHElasticsearch will collect all the processed data and the outcomes of the processing modules for theidentificationoftrustworthinessmarkers.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource ProcessingmodulesimplementedinWP4

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence Deletedafterprocessing

DATATRANSFORMATION

Notransformationsrequired.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Elasticsearch

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence Keptforthedurationoftheproject

D2.1_v1.0 Datalakeintegrationplan

Page28

3.2.3.5. CERTHINTEGRATIONVII–SPARKSparkwillbeusedforprocessingtheavailabledatainthedevelopedmodulesofWP4.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource HDFS

Ownership FANDANGO

Type Anyformat

AccessControl Internalnetwork

Persistence KeptinHDFSforthedurationoftheproject

DATATRANSFORMATION

Notransformationsarerequired.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Spark

Ownership FANDANGO

Type Anyformat

AccessControl Internalnetwork

Persistence In-memoryprocessingonly

3.2.3.6. CERTHINTEGRATIONVIII–MLLIBTheMLlib isused forMachineLearningmodulesandworks incollaborationwith spark.Assuchall thedetailsthatapplyforsparkapplytoMLlibaswell.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource HDFSorHBASE

D2.1_v1.0 Datalakeintegrationplan

Page29

Ownership FANDANGO

Type Anyformat

AccessControl Internalnetwork

Persistence Keptinthestorageasitisdefinedinthedataloadingsection

DATATRANSFORMATION

Notransformationsarerequired.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem MLlib

Ownership FANDANGO

Type Anyformat

AccessControl Internalnetworks

Persistence In-memoryprocessingonly

3.2.3.7. CERTHINTEGRATIONIX–ELASTICSEARCHThisintegrationwillbeusedinitiallyforthepilot0.1andwillbeevaluatedif itwillremaininthefutureversionsoftheFANDANGOplatform.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource ApacheNifi

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence In-memoryonly

DATATRANSFORMATION

Notransformationsapplyhere.

D2.1_v1.0 Datalakeintegrationplan

Page30

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Elasticsearch

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence For thedurationof theprojectorotherwisedefinedby thedata sources’termsofuse,

D2.1_v1.0 Datalakeintegrationplan

Page31

3.2.4. LVTINTEGRATIONSTheintegrationshighlightedinredinFigure6-LvTIntegrationsaregoingtoimplementedbythepartnerLvT.

Figure6-LvTIntegrations

3.2.4.1. LVTINTEGRATIONI–KAFKAKafkawillbeusedforqueueingupthenewsbeforetheiranalysisoffakeandsaving

The collection of news from different sources (Web site, Open Data, Webapp, etc..) will producehundreds/thousandsofdata,forthisreasonwewillneedKafka,otherwisewecan’tprocessallthenewstogether.Inaddition,wedon’twanttosaveallthenewsweretrieveintodatabases,butonlythefakenewsourMachineLearningSystemrecognizes.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource ApacheNifi,FandangoApp

Ownership FANDANGO

Type JSON

AccessControl Authenticatedaccess

Persistence Real-timescreenvisualizationonly

D2.1_v1.0 Datalakeintegrationplan

Page32

DATATRANSFORMATION

Kafkadoesn’tapplyanykindoftransformation,thereforeallnewsmusthaveanagreedjsonformatbeforeputtingtheminKafka.

Forexample:

{“source”:aaaa,

“date”:dddd

“title”:xxxx,

“body”:yyyy,

“urls_image”:[“xxx”,”xxx”,...,”xxx”]

etc..

}

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem HDFS,HBase

Ownership Fandango

Type JSON

AccessControl Internalnetwork

Persistence In-memoryprocessing;deletedafterprocessing

3.2.4.2. LVTINTEGRATIONII–FANDANGOWEBAPPTheWebappprovidesasetofinterfacestomanageinasingletouchpoint:

- datafromthedifferentFANDANGOlayers- distributeddataresourcesandtransfersthemefficientlyandsecurely.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Fandangoplatform’s

Ownership FANDANGO

D2.1_v1.0 Datalakeintegrationplan

Page33

Type JSON

AccessControl PublicaccessandAuthenticatedaccess

Persistence In-memoryprocessing

DATATRANSFORMATION

ItisaWebappcontrol,soitdoesn’tneedanykindoftransformations.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Fandangoplatform’s

Ownership FANDANGO

Type JSON

AccessControl Publicaccess,authenticatedaccess

Persistence Real-timescreenvisualizationandHIVE/HBasetosavetheconfigurations

3.2.4.3. LVTINTEGRATIONIII,IV,V–APACHEZEPPELINApachezeppelinisapythonnotebookenvironment.ItwillbeusedtoworkonthedatathatarestoredintheFANDANGOclusterandexperimentdirectlyonthedatawithouttheneedtotransferdatatoandfromtheclusterfortheresearchandprototypingpurposes.Asitisusedasasandboxarea,differentdatasetsandformatswillbeloadedintoit.Datainthisareawillonlybepersistedwhilerequiredfortheprototypingpurposesandwillbecontrolledthroughuserauthenticationforaccessingthenotebooks.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource HBase,HDFS,Elasticsearch

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence In-memoryprocessing

D2.1_v1.0 Datalakeintegrationplan

Page34

DATATRANSFORMATION

HBasealreadycontainsdatainjsonformat,thereforeitdoesn’trequireanytransformations.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem ApacheZeppelin

Ownership FANDANGO

Type Serializedobject,text,JSONandothers

AccessControl Internalnetwork

Persistence In-memoryprocessing

3.2.4.4. LVTINTEGRATIONVI–ELASTICSEARCHItcentrallystoresthedataanditwillbeusedtoanalyzenewsusingnaturallanguageprocessingtools.

Elasticsearchisnecessarybecauseitimplementsasearchenginethatwillbeusedtosearchsimilarnewsbasedonsemanticcontext.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Spark

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence In-memoryprocessing

DATATRANSFORMATION

ElasticsearchwillapplyNLPpipelinesdependingonthelanguageofthenews.Themaintransformationsthatwillbeappliedare:Tokenization,Lemmatization,SyntacticParser,Ngrametc.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

D2.1_v1.0 Datalakeintegrationplan

Page35

TargetSystem Elasticsearch

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence Keptindeterminately

3.2.4.5. LVTINTEGRATIONVII–SPARKApacheOozieisaworkflowschedulerthatisusedtomanageApacheHadoopjobs.Ooziecombinesmultiplejobssequentiallyintoonelogicalunitofworkasadirectedacyclicgraph(DAG)ofactions.OoziecanweavesaSparkjobintoyourworkflow.TheworkflowwaitsuntiltheSparkjobcompletesbeforecontinuingtothenextaction.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Oozie

Ownership Fandango

Type JSON

AccessControl Authenticatedaccessandinternalnetwork

Persistence In-memoryprocessing

DATATRANSFORMATION

ApacheOoziewillnotrequiretoperformanydatatransformation

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Spark

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

D2.1_v1.0 Datalakeintegrationplan

Page36

Persistence In-memoryprocessing

3.2.4.6. LVTINTEGRATIONVIII–SPARKSparkisusefultoanalyzemillionsofdatainshorttime.Therefore,inthisproject,Sparkwillbeemployedto process the news provided by the different data source. This is done using theMLlib, a library forMachineLearningofSpark.Theinteractionwiththesavednewsisusefultocatchfeedbacksoftheusersaboutthetruthofanews.Thesefeedbackswillbeprocessedbythemachinelearningmodels.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource HDFS

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence Keptindeterminately

DATATRANSFORMATION

Inordertore-trainingtheMachineandDeepLearningalgorithms inMLlib,wewillapplydifferentdatatransformationstomaketheformatscompatiblewiththelibrariesyouuse.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem Spark

Ownership Fandango

Type JSON

AccessControl Internalnetwork

Persistence In-memoryprocessing

D2.1_v1.0 Datalakeintegrationplan

Page37

3.2.4.7. LVTINTEGRATIONIX–SPARKSpark is useful to analyze millions of data in very short time, therefore in this project, Spark will beemployedtoprocessthenewsprovidedbythedifferentdatasourcesusingtheMLlib,alibraryforMachineLearninginSpark.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

MainSource Kafka

Ownership Fandango

Type JSON

AccessControl Internalnetwork

Persistence In-memoryprocessing,Kafkapersistence,deletedafterprocessing.

DATATRANSFORMATION

In order to use the Machine and Deep Learning algorithms in MLlib, we will apply different datatransformationstomaketheformatscompatiblewiththelibrariesyouuse.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem HDFS

Ownership FANDANGO

Type JSON,Textfile

AccessControl Internalnetwork

Persistence Datasaveduntiltheyareremoved

3.2.4.8. LVTINTEGRATIONX–HBASE HBASEwillbeusedtosavedatathatarestructured,suchasnewsanalyzedorthatcomefromKafka.

SOURCE

Characteristicsofthedataorigin.

CHARACTERISTIC DESCRIPTION

D2.1_v1.0 Datalakeintegrationplan

Page38

MainSource Kafka

Ownership FANDANGO

Type JSON

AccessControl Internalnetwork

Persistence Saveduntiltheyareremoved

DATATRANSFORMATION

ApacheHBasewillnotrequiretoperformanydatatransformation.

TARGET

Characteristicsofthedatatarget.

CHARACTERISTIC DESCRIPTION

TargetSystem ApacheZeppelin

Ownership Fandango

Type In-memorystructure

AccessControl Internalnetwork

Persistence In-memoryprocessing

4. CONCLUSIONWhilethepurposeofthisdocumentisprovidinganinitialoverviewofrequireddataintegrationsandhowdataisgoingtobecuratedinordertodirecttheinitialdevelopmentandfacilitatetheoverallcoordinationofactivitiesbetweenthepartners,thisstructureshouldevolvealongthedevelopmentofprojectandwillbe updated in later stages. Further definitions are already planned on deliverables D2.2 - DataInteroperabilityanddatamodeldesignandD3.1-Datamodelandcomponents.

Nonetheless, the architectural structure, data ingestion and data integration definitionswill direct thedevelopment of the first versions of the solution and play a crucial role in refining requirements andvalidatingtheplatform.

D2.1_v1.0 Datalakeintegrationplan

Page39

5. ANNEX–DATASILOSFOREUROPEANCONTENT,CLIMATECHANGEANDMIGRATIONThisannexliststheinitialdatasilosusedcollectarticlesfortheFANDANGOproject.

EUOpenDataPortal

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EUOpenDataPortal

http://data.europa.eu/euodp/en/home

The European Union Open DataPortal(EUODP)providesaccesstoan expanding range of data fromthe European Union (EU)institutionsandotherEUbodies.

Only EU institutions, agencies andbodies can provide data for the EUOpenDataPortal–asinglepointofaccessforEUdata.

Thesedatacanbeusedandreusedforcommercialornon-commercialpurposes.

• Socialquestions• Science• Environment• Employmentand

workingconditions

• Economics• Finance• Trade• Production,

technologyandresearch

• Industry• EuropeanUnion

• European• Eurozone(whenrelevant)

• National• Other

GEOGRAPHICALCOVERAGE:

• France(3014)

• Italy(2964)

• Austria(2948)

EN,FR,GE

Data• ZIP(880

0)• HTML(7

496)• text/tab

-separated-values(7326)

• PDF(876)

• XML(818)

Data can bereusedfreeofcharge andwithout anycopyrightrestrictions.

(REUSEOFEUDATAHAS TOBE SHAREDWITHTHEPORTAL)

D2.1_v1.0 Datalakeintegrationplan

Page40

Totaldatasetsavailable:12,209

PUBLISHERS

EuropeanParliament(54datasets)

Council of the European Union (3datasets)

European Commission (11072datasets)

EuropeanCentralBank(31datasets)

European External Action Service (0datasets)

European Economic and SocialCommittee(2datasets)

CommitteeoftheRegions(2datasets)

European Investment Bank (2datasets)

EuropeanOmbudsman(1datasets)

European Data ProtectionSupervisor(6datasets)

EUbodyoragency(1035datasets)

• Agriculture,forestryandfisheries

• Energy• Transport• Businessand

competition• International

relations• Geography• Educationand

communications• Law• International

organisations• Politics• Agri-foodstuffs

• Germany(2929)

• Spain(2908)

• Belgium(2874)

• Netherlands(2867)

• Denmark(2849)

• Ireland(2842)

• Portugal(2839)

• Sweden(2823)

• Luxembourg(2814)

• Greece(2813)

• UnitedKingdom(2804)

• Finland(2802)

• Slovenia(2730)

• Hungary(2727)

• Poland(2713)

• CzechRepublic(2685)

• Latvia(2683)

• Estonia(2679)

• Slovakia(2678)

• Lithuania(2678)

• Bulgaria(2632)

• Romania(2623)

• Cyprus(2562)

• Malta(2551)

• Croatia(2299)

• Norway(1116)

• Switzerland(1038)

• Excel(405)

• CSV(379)

• application/msaccess(157)

• RDF(155)

• OCTETSTREAM(146)

• TXT(75)• DOC(58

)• webserv

ice/sparql(49)

• text/n3(49)

• application/x-dbase(45)

• XLSX(36)

• JSON(18)

• JPEG(16)

• PNG(13)

AlreadylistedintheFandangoAgreement

D2.1_v1.0 Datalakeintegrationplan

Page41

• Liechtenstein(988)

• Serbia(966)

• FormerYugoslavRepublicofMacedonia(963)

• BosniaandHerzegovina(953)

• Montenegro(950)

• Albania(922)

• Iceland(902)

• Russia(896)

• Turkey(885)

• Monaco(794)

• Ukraine(790)

• SanMarino(777)

• Belarus(772)

• Andorra(767)

• Moldova(749)

• VaticanCity(740)

• Kosovo(737)

• ÅlandIslands(726)

• Gibraltar(711)

• Guernsey(708)

• PPT(11)• applicati

on/msword(9)

• WEBSERVICE/SPARQL(7)

• TIFF(7)• RSS(5)• N3(3)• DOCX(3

)• KML(2)• GIF(2)• interacti

vewebpages(1)

• file(1)• applicati

on/x-compress(1)

• application/javascript(1)

• OWL(1)• MXD(1)• E00(1)• Access(

1)

D2.1_v1.0 Datalakeintegrationplan

Page42

EUROPEANDATAPORTAL

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EuropeanDataPortal

https://www.europeandataportal.eu

TheEuropeanDataPortalharveststhe metadata of Public SectorInformation available on publicdata portals across Europeancountries.

Information regarding theprovisionof data and the benefitsofre-usingdataisalsoincluded.

At the very last count, it providesaccessto817,747datasets.Amongtheothers:

- Italy–38,259- Spain–28,693- Netherlands–22,672- Belgium–7,315- Greece–6,709-

(THOSE DATASET SHOULD BECHOSEN CAREFULLY, TO AVOIDUNNECESSARYINGESTIONS)

Datasets bycategories:

- Agriculture,Fisheries,Forestry&Food

- Energy- Regions&

Cities- Economyand

Finance- Health- Population&

Society- Government

&PublicSector

- InternationalIssues

- Transport- Environment- Science&

Technology

All EUCountries

All EUmembers’languages

CSV,XLA,XML,RDF,JSON

ProvidesDatasets URIand SPARQLQueries

(and 78catalogueson EUCountriesDatasets)

Already listed inthe FandangoAgreement

LICENCENEEDED

OPEN LICENSEASSISTANT

https://www.europeandataportal.eu/en/content/show-license

D2.1_v1.0 Datalakeintegrationplan

Page43

EUROPEANPRESSROOM(Pressreleases)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EuropeanUnion

Newsroom– PressReleases

https://europa.eu/newsroom/

PressreleasesbyEUInstitutionsofthelast30days.

Pressreleasesdatabases:

• CommitteeoftheRegions• CounciloftheEUandEuropeanCouncil• CourtofJusticeoftheEuropeanUnion• EuropeanCentralBank• EuropeanCommission• EuropeanCourtofAuditors• EuropeanDataProtectionSupervisor• EuropeanEconomicandSocial

Committee• EuropeanInvestmentBank• EuropeanOmbudsman• EuropeanParliament• Eurostat–StatisticalOffice

• Asylumandmigration

• Business• Business,

taxationandcompetition

• Consumeraffairsandpublichealth

• Culture,educationandyouth

• Economyandtheeuro

• Employmentandsocialrights

• Energy,environmentandclimate

• Enlargement,externalrelationsandtrade

• EUregionalandurbandevelopment

• Food,farmingandfisheries

All EUCountries

EN

FR

DE

AudioVideoText

E-mail andRSS:(https://europa.eu/newsroom/rss-feeds_en#press-releases)

PODCASTS

AND

VODCASTS

• EuropeanParliament

• EuropeanCommission

• Dailypressbriefing

Already listed inthe FandangoAgreement

(FREETOUSE?)

Weshouldaskforclarification:

https://europa.eu/european-union/contact/write-to-us_en)

D2.1_v1.0 Datalakeintegrationplan

Page44

• Institutionalaffairs

• Internationalaid,developmentandcooperation

• Justiceandcitizens’rights

• Researchandinnovation

• Securityanddefence

• Statistics• Transportand

travel

sPodcast

• DailypressbriefingsVodcast

• PressconferencesVodcast

EUROPEANMEDIAMONITOR(NewsExplorer)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EuropeanMediaMonitor

NewsExplorer

http://emm.newsexplorer.eu/

TheNewsExplorerusesJRCdevelopedtechnologytoautomaticallygeneratedailynewssummaries,allowinguserstoseethemajornewsstories(newsclusters)invariouslanguagesforanyspecificdayandtocomparehowthesameeventshavebeenreportedinthemediawrittenindifferentlanguages;Thelistofmostmentionednamesandfindfurtherautomaticallyderivedinformation(e.g.variantnamespellings,titlesandphrases,listofthemostrecent

• Clusterednews

• Countries

• People

• Othernames

• Alerts

Austria

Belgium

Germany

Spain

English,Spanish,Greek,Netherlands,Greekandother15languages

Textonly

Available onthe webpages andRSS format(http://emm.newsexplorer.eu/rss?type=clusters&language=it)

WE SHOULD ASKTHEMDIRECTLY

Already listed inthe FandangoAgreement

D2.1_v1.0 Datalakeintegrationplan

Page45

articlesandlistofrelatedpersonsandorganizations).NewsExplorercarriesoutthefollowingtasks:

• clusterallnewsarticlesoftheday,separatelyforeachlanguage,intogroupsofrelatedarticles;

• foreachcluster,identifynamesofpeople,places,organizations;

• applyapproximatenamematchingtechniquestoallnamesfoundinthesamecluster,inordertoidentifywhichnamevariantsmaybelongtothesameperson;

• linkthemonolingualclusterswiththerelatedclustersintheotherlanguages;

• identifythemosttypicalarticleofeachclusteranduseitstitleforthecluster;

• storetheextractedinformationinadatabase,learningmoreabouteachperson,etc.everyday;

• occasionally,theWikipediaonlineencyclopediaisautomaticallysearchedforimagesandforfurthermultilingualnamevariants.

• Timeline

France

U.K.

Italy

Netherlands

UnitedStates

D2.1_v1.0 Datalakeintegrationplan

Page46

EUROSTAT

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

Eurostat

http://ec.europa.eu/eurostat

Eurostat is the statistical office oftheEuropeanUnion.

DatabaseofEuropeanStatistics,bythemeandAtoZ

Statistic byTheme:

- General- Regional- Economy

andFinance

- PopulationandSocialconditions

- Industry,TradeandServices

- AgricultureandFisheries

- InternationalTrade

- Transport- Environm

entandEnergy

All EUCountries

English,German,French(mainly)

Textonly

http://ec.europa.eu/eurostat/data/database

SDMXWebServices

Json andUnicodeWebServices

BULKDOWNLOAD

http://ec.europa.eu/eurostat/data/bulkdownload

Already listed inthe FandangoAgreement

All Eurostatdatabases andelectronicpublications areavailable free ofcharge via thewebsite.

EUROSTATrequiresnotification ofuse and theINDICATION ofprovenance(EURSTAT)

D2.1_v1.0 Datalakeintegrationplan

Page47

- Science,TechnologyandDigitalSociety

EUROBAROMETER

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

Eurobarometer(EUCommission) onPublicOpinion

http://ec.europa.eu/commfrontoffice/publicopinion/index.cfm/General/index

Eurobarometer was established in1974. Each survey consists ofapproximately 1000 face-to-faceinterviews per country. Reports arepublishedtwiceyearly.Reproductionisauthorized,exceptforcommercialpurposes, provided the source isacknowledged.

Special Eurobarometer reports arebased on in-depth thematic studiescarriedoutforvariousservicesoftheEuropean Commission or other EUInstitutions and integrated in the

Public Opinionand… contains:EurobarometerA-Z,EurobarometerTimeline,Eurobarometer40 years,EurobarometerAlmanac

EurobarometerInteractive isthe searchenginewithFAQ

All the EUmemberssince1974

AlltheEUlanguages since1974

Textonly

NoRSS

The Data isgenerallyprovided inPDF format,but there isthepossibility todownloadXLS formatfrom theOpen DataPortal

Already listed inthe FandangoAgreement

Reuse isauthorized,provided thesource isacknowledged. TheCommission'sreusepolicyisimplemented

D2.1_v1.0 Datalakeintegrationplan

Page48

Standard Eurobarometer's pollingwaves.

Links (toTwitter)

Archives pointsto the oldwebsite,renovated in2016

(https://data.europa.eu/)

ARCHVES

http://ec.europa.eu/commfrontoffice/publicopinion/archives_en.htm

bythe Decisionof 12December2011 - reuseofCommissiondocuments[PDF,728KB]

OTHERPOSSIBLEDATASILOS

(PendingFURTHERverification)

EUROPEANE-RESOURCESCENTRE

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EuropeanLibraryand E-

http://ec.europa.e

Online Search of the resources onEU policies, law and more in the

Open AccessResources.

AllEuropeanCountries

English Textonly

DirectSearch onthe website

Not listed in theFandangoAgreementDoc

D2.1_v1.0 Datalakeintegrationplan

Page49

Resources

u/libraries/

European Commission Library'selectroniccollections

Books, eBooks,JournalArticlesandmore

(possibleotheraccess)

REQUIRESDIRECTCONTACT

EUROPEANEXTERNALACTION(EUForeignPolicy)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

RESOURCESFORMAT

NOTE/COPYRIGHT

EUROPEAN

EXTERNAL

ACTION

SERVICE

(EEAS)

http://risis.eu/data/

The EEAS is the European Union'sdiplomaticservice.IthelpstheEU'sforeign affairs chief – the HighRepresentative for Foreign AffairsandSecurityPolicy–carryouttheUnion's Common Foreign andSecurityPolicy.

The websitecontains manydocuments,publications andinfographic on EUForeign Policy,Security andDefence

AlmostallEUCountries’ResearchProjects

English,French,Italian,Spanish

Text,pictures,graphics

RSS orDirectdownload

Not listed in theFandangoAgreementDoc

REQUIRESDIRECTCONTACT

D2.1_v1.0 Datalakeintegrationplan

Page50

EUROPEANSCIENTIFICDATA(Zenodo)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

NOTE/COPYRIGHT

RESOURCESFORMAT

ZENODO https://zenodo.org/

FreeandOpenDigitalArchivebuiltbyCERNandOpenAIREtofacilitatescientific data exchange amongresearchers

Zenodo offersaccess to 1831Scientific‘Communities’

EuropeanCountries

English Text Not listed inthe FandangoAgreementDoc

REQUIRESDIRECTCONTACT

Output viaOAI-PMH

DirectSearch

EUROPEANUNIVERSITYINSTITUTE(EUI)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

NOTE/COPYRIGHT

RESOURCESFORMAT

EuropeanUniversityInstitute(Firenze)

https://www.eui.eu/

The European University Institute

(EUI) is a unique international

Centre for doctorate and post-

doctorate studies and research,

situated in the Tuscan hills

overlookingFlorence.

The website has asearch feature onthe HistoricalArchives of the EuInstitutions and ona unique collection

AlmostallEuropeanCountries

English,French,German,Italian

Text Not listed inthe FandangoAgreementDoc

DirectSearch

D2.1_v1.0 Datalakeintegrationplan

Page51

Since its establishment 40 years

agobythesixfoundingmembersof

the then European Communities,

theEUIhasearnedareputationas

a leading international academic

institutionwithaEuropeanfocus.

The European Commissionsupports the EUI through theEuropeanUnionbudget

TheEUIlibraryboastsaroundhalfamillion volumes in the Institute’sspecialistareas,attractingexternalresearchers with an interest inEurope. The campus also hoststhe Historical Archives of theEuropean Union (HAEU), whichprovides an unparalleled insightinto theworkingsof theEuropeanUnion.

of 150 privatearchives of pro-Europeanassociation andpersonalities

REQUIRESDIRECTCONTACT

RISIS(EuropeanScientificResearch)

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

NOTE/COPYRIGHT

RESOURCESFORMAT

D2.1_v1.0 Datalakeintegrationplan

Page52

RESEARCH

INFRASTRU

CTUREFOR

RESEARCH

AND

INNOVATIO

NPOLICY

STUDIES

(RISIS)

http://risis.eu/data/

Theoverallobjectiveoftheprojectis to build a distributedinfrastructureondata relevant forresearch and innovation dynamicsandpolicies

RISISaimsatopeningtoEuropeanresearchers a large number of(linked) datasets covering 7themes:1.Research funding:Datasets thatcontaininformationaboutresearchprojectsfundedbytheEC(EUPRO,CORDIS), by trans-border fundingprograms between EC memberstates (JOREP),andothers funders(OPEN-AIR, Open Funderdatabase).2. Datasets on dominant sciencesand technologies (nanotechnologydataset).3. Datasets covering firminnovationdynamics4.PublicsectorresearchinEuropewith several data on EuropeanHigher Education Institutions(RISIS-ETER) and on Europeanpublic research organizations

The datasets coverfive criticaldimensions: ERAdynamics (3datasets), firminnovationdynamics (3datasets), publicsector research (3datasets), researchcareers(3datasets)andarepositoryonresearch andinnovation policyevaluations.

Several of thesedatasets areaccessibleonline

AlmostallEUCountries’ResearchProjects

English Text,pictures,graphics

Not listed inthe FandangoAgreementDoc

REQUIRESDIRECTCONTACT

Access onlythroughaccreditation

D2.1_v1.0 Datalakeintegrationplan

Page53

(under development) andon theiracademic performance (Leidenranking).5.Researchcareerswithaccess tothe European mobility survey(MORE) and theGermanpanel ondoctoralstudentsandtheircareers(earlycareerfacility)and,atalaterstage, with access to a platformand/ordatasetintegratingmultiplenational sources (underdevelopment);6. A specific repository, SIPER, onpolicyevaluations,articulatedwiththe OECD-World Bank Innovationpolicy platform) and giving accessto the accumulated knowledgeonpolicy instruments and policymixes.7. Several datasets that providelinked data, such as data fromstatistical offices, geographicalclassifications, patents (USPTO),open science (Open-Air), andothers. For more information seetheSMSDataStore.

D2.1_v1.0 Datalakeintegrationplan

Page54

OpenAIRE

NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY

MAINSECTORS GRANULARITY

LANGUAGE

MEDIA

NOTE/COPYRIGHT

RESOURCESFORMAT

TheOpenAIRE 2020Project

https://www.openaire.eu

50 partners, from all EU countries,andbeyond,willcollaboratetoworkonthislarge-scaleinitiativethataimsto promote open scholarship andsubstantially improve thediscoverability and reusability ofresearch publications and data. Theinitiative brings togetherprofessionalsfromresearchlibraries,open scholarship organizations,national e-Infrastructure and dataexperts, IT and legal researchers,showcasing the truly collaborativenature of this pan-Europeanendeavor.

The website has asearch feature on24 millionpublications andalmost 700thousand datasetson the 2020Projects

AlmostallEuropeanCountries

English,French,German,Italian

Text Not listed inthe FandangoAgreementDoc

REQUIRESDIRECTCONTACT

DirectSearch