D2.1 DATA LAKE INTEGRATION LAN Data... · Hive Query engine (SQL-like language) on HDFS (and HBase)...
Transcript of D2.1 DATA LAKE INTEGRATION LAN Data... · Hive Query engine (SQL-like language) on HDFS (and HBase)...
D2.1_v1.0 Datalakeintegrationplan
Page1
D2.1DATALAKEINTEGRATIONPLAN
DeliverableNo.: D2.1
DeliverableTitle: Datalakeintegrationplan
ProjectAcronym: Fandango
ProjectFullTitle: FAkeNewsdiscoveryandpropagationfrombigDataandArtificialiNtelliGenceOperations
GrantAgreementNo.: 780355
WorkPackageNo.: 2
WorkPackageName: DataAccess,Interoperabilityanduserrequirements
ResponsibleAuthor(s): Sindice(Lead),ENG,LIVETECH,VRT,CERTH,CIVIO,UPM,ANSA
Date: 31.07.2018
Status: V1.1
Deliverabletype: REPORT
Distribution: PUBLIC
Ref. Ares(2018)4118136 - 05/08/2018
D2.1_v1.0 Datalakeintegrationplan
Page2
REVISIONHISTORYVERSION DATE MODIFIEDBY COMMENTS
V0.1 07.05.2018 Jeferson Zanim(Siren/Sindice)
Firstdraft.
V0.2 15.05.2018 Jeferson Zanim(Siren/Sindice)
Content structure adjusts and firstintegrationaddition.
V0.3 22.05.2018 TheodorosSemertzidis(CERTH)
CERTHcontributions.
V0.4 07.06.2018 Jeferson Zanim(Siren/Sindice)
Merged UPM contributions to thedocument.
V0.5 14.06.2018 Jeferson Zanim(Siren/Sindice)
Merged LvT contributions to thedocument.
V1.0 30.07.2018 MonicaFranceschini,MassimoMagaldi
(ENG)
Internallyreviewedversion
V1.1 03.08.2018 Jeferson Zanim(Siren/Sindice)
FinalVersion
D2.1_v1.0 Datalakeintegrationplan
Page3
TABLEOFCONTENTS1. Introduction.................................................................................................................................72. ArchitectureOverview................................................................................................................73. DataIntegrations.........................................................................................................................93.1. DataIngestion.........................................................................................................................93.1.1. DataIngestionI–Partner’sData(FTP)..............................................................................103.1.2. DataIngestionII–RestAPIs..............................................................................................113.1.3. DataIngestionIII–OpenData...........................................................................................123.1.4. DataIngestionIV–Websites.............................................................................................133.1.5. DataLoadingV–RSSSites................................................................................................143.1.6. DataIngestionVI–SocialNetworks..................................................................................143.2. DataProcessing.....................................................................................................................163.2.1. Siren(Sindice)DataProcessingIntegrations.....................................................................163.2.1.1. SirenIntegrationI–SirenInvestigate...............................................................................163.2.2. UPMIntegrations..............................................................................................................183.2.2.1. UPMIntegrationI–Spark.................................................................................................183.2.2.2. UPMIntegrationII–Hive..................................................................................................193.2.2.3. UPMIntegrationIII–Elasticsearch...................................................................................213.2.2.4. UPMIntegrationIV–Neo4J..............................................................................................213.2.2.5. UPMIntegrationV–SirenInvestigate..............................................................................233.2.3. CERTHIntegrations............................................................................................................253.2.3.1. CERTHIntegrationI–HDFS...............................................................................................253.2.3.2. CERTHIntegrationII–HBase.............................................................................................263.2.3.3. CERTHIntegrationIII,VIandV–ApacheZeppelin............................................................273.2.3.4. CERTHIntegrationVI–Elasticsearch.................................................................................273.2.3.5. CERTHIntegrationVII–Spark...........................................................................................283.2.3.6. CERTHIntegrationVIII–MLib...........................................................................................283.2.3.7. CERTHIntegrationIX–Elasticsearch.................................................................................293.2.4. LvTIntegrations.................................................................................................................313.2.4.1. LvTIntegrationI–Kafka....................................................................................................313.2.4.2. LvTIntegrationII–FandangoWebapp.............................................................................323.2.4.3. LvTIntegrationIII,IV,V–ApacheZeppelin.......................................................................333.2.4.4. LvTIntegrationVI–Elasticsearch......................................................................................343.2.4.5. LvTIntegrationVII–Spark.................................................................................................353.2.4.6. LvTIntegrationVIII–Spark................................................................................................363.2.4.7. LvTIntegrationIX–Spark..................................................................................................373.2.4.8. LvTIntegrationX–HBase..................................................................................................374. Conclusion.................................................................................................................................385. ANNEX–DataSilosforEuropeanContent,ClimateChangeandMigration.............................39PodcastsandVodcasts.......................................................................................................................43
D2.1_v1.0 Datalakeintegrationplan
Page4
LISTOFFIGURESFigure1-ArchitectureOverview..................................................................................................7
Figure2-DataLoading...............................................................................................................10
Figure3-SirenIntegrations........................................................................................................16
Figure4-UPMIntegrations........................................................................................................18
Figure5-CERTHIntegrations.....................................................................................................25
Figure6-LvTIntegrations...........................................................................................................31
LISTOFTABLESTable1–ArchitectureComponentsOverview.............................................................................9
D2.1_v1.0 Datalakeintegrationplan
Page5
ABBREVIATIONSABBREVIATION DESCRIPTION
H2020 Horizon2020
EC EuropeanCommission
WP WorkPackage
EU EuropeanUnion
D2.1_v1.0 Datalakeintegrationplan
Page6
EXECUTIVESUMMARYThisdocumentisadeliverableoftheFANDANGOprojectfundedbytheEuropeanUnion’sHorizon2020(H2020)researchandinnovationprogrammeundergrantagreementNo780355.ItisapublicreportthatdescribesthedatalakeintegrationplanforthesoftwaredevelopmentwithinFANDANGO.
Themain goal of this deliverable is todefine thedata sets thatwill be collected andprocessed in theFANDANGO’s data-lake, ultimately describing howdata is handled and curate in different steps of theprocess.
Datalakesrequiredataintegrationsolutionsthatcanworkwithstructuredandunstructureddata,likelywithschema-lessdatastorage,andwithstreamsofdatathatshouldbeprocessed innearreal-time. Inother words, data lake requires a completely different approach to data integration and newer dataintegrationtechnologyascomparedtotraditionaldatawarehouse.
Therefore, this document describes the different data ingestions and integrations currently designed,basedontheproposedarchitecture.Foreachofthose,ownershipofsourceandtargetrepository,typeofdata,accesscontrol,persistenceperiodandpurposeareasserted.
As the data lake evolves sowill its documentation, becomingmore descriptive and precise during thelifespanoftheproject.IntegrationsandcomplementaryinformationwillbeaddedintospecificsectionsofdeliverablesD2.2-DataInteroperabilityanddatamodeldesign,D3.1-Datamodelandcomponentsand/orProjectProgressPeriodicReportstodefinemoredetaileddatastructuresavailableineachrepositoryanditsconventions.
D2.1_v1.0 Datalakeintegrationplan
Page7
1. INTRODUCTIONFANDANGO’sgoalistoaggregateandverifydifferenttypologiesofnewsdata,mediasources,socialmediaandopendatatodetectfakenewsandprovideamoreefficientandverifiedformofcommunicationforEuropeancitizens.
Toachievesuchgoal,severaldifferentapproachesmustbeusedinconjunctiontocollectalargevolumeofdata.ThecollectionofthesedatasetsisessentialtoensurethattheMachineLearningalgorithmscanprocessthe inputs intomeaningful informationandprovidehighquality interactionswiththeuserthatallowsreal-timeanalysisforinvestigationandvalidationpurposes.SolutionslikeSpark,whichwillbeusedforfastprocessingofmachinelearningandgraphanalysis,needstoworkinconjunctionwithElasticsearch,thatisfocusedonsemanticandstatisticalcomputation.Suchstrategyrequiresdifferenttechnologiestobe used and interconnected into a single, meaningful, solution. The parts responsible for suchinterconnections,betweendataanddifferentpartsofthesoftwaresolution,aretheintegrations,whichwillbedescribedinfurtherdetailonthisdocument
2. ARCHITECTUREOVERVIEWTodefinethedatalakeintegration,itiscrucialtoanalysetheoverallarchitectureofthesolutionandhowdata will be collected and processed across different environments. Therefore, the initial architectureoverviewinFigure1servesasbasetodescribethedifferentpartsofthesolution.
Figure1-ArchitectureOverview
FANDANGO’sfeaturestosupportjournalistinfake-newsdetectionandverification,aswellasscoringthenews with different trustworthiness scores, requires the development of several different big dataprocessing and analyzing techniques. To optimize the solution and better comply to software qualitystandards, such as: Functional Suitability, Reliability, Operability, Performance Efficiency, Security,Compatibility, Maintainability and Transferability, FANDANGO relies on well-established products thatwerebrought together to form theproposedarchitecture. The componentsof the architecture,whichneedstobeintegratedaredescribedonTable1–ArchitectureComponentsOverview.
D2.1_v1.0 Datalakeintegrationplan
Page8
SOFTWARE DESCRIPTION
Nifi Data flow ingestion tool, open source, distributed and scalable, tomodelreal-timepre-processingworkflowfromseveraldifferentsources.
Kafka Publish-subscribedistributedmessagingsystem,thatgrantshighthroughputandbackpressuremanagement.
Spark Fast, in-memory, distributed and general engine for large-scale dataprocessingwithmachine learning (Mllib), graph processing (GraphX), SQL(SparkSQL)andstreaming(SparkStreaming)features.
HDFS TheHadoopdistributedfilesystem,opensource,reliable,scalable,chosenasstorage.
Elasticsearch +SirenFederate
Distributed,multitenant-capable full-textsearchenginewithanHTTPwebinterfaceandschema-freeJSONdocuments.SirenFederatepluginisaddedtoElasticsearchtoallowdatasetsemi-joinsandseamless integrationwithdifferentdatasources.
Hive Query engine (SQL-like language) on HDFS (and HBase) with JDBC/ODBCinterfaces.
Oozie Workflowscheduler.
Ambari itactsasbothaworkflowengineandascheduler.Inthiscase,itsmainroleistomanagetheschedulingofSparkjobsandthecreationofHivetables.
Hue HadoopHUEtoperformdashboards,queriesandbrowsetheservices.
Siren InvestigativeIntelligenceUIwithconnectivitytoElasticsearch,whoseaimistoallowreporting,investigativeanalysisandalertingtousersbasedontheindexedcontents.
Rest APIs, RSS,Web Sites,Open Data,Socialnetwork
DatasourcesoftheFandangoproject.Specificcrawlerswillconnecttothesesourcesofdatatogettheinformationneededtoverifythenews.
FTP TheFileTransferProtocol(FTP)isastandardnetworkprotocolusedforthetransfer of computer files between a client and server on a computernetwork.InourArchitectureitiswhereUserscanplacefilesthatwillbethaningestedinthedatalake.
HBase TheHadoopNoSQLdatabase,toperformrandomreadandwritesbasedonrowkeyidentifiers.
Zeppelin Thenotebookdedicatedtodatascientists,toruninREPLmodescriptsandalgorithmsondatastoredinHadoop.
D2.1_v1.0 Datalakeintegrationplan
Page9
Atlas Apache Atlas provides scalable governance for Enterprise Hadoop that isdrivenbymetadata,addingfeaturesfordatalineage,governancecontrolstoaddresscompliancerequirementsandagiledatamodelling.
Ranger Frameworktoenable,monitorandmanagedatasecurityacrosstheHadoopplatform according to fine-grained policies and a centralized security andauditing.
WebApp AccesspointtoFandangoInfrastructure.ThejournalistwillusetheFandangoWeb application to insert news and verify the trustworthiness of certainpublications.
Table1–ArchitectureComponentsOverview
3. DATAINTEGRATIONSTo allow the implementation of the data processing and analysis techniques needed to support theFANDANGO’s features, interconnections between the different parts of the solutions and enable thefunctional requirements designed for FANDANGO, multiple Data Integration processes are required,identified by red arrows in Figure 1. This section is going to describe in further detail each of theseintegrationprocesses,breakingdownintotwomainsteps:dataingestionanddataprocessing.ThefirstonedescribestheacquisitionofexternaldatabythesolutionandtheseconditsdifferentprocessingstageswithinFANDANGO.
3.1. DATAINGESTIONTheinitialstepsintheprocessisacquiringdatafrommultiplesources.Someofdesireddatainputshavebeenmappedanditwillbeshapedinmoredetailsalongtherequirementevolutionandthefirstsoftwaredeliveryiterations.ThesecanbeseeninFigure2Figure2-Data,andwillbedescribedinfurtherdetailinthefollowingsections.
D2.1_v1.0 Datalakeintegrationplan
Page10
Figure2-DataIngestion
AlldataingestionprocesseswillbeimplementedbyCERTH,byfollowingthedatamodeldesignthatistobedefinedinWP2.
3.1.1. DATAINGESTIONI–PARTNER’SDATA(FTP)FANDANGO’suserpartnersowncollectionsofvaluabledatathatmaybeusedinvarioussituationsfromtrainingmachinelearningmodelstosupportFANDANGO’sfake-newsdetectionsfeature.Thedatasetsthatwillbemadeavailableareofdifferenttypesandonavarietyofformats.
ForeachdatasetacustomdatashippingscriptwillbeprovidedthatwillingestthedataintheFANDANGOclusterandmadeavailabletotheprocessingunits.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource ThedatasetswillcomefromthedatacollectionsofANSA,VRT,CIVIO
Ownership EachdatasetwillbeownedbythepartnerthatissharingthedataandwillbeusedinFANDANGOonlyinternally.Partner’sdatawillnotbeexposedtothepublic.
Type Mostofthedataare intext formatssuchasplaintext,pdf filesandwordfiles.Anotherbatchofdatawillbeimagesandvideosinknownformats.
AccessControl Internalnetworkonly.
Persistence ThedatawillbekeptforthedurationoftheFANDANGOproject.Afterthefinalisationoftheproject,anewagreementwillbeconducted.
D2.1_v1.0 Datalakeintegrationplan
Page11
DATATRANSFORMATION
Theinitialingestionofthedatawillnotfollowanytransformationprocedure.Thiswillpermitthepartnersworkingontheprocessingmodulestoexperimentwithdifferentconfigurationswiththeoriginaldata.Aftertheestablishmentofasolidprocessingworkflow,thedefineddatatransformationsandsuccessiveupdateswillbespecifiedintospecificannexestodeliverablesD2.2-DataInteroperability
anddatamodeldesign,D3.1-Datamodelandcomponentsand/orProjectProgressPeriodicReports.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem HDFS
Ownership Thepartnerthatsharedthedata
Type HDFSfiles
AccessControl Internalnetwork
Persistence Keptuntiltheendoftheproject
3.1.2. DATAINGESTIONII–RESTAPISAlistofsourcesthatgiveaccessthroughRESTAPIswillalsobeintegratedintheFANDANGOdatashippers.TheRESTAPIdatashipperwillbeabletoloaddatafromdifferentsourcesbychangingonlyasmallpartofthescriptwiththespecificitiesofeachservice.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource RSSsourceswillbedefinedandintegratedduringtheproject
Ownership Thirdparty
Type JSON
AccessControl Publicaccess
Persistence ManagedaccordingtothetermsofuseofeachRESTAPIprovider
DATATRANSFORMATION
Thedatawillfollowthedatamodelthatwillbedefined.Untilthennotransformationwillbeapplied.
TARGET
D2.1_v1.0 Datalakeintegrationplan
Page12
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Elasticsearch
Ownership Thirdparty
Type JSON
AccessControl internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner.
3.1.3. DATAINGESTIONIII–OPENDATAAlistofopendatasourcesisunderdevelopmentintheFANDANGOproject.Thesesourcesaremainlytextdata coming from public organizations either in national or European level. Some of these open dataportals, share theirdata throughprogrammable interfaces,however there is a clearmajority thatonlyprovideddownloadablelinkstopdf,csvorxlsfiles.SourcessuchastheEurobarometer,theEurostat,theEuropeanExternalActionServiceandotherorganizationsareinthiscategory.
Alistofalltheopendatasetisprovidedin5ANNEX–DataSilosforEuropeanContent,ClimateChangeandMigration.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource DatapubliclyavailablefromnationalandEuropeanorganisations
Ownership OpenDatapublishers.
Type json,csv,xls,pdfformatsareavailable
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner
DATATRANSFORMATION
ThedatawillbefedtotheFANDANGOdatalakeasis.Aftertheloading,themodulesthatperformthepre-processingwilltransformthemtoplaintext.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
D2.1_v1.0 Datalakeintegrationplan
Page13
TargetSystem HBASE
Ownership OpenDatapublishers.
Type Datatable
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner.
3.1.4. DATAINGESTIONIV–WEBSITESAfocusedwebcrawlerisbeingimplementedinWP3toloadwebsitesthatarerelevanttonewsandfakenewsdebunking.Thecrawlerwillholdalistofpredefinednewssourcessuchasnewspapersites,blogs,factcheckersandotherrelatedsources.Thecrawlingofthesesiteswillfollowadeltaapproachthatwillgatheronlytheupdatesaftertheoriginal/initialcrawlingprocess.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Newssitesandotherrelatedwebpagesthatwillbemanuallyselected
Ownership Publishersand/orauthors
Type JSONfilesthatcontainthetextandtheURIsofmultimediaineachnewspost
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner.
DATATRANSFORMATION
ThedatawillbeholdinJSONformatandnofurthertransformationisneeded.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Elasticsearch
Ownership Publishersand/orauthors
Type JSON
AccessControl Internalnetwork
D2.1_v1.0 Datalakeintegrationplan
Page14
Persistence Managedaccordingtothetermsofuseofthedataowner.
3.1.5. DATAINGESTIONV–RSSSITESThedatashipperswillprovidethemeanstogatherdatafromRSSfeeds.AlistofmonitoredRSSfeedswillbecreatedwiththehelpofourusers’partners.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource RSSfeedsofnewsagenciesandnewssites
Ownership Thirdparty
Type XML
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner.
DATATRANSFORMATION
ThedatawillbetransformedintoJSONformatforauniformapproachonthehandlingoftextdata.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem HBASE
Ownership Thirdparty
Type Datatable
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowners.
3.1.6. DATAINGESTIONVI–SOCIALNETWORKSA special source for fake news is the social networks. Social networks are themain channels of newspropagationandassuchFANDANGOmustkeepaconstanteyeonwhatissharethere.ThefirstandthemostinterestingintermsoffakenewsandnewspropagationisTwitter.FANDANGO’sdatashipperswillcreatedatagatheringwithdifferentparameterssuchasusingkeywords,hashtagsorusers’accountsandgeolocationqueries.
D2.1_v1.0 Datalakeintegrationplan
Page15
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Twitterandothersocialnetworks
Ownership Thirdparty
Type JSON
AccessControl Internalnetwork
Persistence Managedaccordingtothetermsofuseofthedataowner.
DATATRANSFORMATION
Notransformationswillbeappliedintheoriginaldata.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Elasticsearch
Ownership Thirdparty
Type JSON
AccessControl Internalnetwork
Persistence Willbekeptaslongasitispermittedbythetermsofuseofeachservice
D2.1_v1.0 Datalakeintegrationplan
Page16
3.2. DATAPROCESSINGOnceexternaldatahasbeenmadeavailablewithinFANDANGO’splatform,therearemultipleprocessingstagesrequiredtoenhanceit,analyseitandultimatelymakeinformationavailabletotheuser,whichwillthenprovidemoredatatothesystemtocontinueitslearningcycle.
To facilitate thecontrolof thedeliverables,projectplanningandprovidebettervisibilityof therequireimplementations,thedifferentdataprocessingintegrationhavebeenbrokenintosub-groupsthatwillbeimplementedbydifferentpartnersinFANDANGOproject.Eachgroupanditsimplementationsisdescribedinthefollowingsections.
3.2.1. SIREN(SINDICE)DATAPROCESSINGINTEGRATIONSTheintegrationshighlightedinredinFigure3aregoingtoimplementedbythepartnerSiren.
Figure3-SirenIntegrations
3.2.1.1. SIRENINTEGRATIONI–SIRENINVESTIGATEThis integration is responsible for accessing consolidated datasets,made available in Elasticsearch andbringingittoSirenInvestigateplatform,whereuserscandoinvestigativeanalysisthroughDashboardsandKnowledgeGraphs.
Thedataretrievedisdependentonuserrequestandtheassignedcredentials,andit isonlytreatedforpresentationonthetargetsoftware.Whilemulti-datasetfiltersareallowed,datacontentandgranularityiskeptunchangedbetweenthesystems.
SOURCE
Characteristicsofthedataorigin.
D2.1_v1.0 Datalakeintegrationplan
Page17
CHARACTERISTIC DESCRIPTION
MainSource ElasticsearchandSirenFederateplugin
Ownership FANDANGO
Type JSONDocuments
AccessControl Userauthentication
Persistence Data will be preserved indeterminately for analysis purposes. Sanitizingpoliciesmightbecreatedafterproduction
DATATRANSFORMATION
Data iscollectedandtransportedwithoutchangesto itscontent.Thatallowstheoriginaldataretrievaldesign to be preserved and ensures that information being presented to the user isn’t altered byaggregationortransformationprocesses.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem SirenInvestigate
Ownership FANDANGO
Type JSONDocumentsconsolidatedintoDashboardsandKnowledgeGraphs
AccessControl Userauthentication
Persistence Real-timescreenvisualizationonlyorCSVexportbyusers
D2.1_v1.0 Datalakeintegrationplan
Page18
3.2.2. UPMINTEGRATIONSTheintegrationshighlightedinredinFigure4-UPMIntegrationsaregoingtoimplementedbythepartnerUPM.
Figure4-UPMIntegrations
3.2.2.1. UPMINTEGRATIONI–SPARKSparkisafastandgeneralclustercomputingsystemforBigData.Itprovideshigh-levelAPIsindifferentlanguagesincludingScala,Java,Python,andR,andanoptimizedenginethatsupportsgeneralcomputationgraphsfordataanalysis. Italsosupportsarichsetofhigher-leveltools includingSparkSQLforSQLandDataFrames,MLlib formachine learning,GraphX forgraphprocessing,andSparkStreaming for streamprocessing.(ApacheSpark,s.f.)
Moreover,Sparkisveryflexible,anditallowstopreprocessthedatainordertostoreitinaproperformatforfuturedatatransformationsanddataanalysis.
Indeed,Sparkwillbeemployedtopreprocessthedataprovidedbythedifferentdatasourceswiththeaimofreducingthecomplexityoftherawdatasuchasimages,videocontentaswellasthetextfiles.SinceHortonworksDataPlatform(HDP)supportsApacheSpark, the integrationof suchcomponentdoesnotrequireanexternalprocedure.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource ApacheNIFI.
D2.1_v1.0 Datalakeintegrationplan
Page19
Ownership FANDANGO
Type BinaryandHDFSfilesforstoringimages,videocontentandmeta-data.
AccessControl InternalNetwork
Persistence Theoriginaldatawillbemanagedbyathirdpartyinordertoberemoved,processedorstored.
DATATRANSFORMATION
Datapreprocessingcanbedefinedasadataminingtechniquethatinvolvestransformingrawdataintoanunderstandableformat.ThemainprobleminReal-worlddataisthatitisoftenincomplete,inconsistent,and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence, a datapreprocessingstageisaprovenmethodofresolvingsuchissues.Datapreprocessingpreparesrawdataforfurtherprocessing.
In this project, different data sources will be stored in the Data Lake, and therefore, some datatransformationsshouldbeappliedtoImagesandothermedia-contentinordertonormalizeandscaletheoriginaldata.
SeveralDataTransformationsproceduresincludingcenteringthedata,normalizingwillbeappliedinthedifferentdatasourceswiththeaimofstandardizingthedataforthefutureMachineandDeepLearningprocedures.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem ApacheZEPPELIN
Ownership FANDANGO
Type HDFSfiles
AccessControl InternalNetwork
Persistence Theoriginaldatawillbemanagedbyathirdpartyinordertoberemoved,processedorstored.
3.2.2.2. UPMINTEGRATIONII–HIVEHiveisadatawarehouseinfrastructurebuiltontopofHadoop.ItprovidestoolstoenableeasydataETL,amechanismtoputstructuresonthedata,andthecapabilityforqueryingandanalysisoflargedatasetsstoredinHadoopfiles.TheintegrationofsuchcomponentintheHDPissimilartotheSparkonesinceHIVEisanativecomponentofHortonworks.
HivedefinesasimpleSQLquerylanguage,calledHiveQL,thatenablesusersfamiliarwithSQLtoquerythedata.At thesametime, this languagealsoallows toworkwith theMapReduce frameworkbyplugging
D2.1_v1.0 Datalakeintegrationplan
Page20
custommappersandreducerstoperformmoresophisticatedanalysisthatmaynotbesupportedbythebuilt-incapabilitiesofthelanguage.1
TheuseofHIVEintheprojectwillbebasicallytosupportSparkinthepreprocessingmethodsbyprovidingflexibilityandscalability intherequireddataqueries. ItmayalsobeemployedtoperformdataanalysistasksoverlargedatasetswhicharestoredinHDFSfilesusingitsUserInterfaceaswell.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Spark
Ownership FANDANGO
Type HDFSfiles
AccessControl InternalNetwork
Persistence Thedataismanagedbyathirdpartytobevisualizedorprocessed
DATATRANSFORMATION
Inthisscenario,sincethetransformationstepwillbecarriedoutbytheSparkcomponent,ApacheHIVEwillnotrequiretoperformanydatatransformationduetoItwillbeemployedtomakeflexibleandscalablequeries of the data stored in the Spark component aswell as to support any other componentwhichrequirestheusageofaquerysystemforreal-timevisualizationordataanalysis.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem FANDANGOWebApp
Ownership FANDANGO
Type HDFSfiles
AccessControl InternalNetwork
Persistence ThedatawillbequeriedinReal-Timescreenvisualizationandmanagedbyathirdparty.
1https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ch_using-hive.html
D2.1_v1.0 Datalakeintegrationplan
Page21
3.2.2.3. UPMINTEGRATIONIII–ELASTICSEARCHElasticsearchisadistributed,RESTfulsearchandanalyticsenginecapableofsolvingagrowingnumberofusecases.Itcentrallystoresthedataanditwillbeusedtoprocesstherequiredqueriesinordertovisualizetheresultsinreal-timeusingtheDashboardoftheSirenInvestigate.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Spark,HDFS,ApacheNIFI
Ownership FANDANGO
Type JSONDocument
AccessControl InternalNetwork
Persistence In-memoryprocessingonly
DATATRANSFORMATION
Inthiscase,thedata iscollectedandtransportedwithoutsufferingfromanykindofchangessincethiscomponent will be used to transport the information between pairs of modules and the originalinformationshouldremainintactsincethismodulewillhelpusersinthereal-timevisualizationprocedure.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem FANDANGOWebApp
Ownership FANDANGO
Type JSONdocument
AccessControl InternalNetwork
Persistence In-memoryprocessing
3.2.2.4. UPMINTEGRATIONIV–NEO4JNeo4JisaGraphDataBase(GDB)mainlyorientedtographs.Itmeansthatitusesgraphstorepresentthedata(entities)andtherelationshipbetweenthese. Thereexistmultiplemannersofrepresentingthesegraphs:
- UndirectedGraph:nodesandlinkscanbeexchanged,anditsrelationshipcanbeinterpretedregardlessthedirection.(i.e.FriendlinksinFacebook)
D2.1_v1.0 Datalakeintegrationplan
Page22
- DirectedGraph:nodesandrelationshipsarenotbidirectional(bydefault).Anexampleofthistypeisgivenbytwitterfollowingrelationships.Ausercanfollowsomeprofilesinthisnetworkwithouttheseprofilescanfollowhim/herback.
- Weightedgraph:therelationshipsbetweenentitiesarerepresentedbyanumericalvalue(weight).Itallowsperformingsomespecialoperations.
- Labeledgraph:thesegraphshavelabelsincorporatedthatcandefinethemultipleedgesandthetypeof relationship between nodes (i.e. Facebook labeled relationships include friends, job colleague,partnerof,friendof
- PropertyGraph: isaweightedgraph,with labelswherepropertiescanbeassignedbothtoentities(journal,publisher….)aswellasrelationships(Generalcategoriessuchasname,country,birthplace)
InthecontextofFANDANGO,aparticularapplicationofthesedatabasesmatcheswiththeambitionofTaskT4.4Sourcecredibilityscoring,profilingandsocialgraphanalytics.Thistaskaimstodetectnodesassociatedtofakecontentgenerationandrelationshipswiththeseentities.Forthispurpose,itisexpectedtohaveacompletedefinitionofthemultipleactorsinvolvedinthefakenewsdetectionparadigm.Thiswillallowtoexpressthecompleteenvironmentandtoexploitsourcesfrommultipleentities(Newshasbeenpublishedforanauthorwithabadreputation,soitislikelytobebiasedorevenfake).Thisparadigmdefinitionisalsocommonlyknownasontology.Therearesomegeneral-purposenews-relatedontologiesinthefieldofnewsanalysis2.Theseapproacheswillbetakenasstartingpointforthenewsanalysis3.
TheintegrationofNEO4JintotheHortonWorksDataPlatform(HDP)asaservicehasbeendone.Theprocessissummarizedasfollows:
- inHDPafoldermustbecreatedat:‘/var/lib/ambari-agent/cache/stacks/HDP/2.6/services’withthenameoftheservice‘Neo4J’
- gointothefolderandclonethehttps://github.com/cas-bigdatalab/ambari-neo4jrepository.
- [optional]Changetheconfiguration(GeneralparametersIP,PORTS,SECURITY….)intheconfigurationfilea‘/master/configuration/neo4j.xml’
- StarttheHDPandgotoaddaservice…
Whatthisrepositorydoes,istocreateafolderinthe/etc/yum.repos.d/neo4j.repoandinstallthemostrecentversionofthesoftwareandattachedittothewholestackofservicesintotheHDPplatform.
For the interestof theentire research/development community, aDockerHub imagehasbeencreatedintegratingtheservicesrequiredbyUPMforFANDANGO.4
SOURCE
Characteristicsofthedataorigin.
2 The IPTC is the global standards body of the newsmedia that provides the technical foundation for the newsecosystemhttp://dev.iptc.org/rNews
3BBCOntology:https://www.bbc.co.uk/ontologies/storyline
4DockerHubFANDANGOmoduleshttps://hub.docker.com/r/tavitto16/fandango_hdp/
D2.1_v1.0 Datalakeintegrationplan
Page23
CHARACTERISTIC DESCRIPTION
MainSource SparkandFANDANGOWebApp
Ownership FANDANGO
Type JSONandcsvdocuments
AccessControl Authenticatedaccess
Persistence Thedatawillbeusedtomakegraphanalyticsanddatavisualizationsbyathirdparty.
DATATRANSFORMATION
SinceNEO4jwillbeemployed in thegraphanalysisperformance, the transformationsappliedover theorigindatawillconsistofasetofqueriesandgraphoperations.Thissetofoperationswillbeusedtofindrelevantpatternsandtoanalyzethecredibilityofsomesourcesusinggraphalgorithmsbutattheend,theoutputoftheseoperationsmustbeanewgraph(ifthegraphhasbeenmodified)aswellasasetofmetricsorresultswhethertheyarerequired.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem FANDANGOWebAppandSireninvestigate.
Ownership FANDANGO
Type JSONdocument
AccessControl Authenticatedaccess
Persistence Thedatawillbeusedforgraphanalyticsanddatavisualizationsbyathirdparty.
3.2.2.5. UPMINTEGRATIONV–SIRENINVESTIGATESireninvestigatewillbeusedtovisualizethegraphdatabasewithalltheentitiesandrelationshipsinvolvedin FANDANGO’s ontology. In addition, basic modifications and analytics within the graph can also beperformedusingthismodule.
Moreover,SireninvestigatewillbecommunicatewithNeo4jincasethelatterisrequiredtoperformmoreadvancegraphanalytics.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
D2.1_v1.0 Datalakeintegrationplan
Page24
MainSource NEO4J,FANDANGOWebApp
Ownership FANDANGO
Type JSON,ontologyfile(owl,rdf).
AccessControl InternalNetwork
Persistence Real-timevisualizations
DATATRANSFORMATION
Inthisprocess,thedatawillbetransformedwhetherthegraphanalysisemployssomealgorithmsthatwillmodifythecurrentgraph(i.e.addnewentitiesorrelationshipsorremovesomeofthem).Inthiscase,thetransformationwill consist of updating the current graphand store such version in theplatform tobevisualizedlateron.However,theformatofthedatamustbethesamethattheoriginal.Thistransformationwillonlyaffecttotheinformationprovidedbythegraphbutnotinthestructureofit.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem FANDANGOWebApp
Ownership FANDANGO
Type JSONandontologiesconfigurationfiles
AccessControl InternalNetwork
Persistence RealTimeVisualizations
D2.1_v1.0 Datalakeintegrationplan
Page25
3.2.3. CERTHINTEGRATIONSTheintegrationshighlightedinredinFigure5aregoingtobeimplementedbythepartnerCERTH.
Figure5-CERTHIntegrations
3.2.3.1. CERTHINTEGRATIONI–HDFSthatwillbeimmutable.Itisconvenienttoworkwithandwillholdanytypeofdataofanyformat.HDFSwillbethemainstoragemoduleFANDANGOwillbeusingtopushdata
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource DatafromdatashippersthroughtheApacheNIFI
Ownership FANDANGO
Type HDFSholdinganytypeofdata
AccessControl internalnetwork
Persistence dependsoneachsourceasdescribedintheprevioussections.
DATATRANSFORMATION
Asitisdescribedinthedataloadingsection.
TARGET
Characteristicsofthedatatarget.
D2.1_v1.0 Datalakeintegrationplan
Page26
CHARACTERISTIC DESCRIPTION
TargetSystem Allprocessingsystemse.g.MLlib,Spark,etc.
Ownership FANDANGO
Type MostlyJSONfilesbutdependsontheoriginaldata
AccessControl Internalnetwork
Persistence Asitisdefinedinthedataloadingsection
3.2.3.2. CERTHINTEGRATIONII–HBASEHBASEwillbeusedfordatathatarestructuredsuchasinformationcomingfromRSSfeedsoropendataportalsfromEuropeanandnationalorganizations.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource RSSandopendata
Ownership FANDANGOunlessotherwisedefinedbythedataprovider
Type Datatable
AccessControl Internalnetwork
Persistence Asitisdefinedinthedataloadingsection
DATATRANSFORMATION
Notransformationsrequired.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Allprocessingmodulese.gMLlibandspark
Ownership FANDANGO
Type Datatable
AccessControl Internalnetwork
Persistence Asitisdefinedinthedataloadingsection
D2.1_v1.0 Datalakeintegrationplan
Page27
3.2.3.3. CERTHINTEGRATIONIII,VIANDV–APACHEZEPPELINApachezeppelin isanotebookenvironment. Itwillbeused toworkon thedata thatarestored in theFANDANGOclusterandexperimentdirectlyonthedatawithouttheneedtotransferdatatoandfromtheclusterfortheresearchandprototypingpurposes.Asitisusedasasandboxarea,differentdatasetsandformatswillbeloadedintoit.Datainthisareawillonlybepersistedwhilerequiredfortheprototypingpurposesandwillbecontrolledthroughuserauthenticationforaccessingthenotebooks.
3.2.3.4. CERTHINTEGRATIONVI–ELASTICSEARCHElasticsearch will collect all the processed data and the outcomes of the processing modules for theidentificationoftrustworthinessmarkers.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource ProcessingmodulesimplementedinWP4
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence Deletedafterprocessing
DATATRANSFORMATION
Notransformationsrequired.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Elasticsearch
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence Keptforthedurationoftheproject
D2.1_v1.0 Datalakeintegrationplan
Page28
3.2.3.5. CERTHINTEGRATIONVII–SPARKSparkwillbeusedforprocessingtheavailabledatainthedevelopedmodulesofWP4.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource HDFS
Ownership FANDANGO
Type Anyformat
AccessControl Internalnetwork
Persistence KeptinHDFSforthedurationoftheproject
DATATRANSFORMATION
Notransformationsarerequired.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Spark
Ownership FANDANGO
Type Anyformat
AccessControl Internalnetwork
Persistence In-memoryprocessingonly
3.2.3.6. CERTHINTEGRATIONVIII–MLLIBTheMLlib isused forMachineLearningmodulesandworks incollaborationwith spark.Assuchall thedetailsthatapplyforsparkapplytoMLlibaswell.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource HDFSorHBASE
D2.1_v1.0 Datalakeintegrationplan
Page29
Ownership FANDANGO
Type Anyformat
AccessControl Internalnetwork
Persistence Keptinthestorageasitisdefinedinthedataloadingsection
DATATRANSFORMATION
Notransformationsarerequired.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem MLlib
Ownership FANDANGO
Type Anyformat
AccessControl Internalnetworks
Persistence In-memoryprocessingonly
3.2.3.7. CERTHINTEGRATIONIX–ELASTICSEARCHThisintegrationwillbeusedinitiallyforthepilot0.1andwillbeevaluatedif itwillremaininthefutureversionsoftheFANDANGOplatform.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource ApacheNifi
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence In-memoryonly
DATATRANSFORMATION
Notransformationsapplyhere.
D2.1_v1.0 Datalakeintegrationplan
Page30
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Elasticsearch
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence For thedurationof theprojectorotherwisedefinedby thedata sources’termsofuse,
D2.1_v1.0 Datalakeintegrationplan
Page31
3.2.4. LVTINTEGRATIONSTheintegrationshighlightedinredinFigure6-LvTIntegrationsaregoingtoimplementedbythepartnerLvT.
Figure6-LvTIntegrations
3.2.4.1. LVTINTEGRATIONI–KAFKAKafkawillbeusedforqueueingupthenewsbeforetheiranalysisoffakeandsaving
The collection of news from different sources (Web site, Open Data, Webapp, etc..) will producehundreds/thousandsofdata,forthisreasonwewillneedKafka,otherwisewecan’tprocessallthenewstogether.Inaddition,wedon’twanttosaveallthenewsweretrieveintodatabases,butonlythefakenewsourMachineLearningSystemrecognizes.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource ApacheNifi,FandangoApp
Ownership FANDANGO
Type JSON
AccessControl Authenticatedaccess
Persistence Real-timescreenvisualizationonly
D2.1_v1.0 Datalakeintegrationplan
Page32
DATATRANSFORMATION
Kafkadoesn’tapplyanykindoftransformation,thereforeallnewsmusthaveanagreedjsonformatbeforeputtingtheminKafka.
Forexample:
{“source”:aaaa,
“date”:dddd
“title”:xxxx,
“body”:yyyy,
“urls_image”:[“xxx”,”xxx”,...,”xxx”]
etc..
}
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem HDFS,HBase
Ownership Fandango
Type JSON
AccessControl Internalnetwork
Persistence In-memoryprocessing;deletedafterprocessing
3.2.4.2. LVTINTEGRATIONII–FANDANGOWEBAPPTheWebappprovidesasetofinterfacestomanageinasingletouchpoint:
- datafromthedifferentFANDANGOlayers- distributeddataresourcesandtransfersthemefficientlyandsecurely.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Fandangoplatform’s
Ownership FANDANGO
D2.1_v1.0 Datalakeintegrationplan
Page33
Type JSON
AccessControl PublicaccessandAuthenticatedaccess
Persistence In-memoryprocessing
DATATRANSFORMATION
ItisaWebappcontrol,soitdoesn’tneedanykindoftransformations.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Fandangoplatform’s
Ownership FANDANGO
Type JSON
AccessControl Publicaccess,authenticatedaccess
Persistence Real-timescreenvisualizationandHIVE/HBasetosavetheconfigurations
3.2.4.3. LVTINTEGRATIONIII,IV,V–APACHEZEPPELINApachezeppelinisapythonnotebookenvironment.ItwillbeusedtoworkonthedatathatarestoredintheFANDANGOclusterandexperimentdirectlyonthedatawithouttheneedtotransferdatatoandfromtheclusterfortheresearchandprototypingpurposes.Asitisusedasasandboxarea,differentdatasetsandformatswillbeloadedintoit.Datainthisareawillonlybepersistedwhilerequiredfortheprototypingpurposesandwillbecontrolledthroughuserauthenticationforaccessingthenotebooks.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource HBase,HDFS,Elasticsearch
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence In-memoryprocessing
D2.1_v1.0 Datalakeintegrationplan
Page34
DATATRANSFORMATION
HBasealreadycontainsdatainjsonformat,thereforeitdoesn’trequireanytransformations.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem ApacheZeppelin
Ownership FANDANGO
Type Serializedobject,text,JSONandothers
AccessControl Internalnetwork
Persistence In-memoryprocessing
3.2.4.4. LVTINTEGRATIONVI–ELASTICSEARCHItcentrallystoresthedataanditwillbeusedtoanalyzenewsusingnaturallanguageprocessingtools.
Elasticsearchisnecessarybecauseitimplementsasearchenginethatwillbeusedtosearchsimilarnewsbasedonsemanticcontext.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Spark
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence In-memoryprocessing
DATATRANSFORMATION
ElasticsearchwillapplyNLPpipelinesdependingonthelanguageofthenews.Themaintransformationsthatwillbeappliedare:Tokenization,Lemmatization,SyntacticParser,Ngrametc.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
D2.1_v1.0 Datalakeintegrationplan
Page35
TargetSystem Elasticsearch
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence Keptindeterminately
3.2.4.5. LVTINTEGRATIONVII–SPARKApacheOozieisaworkflowschedulerthatisusedtomanageApacheHadoopjobs.Ooziecombinesmultiplejobssequentiallyintoonelogicalunitofworkasadirectedacyclicgraph(DAG)ofactions.OoziecanweavesaSparkjobintoyourworkflow.TheworkflowwaitsuntiltheSparkjobcompletesbeforecontinuingtothenextaction.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Oozie
Ownership Fandango
Type JSON
AccessControl Authenticatedaccessandinternalnetwork
Persistence In-memoryprocessing
DATATRANSFORMATION
ApacheOoziewillnotrequiretoperformanydatatransformation
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Spark
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
D2.1_v1.0 Datalakeintegrationplan
Page36
Persistence In-memoryprocessing
3.2.4.6. LVTINTEGRATIONVIII–SPARKSparkisusefultoanalyzemillionsofdatainshorttime.Therefore,inthisproject,Sparkwillbeemployedto process the news provided by the different data source. This is done using theMLlib, a library forMachineLearningofSpark.Theinteractionwiththesavednewsisusefultocatchfeedbacksoftheusersaboutthetruthofanews.Thesefeedbackswillbeprocessedbythemachinelearningmodels.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource HDFS
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence Keptindeterminately
DATATRANSFORMATION
Inordertore-trainingtheMachineandDeepLearningalgorithms inMLlib,wewillapplydifferentdatatransformationstomaketheformatscompatiblewiththelibrariesyouuse.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem Spark
Ownership Fandango
Type JSON
AccessControl Internalnetwork
Persistence In-memoryprocessing
D2.1_v1.0 Datalakeintegrationplan
Page37
3.2.4.7. LVTINTEGRATIONIX–SPARKSpark is useful to analyze millions of data in very short time, therefore in this project, Spark will beemployedtoprocessthenewsprovidedbythedifferentdatasourcesusingtheMLlib,alibraryforMachineLearninginSpark.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
MainSource Kafka
Ownership Fandango
Type JSON
AccessControl Internalnetwork
Persistence In-memoryprocessing,Kafkapersistence,deletedafterprocessing.
DATATRANSFORMATION
In order to use the Machine and Deep Learning algorithms in MLlib, we will apply different datatransformationstomaketheformatscompatiblewiththelibrariesyouuse.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem HDFS
Ownership FANDANGO
Type JSON,Textfile
AccessControl Internalnetwork
Persistence Datasaveduntiltheyareremoved
3.2.4.8. LVTINTEGRATIONX–HBASE HBASEwillbeusedtosavedatathatarestructured,suchasnewsanalyzedorthatcomefromKafka.
SOURCE
Characteristicsofthedataorigin.
CHARACTERISTIC DESCRIPTION
D2.1_v1.0 Datalakeintegrationplan
Page38
MainSource Kafka
Ownership FANDANGO
Type JSON
AccessControl Internalnetwork
Persistence Saveduntiltheyareremoved
DATATRANSFORMATION
ApacheHBasewillnotrequiretoperformanydatatransformation.
TARGET
Characteristicsofthedatatarget.
CHARACTERISTIC DESCRIPTION
TargetSystem ApacheZeppelin
Ownership Fandango
Type In-memorystructure
AccessControl Internalnetwork
Persistence In-memoryprocessing
4. CONCLUSIONWhilethepurposeofthisdocumentisprovidinganinitialoverviewofrequireddataintegrationsandhowdataisgoingtobecuratedinordertodirecttheinitialdevelopmentandfacilitatetheoverallcoordinationofactivitiesbetweenthepartners,thisstructureshouldevolvealongthedevelopmentofprojectandwillbe updated in later stages. Further definitions are already planned on deliverables D2.2 - DataInteroperabilityanddatamodeldesignandD3.1-Datamodelandcomponents.
Nonetheless, the architectural structure, data ingestion and data integration definitionswill direct thedevelopment of the first versions of the solution and play a crucial role in refining requirements andvalidatingtheplatform.
D2.1_v1.0 Datalakeintegrationplan
Page39
5. ANNEX–DATASILOSFOREUROPEANCONTENT,CLIMATECHANGEANDMIGRATIONThisannexliststheinitialdatasilosusedcollectarticlesfortheFANDANGOproject.
EUOpenDataPortal
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EUOpenDataPortal
http://data.europa.eu/euodp/en/home
The European Union Open DataPortal(EUODP)providesaccesstoan expanding range of data fromthe European Union (EU)institutionsandotherEUbodies.
Only EU institutions, agencies andbodies can provide data for the EUOpenDataPortal–asinglepointofaccessforEUdata.
Thesedatacanbeusedandreusedforcommercialornon-commercialpurposes.
• Socialquestions• Science• Environment• Employmentand
workingconditions
• Economics• Finance• Trade• Production,
technologyandresearch
• Industry• EuropeanUnion
• European• Eurozone(whenrelevant)
• National• Other
GEOGRAPHICALCOVERAGE:
• France(3014)
• Italy(2964)
• Austria(2948)
EN,FR,GE
Data• ZIP(880
0)• HTML(7
496)• text/tab
-separated-values(7326)
• PDF(876)
• XML(818)
Data can bereusedfreeofcharge andwithout anycopyrightrestrictions.
(REUSEOFEUDATAHAS TOBE SHAREDWITHTHEPORTAL)
D2.1_v1.0 Datalakeintegrationplan
Page40
Totaldatasetsavailable:12,209
PUBLISHERS
EuropeanParliament(54datasets)
Council of the European Union (3datasets)
European Commission (11072datasets)
EuropeanCentralBank(31datasets)
European External Action Service (0datasets)
European Economic and SocialCommittee(2datasets)
CommitteeoftheRegions(2datasets)
European Investment Bank (2datasets)
EuropeanOmbudsman(1datasets)
European Data ProtectionSupervisor(6datasets)
EUbodyoragency(1035datasets)
• Agriculture,forestryandfisheries
• Energy• Transport• Businessand
competition• International
relations• Geography• Educationand
communications• Law• International
organisations• Politics• Agri-foodstuffs
• Germany(2929)
• Spain(2908)
• Belgium(2874)
• Netherlands(2867)
• Denmark(2849)
• Ireland(2842)
• Portugal(2839)
• Sweden(2823)
• Luxembourg(2814)
• Greece(2813)
• UnitedKingdom(2804)
• Finland(2802)
• Slovenia(2730)
• Hungary(2727)
• Poland(2713)
• CzechRepublic(2685)
• Latvia(2683)
• Estonia(2679)
• Slovakia(2678)
• Lithuania(2678)
• Bulgaria(2632)
• Romania(2623)
• Cyprus(2562)
• Malta(2551)
• Croatia(2299)
• Norway(1116)
• Switzerland(1038)
• Excel(405)
• CSV(379)
• application/msaccess(157)
• RDF(155)
• OCTETSTREAM(146)
• TXT(75)• DOC(58
)• webserv
ice/sparql(49)
• text/n3(49)
• application/x-dbase(45)
• XLSX(36)
• JSON(18)
• JPEG(16)
• PNG(13)
AlreadylistedintheFandangoAgreement
D2.1_v1.0 Datalakeintegrationplan
Page41
• Liechtenstein(988)
• Serbia(966)
• FormerYugoslavRepublicofMacedonia(963)
• BosniaandHerzegovina(953)
• Montenegro(950)
• Albania(922)
• Iceland(902)
• Russia(896)
• Turkey(885)
• Monaco(794)
• Ukraine(790)
• SanMarino(777)
• Belarus(772)
• Andorra(767)
• Moldova(749)
• VaticanCity(740)
• Kosovo(737)
• ÅlandIslands(726)
• Gibraltar(711)
• Guernsey(708)
• PPT(11)• applicati
on/msword(9)
• WEBSERVICE/SPARQL(7)
• TIFF(7)• RSS(5)• N3(3)• DOCX(3
)• KML(2)• GIF(2)• interacti
vewebpages(1)
• file(1)• applicati
on/x-compress(1)
• application/javascript(1)
• OWL(1)• MXD(1)• E00(1)• Access(
1)
D2.1_v1.0 Datalakeintegrationplan
Page42
EUROPEANDATAPORTAL
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EuropeanDataPortal
https://www.europeandataportal.eu
TheEuropeanDataPortalharveststhe metadata of Public SectorInformation available on publicdata portals across Europeancountries.
Information regarding theprovisionof data and the benefitsofre-usingdataisalsoincluded.
At the very last count, it providesaccessto817,747datasets.Amongtheothers:
- Italy–38,259- Spain–28,693- Netherlands–22,672- Belgium–7,315- Greece–6,709-
(THOSE DATASET SHOULD BECHOSEN CAREFULLY, TO AVOIDUNNECESSARYINGESTIONS)
Datasets bycategories:
- Agriculture,Fisheries,Forestry&Food
- Energy- Regions&
Cities- Economyand
Finance- Health- Population&
Society- Government
&PublicSector
- InternationalIssues
- Transport- Environment- Science&
Technology
All EUCountries
All EUmembers’languages
CSV,XLA,XML,RDF,JSON
ProvidesDatasets URIand SPARQLQueries
(and 78catalogueson EUCountriesDatasets)
Already listed inthe FandangoAgreement
LICENCENEEDED
OPEN LICENSEASSISTANT
https://www.europeandataportal.eu/en/content/show-license
D2.1_v1.0 Datalakeintegrationplan
Page43
EUROPEANPRESSROOM(Pressreleases)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EuropeanUnion
Newsroom– PressReleases
https://europa.eu/newsroom/
PressreleasesbyEUInstitutionsofthelast30days.
Pressreleasesdatabases:
• CommitteeoftheRegions• CounciloftheEUandEuropeanCouncil• CourtofJusticeoftheEuropeanUnion• EuropeanCentralBank• EuropeanCommission• EuropeanCourtofAuditors• EuropeanDataProtectionSupervisor• EuropeanEconomicandSocial
Committee• EuropeanInvestmentBank• EuropeanOmbudsman• EuropeanParliament• Eurostat–StatisticalOffice
• Asylumandmigration
• Business• Business,
taxationandcompetition
• Consumeraffairsandpublichealth
• Culture,educationandyouth
• Economyandtheeuro
• Employmentandsocialrights
• Energy,environmentandclimate
• Enlargement,externalrelationsandtrade
• EUregionalandurbandevelopment
• Food,farmingandfisheries
All EUCountries
EN
FR
DE
AudioVideoText
E-mail andRSS:(https://europa.eu/newsroom/rss-feeds_en#press-releases)
PODCASTS
AND
VODCASTS
• EuropeanParliament
• EuropeanCommission
• Dailypressbriefing
Already listed inthe FandangoAgreement
(FREETOUSE?)
Weshouldaskforclarification:
https://europa.eu/european-union/contact/write-to-us_en)
D2.1_v1.0 Datalakeintegrationplan
Page44
• Institutionalaffairs
• Internationalaid,developmentandcooperation
• Justiceandcitizens’rights
• Researchandinnovation
• Securityanddefence
• Statistics• Transportand
travel
sPodcast
• DailypressbriefingsVodcast
• PressconferencesVodcast
EUROPEANMEDIAMONITOR(NewsExplorer)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EuropeanMediaMonitor
NewsExplorer
http://emm.newsexplorer.eu/
TheNewsExplorerusesJRCdevelopedtechnologytoautomaticallygeneratedailynewssummaries,allowinguserstoseethemajornewsstories(newsclusters)invariouslanguagesforanyspecificdayandtocomparehowthesameeventshavebeenreportedinthemediawrittenindifferentlanguages;Thelistofmostmentionednamesandfindfurtherautomaticallyderivedinformation(e.g.variantnamespellings,titlesandphrases,listofthemostrecent
• Clusterednews
• Countries
• People
• Othernames
• Alerts
Austria
Belgium
Germany
Spain
English,Spanish,Greek,Netherlands,Greekandother15languages
Textonly
Available onthe webpages andRSS format(http://emm.newsexplorer.eu/rss?type=clusters&language=it)
WE SHOULD ASKTHEMDIRECTLY
Already listed inthe FandangoAgreement
D2.1_v1.0 Datalakeintegrationplan
Page45
articlesandlistofrelatedpersonsandorganizations).NewsExplorercarriesoutthefollowingtasks:
• clusterallnewsarticlesoftheday,separatelyforeachlanguage,intogroupsofrelatedarticles;
• foreachcluster,identifynamesofpeople,places,organizations;
• applyapproximatenamematchingtechniquestoallnamesfoundinthesamecluster,inordertoidentifywhichnamevariantsmaybelongtothesameperson;
• linkthemonolingualclusterswiththerelatedclustersintheotherlanguages;
• identifythemosttypicalarticleofeachclusteranduseitstitleforthecluster;
• storetheextractedinformationinadatabase,learningmoreabouteachperson,etc.everyday;
• occasionally,theWikipediaonlineencyclopediaisautomaticallysearchedforimagesandforfurthermultilingualnamevariants.
• Timeline
France
U.K.
Italy
Netherlands
UnitedStates
D2.1_v1.0 Datalakeintegrationplan
Page46
EUROSTAT
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
Eurostat
http://ec.europa.eu/eurostat
Eurostat is the statistical office oftheEuropeanUnion.
DatabaseofEuropeanStatistics,bythemeandAtoZ
Statistic byTheme:
- General- Regional- Economy
andFinance
- PopulationandSocialconditions
- Industry,TradeandServices
- AgricultureandFisheries
- InternationalTrade
- Transport- Environm
entandEnergy
All EUCountries
English,German,French(mainly)
Textonly
http://ec.europa.eu/eurostat/data/database
SDMXWebServices
Json andUnicodeWebServices
BULKDOWNLOAD
http://ec.europa.eu/eurostat/data/bulkdownload
Already listed inthe FandangoAgreement
All Eurostatdatabases andelectronicpublications areavailable free ofcharge via thewebsite.
EUROSTATrequiresnotification ofuse and theINDICATION ofprovenance(EURSTAT)
D2.1_v1.0 Datalakeintegrationplan
Page47
- Science,TechnologyandDigitalSociety
EUROBAROMETER
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
Eurobarometer(EUCommission) onPublicOpinion
http://ec.europa.eu/commfrontoffice/publicopinion/index.cfm/General/index
Eurobarometer was established in1974. Each survey consists ofapproximately 1000 face-to-faceinterviews per country. Reports arepublishedtwiceyearly.Reproductionisauthorized,exceptforcommercialpurposes, provided the source isacknowledged.
Special Eurobarometer reports arebased on in-depth thematic studiescarriedoutforvariousservicesoftheEuropean Commission or other EUInstitutions and integrated in the
Public Opinionand… contains:EurobarometerA-Z,EurobarometerTimeline,Eurobarometer40 years,EurobarometerAlmanac
EurobarometerInteractive isthe searchenginewithFAQ
All the EUmemberssince1974
AlltheEUlanguages since1974
Textonly
NoRSS
The Data isgenerallyprovided inPDF format,but there isthepossibility todownloadXLS formatfrom theOpen DataPortal
Already listed inthe FandangoAgreement
Reuse isauthorized,provided thesource isacknowledged. TheCommission'sreusepolicyisimplemented
D2.1_v1.0 Datalakeintegrationplan
Page48
Standard Eurobarometer's pollingwaves.
Links (toTwitter)
Archives pointsto the oldwebsite,renovated in2016
(https://data.europa.eu/)
ARCHVES
http://ec.europa.eu/commfrontoffice/publicopinion/archives_en.htm
bythe Decisionof 12December2011 - reuseofCommissiondocuments[PDF,728KB]
OTHERPOSSIBLEDATASILOS
(PendingFURTHERverification)
EUROPEANE-RESOURCESCENTRE
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EuropeanLibraryand E-
http://ec.europa.e
Online Search of the resources onEU policies, law and more in the
Open AccessResources.
AllEuropeanCountries
English Textonly
DirectSearch onthe website
Not listed in theFandangoAgreementDoc
D2.1_v1.0 Datalakeintegrationplan
Page49
Resources
u/libraries/
European Commission Library'selectroniccollections
Books, eBooks,JournalArticlesandmore
(possibleotheraccess)
REQUIRESDIRECTCONTACT
EUROPEANEXTERNALACTION(EUForeignPolicy)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
RESOURCESFORMAT
NOTE/COPYRIGHT
EUROPEAN
EXTERNAL
ACTION
SERVICE
(EEAS)
http://risis.eu/data/
The EEAS is the European Union'sdiplomaticservice.IthelpstheEU'sforeign affairs chief – the HighRepresentative for Foreign AffairsandSecurityPolicy–carryouttheUnion's Common Foreign andSecurityPolicy.
The websitecontains manydocuments,publications andinfographic on EUForeign Policy,Security andDefence
AlmostallEUCountries’ResearchProjects
English,French,Italian,Spanish
Text,pictures,graphics
RSS orDirectdownload
Not listed in theFandangoAgreementDoc
REQUIRESDIRECTCONTACT
D2.1_v1.0 Datalakeintegrationplan
Page50
EUROPEANSCIENTIFICDATA(Zenodo)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
NOTE/COPYRIGHT
RESOURCESFORMAT
ZENODO https://zenodo.org/
FreeandOpenDigitalArchivebuiltbyCERNandOpenAIREtofacilitatescientific data exchange amongresearchers
Zenodo offersaccess to 1831Scientific‘Communities’
EuropeanCountries
English Text Not listed inthe FandangoAgreementDoc
REQUIRESDIRECTCONTACT
Output viaOAI-PMH
DirectSearch
EUROPEANUNIVERSITYINSTITUTE(EUI)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
NOTE/COPYRIGHT
RESOURCESFORMAT
EuropeanUniversityInstitute(Firenze)
https://www.eui.eu/
The European University Institute
(EUI) is a unique international
Centre for doctorate and post-
doctorate studies and research,
situated in the Tuscan hills
overlookingFlorence.
The website has asearch feature onthe HistoricalArchives of the EuInstitutions and ona unique collection
AlmostallEuropeanCountries
English,French,German,Italian
Text Not listed inthe FandangoAgreementDoc
DirectSearch
D2.1_v1.0 Datalakeintegrationplan
Page51
Since its establishment 40 years
agobythesixfoundingmembersof
the then European Communities,
theEUIhasearnedareputationas
a leading international academic
institutionwithaEuropeanfocus.
The European Commissionsupports the EUI through theEuropeanUnionbudget
TheEUIlibraryboastsaroundhalfamillion volumes in the Institute’sspecialistareas,attractingexternalresearchers with an interest inEurope. The campus also hoststhe Historical Archives of theEuropean Union (HAEU), whichprovides an unparalleled insightinto theworkingsof theEuropeanUnion.
of 150 privatearchives of pro-Europeanassociation andpersonalities
REQUIRESDIRECTCONTACT
RISIS(EuropeanScientificResearch)
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
NOTE/COPYRIGHT
RESOURCESFORMAT
D2.1_v1.0 Datalakeintegrationplan
Page52
RESEARCH
INFRASTRU
CTUREFOR
RESEARCH
AND
INNOVATIO
NPOLICY
STUDIES
(RISIS)
http://risis.eu/data/
Theoverallobjectiveoftheprojectis to build a distributedinfrastructureondata relevant forresearch and innovation dynamicsandpolicies
RISISaimsatopeningtoEuropeanresearchers a large number of(linked) datasets covering 7themes:1.Research funding:Datasets thatcontaininformationaboutresearchprojectsfundedbytheEC(EUPRO,CORDIS), by trans-border fundingprograms between EC memberstates (JOREP),andothers funders(OPEN-AIR, Open Funderdatabase).2. Datasets on dominant sciencesand technologies (nanotechnologydataset).3. Datasets covering firminnovationdynamics4.PublicsectorresearchinEuropewith several data on EuropeanHigher Education Institutions(RISIS-ETER) and on Europeanpublic research organizations
The datasets coverfive criticaldimensions: ERAdynamics (3datasets), firminnovationdynamics (3datasets), publicsector research (3datasets), researchcareers(3datasets)andarepositoryonresearch andinnovation policyevaluations.
Several of thesedatasets areaccessibleonline
AlmostallEUCountries’ResearchProjects
English Text,pictures,graphics
Not listed inthe FandangoAgreementDoc
REQUIRESDIRECTCONTACT
Access onlythroughaccreditation
D2.1_v1.0 Datalakeintegrationplan
Page53
(under development) andon theiracademic performance (Leidenranking).5.Researchcareerswithaccess tothe European mobility survey(MORE) and theGermanpanel ondoctoralstudentsandtheircareers(earlycareerfacility)and,atalaterstage, with access to a platformand/ordatasetintegratingmultiplenational sources (underdevelopment);6. A specific repository, SIPER, onpolicyevaluations,articulatedwiththe OECD-World Bank Innovationpolicy platform) and giving accessto the accumulated knowledgeonpolicy instruments and policymixes.7. Several datasets that providelinked data, such as data fromstatistical offices, geographicalclassifications, patents (USPTO),open science (Open-Air), andothers. For more information seetheSMSDataStore.
D2.1_v1.0 Datalakeintegrationplan
Page54
OpenAIRE
NAME URL DESCRIPTION, PROVIDER ANDAVAILABILITY
MAINSECTORS GRANULARITY
LANGUAGE
MEDIA
NOTE/COPYRIGHT
RESOURCESFORMAT
TheOpenAIRE 2020Project
https://www.openaire.eu
50 partners, from all EU countries,andbeyond,willcollaboratetoworkonthislarge-scaleinitiativethataimsto promote open scholarship andsubstantially improve thediscoverability and reusability ofresearch publications and data. Theinitiative brings togetherprofessionalsfromresearchlibraries,open scholarship organizations,national e-Infrastructure and dataexperts, IT and legal researchers,showcasing the truly collaborativenature of this pan-Europeanendeavor.
The website has asearch feature on24 millionpublications andalmost 700thousand datasetson the 2020Projects
AlmostallEuropeanCountries
English,French,German,Italian
Text Not listed inthe FandangoAgreementDoc
REQUIRESDIRECTCONTACT
DirectSearch