Support Action Big Data Europe Empowering Communities with ... · D3.2 – v1.0 1.Introduction In...

Support Action

Big Data Europe – Empowering Communities with Data

Technologies Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months

Deliverable 3.2: Technical Requirements Specifications

& Big Data Integrator Architectural Design I Dissemination Level Public

Due Date of Deliverable M6, 30 June, 2015 (officially shifted to M7)

Actual Submission Date M7, 31 July, 2015

Work Package WP3, Big Data Generic Enabling Technologies and Architecture

Task T 3.2, T 3.3

Type Report

Approval Status Final

Version v1.0

Number of Pages 36

Filename D3.2_Technical_Requirements_Specifications_and_Big_Data_Integrator_Architectural_Design_I.pdf

Abstract: A wide set requirements have been gathered from each Societal Challenge based on interviews, workshops and surveys. Based on analysis of these requirements, a profile platform architecture has been proposed. This generic profile will be made more specific for pilot instances for individual societal challenges in the later stages of the project. The information in this document reflects only the author’s views and the EuropeanCommunity is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability.

Project funded by the European Union’s Horizon 2020 Research and Innovation Program (2014 – 2020)

D3.2 – v1.0

History Version Date Reason Revised by 0.0 21/04/2015 Initial version Timea Turdean (SWC) 0.1 07/05/2015 First version Bert Van Nuffelen (TF) 0.2 22/06/2015 Intermediate version Hajira Jabeen (InfAI)

0.3 25/06/2015 Peer review Mohamed N. Mami (FH) Paul Massey (TF)

0.4 27/06/2015 Address peer review comments Hajira Jabeen (InfAI)

0.5 30/06/2015 Peer review Iraklis Klampanos, Stasinos Konstantopoulos (NCSRD)

0.6 31/07/2015 Final version Hajira Jabeen (InfAI) Author List Organisation Name Contact Information

SWC Timea Turdean t.turdean@semanticweb.at InfAI Hajira Jabeen [email protected]leipzig.de InfAI Jens Lehmann [email protected]leipzig.de TenForce Paul Massey [email protected] TenForce Bert Van Nuffelen [email protected] FH Mohamed Nadjib Mami [email protected]bonn.de

Page 1

D3.2 – v1.0

Executive Summary

This document details the technical requirements gathered from different societal challenges and outlines the approach for setting up the Big Data Europe data management platform.

The technical requirements gathered from all the societal challenges address the four Big Data challenges namely Volume, Velocity, Variety and Veracity of data. It has been found that the data requirements include all the four features of big data with particular focus on volume and velocity. The analysis of data value chain has revealed that each societal challenge has a different set of requirements resulting in a diverse set of tools and frameworks required for each step of handling in the data value chain.

This, in turn raises the need of a diverse and multifunctional platform that would cater for the variety of big data management requirements. Different big data frameworks exist to handle different aspects of big data. It has been decided that the components of Big Data platform will be based on existing opensource Big Data technologies and platforms. In order to integrate them in a single platform, a fault resilient and flexible platformmanager is needed. A deployment platform based on HDFS (Hadoop Distributed File System), Mesos and Dockers has been chosen to handle the stack of big data components in harmony.

Page 2

D3.2 – v1.0

Abbreviations and Acronyms LOD Linked Open Data SC Societal Challenge BDE BigData Europe GB Gigabyte TB Terabyte PB Petabytes JSON JavaScript Object Notation SC1 Societal Challenge 1 API Application programming interface PDF Portable Document Format JPEG Joint Photographic Experts Group PNG Portable Network Graphics GIF Graphics Interchange Format GML Geography Markup Language GeoTIFF Geographic Tagged Image File Format TIF Tagged Image Files GMLJP2 Geography Markup Language JPEG 2000 CCTV Closedcircuit television CMIP5 Coupled Model Intercomparison Project Phase 5 CMIP6 Coupled Model Intercomparison Project Phase 6 CORDEX Coordinated Regional Climate Downscaling Experiment

SPECS Seasonaltodecadal climate Prediction for the improvement of European Climate Services

HDF Hierarchical Data Format NetCDF Network Common Data Form ASCII American Standard Code for Information Interchange MRI Magnetic resonance imaging INSPIRE Infrastructure for Spatial Information in Europe PPT PowerPoint templates

Page 3

D3.2 – v1.0

Table of Contents

Introduction

Technical Requirements Specification

Summary of CommunityDriven Data Management Requirements

Technical Requirements

Core Functional Requirements

Data Volume Requirement

Data Variety Requirement

Data Veracity Requirements

Requirements along the Data Value Chain

Multilinguality of Data as Requirement

Instances of Functional Requirements

Instance 1: Unstructured and Semistructured Data Processing

Instance 2: Multimedia Data Processing

Instance 3: Sensor Data Processing

Instance 4: Geospatial Data Processing

Specific Societal Challenge Requirements

Data Value Chain

Data Generation, Acquisition and Harvesting

Data Analysis and Processing

Data Curation & Storage

Data Dissemination, Visualization and Usage

NonFunctional Requirements

Big Data Integrator Platform Architectural Design

Platform Profiling

Selection of Platform Components

Architecture

File System

Resource Manager

Scheduler

Coordination

Page 4

D3.2 – v1.0

Data Input

Data Acquisition

Data Storage

Data Processing

Data Integration and Communication

Operational Frameworks

Challenge

Platform Deployment Solution

Docker

Summary of Proposed Profile Architecture

Configuration of Components

Extension of Components

Generation of Pilot Instances

Conclusion

References

Page 5

D3.2 – v1.0

List of Figures

Figure 1: BDE Platform deployment

Figure 2: Big Data Platform Components

List of Tables Table 1: Findings overall Societal Challenges 10

Table 2: Findings overall Societal Challenges about the four Vs and different data types 12

Table 3: Specific findings overall Societal Challenges grouped after the data processing velocity requirement 16

Table 4: Nonfunctional requirements for the BDE Platform overall SCs 23

Table 5: Societal Challenges and corresponding problem Instances 27

Table 6: BDE Profile Platform Components 32

Table 7: Platform Instances and corresponding data tools 33

Page 6

D3.2 – v1.0

1. Introduction

In deliverable D3.1 the technological landscape of Big Data technologies was presented. Based on this initial assessment it was decided to base the BDE platform upon the open source Big Data distribution Frameworks.

It is complemented with the selected standard deployment methodology using different frameworks that allow to make full instantiation of the classical Lambda architecture using the selection of Big Data components. Within BDE, work will be done to augment the Lambda architecture with semantical powers. It is the ambition to integrate Linked Data technology frameworks seamlessly within hybrid Lambda architecture.

This deliverable takes the past work in WP3.1 one step further. It bridges for the first time in the project the requirements and expectations expressed by possible users from almost all societal challenges with the BDE platform. In section 2, the user requirements analyzed in deliverable D2.4 are grouped and assessed from a technical perspective. The first insights show the possibility of BDE platform profiling.

In section 3, the BDE platform profiling is further elaborated from a technical perspective. It takes the decisions reported in deliverable D3.1 and describes an approach to allow profiling of a generic BDE platform which can cater for multiple and diverse set of requirements and is not immediately applicable to a concrete pilot instance.

2. Technical Requirements Specification The technical requirements were gathered from the community groups of the 7 Societal

Challenges which were asked to describe the data challenges they face. These 7 Societal Challenges are:

SC1: Health, demographic change and wellbeing; SC2: Food security, sustainable agriculture and forestry, marine and maritime and

inland water research, and the Bio economy; SC3: Secure, clean and efficient energy; SC4: Smart, green and integrated transport; SC5: Climate action, environment, resource efficiency and raw materials; SC6: Europe in a changing world inclusive, innovative and reflective societies; SC7: Secure societies protecting freedom and security of Europe and its citizens.

Page 7

D3.2 – v1.0

The outcome of the first requirements is documented in Deliverable 2.4.1. A summary is provided below. In the subsequent sections, these data management requirements are reformulated as technical requirements.

Summary of CommunityDriven Data Management Requirements

The survey, interviews and workshops conducted so far in the requirements elicitation phase, gathered preliminary information about the requirements of a BDE Platform. It is important to note that the outcome per Societal Challenge depends on the people interviewed. Still, they reveal great insights into their needs and requirements. It should be mentioned that inside each domain there is also a great variety of needs and requirements.

The next presented information builds upon the findings presented in Deliverable 2.4.1 and comes from 131 online survey answers, around 40 interviews, and 2 workshops. The next deliverable on this topic (Deliverable 3.5) will try to complete the picture for all Societal Challenges while here we report on the incomplete requirements elicitation findings which are in Table 1 marked with a light gray background.

Table 1: Findings for overall Societal Challenges Societal Challenge

1 2 3 4 5 6 7

D A T A V A L U E C H A I N

Data Generation, Acquisition and Harvesting

Using API Using metadata Ingestion components

Data Analysis and Processing Data enriching Analysis

Data Curation

Cleaning Normalizing Data update

Data Storage Amount of data used Storage

Data Dissemination, Visualization and Usage & Datadriven Services

Present/visualize the output Publish data

S P E C I F

Multilingualism of data

Temporal aspect of data

Page 8

D3.2 – v1.0

Common requirements for a common functionality identified

Specific requirements for a common functionality identified

Specific requirements for a specific (set of) functionality identified

No common requirements identified so far

No relevant data available

In Deliverable 2.4.1 we presented the findings for requirements elicitation along the Data Value Chain that can be revisited in Table 1. The Data Value Chain was used in the workshops to structure one of the group discussions. Some of the interview questions complete these information. Detailed findings along the Data Value Chain are discussed later on.

For data generation, acquisition and harvesting most of the Societal Challenges use APIs. But each one uses specific metadata and uses specific crawling, ingestion, loading components. From the perspective of data analysis and processing, the methods and algorithms are also Societal Challenge specific. For data curation all Societal Challenges need or have ways to solve deduplication, validation and data quality issues. Data storage and data dissemination, visualization and usage are also quite specific per Societal Challenge or have common requirements between 2 or 3 of them. Also, most of the Societal Challenges publish data one way or another and they need a proper way to do so. The specific aspects of each Societal Challenge are presented later on in this document.

Before talking about core technical requirements, we want to also draw attention to the answers of the interview question:

“How would you judge if a Big Data Management solution is a success?”

Main answers include the desire to have a solution that can help: To optimize management To assist in fostering informed decisions for each Societal Challenge Researchers focus on research and problem solving more than data collection and

acquisition (since in some Societal Challenges, this, plus dealing with a large variety of data types hinders them in making better informed decisions).

Attention must be given also to the answers of the interview question:

“What should a Big Data Management solution not do?”

No matter if we talk about a solution offering core requirements or instance specific requirements, a Big Data Management solution will not be used, based on the opinions we gathered, if it fails regarding security, reliability and stability. At the same time. a new solution needs to be adopted by domain experts, case in which proper informative instructions of how to use it is needed such that not only Big Data experts can take advantage of such a solution.

Page 9

D3.2 – v1.0

Technical Requirements Technical requirements were collected via the online survey [1] through which

participants could name the components/software/frameworks they are working with within their domains. In the interviews, more details were gathered to find out how the data and technical aspects are used to solve their use cases. In the workshops, a more collaborative naming and listing of specific technical requirements was done, which helps in the shaping of the core and instance technical requirements of the BDE Platform better.

Table 2: Findings overall Societal Challenges about the four Vs and data types Societal Challenge 1 2 3 4 5 6 7

4 V s

Volume Velocity Variety Veracity

D A T A T Y P E S

Archives Documents & scientific publications Multimedia Social media & news & events Data storage Machine log Sensor data Geospatial data Organizations & researchers' profile

Core requirement

Specific requirements

Not required

No relevant data available so far

The functional and nonfunctional requirements will follow the FURPS model [2] for classifying software quality attributes, developed at HewlettPackard and widely used nowadays. The FURPS model includes aspects of:

Functionality: security, future extensibility, capability …

Usability: aesthetics, documentation, consistency …

Reliability: failure rate, recoverability, data consistency ….

Performance: responsiveness, scalability, throughput …

Supportability: testability, maintainability, adaptability …

Page 10

D3.2 – v1.0

Core Functional Requirements In the requirements elicitation phase some of the core requirements that the BDE

Platform must hold have been identified. These are presented in this section.

Table 2 adds to the requirements found along the Data Value Chain also the requirements of the 4Vs: volume, velocity, variety and veracity and different data types which have increased in the past 12 months for each Societal Challenge. The common requirements found overall Societal Challenges follow next.

Data Volume Requirement Data volume is an issue for all Societal Challenges. However, the relevant tools and

platforms developed by their communities have reached different levels of maturity: While for some SCs the community has recently come across the need for large scale data processing infrastructures, for others infrastructures and tools already exist and are widely used by the community.

The BDE Platform must deal with genome files which are “~500GB for raw or aligned data” in Societal Challenge 1, “TB to PB of data” [2] in Societal Challenge 5, “Earth Observation data” [3] in Societal Challenge 7.

Since data variety is one of the core requirements, we look also at the data types used across Societal Challenges. Our findings indicate that documents and publications, multimedia data and sensor data are the most common data types which are in common for at least 4 of the Societal Challenges.

Data Variety Requirement Societal Challenge 1 deals with “data with mixed levels of standardization:”

“Electronic Health Records (ICD10 [4], CDISC [5], Snomed [6]: welldefined data standards, also free text with MESH [7] /UMLS [8] tags)”

“Publications (MESH [9] for abstracts, free text)” “Medical imaging (MRI standards, Raw image data)

Data with low levels of standardization” “Metabolomics (there are some databases, some identifiers, lot of free text)” In Societal Challenge 5 “the kinds of data of the users mentioned were the following”: “Agroenvironmental data, remote sensing data, model input/output Popular formats

are [...] GISdatabases. Unstructured like documents.” In Societal Challenge 2 data is “soil data mainly, INSPIRE geodata [10], geology data,

land use data, land cover data, climate data, satellite images, publications mainly as literature”. Societal Challenge 4 uses “maps, traffic movement, position, speed, original destination” and “vehicle sensors data measuring the environment of the vehicle. Like traffic signs, constructions areas, lane markers, other vehicles, pedestrians, traffic lights”. Data such as “GISdatabases, shape, NetCDF [11], JSON” and “agroenvironmental data, remote sensing data” are some of the types used in Societal Challenge 5. In SC6 “mostly economic

Page 11

D3.2 – v1.0

and Social Science data” is used. And Societal Challenge 7 uses “samples of high and very high resolution images” and “video streaming” but also “video, text, webcam and other collateral data”.

Data Veracity Requirements For Societal Challenge 1 it was identified that “a platform would therefore need to have

ways of validating data and conveying this validation to the user”:

“record and display provenance” “record reliability through probabilities (although participants indicate that these are

not universal)” “record whether data is expertvalidated or not (usergenerated/crowdsourcing)” “visualize the provenance/reliability of data and statements”

Requirements along the Data Value Chain Along the Data Value Chain there are not a lot of common needs that were identified

across all Societal Challenges. Previously we found out that data generation, acquisition and harvesting must be possible through APIs but this needs further investigation which specific APIs were meant. There is a general need for solutions for data curation which should include deduplication of data, validation of data and general solutions that ensure high data quality especially in Societal Challenge 1. When looking at the requirements for data dissemination, visualisation and usage, some common needs were discovered. In most Societal Challenges the final data output is in form of interactive charts, graphs, spreadsheets and maps. Also, most of the Societal Challenges publish data one way or another and they need a standardised way to do so.

Multilinguality of Data as Requirement In 6 of the Societal Challenge there was the mention of working with data in multiple

languages. This was in regard of research publications which are not always just in English language or regarding research data itself. Dealing with multilinguality of data must be part of the core requirements.

Societal Challenge 1 deals with “multilingual datasets including literature, patents, electronic health records”. In the workshop of Societal Challenge 5 it was mentioned: “currently there are metadata and multilingual problems. In this context, libraries or services should facilitate data access.”

Instances of Functional Requirements So far we discussed core requirements which can be found in most of the Societal Challenges as already existing or needed in the near future. In Table 2 we also observe a lot of different and specific requirements from each Societal Challenge. In Table 3 we took out the core requirements previously identified and added some aspects which were found to be in common for 2 or 3 of the Societal Challenges. We were looking to find if cross Societal

Page 12

D3.2 – v1.0

Challenge requirements can be grouped such that they can be met by specific BDE Platform instances.

The data processing velocity is related to the data type usage for each Societal Challenge, such that:

text analysis is needed when working with documents and publications; image interpretation, object recognition, patterns recognition is needed when using

multimedia data.

The need of realtime processing of data is splitting the Societal Challenges in four groups at this point. Looking at the correlation between the type of data that needs to be processed and the requirement of velocity, we decided to start with four example instances for the BDE Platform. The decision is done based on the common data properties. In addition we keep in mind that data volume, data variety and data veracity are core requirements which are part of these instances as well.

Page 13

D3.2 – v1.0

Table 3: Specific findings overall Societal Challenges grouped after the data processing velocity requirement

Societal Challenge Real time processing Needed Not needed 1 4 5 7 2 6 Velocity

Data in archives Documents & publications data Multimedia data Social media & news & events data Data in data storages Machine log data Sensor data Geospatial data Organizations and researchers' profiles

Text analysis Metadata usage

Multimedia data analysis Statistical methods

GIS analysis Spreadsheets as output Graphs & interactive charts as output

Maps as output Temporal aspect of data Common requirement Specific requirements Not required

The four realtime processing BDE Platform example instances chosen are:

Instance 1: unstructured and semistructured data processing

Instance 2: multimedia data processing

Instance 3: sensor data processing

Instance 4: geospatial data processing

Page 14

D3.2 – v1.0

Instance 1: Unstructured and Semistructured Data Processing Besides the core functional requirements which were identified in Table 2, the BDE

Platform should be able to handle the realtime processing of semistructured data. Four out of seven Societal Challenges work with documents of different formats, publications and in general, textual information.

A BDE platform should be able to handle most of the following formats:

CSV or Excel files (found in SC2) PDF documents (found in SC2) XML data formats (found in SC7) Specific metadata on document level capabilities (title, abstract, author found in SC2;

different metadata found also in SC5) Entity extraction and automated tagging of content (found in SC2, SC4) Text mining (found in SC1) Opinion mining and sentiment extraction (found in SC4)

The previous examples also point to the variety of semistructured data types that need to be processed by the BDE Platform. Some of these data can be of big volumes which is indicated also by the amount of data created: “Indication of publication speed is that 2 new pubmed articles are published per minute.” (mentioned in the Workshop of SC1)

These data is usually stored in

archives, (remote) repositories, silos (in SC2, SC7, SC5) filesystems on personal computer or remote servers (in SC7, SC5)

The data output of Societal Challenges which work with semistructured data is:

spreadsheets (in SC1) reports (in SC7) charts, time series, scattered plots, infographics, integrated viewers and dashboards

(in SC5 processed using R [12], Python [13] and Matlab [14] )

Instance 2: Multimedia Data Processing It was identified that four out of seven Societal Challenges work with multimedia data:

SC1, SC2, SC4, SC7. The multimedia data is:

(satellite) images jpeg, png, gif (found in SC2, SC7, SC1) GML, GEoTIFF, .TIF, GMLJP2 (found in SC7) video (streaming) (mp4) (found in SC2, SC7, SC4) audio (found in SC7)

There is no standard for video data so in Societal Challenges they use a proprietary format from Elektrobit [15].

The data enrichment and processing required for these formats are:

image processing techniques (found in SC7) visual analysis (found in SC7) image interpretation (found in SC7 where they add structural detail of an object) 2D and 3D maps (found in SC5) object recognition (found in SC4)

Page 15

D3.2 – v1.0

pattern recognition (found in SC4) scene recognition (found in SC4) detection of objects in videos (found in SC4)

The multimedia data used is also normalized in some cases. In Societal Challenge 7 “timeseries of images” are normalized through “radiometric correction, atmospheric correction”.

The output of these processed multimedia data is presented in the form of graphs, interactive charts and maps.

Instance 3: Sensor Data Processing Sensor data is present in most Societal Challenges. However, it was found that this is

especially needed in Societal Challenge 4 where parallel processing of traffic data, video data and textual data are used to get information about real time detection of objects, driving behaviour and traffic control. Streamlined data processing is also a topic for Societal Challenge 5 where sensor data is analysed.

Considering velocity we also look at the data acquisition which is happening at a fast rate in some cases as well: “information every 6 seconds” is updated in Societal Challenge 4, “data are continuously acquired and updated” in Societal Challenge 7. In case of Societal Challenge 1 in the workshop it was mentioned: “In a clinical setting speed could be interesting, especially with new opportunities such as crowdsensing”.

Societal Challenge 4 deals also with floating car data and with data from cameras and bluetooth detectors. But also data from CCTV camera [16] inside cities on roads and motorways, police car cameras, train movements, busses, harbor, from an airport. With these data they do analytics based on statistical methods, create maps about driving behavior of the vehicle and create complex models on interactive charts.

Instance 4: Geospatial Data Processing Geospatial data is used in Societal Challenge 4, 5 and 7. In Societal Challenge 5

observation data is collected from more than 30 global and regional models. GIS data management and analysis is a common practise in this Societal Challenge. Beside observational data, SC5 also uses:

“Climate Data. i.e. CMIP5 /CMIP6, CORDEX [17], SPECS [18] and many others from different providers”

“Agroenvironmental data, remote sensing data, model input/output”

During the Societal Challenge 5 workshop it was discussed:

“The data format is important. The tendency is towards GRIB [19]. NetCDF [20] is preferable for analysis. Tables of parameters, if local (e.g. from integrated models), constitute a problem. The World Meteorological Organization [20] is responsible for such guidelines. Popular formats are: structured e.g. multidimensional arrays, Netcdf3, Netcdf4, HDF5, GRIB1, GRIB2, and ASCII formats, GISdatabases, shape, JSON; and unstructured like documents.

Regarding the metadata, the data exchange between different geographical regions and communities must be standardized.”

Page 16

D3.2 – v1.0

In Societal Challenge 7 the metadata mentioned is Earth Observation Metadata profile(s), OGC Web Service Common (OWS) [21] and some metadata is defined by the Defence Geospatial Information Working Group [22]. Societal Challenge 7 uses ArcGIS [23] and presents their output in form of geoportals, geospatial databases and web mapping.

In Societal Challenge 4 traffic data is geolocated and timestamped before being used.

Specific Societal Challenge Requirements Societal Challenges also have some specific requirements following the Data Value

Chain which are presented here.

Data Value Chain

1. Data Generation, Acquisition and Harvesting Societal Challenge 4 uses Solr [24] with Nutch [25] for web crawling and search. Each

Societal Challenge uses other types of metadata. Societal Challenge 7 uses gene ontology, BioAssay ontology [26]. Societal Challenge 5 uses the federated querying tool SemaGrow for heterogeneous data integration [27]. In Societal Challenge 6 data is not harvested. Much of the data is gathered via negotiation and human interaction with major depositors, some via selfdeposit systems. They use DDI2 [28] for cataloguing the datasets.

2. Data Analysis and Processing Societal Challenge 1, 4 and 5, have the need to find solutions to deal with the temporal

aspect of the data. For Societal Challenge 1, electronic patient records change over time. In Societal Challenge 4 traffic data is geolocated and timestamped. At most when the data needs to be stored, this information has to be also included.

Data analysis must be ready to deal with, as mentioned above, geolocating and timestamping traffic data, visual analysis, image interpretation, and video: object recognition, patterns recognition, scene recognition, event detection, for textual data analysis: opinion and sentiment extraction, feature extraction, automated tagging, entity detection. For the Societal Challenge 1 workshop we found out that data analysis and processing is about:

“Identify outliers” “Combine analysis methods, including text mining and sequence analysis” “Detect or predict events on both a patient level or at an epidemiological level”

In the workshop of Societal Challenge 5 it was stated: “Climate experts indicated that they typically make use of inhouse analysis tools in order to carry out their work.” It was identified as a requirement the need to “integrating analysis software in the tools aggregator” such that the time spent on acquisition and processing could be reduced. Since there is no closed set of analysis tools that can cover SC5 requirements, BigData Europe can offer a data management tool that can prepare datasets in the GRIBS and NetCDF formats that are common in SC5, rather than trying to integrate SC5 analysis tools into the platform.

Page 17

D3.2 – v1.0

In Societal Challenge 6, data is statistically analysed using IBM SPSS Statistics software [29] and AMOS [30]. Societal Challenge 5 uses advanced statistics and machine learning techniques.

3. Data Curation & Storage In the Societal Challenge 1 workshop, it was stated that there are needs to:

“Deal with mapping of standards and ontology alignment. Be able to deal with onetomany mappings”

“Deal with data provenance and calculate reliability of data” In the workshop of Societal Challenge 5 some of the topics discussed were the need of standards and ontology alignment because this is now taking place manually after data, from different sources has been gathered locally. At the end, the datasets are too heterogeneous to be processed and analysed further. Another requirement mentioned in the SC5 workshop was the need of finding a solution for data versioning and identifying and dealing with incomplete and temporarily erroneous data sets.

In SC4 data cleaning is required for anonymity reasons and it is done by aggregating. And in SC1 validation pipelines exist for chemical structure and chemical properties to ensure ‘proper’ information is available.

Storage options are also different and also diverse inside the Societal Challenge. In Societal Challenge 2 they work with RDF data and as well as documents and use specific databases for each type including relational databases.

4. Data Dissemination, Visualization and Usage The dissemination, visualization and usage of data have some common needs but are

rather specific for each Societal Challenge. Societal Challenge 6 presents the data using IBM’s SPSS Software [29] or Microsoft Word while Societal Challenge 4 needs to visualize 2D and 3D time varying scenarios.

The output of Societal Challenge 1 is episodic disease predictions “in 24 hours, patient X will have condition Y”. Societal Challenge 2 shows high level statistic available from Altmetric [32]. In Societal Challenge 5 R [12], Python [13] and Matlab [14] are used to present integrated viewer and dashboards, scattered plots, infographics and graphs.

NonFunctional Requirements Based on the FURPS model [2] a table has been compiled with some of the gathered

answers from workshops and interviews which are relevant in relation to each requirement. At this stage in the requirements elicitation phase there is not enough data to conclude what would be the core nonfunctional requirements for the BDE Platform. From Table 4 we can only indicate to the requirement which are most important at the moment in the 7 Societal Challenges. The most important nonfunctional requirements which appear in most of the Societal Challenges are: interoperability, security and userfriendliness.

The interviewed people and the participants of the BDE workshops talk in general about APIs and the possibility to easily connect with other existing tools. This points to the

Page 18

D3.2 – v1.0

requirement of interoperability needed for the BDE Platform. We can remember the definition of interoperability to be “a property of a product or system, whose interfaces are completely understood, to work with other products or systems, present or future, without any restricted access or implementation” [33].

Security is strongly correlated with privacy of data. In some cases sensitive data is used which is not meant to be disseminated which leads to the need of a secure BDE Platform to be a key requirement. Data preservation and integrity are seen as being part of the core needs of a BDE Platform being secure.

Userfriendliness is also an important requirement for the BDE Platform. The people who we interviewed and the participants of the workshop are quite diverse. They have very different backgrounds with some knowing more about Big Data and some knowing much less. There is always a fear of a complex tool which is not well documented and not intuitive enough even if the tool might help solve some of their biggest issues.

The other statements collected from interviews and workshops can be viewed in Table 4. These statements all represent quotations.

Page 19

D3.2 – v1.0

Table 4: Nonfunctional requirements for the BDE Platform overall SCs

Societal Challenge

1 2 4 5 6 7

Reusability: Interoperability

Provide APIs (this was deemed critical to the participants); interoperability is vulnerable to changes

We would prefer to access the platform through REST APIs, we would like the platform to support the use of URIs as identifiers and existing vocabularies for data models. Integration of heterogenous data is the main problem.

Integration with tools commonly used for pre and postprocessing; Depends if R can be connected; Integration with existing tools been used by researchers

The most important characteristics for a new technology are interoperability

Reusability: Portability

Cross platform; data transfer to cloud infrastructure is a bottleneck today

Security: Safety

Privacy and security concerns “hang over everything like a black cloud”

Data security is important, especially data preservation and integrity, considering that ourplatforms contain data provided by other part other partners

Security some projects we will not do on cloud. Sometimes we have to rely on a spanish data providers to guarantee that thedata is held in Spain; Privacy protection is the main issue; Security is very big issue;Data protection and data privacy

First of my concern is privacy; There is the privacy and data protection concern, which dominates.

Platform must deal securing the privacy of the data users and the information contained in it; Data security is a key point in our data management solution; Data security is a key point in our data management solution.

Future extensibility

I would use it mostly for realtime data cleaning and enrichment, especially disambiguation and mapping to externalauthority data at thng

Userfriendly Easy to use,

acceptance by customer

Simple and user friendly and fast; Community adoption

It is important to be userfriendly

The platform should be easy to use

Documentation Easy explanations

of how to use the platform

It is important [...] with sufficient instructions about its

The platform should provide supporting tools on how to use it (instructions,

Page 20

D3.2 – v1.0

purpose; Contains informative instruction

tutorials, etc.); a new technology to be adopted must have a reasonable level of support levels.

Responsiveness High

responsiveness Response time in terms of latency, within seconds

Fast

Reliability

A Big Data Management solution creates problems if it fails regarding security, reliability and stability.

Availability The system should be available

Scalability

Must scale: up and down (for embedded solutions); worldwide scalability

The increasing flow of data (e.g. from EO satellites) will require new technologies to store, manage, analyse and disseminate them; a new technology to be adopted must have a reasonable level of stability

Efficiency

Efficient data filtering I would like to

select the right data on a specific region fast and perform some image processing techniques in an automatic way; it should improve the efficiency in archive management; The most important characteristics for a new technology are: efficiency

Adaptability

We may have some limitations if the response from the API is huge and has to be stored by the client for further processing.

I/O capacity. Can open/visualize different types of data files and not need to install for each type another tool

Sustainability The most important characteristics for

Page 21

D3.2 – v1.0

a new technology are: sustainable (economically, technically)

Empty fields mean no data available

Page 22

D3.2 – v1.0

3. Big Data Integrator Platform Architectural Design

In Deliverable 3.1, the first initial architectural design principles of the BDE platform was presented, resulting in a rather generic blueprint of the BDE platform. This section is devoted to materializing this blueprint into a deployable architecture.

Figure 1: BDE Platform deployment

The ambition of the BDE platform is to support diverse data management challenges, of which, some have been identified in the first requirements assessments (see Deliverable 2.3 and section 2 of this deliverable). As detailed in section 2, different Societal Challenges correspond to different problem instances.

Instance 1: unstructured and semistructured data processing Instance 2: multimedia data processing Instance 3: sensor data processing Instance 4: geospatial data processing

Page 23

D3.2 – v1.0

Table 5: Societal Challenges and corresponding problem instances

Societal Challenge

1 2 3 4 5 6 7

Instance

1

2

3

4

Related

Not relevant

To be able to respond to these diverse set of requirements, the BDE platform is based on a flexible, yet robust and scalable software deployment foundation (i.e. Mesos [34] and Docker [35]). Deliverable 3.1 introduced the technology; the subsequent Deliverable 3.3.1 will provide actual installation and deployment instructions. This deployment creates the necessary flexibility in the BDE platform to facilitate on the one hand the deployment of very diverse software components in a working instance and on the other hand the selection possibility required to customize the BDE platform to the applications needs from various societal challenge partners.

Platform Profiling The analysis of the requirements in section 2, Instance functional requirements, indicates

that there exist cross societal challenge data processing challenges.

Platform profiling is a way to bring the BDE platform closer to potential endusers. Our BDE platform instantiation process passes through the following major stages:

1. Selection of platform components 2. Configuration of components in a running setup for a particular instance profile 3. Extension of components with generic domain processing 4. Generation of Pilot instances

Page 24

D3.2 – v1.0

Selection of Platform Components The first stage is the base BDE platform. A component is added to the BDE platform when it has been validated with respect to the deployment strategy of the BDE platform proposed in the following sections.

We need a platform that can meet all the Big Data challenges arising from dealing with different Societal Challenges having different needs. Therefore, in order to deal with the aforementioned data properties, the following components have been selected for the Big Data Platform.

Architecture

The framework requires a generic, scalable and faulttolerant distributed data processing architecture. It should satisfy the needs for a robust system that is faulttolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, as in case of BDE problem instances. These set of requirements are met by the selection of Lambda Architecture (LA) for the platform.

File System

The platform requires distributed file system which provide storage, fault tolerance, scalability, reliability, and availability to the multitude of SC partners. This has resulted in selection of Apache Hadoop Distributed File system, HDFS.

Resource Manager

The platform should be able to provide resource management capabilities and support schedulers for high utilization and throughput. This set of properties is delivered by the mesos distributed system kernel that offers optimal resource management for distributed applications.

Apache Mesos: Apache Mesos is a cluster manager, a kernel for distributed operating systems that provides efficient resource isolation and sharing across distributed applications or frameworks. It operates between the application layer and the operating system layer, and makes it easier and more efficient to deploy and manage applications in largescale clustered environments. It can run many applications on a dynamically shared pool of nodes.

Similar to the desktop computer operating system, which manages access to the resources on a computer, Mesos ensures that applications have access to the resources they need in a cluster. Instead of setting up numerous server clusters for different parts of an application, Mesos allows sharing a pool of servers so they can all run different parts of an application without interfering with each other and with the ability to dynamically allocate resources across the cluster as needed.

Page 25

D3.2 – v1.0

Scheduler

The scheduler needs to schedule the distributed tasks and offer resources to increase the throughput of overall system. Two schedulers Marathon and Chronos have been selected for task scheduling in the framework.

Coordination

The platform requires efficient system to manage state, distributed coordination, consensus and lock management in the distributed platform. Zookeeper will be used as a decentralized tolerant coordination framework.

Data Input

The requirements from different Societal Challenges, portray that the data could contain the properties like;

The data is from numerous heterogeneous sources, dealing with different societal challenges, which requires a different approach to tackle the data and inference process. i.e Variety

The velocity of data could range from satellite streaming data to stationary geospatial data.

Data could be as large as terabytes gathered from satellite streaming.i.e. Volume

Data Acquisition Owing to the wide range of input data properties, a set of tools is needed to support the process of gathering, filtering and cleaning data before it is put in a data warehouse or any other storage solution on which data processing can be carried out a set of data acquisition The set of frameworks including Apache Flume and Apache Sqoop have been chosen with an ambition that it would cater for the all the four properties of data.

Apache Flume: A framework to populate HDFS with streaming event data. This corresponds to the problem instance 3 and 4.

Apache Sqoop: A framework to transfer data between structured data stores such as RDBMS and Hadoop.

Data Storage

The system requires Polyglot storage with application specific databases instead of one size fits all. For example, different SC instances may require Keyvalue data, linked data, or data in the form of document or graphs. In intention is to adopt one from MongoDB, Cassandra, Hbase or Project Voldemort depending on the type of data in the platform instance.

Data Processing

A multitude of tools are available for the type of processing to be performed on the underlying data, this includes, but not limited to MapReduce for Batch processing, Spark GraphX for

Page 26

D3.2 – v1.0

iterative processing, Apache spark and Apache storm for data stream and real time processing.

Data Processing Frameworks: The platform requires different frameworks for diverse SC instances. Each framework has different set of strengths and applicable for a specific set of properties of underlying data.

Apache Hadoop: A software framework to write applications that process large amounts of data in parallel using the MapReduce paradigm.

Apache Storm: A distributed realtime computation system on data streams.

Apache Spark: An inmemory data processing engine that provides 4 libraries:

Spark SQL Library to make Spark work with (semi)structured data by

Spark streaming Library that adds stream data processing to Spark core.

Mlib Machine Learning Library Library that provides a Machine Learning framework on top of Spark core.

GraphX Library that provides a distributed graph processing framework on top of Spark core.

Apache Flink: A data processing engine for batch and realtime data processing, pretty much similar to Spark. However, it is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams. This allows to perform flexible window operations on streams.

Data Analysis Tools: The data analysis tools could be languages that may be procedural or declarative.

Apache Pig: A highlevel platform for creating MapReduce programs and code to be used with Hadoop to analyse large datasets.

Apache Hive: A data warehouse infrastructure built on top of Hadoop to extract, transform and load data and providing data summarization, query, and analysis.

Data Manipulation Libraries: Libraries support out of the box implementations of the most common data mining and machine learning libraries.

SparkR: An R package that provides a lightweight frontend to use Apache Spark from R.

Apache Mahout: Produces free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

Data Integration and Communication This includes orchestration tools for managing data pipelines and metadata management.

Page 27

D3.2 – v1.0

Owing to the wide range of technologies and schemas used to store and manage data, a set of tools is needed to support the integration of heterogeneous data.

Table 6: BDE Profile Platform Components

BDE Profile Platform Components Architecture Lambda

File system Apache Hadoop Distributed File system, HDFS

Resource manager Mesos

Scheduler Marathon

Coordination Zookeeper

Data Acquisition Apache Flume, Apache sqoop

Data Stores MongoDB, Cassandra, Hbase, Project Voldemort

Data Processing

Frameworks Hadoop MapReduce, Apache Spark, Apache Storm, Apache FLink

Tools Apache Pig, Apache HIve

Libraries SparkR, Apache Mahout

Data Integration

Message Passing Managing data

heterogeneity

Apache Kafka SemaGrow, Strabon

Operational Frameworks

Monitoring Apache Ambari

SemaGrow: A federated querying system that transparently provides a homogeneous view over datasets that are both syntactically (RDF, NetCDF, etc.) and semantically (e.g., different RDF vocabularies, different NetCDF variable names for the same physical properties) heterogeneous.

Strabon: A semantic spatiotemporal RDF store that supports linked geospatial data and offers geotemporal and relational querying (stSPARQL and GeoSPARQL). Strabon also integrates heterogeneous geodata, such as data following multiple Coordinate Reference Systems).

Apache Kafka: A distributed publishsubscribe messaging system

Operational Frameworks The operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage workloads.

Apache Ambari: system for collecting, aggregating and serving Hadoop and system metrics

Page 28

D3.2 – v1.0

Table 6 summarises the set of components chosen for the BDE platform profiling corresponding to the set of requirements gathered in this phase.

Table 7: Platform Instances and corresponding data tools

Tools Instance 1 Instance 2 Instance 3 Instance 4 Flume Sqoop Scribe Hadoop Storm Spark Flink

Relevant Not Relevant

The selected set of components neatly align with the problem instances pointed out earlier. Table 7 shows the correspondence with the requirement instances.

Integration of such a large number of versatile components requires a fault resilient and highly available platform that should be able to handle different workloads and varying resource requirements from different components is a challenging task.

Challenge These multiple Big Data components must be integrated into a single, homogenous platform to fulfill the differing processing requirements from different Societal Challenges. This imposes the need to bundle a variety of Big Data frameworks together in a compatible and peacefullycoexisting environment of the Big Data Europe’s Platform. Such homogeneity is hard to achieve in a single layer.

It should also be kept in mind that the data variables (and use cases) constantly evolve and usually before the technology itself can keep up. One must be prepared to leverage multiple execution frameworks, and switch between them as necessary.

Platform Deployment Solution

Docker

Docker containers provide a consistent, compact and flexible means of packaging application builds. Delivering applications with Docker on Mesos promises a truly elastic, efficient and consistent platform for delivering a range of applications on premises or in the

Page 29

D3.2 – v1.0

cloud. Docker provides a solution to isolate the different dependencies inside the container irrespective of the host setup where the Mesos slave is running, thereby helping to operate on a single cluster.

Page 30

D3.2 – v1.0

Figure 2: Big Data Platform Components

Summary of Proposed Profile Architecture Figure 2 illustrates the overview of the Big Data Europe platform. Mesos will be used as a generic resource manager for memory and processing, and other data dependent, processing applications execute atop Mesos as Docker containers. The containers execute as tasks within the mesos framework and exit after the task is completed. Whereas mesos takes care of resource allocation and management for simultaneous execution of multiple containers. This layout could provide heterogeneous framework integration with efficient resource management for Big Data Europe platform

Configuration of Components The second phase covers a deployable configuration of a selected group of components for the BDE platform. This instance is configured so that it achieves a particular data management effect: e.g. a scalable velocity pipeline or a batch processing with full

Page 31

D3.2 – v1.0

provenance tracing. Moreover, the process must adhere to the nonfunctional requirements for the platform , including scalability, portability, interoperability, to name a few.

Extension of Components In the third phase the generic component instances are replaced with application and pilot specific instances. For instance, the generic Storm setup is replaced with a Storm setup that extracts metadata from XML documents. Or the generic Hadoop is replaced with a Hadoop processing analysing satellite images. This would also be completed in light with the nonfunctional requirements of the platform pilot instance, including scalability and extendibility.

Generation of Pilot Instances Aligned with the final step of requirements gathering, particular pilot instances will be created for all Societal Challenges fulfilling the requirements of pilot instances. Our ambition is to make the platform profile as close to the intended pilot instances as possible, while keeping it flexible enough with possibility of modifications and enhancements for a variety of further possible pilots.

Page 32

D3.2 – v1.0

4. Conclusion

This deliverable contains the technical requirements and preliminary big data integrator architecture profile based on the platform instances identified for different societal challenges.

It has been concluded that the different communities across Europe’s societal challenges have a diverse set of requirements and require different components and frameworks, but at the same time they all share the common data value chain and big data properties.

This documents presents a roadmap to the achieve the functionalities of the platform

The proposed big data integrator platform contains the components that fulfill the technical requirements and the problem instances. This requires numerous number of components to be integrated and operate in harmony.

This challenge is tackled by selecting open source software from Hortonworks and the use of the lambda architecture, which has the ability to deal with multiple types of data streams. The deployment will be carried out using portable, selfcontained Docker components which can be deployed on demand.

As mentioned in the document, this is a profileplatform which could include refinements and changes later on, based on the particular platform instances for different societal changes.

The next deliverable, Deliverable 3.5, will contain the requirement elicitation over all 7 Societal Challenges. More interviews and more workshops will be conducted for this matter. Based on these detailed requirements the platform will be made more specific and pilot oriented.

Page 33

D3.2 – v1.0

5. References

1. “BDE online survey" 2005. 23 Jul. 2015 Appendix of Deliverable 4.1 2. Practical Software Metrics for Project Management and Process Improvement,

Robert B. Grady, Prentice Hall, 1992 3. "Earth observation Wikipedia, the free encyclopedia." 2011. 24 Jul. 2015

<https://en.wikipedia.org/wiki/Earth_observation> 4. "ICD10 Wikipedia, the free encyclopedia." 2011. 23 Jul. 2015

<https://en.wikipedia.org/wiki/ICD10> 5. "CDISC | Strength through collaboration." 23 Jul. 2015 <http://www.cdisc.org/> 6. "SNOMED CT Wikipedia, the free encyclopedia." 2011. 23 Jul. 2015

<https://en.wikipedia.org/wiki/SNOMED_CT> 7. "Home MeSH NCBI." 2003. 23 Jul. 2015 <http://www.ncbi.nlm.nih.gov/mesh/> 8. "Unified Medical Language System (UMLS) Home." 2002. 23 Jul. 2015

<http://www.nlm.nih.gov/research/umls/> 9. "Home MeSH NCBI." 2003. 23 Jul. 2015 <http://www.ncbi.nlm.nih.gov/mesh/> 10. "INSPIRE Geodata.se." 2013. 24 Jul. 2015

<https://www.geodata.se/en/What/INSPIRE/> 11. "Unidata | NetCDF." 2005. 23 Jul. 2015 <http://www.unidata.ucar.edu/netcdf/> 12. "R (programming language) Wikipedia, the free encyclopedia." 21 Jul. 2015

<https://en.wikipedia.org/wiki/R_(programming_language)> 13. "Python (programming language) Wikipedia, the free ..." 21 Jul. 2015

<https://en.wikipedia.org/wiki/Python_(programming_language)> 14. "MATLAB Wikipedia, the free encyclopedia." 21 Jul. 2015

<https://en.wikipedia.org/wiki/MATLAB> 15. "Elektrobit Automotive embedded software for cars." 2007. 23 Jul. 2015

<https://www.elektrobit.com/> 16. "Closedcircuit television Wikipedia, the free encyclopedia." 2011. 22 Jul. 2015

<https://en.wikipedia.org/wiki/Closedcircuit_television> 17. "CORDEX." 2011. 24 Jul. 2015 <http://www.cordex.org/> 18. "SPECS | Sesonaltodecadal climate Prediction for the ..." 2012. 24 Jul. 2015

<http://www.specsfp7.eu/> 19. "GRIB.US > Home." 2006. 24 Jul. 2015 <http://www.grib.us/> 20. "Unidata | NetCDF." 2005. 24 Jul. 2015 <http://www.unidata.ucar.edu/netcdf/> 21. "Web Service Common | OGC." 2006. 24 Jul. 2015

<http://www.opengeospatial.org/standards/common> 22. "Defence Geospacial information Working Group." 2002. 23 Jul. 2015

<https://en.wikipedia.org/wiki/Defence_Geospatial_Information_Working_Group> 23. "ArcGIS | Main." 2005. 24 Jul. 2015 <http://www.arcgis.com/> 24. "Apache Solr ." 2007. 21 Jul. 2015 <http://lucene.apache.org/solr/> 25. "Apache Nutch™ ." 2006. 21 Jul. 2015 <http://nutch.apache.org/>

Page 34

https://en.wikipedia.org/wiki/Earth_observation

https://en.wikipedia.org/wiki/ICD-10

http://www.cdisc.org/

https://en.wikipedia.org/wiki/SNOMED_CT

http://www.ncbi.nlm.nih.gov/mesh/

http://www.nlm.nih.gov/research/umls/

http://www.ncbi.nlm.nih.gov/mesh/

https://www.geodata.se/en/What/INSPIRE/

http://www.unidata.ucar.edu/netcdf/

https://en.wikipedia.org/wiki/R_(programming_language)

https://en.wikipedia.org/wiki/MATLAB

https://www.elektrobit.com/

https://en.wikipedia.org/wiki/Closed-circuit_television

http://www.cordex.org/

http://www.specs-fp7.eu/

http://www.grib.us/

http://www.unidata.ucar.edu/netcdf/

http://www.opengeospatial.org/standards/common

https://en.wikipedia.org/wiki/Defence_Geospatial_Information_Working_Group

http://www.arcgis.com/

http://lucene.apache.org/solr/

http://nutch.apache.org/

D3.2 – v1.0

26. "BioAssay Ontology | The BioAssay Ontology (BAO ..." 2009. 21 Jul. 2015 <http://bioassayontology.org/>

27. "SemaGrow: Home." 2012. 21 Jul. 2015 <http://www.semagrow.eu/> 28. "Welcome to the Data Documentation Initiative | DDI Data ..." 2003. 21 Jul. 2015

<http://www.ddialliance.org/> 29. "IBM SPSS Statistics: Compare Academic Editions." 21 Jul. 2015

<http://www.ibm.com/software/analytics/spss/products/statistics/editioncomparisonacademic.html>

30. "Statistical Software AMOS" 21 Jul. 2015 <http://www.statisticssolutions.com/amos/>

31. "IBM SPSS Statistics: Compare Editions." 21 Jul. 2015 <http://www.ibm.com/software/analytics/spss/products/statistics/editioncomparison.html>

32. "Altmetric – Experimental evidence of massivescale emotional ..." 21 Jul. 2015 <http://www.altmetric.com/details/2397894>

33. "Interoperability dedicated website for Definition of Interoperability" 2011. 22 Jul. 2015 <http://interoperabilitydefinition.info/>

34. “Mesos Documentation"22, Jul. 2015 < http://mesos.apache.org/documentation/latest/>

35. "Dockers." 22 Jul. 2015 <https://www.docker.com/>

Page 35

http://bioassayontology.org/

http://www.semagrow.eu/

http://www.ddialliance.org/

http://www.ibm.com/software/analytics/spss/products/statistics/edition-comparison-academic.html

http://www.ibm.com/software/analytics/spss/products/statistics/edition-comparison-academic.html

http://www.ibm.com/software/analytics/spss/products/statistics/edition-comparison.html

http://www.ibm.com/software/analytics/spss/products/statistics/edition-comparison.html

http://www.altmetric.com/details/2397894

http://interoperability-definition.info/

http://mesos.apache.org/documentation/latest/

http://mesos.apache.org/documentation/latest/

https://www.docker.com/

Support Action Big Data Europe Empowering Communities with ... · D3.2 – v1.0 1.Introduction In...

Documents

Transcript of Support Action Big Data Europe Empowering Communities with ... · D3.2 – v1.0 1.Introduction In...