Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1...

27
European Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian Pag´ e, Yvonne ustermann, Reinhard Budich Status Draft/Review/Approval/Final Version v.1.0 Date November 21, 2013 Abstract: This deliverable is reporting on the progress on the construction and integration of the Generic Exe- cution Framework (GEF), as well as additional required tools and components. The focus is on any lessons learned from the first stage of technology adaptation and construction. It describes how existing EUDAT user technologies have been incorporated, including any necessary adaptations. It also suggests expected behavior of the framework against User Community needs. The final report (D7.5.2) will describe and assess the EUDAT GEF, including any adaptations that were necessary to accommodate the requirements of the user communities.

Transcript of Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1...

Page 1: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

European Data

Grant agreement number: RI-283304

Deliverable D7.5.1Technology adaptation and development framework

in EUDAT WP7.3

Authors Emanuel Dima, Christian Page, Yvonne Kustermann,

Reinhard Budich

StatusDraft/Review/Approval/Final

Version v.1.0

Date November 21, 2013

Abstract:

This deliverable is reporting on the progress on the construction and integration of the Generic Exe-

cution Framework (GEF), as well as additional required tools and components. The focus is on any

lessons learned from the first stage of technology adaptation and construction. It describes how existing

EUDAT user technologies have been incorporated, including any necessary adaptations. It also suggests

expected behavior of the framework against User Community needs. The final report (D7.5.2) will

describe and assess the EUDAT GEF, including any adaptations that were necessary to accommodate

the requirements of the user communities.

Page 2: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian
Page 3: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

Document identifier: EUDAT-DEL-WP7-D7.5.1

Deliverable lead Christian Page

Related work package 7

Author(s) Emanuel Dima, Christian Page, Yvonne Kustermann, Reinhard

Budich

Contributor(s) Stephane Coutin, Pascal Dugenie

Due date of deliverable 01/10/2013

Actual submission date 21/11/2013

Reviewed by Morris Riedel, Ari Lukkarinen

Approved by

Dissemination level PUBLIC

Website www.eudat.eu

Call FP7-INFRA-2011-1.2.2

Project number 283304

Instrument CP-CSA

Start date of project 01/10/2011

Duration 36 months

Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does

not necessarily represent the views expressed by the European Commission or its services.

While the information contained in the document is believed to be accurate, the author(s) or any

other participant in the EUDAT Consortium make no warranty of any kind with regard to this material

including, but not limited to the implied warranties of merchantability and fitness for a particular purpose.

Neither the EUDAT Consortium nor any of its members, their officers, employees or agents shall

be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission

herein.

Without derogating from the generality of the foregoing neither the EUDAT Consortium nor any of

its members, their officers, employees or agents shall be liable for any direct or indirect or consequential

loss or damage caused by or arising from any information advice or inaccuracy or omission herein.

Page 4: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

Contents

1 Introduction 5

1.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Functionality Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Service Catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 API 9

2.1 REST Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Request Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 List of Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Basic Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Execution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Filtering Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.4 Map-Reduce Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.5 Workflow Management Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Using the API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Conceptual Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.2 Extended Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Job Control and Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Differences to Description of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 User Interface 15

4 Implementation 16

4.1 The Web service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 The iRODS-Based Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Map-Reduce: Hadoop, Pig and PigLatin . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Testing Use Cases 19

5.1 ENES Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Data Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.2 Data Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.3 Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 GEF implementation in a data node at CINES . . . . . . . . . . . . . . . . . . . . . . 22

5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 CLARIN Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.1 Metadata Query Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.2 Google Books Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Conclusion, notes, discussion 26

4/27 PUBLIC Copyright c© The EUDAT Consortium

Page 5: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

1 Introduction

1 Introduction

This deliverable is reporting on the progress on the construction and integration of the

Generic Execution Framework (GEF), as well as additional required tools and components.

The GEF is a mechanism designed for the enctment of scientific workflows (with certain

restrictions) on massive amounts of data in an environment where the data is readily acces-

sible.

This document focuses on the lessons learned from the first stage of technology adap-

tation and construction. It describes how existing EUDAT user technologies have been

incorporated, including any necessary adaptations. It also suggests expected behaviour of

the framework against User Community needs. The final report (D7.5.2) will describe and

assess the EUDAT GEF, including any adaptations that were necessary to accommodate

the requirements of the user communities.

1.1 Design Goals

The GEF offers the possibility of doing processing on datasets at a network location very

close to the actual location of the data. The EUDAT CDI already hosts petabytes of

data in its datacenters. Analyzing these data often implies transferring them to a different

software environment, a prohibitive operation due to the data volume and available network

bandwidth. The GEF offers topological proximity which provides advantages like fast access

and lower network load. An especially useful purpose of the GEF is filtering and subsetting

datasets, then transferring the resulting data to a local computer for further analysis, again

lowering the network load and the time needed to perform data analysis.

Being designed as a general-purpose framework, for the use by many diverse scientific com-

munities, the GEF must be capable to work with the tools already in use in these com-

munities, but it must also propose a framework for those communities who have not yet

organized their data in a federation. Therefore the GEF implementation must be highly flex-

ible and designed for continous enhancement of its functionalities. The minimum required

GEF capability is to execute command line tools against the stored data. An additional exe-

cution module for map-reduce jobs will also be integrated. Plugging in additional execution

modules must be possible and easy by design (e.g. modules for processing streaming data

or modules for statistical analysis).

1.2 Functionality Overview

The GEF is defined as a collection of HTTP web services. The specification of the web

services constitutes the API layer. The API is the only layer which is guaranteed to be

stable against change. Non backwards-compatible changes will be added as new versions of

the API are developed. The stable, web service based API layer should considerably ease

the integration of the GEF with other software tools, including common workflow engines

(e.g. Taverna, Kepler) and communities-specific data federation interfaces. The API is

implemented by various back end modules (see Figure 1).

In the initial phase the GEF will only support a limited number of fixed functions, or services.

Calling a GEF service involves sending an HTTP request to the corresponding web service

Copyright c© The EUDAT Consortium PUBLIC 5/27

Page 6: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

GEF  

Map-­‐Reduce  back  end  

Workflows  back  end  

CLI  scripts  back  end  

iRODS  federa@on  

Services  

Figure 1: GEF structural overview

endpoint and providing the necessary parameters. Some services will be bound to using

a particular dataset; others will accept an identifier of the input dataset as a parameter.

Services that need an input dataset to be specified should refuse the request when the

dataset is not close to the service, or the transfer would take too long.

The individual services will be distributed to various servers in the EUDAT infrastructure.

However, a single root URL for the GEF will serve as a request dispatcher (e.g. https:

//eudat.eu/gef). A request coming to this URL will be redirected to a different GEF

endpoint or refused, depending on the service name and the location of the input data.

Each GEF endpoint would be close to a data center and thus enabled to work with the local

datasets.

Input and output datasets should be specified indirectly (via URIs or handles/PIDs), not

transferred over HTTP. This requirement also applies to temporary files when there is a

sequence of steps (except PIDs). Also, they should be accessible via different protocols, at

least one of which should be capable of resiliently handling large data transfers. It must be

noted that although one of the goals of the GEF is to reduce the data volume size that

needs to be transferred over the net, in the near future the resulting transferred data can

still be large compared to today’s standards. The data transfer functionality will be unified

with the Lightweight Replication service1 when both services will attain a mature status.

Many workflows consist e.g. of a set of GEF operations executing partially in sequence and

partially in parallel. An efficient sequence of GEF operations would execute without any

data transfers, the input for one service being the output of the preceding one (the input

1The Lightweight Replication is a simple service for data movement in and out of the CDI: https://confluence.csc.

fi/display/Eudat/Lightweight+Replication+service

6/27 PUBLIC Copyright c© The EUDAT Consortium

Page 7: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

1 Introduction

and output being specified by data handles). Commonly used workflows in the climate and

linguistic user communities start with a ’filter’ step, and only afterwards process the resulting

data. A data transfer to the user would only occur when the workflow is finished.

For the user, the GEF is the service framework together with the functions offered by

this framework. From an implementation perspective, however, the GEF is distinct from

the functions: it consists of a generic web service, the selection of the backend solution

(irods/hadoop/other), transferring parameters to the backend where the function is ex-

ecuted and returning results. A function is expressed either by a command line script

(executed via an irods rule), a scientific workflow (represented in one of the workflow

management systems) or a Pig/Hadoop script. The GEF API is common, but the functions

are to be created, owned and maintained by the user communities independently.

As specified above, a function can be implemented in the back end by a scientific workflow.

The term workflow, however, encompasses a large variety of meanings. In the scope of

the GEF, a workflow is understood as a representation of a data processing functionality,

enactable by means of a suitable workflow management system and constrained to execute

in the environment provided by a virtual machine with limited external connections.

The GEF execution backend of the prototype implementation is using the iRODS middleware

currently used in EUDAT for data management. A different backend module for executing

map-reduce jobs will make use of the Apache Pig and Apache Hadoop projects. Other

specific backends can be written by communities themselves if they already have a mature

data federation which is not using iRODS.

1.3 Security

All the GEF web services must run on encrypted connections, using the HTTPS protocol.

Ideally, a single sign on system (SSO) will be used, for which coordination with the EUDAT

AAI taskforce2 is necessary. The current envisioned solution is the usage of either OAuth,

for simple authorization cases, or the usage of client certificates transferred over HTTPS

from which the identity and the relevant attributes of the client can be extracted.

1.4 Metadata

The GEF will be most usable only if there is a Search Service (API) available to lookup

specific data, and return matching URIs/Handles/PIDs. This requires a Metadata Catalog,

both with at least the common semantics across EUDAT communities and with supplemental

community-specific Metadata. A common semantic catalogue needs to have a well defined

API which allows for typical search requests for data needed as input for the processing

in question. The returned URIs/Handles/PIDs of the requested data are provided to the

processing. Unless this functionality is available, the user (or the interface) is supposed to

know beforehand the required URIs/Handles/PIDs.

The current implementation of the EUDAT Metadata Catalog is based on CKAN, an open

source data management platform. CKAN provides a rich user interface with facet based

search facilities and also a HTTP REST API which can be used for the purpose.

2http://www.eudat.eu/authentication-and-authorization-infrastructure-aai

Copyright c© The EUDAT Consortium PUBLIC 7/27

Page 8: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

1.5 Service Catalog

A Catalog of Services available through the GEF will be available to the users, through

an API and a user interface. The catalog will be generated automatically from querying

all the registered GEF endpoints for available services. Each service metadata will contain

a human readable description, the locations (i.e. GEF endpoints/data centers) where the

service is available, the data types that can be used for the service and the details about

other parameters required by the service.

Currently the location of a dataset can be determined only indirectly and approximatively,

from the server domain specified in the URL.

At each GEF endpoint the same functionality should be available by using the OPTIONS

method of the HTTP protocol.

1.6 Related work

SHIWA (SHaring Interoperable Workflows for large-scale scientific simulations on Available

DCIs) is a FP7 project that aims to develop new technologies for workflow systems interop-

erability.3 The project provides an execution platform where the workflows can be executed

on various Distributed Computing Infrastructures (DCIs).

The SCAPE (SCAlable Preservation Environment) FP7 project is aiming to build a scal-

able platform for digital preservation.4 The preservation processes will be realized as data

pipelines and implemented as workflows expressed in the Taverna workflow system. SCAPE

will deploy large scale workflows and execute them on cloud infrastructures, also collecting

the provenance data produced during this process.

Many other research projects use various workflow systems for complex data analysis or

contribute to the workflow ecosystem in other ways. Contrail5 offers autonomic workflow

execution on cloud infrastructures. e-LICO6 provides services and tools to assist the user in

designing scientific workflows. Wf4Ever7 provides a management environment for Research

Objects, which it defines as comprising scientific workflows, the provenance data gathered

at execution, the interconnections between them and other resources and the related social

aspects.

Work on converting workflows from one workflow representation to another has been done

in the frame of the SCI-BUS8 project (conversion from the desktop based KNIME system

to the DCI based system gUSE9). A more general solution to the problem of workflow

translation was given in the frame of the SHIWA project by introducing an intermediate

workflow language, IWIR10.

3http://www.shiwa-workflow.eu4http://www.scape-project.eu/5http://contrail-project.eu6http://www.e-lico.eu7http://www.wf4ever-project.org/8http://www.sci-bus.eu9L. de la Garza, J. Kruger, C. Scharfe, M. Rottig, S. Aiche, K. Reinert, and O. Kohlbacher, 2013. From the Desktop to

the Grid: Conversion of KNIME Workflows to gUSE. http://ceur-ws.org/Vol-993/paper9.pdf10Kassian Plankensteiner, Johan Montagnat, and Radu Prodan. 2011. IWIR: a language enabling portability across grid

workflow systems. In Proceedings of the 6th workshop on Workflows in support of large-scale science (WORKS ’11).

ACM, New York, NY, USA, 97-106. http://doi.acm.org/10.1145/2110497.2110509

8/27 PUBLIC Copyright c© The EUDAT Consortium

Page 9: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

2 API

Web based management of workflows has beed developed in the P-GRADE project11, where

the execution environment is offered by various grid platforms12.

2 API

This section is a description of the principles of the API. A separate technical report will

follow to describe the API in more details.

2.1 REST Generalities

Representational State Transfer13 (REST) is a architectural style in software engineering,

typically used in conjunction with the HTTP protocol for developing web applications. The

REST style requires client-server separation and stateless communication (no preservation

of context on the server) among other constraints. In exchange, the resulting system will

be scalable, reliable and easily modifiable.

An HTTP web service built on REST principles offers its functionality via an HTTP URL

that identifies the service. The client can create/update/get/delete resources on the server

using the HTTP methods POST/PUT/GET/DELETE.

In addition, the HEAD and the OPTIONS methods can be used for querying metadata

about the resources or available operations. The HEAD method is equivalent with the GET

method but only returns the header information (the metadata part) and not the actual data.

The OPTIONS method, which usually is used for determining the options or requirements

of a resource14, can be used in the GEF for reflecting the data operations available for a

specific dataset.

2.2 Request Parameters

An HTTP request can carry an arbitrary number of parameters. Raw data can also be

transferred to the service as part of a request. As an example, a common form of a web

service URL, using GET parameters is:

https://eudat.eu/gef/webservice ?parameter1=value1&parameter2=value2

Each GEF web service is expected to take a number of parameters, of various kinds. Some of

the parameters are common for all web services. These parameters are defined as key-value

pairs, where the values can be either free form strings or a limited set of values (controlled

vocabulary).

The common parameters are:

certificate This parameter contains the certificate for authenticating a user (in the case a

certificate is needed for authentication and authorization).

11http://portal.p-grade.hu/12Peter Kacsuk and Gergely Sipos, 2005. Multi-Grid, Multi-User Workflows in the P-GRADE Grid Portal, Journal of Grid

Computing 3:3-4, http://link.springer.com/article/10.1007%2Fs10723-005-9012-613http://www.ics.uci.edu/˜fielding/pubs/dissertation/rest˙arch˙style.htm14http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

Copyright c© The EUDAT Consortium PUBLIC 9/27

Page 10: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

dataID A data handle that identifies the concrete resource that the service is processing.

This handle can be either a PID or, for transient data, a temporary data identifier.

The parameters can be of different kinds. Path parameters show up in the path of the web

service (e.g. the dataID parameter, with the value 12345/00-6789-ABCDE-F, can be part

of the path:

https://eudat.eu/gef/dataset/12345/00-6789-ABCDE-F?queryType=SRU CQL

&query=dc.title+2001

Query parameters are appended to the URL with a special syntax (e.g. the queryType and

query parameters in the previous example). Form parameters are sent with the data payload

in the request body and not visible in the URL.

2.3 List of Web Services

2.3.1 Basic Data Retrieval

The /gef/dataset/{dataID} service can be used to extract the data out of the EUDAT

CDI over HTTP, using the GET method; it may differ from the Lightweight Replication

service (which is designed to transfer data in and out of EUDAT servers) by accepting both

PIDs and temporary IDs for dataset identification. Also, it is not designed for ingesting

data into EUDAT, but just for getting data out of the system. The Lightweight Replication

service could be used as a backend for part of the implementation. This service should only

be used for transferring relatively small amounts of data.

dataID is a string identifying the dataset (either a PID or another temporary identifier).

2.3.2 Execution Functions

The /gef/function/{funcID} is the actual execution service, calling various data process-

ing workflows on the data identified by dataID, a required query parameter. The funcID

parameter identifies the function needed to be called.

2.3.3 Filtering Function

The most important service of the GEF in this first phase is the filtering service

(/gef/function/filter), which is used for filtering/subsetting the dataset identified by

the required query parameter dataID and should return a list of data handles.

The parameters are:

dataID : string

queryType : string

query : string

The filter web service depends on the requirements of the individual communities. Each

queryType value directs the execution of the service to internal community-specific subser-

vices.

10/27 PUBLIC Copyright c© The EUDAT Consortium

Page 11: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

2 API

2.3.4 Map-Reduce Function

The /gef/function/mapreduce service will offer a map-reduce functionality using the Pig

Latin script and a Hadoop backend (see section 4.3).

dataID : string

scriptType : string; currently, the only allowed value is ”PigLatin”

script : string

2.3.5 Workflow Management Service

The /gef/workflow service allows for authorized users to enhance the functionality of the

GEF by uploading and installing scientific workflows on site. The workflows can subsequently

be called as just another function of the execution service.

As an example, we can imagine the case of a scientific workflow that takes a data file as input,

processes the input in some way and outputs the result of this processing to an output port.

The workflow can be POST-ed to the workflow management service, which would analyze

it, test for conformance, install it in the backend and return an identification (workflowID)

of the workflow and the function it performs. The workflow can subsequently be executed

by invoking the execution service (a POST command at /gef/function/{workflowID})

or removed by a DELETE /gef/workflow/{workflowID}.

The workflow management service is expected to be used much less frequently than the

execution and data-retrieval services. This service is only available to some users (usually

community representatives), as it can introduce security and stability issues. An important

requirement is for the back-end to have a flexible way of providing support for additional

workflow systems or other software environments, the current envisioned solution being

encapsulation of the necessary tools in virtual machines.

2.4 Using the API

2.4.1 Conceptual Example

In broad lines, a researcher should make use of the EUDAT/GEF software stack as in the

following use case:

1. The researcher would first search relevant data sets for her problem or hypothesis.

This involves going to the Metadata Catalogue (the web service which indexes all the

EUDAT ingested data), exploring the data sets and selecting a relevant few (with their

PIDs).

2. The researcher should then go to the Service Catalogue (the human-readable list of

GEF services), review the GEF functions available for the selected datasets at different

locations and select a set of relevant functions.

3. Once a set of relevant functions is identified, the researcher would start processing

the data. This could be done either from the GEF web based user interface, or by

directly using the API from a client-side workflow system or a programming language

Copyright c© The EUDAT Consortium PUBLIC 11/27

Page 12: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

environment or a command-line environment. During this step, the researcher would

iteratively produce new datasets by applying functions to existing datasets.

4. The final result(s) would be downloaded to the local computer either via the basic data

service or by using other available protocols, and/or could be stored in the EUDAT

CDI and assigned a PID. In the last case the researcher would probably also store the

provenance data generated by the workflow systems.

2.4.2 Extended Example

A large climatic/linguistic dataset is stored in an EUDAT datacenter and has a public pid:

1234/00-1111-2222-3333. This is the pid of the collection; the dataset is comprised of

multiple files. One user wants to process a subset of this data and get back the results.

The processing consists of 3 major steps:

a) filter of the data based on metadata (year and location for climatic data / year and

language for linguistic data)

b) do a simulation and predict a future statistical variable based on the filtered data /

create a collocation table and select the values for a set of words

c) create a visualization of the resulting data (color coded map/frequency chart)

The services used in steps b) and c) are not services currently included in the GEF but exam-

ples of extending the framework’s functionality based on the needs of the user communities.

Step 0.

The user makes the first request for filtering to the central GEF endpoint:

POST https://eudat.eu/gef/function/filter?dataPID=1234/00-1111-2222-3333

&queryType=SRU CQL&query=dc.title+2001+sortBy+dc.date/sort.descending

the endpoint responds with HTTP 307 (Temporary Redirect) to the following local GEF

endpoint:

Location: https://specific.datacenter.eu/eudat/gef/function/filter

?dataPID=1234/00-1111-2222-3333 &queryType=SRU CQL

&query=dc.title+2001+sortBy+dc.date/sort.descending

Step 1.

The user reissues the request to the local GEF endpoint:

POST https://specific.datacenter.eu/eudat/gef/function/filter

?dataPID=1234/00-1111-2222-3333 &queryType=SRU CQL

&query=dc.title+2001+sortBy+dc.date /sort.descending

the GEF responds with HTTP 202 (Accepted):

12/27 PUBLIC Copyright c© The EUDAT Consortium

Page 13: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

2 API

Location: https://specific.datacenter.eu/eudat/gef/jobs/job1

The user busy-loops waiting for the job to end:

GET https://specific.datacenter.eu/eudat/gef/jobs/job1

response: HTTP 204 (No Content) – job˙status: running ˝

Eventually the user tries again and succeeds:

GET https://specific.datacenter.eu/eudat/gef/jobs/job1

response: HTTP 200 (OK) (job done):

size: ’192GB’,

dataId: ’jobs/job1/result’,

url: ’https://specific.datacenter.eu/eudat/gef/jobs/job1/result’,

irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job1/result’

˝

Step 2.

The user requests a fixed functionality of the GEF, using the previous result as input (forward

slashes being encoded as %2F):

POST https://specific.datacenter.eu/eudat/gef/function/func

?dataID=jobs%2Fjob1%2Fresult &...

response: HTTP 202 (Accepted):

Location: https://specific.datacenter.eu/eudat/gef/jobs/job2

User requests the result:

GET https://specific.datacenter.eu/eudat/gef/jobs/job2

response: HTTP 200 (OK) (job done):

size: ’20MB’,

dataId: ’jobs/job2/result’,

url: ’https://specific.datacenter.eu/eudat/gef/jobs/job2/result’,

irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job2/result’

˝

Step 3.

In the final visualization step, the same pattern is used:

Copyright c© The EUDAT Consortium PUBLIC 13/27

Page 14: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

POST https://specific.datacenter.eu/eudat/gef//function/visualize

?dataID=jobs%2Fjob2%2Fresult &...

server response: HTTP 202 (Accepted):

Location: https://specific.datacenter.eu/eudat/gef/jobs/job3

User requests the result:

GET https://specific.datacenter.eu/eudat/gef/jobs/job3

answered with HTTP 200 (OK) (job done):

size: ’1MB’,

dataId: ’jobs/job3/result’,

url: ’https://specific.datacenter.eu/eudat/gef/jobs/job3/result’,

irodsUrl: ’irods://specific.datacenter.eu/vzDATA/eudat/gef/jobs/job3/result’

˝

Step 4. The user chooses to download the end result over http:

GET https://specific.datacenter.eu/eudat/gef/jobs/job3/result

result: image.png

and also downloads the result of step 3 using irods icommands:

$ iinit #specific.datacenter.eu, port 1247, user, pass

$ iget /vzDATA/eudat/gef/function/job3/result ./function.dat

2.5 Job Control and Garbage Collection

Each job which is started as a result of a GEF request is referenced by the URI returned to the

user (e.g. https://specific.datacenter.eu/eudat/gef/jobs/job1). A GET request

to this resource returns the state of the job (scheduled, running, done, failed). A DELETE

request forcefully terminates the process and removes the URI resource. Requesting the

state of a job that has been deleted will necessarily return a HTTP 404 (Not Found) error.

When a job ends normally the user collects the result and has the option to DELETE

the job. If the user does not DELETE the job a garbage collection mechanism removes

the job results and the URI resource after a sufficiently large grace period. The garbage

collection mechanism has no information whether the user retrieved the data or not and its

implementation can be as simple as a cron job running every hour that removes all the jobs

older than a number of days.

The GEF should keep and manage as little state as possible. It should only expose the state

of the processes it starts and should not cache this state but retrieve it from the appropriate

backend. The job state reported back to the user will be one of: WAITING, RUNNING

or DONE. Each state can have substates, activated whenever the corresponding backend

provides more information on the jobs.

14/27 PUBLIC Copyright c© The EUDAT Consortium

Page 15: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

3 User Interface

2.6 Differences to Description of Work

The DoW stresses that GEF should support data streaming instead of in-memory or file-

based processing. Streaming implies the availability in the GEF backends of software com-

ponents designed for streaming data. A simple example of such a workflow would be

input → processing node 1→ processing node 2→ output

All the steps of the processing should work in parallel, like a bash pipeline. The simplicity

of the GEF makes this scenario possible, with the appropriate backend. A suitable project

to use for this case is Storm15, a free and open-source framework for processing massive

streams of data (already used by Twitter).

DoW also specifies that the GEF should translate any workflow into a common format and

execute it. The current prototype implementation can only accept some type of workflows

and executes them using their native engine, thus ensuring 100% compatibility. The existing

work on workflow translation can be used in order to provide a common enactment engine

for all the workflow systems.

3 User Interface

Meeting the needs of all users of a software system is a challenging process, as one can

conclude from the diversity of the existing user interfaces technologies and design choices.

For example in the climate community, the “ESGF based ENES data infrastructure provides

a rich set of different data access methods to meet the different user demands”16. To

download ESGF 17 data there are too many interfaces available to list them here, as a first

glance to this diversity at the data how-to proofs.

An attempt to categorize users could, for example, look like the following:

The expert user wants an efficient interface no matter if it is complicated or not. This

user repeats very similar tasks a lot of times. If the interface is not optimized, using it is

too tedious to concentrate on the relevant scientific tasks. A web interface is no good

solution here. Clicking options takes too much time in comparison to a command line

interface which allows to vary only one aspect in the request and then execute variation

of this request.

The novice or occasional user needs a rich interface but not a complicated one with a lot

of possibilities which could be overwhelming.

Anybody else: people generally interested in the topic, or people following a link in a news-

paper article. They need an interface to the GEF with an extremely restricted selection

of workflows and options.

15http://storm-project.net16Citation from https://verc.enes.org/help/how-to-./data-access, last access November 21, 2013.17ESGF means Earth System Grid Federation and is an international collaboration for a data infrastructure in climate

science.

Copyright c© The EUDAT Consortium PUBLIC 15/27

Page 16: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

Having this in mind we need to design a generic interface API in the GEF to serve such

different interfaces. A typical minimal set for a community is a web- (simple) and a scripting

interface (complex). Another typical set of two interfaces is a graphical user interface (self

explaining) and a command line interface (not self explaining, but faster to use after some

tedious repetition).

These can be programmed separately, depending upon community needs. Our first choice

in EUDAT is for a web interface.

4 Implementation

The current prototypical implementation is organized as a system integrating the front-end

web service (which implements the GEF API) and the backend, which currently depends on

an iRODS environment and contains:

• The iRODS command trigger

• The command executor

• The workflow system

The prototype currently implements the basic data retrieval service and the workflow man-

agement service. It also partially implements the execution service (but without the special

cases of filtering and without map-reduce functionality) with support for Taverna workflows.

The sources are available on the EUDAT SVN18.

4.1 The Web service

The web service is the software component directly implementing the GEF API. It is a Java

servlet using Jersey19 as a REST framework.

The web service receives the http requests from the users and acts accordingly. In the case

of a data transfer request, it intermediates between the user and the iRODS server (via

the Java Jargon20 library). When the user requests data with a PID, the web service also

interrogates the handle system server for the actual URL of the data.

In the case of a workflow execution request, the web service does the following:

1. translates PIDs to iRODS URLs

2. aggregates the request parameters in a local file

3. transfers the parameters file to iRODS

4. returns a token to the user, identifying the new job in progress

18https://svn.eudat.eu/EUDAT/Services/WorkflowEngine/gef19https://jersey.java.net20https://www.irods.org/index.php/Jargon

16/27 PUBLIC Copyright c© The EUDAT Consortium

Page 17: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

4 Implementation

4.2 The iRODS-Based Backend

The iRODS middleware is a core technology in EUDAT, allowing for data interchange via

federation and secure levels of access, facilitating distributed replicas of the data objects

and easing administration through the rule system and customizable data management

primitives.

Workflow Structured Objects (WSO) are the iRODS framework that is designed to provide

basic support for data management workflows. A workflow is effectively defined in iRODS

as any sequence of operations, allowing for the possibility of cycles21. An iRODS WSO

object (which is actually a type of script) defines the operations and mitigating procedures

in case of errors. For workflows, the system offers various customization options including

automatic data staging in and out of the execution environment. By reading a special file

the execution of the workflow is triggered and its success state is provided. As a side effect,

a collection of files is created, containing staged data and workflow results that can be

subsequently read or discarded.

Unfortunately, we were not able to use the WSO for integrating workflow support in iRODS.

The WSO is limited in the current implementation due to the following factors:

1. The WSO is a new feature in iRODS and potentially unstable; it only started working

correctly in the latest iRODS version to date (v.3.3). It is also insufficiently docu-

mented.

2. The workflow objects are difficult to access via the existing APIs. In particular the

Jargon library, which interfaces iRODS to the Java Virtual Machine, does not have (at

the time of this writing) full support for WSOs.

3. A single stage area is available for a single workflow object. Multiple workflows that

are running in parallel can potentially encounter data races.

Therefore we chose a different solution based on custom iRODS rules. The parameters sent

by the user are collected by the web service and deposited in a file in the iRODS system. The

ingestion of this file triggers the activation of a custom rule which starts the GEF command

executor; the executor reads the parameters from the file, prepares the required data and

runs whatever command line script or workflow system is needed. This general pattern is

also used for managing PIDs during Safe Replication.

A schematic view of the execution process is done in the provided sequence diagram (see

Figure 2); an architectural view of the system is also provided (see Figure 3).

4.3 Map-Reduce: Hadoop, Pig and PigLatin

Map-Reduce is a programming model for processing large datasets on computer clusters.

A computation expressed in this model consists of a ”map” step, during which the large

computation is split into smaller computations, and a ”reduce” step, when the results of

the smaller computations is coalesced into a final result.

Apache Hadoop22 is an entire software ecosystem built around the map-reduce paradigm.

Part of this ecosystem, the Hadoop software library is an open-source software framework

21https://www.irods.org/index.php/Introduction˙to˙Workflow˙as˙Objects22http://hadoop.apache.org

Copyright c© The EUDAT Consortium PUBLIC 17/27

Page 18: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

Figure 2: GEF service sequence diagram

18/27 PUBLIC Copyright c© The EUDAT Consortium

Page 19: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

5 Testing Use Cases

HTTP  User  

Response  

Request  

iRODS  

Command  Executor  

Workflow  engine  

OS  process  

iRODS  creden@als  AAI  

Figure 3: GEF implementation architecture

that manages map-reduce jobs. A related project, Apache Pig23, is a higher level platform

for data analysis that relies on Hadoop as a backend. Pig has its own analysis language, Pig

Latin, which is a query algebra for expressing data transformation.

The GEF map-reduce execution module will be backed by Hadoop clusters and will use the

iRODS-Hadoop integration project done in the EUDAT work package 7.2.

5 Testing Use Cases

In this section we describe the testing use cases of our two communities, the linguistic and

the climate community.

5.1 ENES Use Case

The ENES Use Case consists of Downloading the data and then apply a scientific workflow

on it, whereas already the download is a challenging and complicated part of the workflow

which includes more than a network transfer of the data.

5.1.1 Data Download

In the ENES community, data is distributed through a federation of worldwide data servers,

with a few main gateways and several data nodes. The federation is called the Earth System

Grid Federation (ESGF).

23http://pig.apache.org

Copyright c© The EUDAT Consortium PUBLIC 19/27

Page 20: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

Accessing the data is possible via web interfaces, via scripting interfaces i.e. in python24 or

via scripts called from the command line. This diversity can confuse. Which is the right

method for downloading the data? To the authors there is no summary text to be found,

which describes all access methods. Asking the support of the DKRZ, one of the three

access sites for the ENES data, they point to the this page25 but they add the information

that it is also possible to get the data from CERA26 in case it is replicated there.

To check how good the current federation infrastructure is working, we have interviewed

members of the community. To illustrate how unnecessary difficult the data download is,

here – to express it in Scrum terms27 – one user story of a PhD student of the Max-

Planck-Fellowship Program at the Institute for Meteorology in Hamburg. She had a task in

mind (let us call it her scientific workflow) and knew which data she needed (data download

workflow). In the first attempt using one of the web interfaces for download did not work

and using one of the scripts for the download failed due to the lack of knowledge how the

experiment names are encoded (internal knowledge of file management). After finding the

right web interface and the right credentials, the download started and was interrupted due

to quota exceeding. The PhD student ended up to solve the download task by asking her

supervisor. The supervisor downloaded it for her, instead of telling her how to do it (here

we can assume that also explaining how to succeed in downloading would have been difficult

to explain). Thus in the end, the download part of the workflow was accomplished via

“social engineering”. A lot of knowledge played a role: from knowing the right web portals,

replication places for faster download, machine to download with fast network access and

choosing the directories where you have permission and enough quota, to mention only a

part of the required knowledge. To choose a PhD student for the given user story is done

on purpose: an experienced user cannot show all difficulties, because this type of user does

not realize all difficulties any more. This user story illustrates how important even a mere

download workflow is for the climate community.

Another example for the required technical “download knowledge” is given at the CERA

page: “Jblob is a command-line based program for downloading data from the CERA

database. Please note, this program does not replace the graphical user interface. It is

mostly useful for people who know which data to download and for batch downloads.”28. If

it comes to subsetting the data to avoid transferring data which is not needed, there is not

even an interface to use until now.

The ESGF implements the required standards of data being distributed, such as the Data

Reference Syntax (DRS)29 which specifies how the data files must be structured, as well

as required metadata described by a Common Information Model (CIM)30 and Controlled

Vocabulary (CV). This ensures uniformity among the data centers and the data sets.

Currently there is capability embedded in the ESGF software stack that enables the ex-

traction of spatial and temporal data subsets through the data query. However, the ENES

24https://github.com/stephenpascoe/esgf-pyclient25https://verc.enes.org/help/how-to-./data-access26http://cera-www.dkrz.de27In Scrum are used so called user stories to guide implementation.28emphasis added29Taylor, K. E., Balaji, V., Hankin, S., Juckes, M., Lawrence, B., and Pascoe, S. (2010). CMIP5 Data Reference Syntax

(DRS) and Controlled Vocabularies.30Guilyardi, E., Balaji, V., Callaghan, S., DeLuca, C., Devine, G., Denvil, S., Valcke, S. et al. (2011). The CMIP5 model

and simulation documentation: a new standard for climate modelling metadata. ClIVAR Exchanges, 16(2), 42-46.

20/27 PUBLIC Copyright c© The EUDAT Consortium

Page 21: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

5 Testing Use Cases

community has several data processing tools which can be used to apply complex data pro-

cessing to data subsets. But, as said before, these tools can only be used after the data has

been downloaded to the user, on the user’s own computer systems.

The data volumes being used by community’s users is currently increasing rapidly. This

happens not only because there are more users, but also because of the increase in the data

volumes being generated by the community, due to spatial resolution increase, ensembles of

simulations, a larger number of experiments that are developed to enable the community to

answer more scientific questions, for example.

This rises the question whether there is a need for reserving bandwidth. Reserving bandwidth

using a normal internet connection is not possible with the current network technologies and

requires network research. A possible approach is a software defined network architecture.

This would mean that the user receives information from the GEF on how long the download

takes depending on when the user starts it. The scientist could decide whether at all and if

so when to start the download. This is investigated in EUDAT in the task WP7.2.

The scientist can concentrate on the semantics, on what data to use for which computation

to answer the scientific question. This means that also the time for the data subsetting

must be estimated. It is under the hood of the GEF whether the data needs to be just

transferred, or subsetted prior to the transfer.

The metadata taskforce is implementing the search functionality which gives back a handle

(PID) for every request. This PID subsequently is used by the GEF. Currently in ESGF the

user has to search manually31. Though, having a search interface giving back a PID is a

long term aim. In the near future the request will give back a URL, a DOI or a PID.

5.1.2 Data Subsetting

Currently subsetting the data is done after downloading. This results in unnecessary data

transfer and thus bandwidth usage. The subsetting in our workflow should be done directly

at the data centers prior to the data transfer.

This approach is realistic since the subsetting is done via the cdo-command-suite32, which

is portable and thus easy to install at all the heterogeneous data centers. To accelerate the

run of cdo, it would be possible to distribute it over a compute cluster via a Pig call (see

Section 2.3.4), if cdo would be made map-reduce-capable. The current version of cdo does

support OpenMP. The map reduce paradigm is not supported yet.

5.1.3 Scientific Computing

For some research questions it makes sense to provide standard workflows where the user can

choose custom parameters.33 For more complex questions it is necessary that the scientist

can design the part of the workflow after the data retrieval.

For designing workflows it is possible to use cross community-tools which provide a GUI.

Examples are Kepler, Taverna and Vistrails. These three examples were all investigated

31Some search capabilities are http://esgf.org/wiki/ESGF˙Search˙API, http://esgf.org/wiki/ESGF˙Search˙REST˙

API and https://github.com/stephenpascoe/esgf-pyclient.32https://code.zmaw.de/projects/cdo33Example workflows can be found at https://verc.enes.org/computing/workflows.

Copyright c© The EUDAT Consortium PUBLIC 21/27

Page 22: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

in work package 7.3. Also taken into account should be developments from the specific

community, in this case from the climate community. One example is a domain specific

language close to python which can encode workflows. But no matter which way to define

the workflow we choose, it will call the GEF. The GEF API is, as said before, the “fixed

point”.

5.2 GEF implementation in a data node at CINES

5.2.1 Background

CINES is located in Montpellier (France) and is part of the EUDAT Consortium. It of-

fers computer services to the scientific community in public research and higher education.

CINES is one of the Tier-1 computer operators and sites of national relevance selected

by GENCI, in charge of funding large HPC infrastructures for the French public research.

CINES is also involved in PRACE, PRACE 1IP, PRACE 2IP, HPC Europa 2 and in the ini-

tiative to put in place an integrated framework for auditing and certifying digital repositories

which is supported by the European Commission.

With this expertise, collaboration has taken place between ENES (CERFACS) and CINES

to install a first demo of the GEF (draft) in the ESGF infrastructure. The demo aims to

present an operational prototype for a data workflow use case based on ENES requirements.

Due to a tight schedule, the prototype has used some simplified solutions which would need

to be reviewed should we want to deliver a production system.

5.2.2 Scenario

Scenario The scenario is based on the use case 9 (UC9) defined as part of the WP7.3 initial

tests on the ENES workflows (see EUDAT MS23 Data Exploration Technology Experiments

and Benchmarking, section 3.2).

The objective of this use case was described as:

Generating data to support a Surface Temperature / Total Precipitation anomaly graph

over the largest possible number of scenarios:

• 30-year average 2050-2079 of rcp85 compared to

• 30-year average 1970-1999 of historical

(global, over France only and also over Europe only)

For the demo, the sequence is:

• The user enters on a web form the geographical coordinate’s box and two date ranges;

then launch the job

• The job kicks off in batch mode: based on the entered parameters, relevant files are

selected, a cdo set of commands calculates spatial and temporal averages for each

model. This creates a set of result files.

• The user can check the status of the job (ongoing, finished) from a web page.

• Once finished the user can either display the results on the map or download them.

22/27 PUBLIC Copyright c© The EUDAT Consortium

Page 23: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

5 Testing Use Cases

• The user can also decide to store the result in the EUDAT node, choosing either the

basic safe replication storage or a data seal of approval compliant storage.

High level diagram

The diagram below is a high level representation of the scenario:

Entry form

Standard Storage (could be ESGF node)

Available data files (CMIP5 /

netCDF)

Result files (NetCDF)

YODA environment

Input files

Result files

EUDAT basic Storage

Files with PID

EUDAT DSA storage

AIP

REST calls

Generic Eudat Framework (API)

Job Control page

Result display on a map

Result download

Result store or archive

Copy required files

Data calculation using CDO

Copy result file

Launch as batch

Store in EUDAT

Archive in EUDAT

Figure 4: GEF demo ENES implementation at CINES

One of the constraints is to use the GEF (Generic Execution Framework) as it is defined in

its draft description. Even though it is an early stage definition it implements it as much as

possible. This is what drives the usage of the REST interface between client and server.

Sequence

‘Launch a new job’ page

For the demo, we assume only 1 job will be launched at a time. No control mechanism is

implemented.

Click on ‘Launch’ button

Copyright c© The EUDAT Consortium PUBLIC 23/27

Page 24: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

This click makes a REST call to the server:

https://server.cines.fr/eudat/gef/filter?queryType=UC9&lat1=xxx“

&long1=xxx&lat2=xxx&long2=xxx&year1=yyyy&year2=yyyy&year3=yyyy&year4=yyyy

(where xxxx or yyyy are the entered parameters)

Receiving this, the server launches a job, passing the parameters.

Once the job launched, the server responds with HTTP 202 (Accepted):

https://server.cines.fr/eudat/gef/filter/job-id

(to simplify job-id will always be eudat-cerfacs-demo)

‘Job tracking’ page

Click on ‘Track’ button

This click makes a REST call to the server:

https://server.cines.fr/eudat/gef/filter/eudat-cerfacs-demo

The server checks the status of the job and returns:

If the job is running: HTTP 204 (No Content) { job status: running }

If the job is finished: HTTP 200 (OK) (job done):

size: ’192GB’, // Size of the result file

url: ’https://server.cines.fr/eudat/gef/filter/eudat-cerfacs-demo/result’,

// URI for the result files

˝

The fields Job-id, Status and HTTP results are populated according to the answer. If the

status is ‘Finished’, the 3 other buttons are activated.

Click on ‘Display result’ button

This opens the Display result page if the job is finished.

Click on ‘Download result’ button

This download the result files using standard browser download.

Click on ‘Store in EUDAT’ button

This triggers a script which copies the result set of files into the iRods EUDAT space. This

script runs in batch mode Then it opens the CINES ISAAC web application on the SIP page.

Conclusion ENES Use Case

The demo of this ENES Use Case has been presented live both at the EUDAT Workshop

Days (25-26 September 2013, Barcelona) and at the EUDAT 2nd Conference (28-30 Oc-

tober 2013, Rome). Useful feedbacks from experts have been received to better design the

API. It has also demonstrated that the GEF draft API can be installed on a server nearby the

data storage to perform useful data reduction. This kind of Use Case can help in designing

a proper GEF API useful for the ENES communities and other scientific communities.

24/27 PUBLIC Copyright c© The EUDAT Consortium

Page 25: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

5 Testing Use Cases

5.3 CLARIN Use Cases

5.3.1 Metadata Query Service

In the CLARIN community, data is usually assigned with exhaustive metadata. This meta-

data is stored as CMDI34 files together with the object data. A filtering mechanism can

utilize this metadata. For example, a query on the publication date of the objects in scope

can identify all objects which were published in the 19th century.

Such a filtering mechanism on top of metadata can be implemented using the GEF in-

frastructure as a workflow taking a filtering expression as input (e.g. an XPath expression,

depending on the form of the metadata). The output is then a list of resources which match

the filter criteria (see Figure 5).

!&*&-(/$"0&/123*$#-45&.3-6$

%474$8&*7&-$

9&74:474$+;<&/7$:474$

=&'31-/&'$

>/-(?7$5&74:474$@1&-A$

B('7$3C$-&'31-/&'$

D1&-A$&0&/123*$E.3-6F3.G$

H'&-$

%474$7-4*'C&-$

"H%IJ$

Figure 5: Query service: metadata based filtering

For interfacing with the rest of the CLARIN infrastructure, the filtering mechanism can

be wrapped in a REST-style web service. Subsequently, the filtering web service can be

integrated in other workflows, where the subsequent (web-) services can act on the list of

resources which were produced by the filter web service.

A full data querying mechanism, not only of the metadata, is also possible. The implemen-

tation of this service would use various streaming libraries for xml with support for (a subset

of) XPath (e.g. Nux35, Joost 36).

34http://www.clarin.eu/node/321935http://acs.lbl.gov/software/nux/36http://joost.sourceforge.net/

Copyright c© The EUDAT Consortium PUBLIC 25/27

Page 26: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

EUDAT – 283304 D7.5

5.3.2 Google Books Ngram

The Google Books Ngram dataset37 is a diachronic collection of n-grams classified by lan-

guage and sorted by the number of occurrences. Google provides a simple viewer of the

data but for more advanced queries this functionality is insufficient; the users must download

and process the dataset locally. The size of the dataset makes this operation prohibitively

expensive for most linguistic researchers. The solution is therefore to place the dataset on

a data center and use the GEF with custom workflows as a filtering mechanism.

The dataset is freely available for download and licensed with a Creative Commons At-

tribution 3.0 Unsupported License. It consists of a collection of tabular text files with a

compressed size of 2.2TB, approximately 10TB uncompressed.

6 Conclusion, notes, discussion

This document presents a theoretical design view of the GEF backed by an in-progress

prototype implementation. The goal of the GEF is to offer an interface API which is

cross-communities and is able to offer data reduction (through data subsetting, variables

combination) to deal with nowadays data volumes, meaning that it must be generic enough

to access data in a heterogeneous landscape of data centres and federations, with several

data typology and types, hosted by several disjoint communities, but useful and transparent

enough to be adopted by most of EUDAT communities.

There already exists a plethora of workflow engines which are used in the scientific world,

which can be seen as generic. However many communities have adopted one or several

workflow engines, such as Kepler and VisTrails, but the implementation of the processing

within these workflow engines is very dependent on each communities. Part of these es-

tablished workflow engines could be refactored to interface with EUDAT services using the

EUDAT GEF, which then would provide a cross-communities execution engine supporting

these differences. Given that, it has to be stressed that the GEF will be highly dependent

on the Metadata TF and the AAI TF outcomes. There are also some dependencies on

Semantic Annotation, Data Staging and Data Replication.

The current implementation of the GEF in CLARIN and ENES are still in an alpha stage,

but within the next months it will be enhanced and much closer to the current GEF descrip-

tion, which will also evolve. Given the outcomes of the Workflows Track at the EUDAT

Workshop Days (25-26 September 2013, Barcelona), discussions on workflows with experts

have identified four recommendations (Table 1), which will need to be explored, discussed

further, and taken into account for further development. EUDAT must ensure that the GEF

be able to cope and also be efficient with the foreseen and current large data volumes, in

federated data environments, and that it is appealing enough to communities so that they

see large advantages in using it.

37http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

26/27 PUBLIC Copyright c© The EUDAT Consortium

Page 27: Deliverable D7.5 - CerfacsEuropean Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian

6 Conclusion, notes, discussion

Table 1: Four EUDAT Workshop Days Workflows Track Recommendations

Action Short Description Priority Next Steps

Provide EUDAT

Service APIs for

use within

Workflows

Projects like EUDAT should not ‘create new

complete WFs tools’, but instead provide

service APIs for workflows; This enables re-

searchers to seamlessly take advantage of

current/new EUDAT services such as data-

staging, data-transfer, data replication, or

simple store, PID assignments.

High Some EUDAT services

already offer APIs; Cre-

ate a document that

provides an overview

how EUDAT service

APIs can be used.

Explore solutions

for EUDAT

Workflow

Provenance

Service(s)

There is an increasing variety of WF sys-

tems and many of the communities already

have chosen their solutions, but might re-use

components of others; EUDAT could offer

a service that enables ‘workflow component

sharing’ that represents a repository/registry

where components of workflows are stored in-

cluding provenance information; Such informa-

tion includes but is not limited to assignments

of PIDs for workflow components, including

concrete software elements, information about

concrete execution runs of it, and sample data

that enables other researchers to better un-

derstand the shared workflow components.

High PPNL has some work

of sharing components

and describing workflows

independently from con-

crete implementations;

Such work needs to be

surveyed and could be a

baseline for a potential

new EUDAT service.

Provide

higher-level

Analysis &

Analytics

Workflow

Components &

Service APIs

The presentations across all fields has shown

that statistical computing, data mining, and

machine learning algorithms (e.g. classifica-

tion, clustering, or regression techniques) are

used in some parts of the workflows; A poten-

tial set of ‘higher-level data analysis/analytics

services’ could be hosted by EUDAT close

to the data of researchers. This includes the

provisioning of service APIs for a seamless in-

tegration in (existing) analysis workflows) and

their ‘application enabling process’.

Medium Used statistical comput-

ing (e.g. R) or machine

learning (e.g. Apache

Mahout) software al-

ready exists; Provide an

overview which of these

packages could be con-

veniently hosted by EU-

DAT and which service

APIs could be provided;

Investigate

solutions for

data workflow

recommender

services

Data formats are set by user communities and

limited amounts of standardization is having

impact; EUDAT could investigate the possi-

bility of recommender services that provide

advice on suitable workflows in context de-

pending on data formats, scalability, porta-

bility, etc.; This might include benchmarks of

workflows in context and access to (captured)

best practices in the community;

Medium Some data formats are

especially used across

communities such as

HDF5 or NETCDF;

Survey use of common

data formats in

communities;

Copyright c© The EUDAT Consortium PUBLIC 27/27