GRDI2020 Final Roadmap · PDF fileThe GRDI2020 roadmap is an official GRDI2020 project...
Transcript of GRDI2020 Final Roadmap · PDF fileThe GRDI2020 roadmap is an official GRDI2020 project...
GRDI2020 – A Coordination Action:
Towards a 10-year vision for global research data infrastructures
Funded under the Seventh Framework Programme (FP7) - Infrastructures,
“Capacities – Research Infrastructures” - Project Number: 246682
GRDI2020 Final Roadmap Report
Global Research Data Infrastructures: The Big Data Challenges
This document has been generated by the GRDI2020 Consortium and its principal author is CNR-
ISTI. The GRDI2020 roadmap is an official GRDI2020 project deliverable submitted to the European
Commission in February 2012.
GRDI2020 Final Roadmap Report Page 2 of 108
DISCLAIMER
GRDI2020 is funded by the European Commission under the 7th Framework
Programme (FP7).
The goal of GRDI2020 project, Towards a 10-year vision for global research data
infrastructures, is to establish a framework for obtaining technological,
organisational, and policy recommendations guiding the development of ecosystems
of global research data infrastructures. Mobilising user communities, large initiatives,
projects, leading experts, and policy makers throughout the world and involving
them in GRDI2020 activities will achieve the establishment of this framework.
This document contains information on core activities, findings, and outcomes of
GRDI2020. It also contains information from the distinguished experts who are in two
external groups – the Advisory Board Members (AB), and the Technological and
Organisational Working Groups. Any reference to content in this document should
clearly indicate the authors, source, organisation, and date of publication.
The document has been produced with the funding of the European Commission. The content of this
publication is the sole responsibility of the GRDI2020 Consortium and its experts, and it cannot be
considered to reflect the views of the European Commission.
The European Union (EU) was established in accordance with the Treaty on the European Union
(Maastricht). There are currently 27 member states of the European Union. It is based on the European
Communities and the member states’ cooperation in the fields of Common Foreign and Security Policy and
Justice and Home Affairs. The five main institutions of the European Union are the European Parliament,
the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors
(http://europa.eu.int/).
Copyright © The GRDI2020 Consortium. 2010.
See http://www.grdi2020.eu/StaticPage/About.aspx for details on the copyright holders.
GRDI2020 (“Towards a 10-Year Vision for Global Research Data Infrastructures”) is a project funded by the European Commission
within the framework of the 7th Framework Programme for Research and Technological Development (FP7), Research
Infrastructures Coordination Action under the Capacities Programme - Géant & eInfrastructures Unit. For more information on the
project, its partners and contributors please see http://www.grdi2020.eu. You are permitted to copy and distribute verbatim
copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy
this document in whole or in part into other documents if you attach the following reference to the copied elements:
“Copyright © 2010. The GRDI2020 Consortium. http://www.grdi2020.eu/StaticPage/About.aspx”
The information contained in this document represents the views of the GRDI2020 Consortium as of the date they are published.
The GRDI2020 Consortium does not guarantee that any information contained herein is error-free, or up to date. THE GRDI2020
CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.
GRDI2020 Final Roadmap Report Page 3 of 108
GLOSSARY AB Advisory Board
BELIEF II BELIEF-II (Bringing Europe’s eLectronic Infrastructures to Expanding Frontiers -
Phase II) is an EU FP7 project spanning over 25 months, start date 1st April 2008,
with the aim of supporting the goals of e-Infrastructure projects to maximise
synergies in specific application areas between research, scientific and
industrial communities.
CODATA CODATA, the Committee on Data for Science and Technology, is an
interdisciplinary Scientific Committee of the International Council for Science
(ICSU). CODATA works to improve the quality, reliability, management and
accessibility of data of importance to all fields of science and technology.
DC-NET DC-NET - Digital Cultural heritage NETwork is an ERA-NET (European Research
Area Network) project, financed by the European Commission under the e-
Infrastructure - Capacities Programme of the FP7 (December 2009 – December
2011). The main objective of the DC-NET project is to develop and to strengthen
the co-ordination among the European countries of public research programmes
in the sector of the digital cultural heritage.
DCI Distributed Computing Infrastructure.
DL.org Digital Library Interoperability, Best Practices and Modelling Foundations is a
two-year Coordination Action, which started in December 2008, funded by the
European Commission under the 7th Framework Programme ICT Thematic Area
"Digital Libraries and Technology-Enhanced Learning".
DoW Description of Work / Annex 1 / Technical Annex
DRIVER DRIVER II (Digital Repositories Infrastructure Vision for European Research) is a
collaboration, co-funded by the European Commission, to build a network of
freely accessible digital repositories with content across academic disciplines.
e-IRG e-Infrastructure Reflection Group
EC European Commission
ECRI2010 European Conference on Research Infrastructure.
EGEE Enabling Grids for E-SciencE
EGI European Grid Infrastructure
ERA-NET European Research Area Network
EU European Union
FP6 European Commission’s Sixth Framework Programme
FP7 European Commission’s Seventh Framework Programme
GA Grant Agreement
GRDI2020 Global Research Data Infrastructures2020 (FP7 EC-funded Coordination Action aimed at establishing a framework for technical, organisational, and policy recommendations guiding the development of GRDI ecosystems)
GRL2020 Global Research Library 2020
HELIO HELIO is the Heliophysics Integrated Observatory that aims to deploy a
distributed network of services that will address the needs of a broad
community of researchers in heliophysics.
HLG High Level Expert Group
ICT2010 ICT2010 is the Europe's most visible forum for ICT research and innovation
organised by the European Commission and hosted by the Belgian Presidency of
the European Union.
IMPACT IMPACT (IMproving Protein Annotation through Coordination and Technology) is
a project that aims to harness existing technologies (such as web services and
GRDI2020 Final Roadmap Report Page 4 of 108
distributed computing) and use them to dramatically improve existing
information resources.
METAFOR METAFOR (Common Metadata for Climate Modelling Digital Repositories) is a
project that aims to develop a Common Information Model (CIM) to describe
climate data and the models that produce it in a standard way, and to ensure
the wide adoption of the CIM.
OGF Open Grid Forum
OpenAIRE OpenAIRE (Open Access Infrastructure for Research in Europe) is an initiative
that aims to support the implementation of Open Access in Europe by
establishing the infrastructure for researchers to support them in complying
with the EC OA pilot and the ERC Guidelines on Open Access.
SDI Science Data Infrastructures
SEALS SEALS (Semantic Evaluation At Large Scale) is a co-funded project that aims to
provide an independent, open, scalable, extensible and sustainable
infrastructure (the SEALS Platform) that allows the remote evaluation of
semantic technologies thereby providing an objective comparison of the
different existing semantic technologies.
WG Working Group
WP WorkPackage
Table 1 - Glossary
GRDI2020 Final Roadmap Report Page 5 of 108
TABLE OF CONTENTS
Acknowledgments ...................................................................................................................... 7
1. Executive Summary ............................................................................................................. 8
2. Methodology..................................................................................................................... 10
3. The New Science Paradigm ................................................................................................ 12
4. Research Data Infrastructures ........................................................................................... 13
4.1 Data-Intensive Science .......................................................................................................... 15
4.2 Multidisciplinary – Interdisciplinary Science ........................................................................... 15
5. A Strategic Vision for a Global Research Data Infrastructure .............................................. 17
5.1 Defining the Digital Science Ecosystem................................................................................... 17
5.1.1 Digital Data Libraries (Science Data Centres) .......................................................................... 18
5.1.2 Digital Data Archives ................................................................................................................ 19
5.1.3 Digital Research Libraries ........................................................................................................ 20
5.1.4 Communities of Research ........................................................................................................ 20
5.2 Digital Science Ecosystem Concepts ....................................................................................... 20
5.3 Global Research Data Infrastructures: The GRDI2020 Vision ................................................... 21
6. Technological Challenges ................................................................................................... 27
6.1 Data Challenges ..................................................................................................................... 27
6.1.1 Data Modeling ......................................................................................................................... 27
6.1.2 Metadata Modeling ................................................................................................................. 28
6.1.3 Data Provenance Modeling ..................................................................................................... 28
6.1. 4 Data Context Modeling ............................................................................................................ 29
6.1.5 Data Uncertainty Modeling ..................................................................................................... 30
6.1.6 Data Quality Modeling ............................................................................................................. 30
6.2 Data Management Challenges ............................................................................................... 31
6.2.1 Data Curation ........................................................................................................................... 31
6.3 Data Tools Challenges ............................................................................................................ 31
6.3.1 Data Visualization .................................................................................................................... 32
6. 3.2 Massive Data Mining ............................................................................................................... 33
7. System Challenges ............................................................................................................. 34
7.1 Virtual Research Environment (VRE) ...................................................................................... 34
7.2 Science Gateways .................................................................................................................. 35
7.3 Ontology Management .......................................................................................................... 35
7.4 Scientific Workflow Management .......................................................................................... 40
7.5 Interoperability and Mediation Software ............................................................................... 44
7.5.1 Exchangeability – The Heterogeneity Problem ....................................................................... 45
7.5.2 Compatibility – The Logical Inconsistency Problem ................................................................ 45
7.5.3 Usability – The Usage Inconsistency Problem ......................................................................... 45
7.5.4 Mediation Software ................................................................................................................. 46
8. Infrastructural Challenges .................................................................................................. 48
GRDI2020 Final Roadmap Report Page 6 of 108
8.1 Data and Data Service/Tool Findability .................................................................................. 48
8.1.1 Data Registration ..................................................................................................................... 49
8.1.2 Data Citation ............................................................................................................................ 52
8.1.3 Data Discovery ......................................................................................................................... 56
8.1.4 Data Tool/Service Discovery .................................................................................................... 58
8.2 Data Federation .................................................................................................................... 62
8.2.1 Data Integration....................................................................................................................... 62
8.2.2 Data Harmonization ................................................................................................................. 64
8.2.3 Data Linking ............................................................................................................................. 65
8.3 Data Sharing.......................................................................................................................... 66
9. Application Challenges ...................................................................................................... 71
9.1 Complex Interaction Modes ................................................................................................... 71
9.2 Multidisciplinary – Interdisciplinary Research ........................................................................ 71
9.2.1 Boundary Objects .................................................................................................................... 73
9.3 Globalism and Virtual Proximity ............................................................................................ 74
10. Organizational Challenges .............................................................................................. 75
10.1 Digital Data Libraries (Science Data Centres) ....................................................................... 75
10.2 Digital Data Archives .......................................................................................................... 76
10.3 Personal Workstations ....................................................................................................... 77
11. A New Computing Paradigm: Cloud Computing .............................................................. 78
12. A New Programming Paradigm: MapReduce .................................................................. 86
13. Policy Challenges ........................................................................................................... 90
14. Open Science – Open Data ............................................................................................. 95
15. Recommendations ....................................................................................................... 100
16. References .................................................................................................................. 104
GRDI2020 Final Roadmap Report Page 7 of 108
Acknowledgments
The GRDI2020 Consortium would like to thank all those who have provided feedback and suggestions to
the roadmap, in particular;
Malcolm Atkinson, GRDI2020 Advisory Board Member
Dan Atkins, Univ. of Michigan, Vice-President for Research Cyber-infrastructure & GRDI2020 Advisory
Board Member
Antonella Fresa, DC-Net Project Coordinator, Italy
Fabrizio Gagliardi, EMEA Director, Microsoft Research Connections (External Research), Microsoft
Research, UK & GRDI2020 Advisory Board Member
David Giaretta, STFC and Alliance for Permanent Access, United Kingdom & HLG-SDI Rapporteur*
Stephen M. Griffin, Program Director Information Integration and Informatics (III) cluster, National Science
Foundation, US
Ray Harris, Emeritus Professor, University College London & GRDI2020 Advisory Board Member
Jane Hunter, University of Queensland, Australia
Simon Lin, Academia Sineca & GRDI2020 Advisory Board Member
Monica Marinucci , Director, Oracle Public Sector, Education & Research Business Unit & HLG-SDI
Member*
Reagan Moore, Director of the Data Intensive Cyber Environments (DICE Center) & Prof. in School of
Library Information Science at the Univ. of North Carolina at Chapel Hill, US
Vanderlei Perez Canhos, President Director of the Reference Center on Environmental Information (CRIA),
Brazil & GRDI2020 Advisory Board Member
Laurent Romary, INRIA & Humboldt University & HLG-SDI Member*
Michael Wilson, e-Science Department, STFC Rutherford Appleton Laboratory, United Kingdom
Peter Wittenburg, Technical Director, Max Planck Institute for Psycholinguistics, The Netherlands & HLG-
SDI Member*
*European Commission, Directorate General Information Society & Media High Level Expert Group on Scientific Data
The GRDI2020 Consortium is composed of Trust-IT Services Ltd. (UK), Consiglio Nazionale delle Ricerche, -
Istituto di Scienza e Tecnologie dell' Informazione A. Faedo (CNR-ISTI) (IT), ATHENA Research & Innovation
Center in Information Communication & Knowledge Technologies (GR) PDC Center for High Performance
Computing, Kungliga Tekniska Hoegskolan (SE)
GRDI2020 Final Roadmap Report Page 8 of 108
1. Executive Summary
New high-throughput scientific instruments, telescopes, satellites, accelerators, supercomputers,
sensor networks and running simulations are generating massive amounts of data.
The availability of huge volumes of data is a big opportunity and at the same time is a big
challenge for scientists.
This data availability can revolutionize the way research is carried out and lead to a new data-
centric way of thinking, organizing and carrying out research activities.
However, in order to be able to exploit these huge volumes of data, new techniques and
technologies are needed. A new type of e-infrastructure, the Research Data Infrastructure, must
be developed for harnessing the accumulating data and knowledge produced by the communities
of research.
Research Data Infrastructures can be defined as managed networked environments for digital
research data consisting of services and tools that support: (i) the whole research cycle, (ii) the
movement of research data across scientific disciplines, (iii) the creation of open linked data
spaces by connecting data sets from diverse disciplines, (iv) the management of scientific
workflows, (v) the interoperation between research data and literature and (vi) an integrated
Science Policy Framework.
Science is a global undertaking and research data are both national and global assets. Therefore,
there is a need for global research data infrastructures for overcoming language, policy, and social
barriers and reducing geographic temporal, and National barriers in order to discover, access and
use of data.
The next generation of global scientific data infrastructures is facing two main challenges:
To effectively and efficiently support data-intensive Science
To effectively and efficiently support multidisciplinary/interdisciplinary Science
In this Report an organizational model of the science universe is defined, based on the ecosystem
metaphor, in order to conceptualize all the “research relationships “ between the components of
the science universe.
According to this metaphor a digital science ecosystem is a complex system composed of Digital
Data Libraries (Data Centres), Digital Data Archives, Digital Research Libraries and Communities of
Research. A Global Research Data Infrastructure should act as: an enabler of an open, extensible
and evolvable digital science ecosystem; a facilitator of data, information and knowledge
discovery; and an enhancer of problem–solving processes.
The Report describes a core set of functionality that must be provided by Global Research Data
Infrastructures in order to make the holdings (data collections and data tools) of the science
ecosystem components discoverable, interoperable, correlatable and usable.
GRDI2020 Final Roadmap Report Page 9 of 108
To make this happen several technological breakthroughs must be achieved in the fields of
research data modelling and management, data tools and services. In addition, several system,
application, organization, and policy challenges must be successfully tackled.
This Report, whose intended audience are policy makers, scientists, engineers, computer scientists
and theoreticians, highlights some of the many difficult technological problems that must be
solved in order to make feasible the building of theoretically founded global scientific data
infrastructures.
To pursue a strategy aimed at achieving our vision for global scientific data infrastructures a
number of recommendations are given and explained in detail in Section 15:
1. Science social and organizational aspects should be taken in due consideration when
designing global research data infrastructures as well as potential tensions which could be
faced or provoked by them.
2. Global Research Data Infrastructures must be based on scientifically sound foundations.
3. Formal models and query languages for data, metadata, provenance, context, uncertainty
and quality must be defined and implemented.
4. New advanced data tools must be developed.
5. Future Research Data Infrastructures must support open linked data spaces.
6. Future Research Data Infrastructures must support interoperation between science data
and literature.
7. New advanced infrastructural services must be developed.
8. The principles of open science and open data in order to be widely accepted must be
realized within an integrated science policy framework to be implemented and enforced by
global research data infrastructures.
9. A new international research community must be created
10. New Professional Profiles must be created
GRDI2020 Final Roadmap Report Page 10 of 108
2. Methodology
When producing a Roadmap Report on scientific data infrastructures the main difficulty faced by
the authors is how to distinguish it from a plethora of roadmaps and white papers produced by
international organizations (ICSU, ESFRI, e-IRG, etc.), projects & initiatives (PARADE, e-SciDR, etc.),
funding agencies (EC, ERC, NSF, NSTC, etc.), and expert committees (HLG, WGs, etc.).
The challenge is to avoid repetitions and lists of well-known application requirements and
recommendations.
There are two main constituencies which play a crucial role in the development of scientific data
infrastructures: one that uses data-intensive methods and another that creates these methods.
So far, most of the roadmaps/white papers/visions that are circulating have been produced or
heavily influenced by members of the first constituency with very little involvement by the second
constituency. Consequently, these reports mainly describe the application requirements that must
be met by the future data infrastructures. The challenges that the researchers of the second
constituency have to overcome in order to make feasible the building of the next generation of
data infrastructures are generally ignored. The current reports concentrate on the application
constituency.
The approach taken by the authors of this report is the opposite. We take for granted a number of
core application requirements which can be found in almost all the existing reports and investigate
and describe the technical/scientific challenges, from the computer science point of view, that
must be overcome in order to meet them.
We think that this approach on one hand highlights the many difficult technical problems that
must be solved in order to make feasible the building of research data infrastructures and on the
other hand presents a complementary vision, with respect to the other roadmaps reports, to the
problems which hinder the implementation of these systems.
In order to be able to carry out this task, the GRDI2020 project mobilized the international
research community. Two Working Groups on technological and organizational/policy issues have
been created and were composed of internationally recognized experts with the mandate of
identifying and describing the most critical technical and organizational problems in terms of
state-of-the-art and current practices, recommendations for research directions and potential
impact.
In addition, two international GRDI2020 workshops were organized over the lifetime of the
project, Cape Town, South Africa (Oct. 2010) and Brussels (Oct. 2011) where intermediate versions
of the GRDI2020 Roadmap Report as well as the findings of the two Working Groups were
presented to a selected representation of the international research community for further
validation.
Two versions of the GRDI2020 Roadmap Report were published. This first preliminary version was
delivered at the end of the first year of the project (January 2011), mainly, addressing some of the
main data, system, application and organization challenges to be faced by global research data
infrastructures and its primary target audience were policy makers and scientists. This first version
of the Roadmap Report was widely circulated among the interested scientific communities in
GRDI2020 Final Roadmap Report Page 11 of 108
order to collect their feedback. It took into consideration discussions at the working group
meetings and at GRDI2020 South Africa workshop (Oct 2010).
The Final Version of the Report is an enriched and enhanced version of the First version and has
mainly focused on the description of the services to be provided by a global research data
infrastructure in order to enable global collaboration and to create, thus, a science collaborative
environment. The policy challenges faced by global research data infrastructures as well as new
computing and programming paradigms relevant for data intensive processing are also described.
This version was delivered at the end of the project (January 2012) and took into consideration
suggestions and recommendations of the international research community as well as feedback
collected from the main stakeholders, including relevant scientific organizations/foundations such
as CODATA, ESFRI, e-IRG, e-Science Centre, NSF and on-going projects implementing data
infrastructures.
In addition, a Short Version of the Final version of the Report was produced, printed and widely
distributed and presented in several scientific events (Workshop on “Global Scientific Data
Infrastructures: The Big Data Challenges”, Capri May 2011 , “Data-Intensive Research Theme Final
Workshop”, Edinburgh June 2011, Workshop on “Global Research Data Infrastructures: The GRDI
Vision”, Brussels October 2011, Workshop on “Global Research Data Infrastructures: The GRDI
Vision”, Stockholm December 2011, organized in conjunction with the international e-Science
Conference, “GCOE-NGIT 2012 Conference on Knowledge Discovery and Federation”, Sapporo-
Japan, January 2012).
Finally, a journalistic level version of the Short Version was produced. This version was edited by a
professional journalist (Richard Hudson CEO & Editor Science/Business). This Report was also
widely disseminated.
GRDI2020 Final Roadmap Report Page 12 of 108
3. The New Science Paradigm
Some areas of science are currently facing from a hundred – to a thousand-fold increase in
volumes of data compared to the volumes generated only a decade ago. This data is coming from
satellites, telescopes, high-throughput instruments, sensor networks, accelerators,
supercomputers, simulations, and so on [1]. The availability and use of huge datasets presents
both new opportunities and at the same time new challenges for scientific research.
Often referred to as a data deluge massive datasets is revolutionizing the way research is carried
out and resulting in the emergence of a new fourth paradigm of science based on data-intensive
computing [2]. New data-dominated science will lead to a new data-centric way of
conceptualizing, organizing and carrying out research activities which could lead to a rethinking of
new approaches to solve problems that were previously considered extremely hard or, in some
cases, even impossible to solve and also lead to serendipitous discoveries.
The new availability of huge amounts of data, along with advanced tools of exploratory data
analysis, data mining/machine learning and data visualization, offers a whole new way of
understanding the world. One view put forward is that in the new data-rich environment
correlation supersedes causation, and science can advance even without coherent models, unified
theories, or really any mechanistic explanation at all [3].
In order to be able to exploit these huge volumes of data, new techniques and technologies are
needed. A new type of e-infrastructure, the Research Data Infrastructure, must be developed for
harnessing the accumulating data and knowledge produced by the communities of research,
optimizing the data movement across scientific disciplines, enabling large increases in multi- and
inter- disciplinary science while reducing duplication of effort and resources, and integrating
research data with published literature.
To make this happen several breakthroughs must be achieved in the fields of research data
modelling, management and data tools and services.
GRDI2020 Final Roadmap Report Page 13 of 108
4. Research Data Infrastructures
Research Data Infrastructures can be defined as managed digital research data and resources in
networked environments that include services and tools that support: (i) the whole research cycle,
(ii) the movement of scientific data across scientific disciplines, (iii) the creation of open linked
data spaces by connecting data sets from diverse disciplines, (iv) the management of scientific
workflows, (v) the interoperation between scientific data and literature, and (vi) an Integrated
Science Policy Framework.
Research data infrastructures are not systems in the traditional sense of the term; they are
networks that enable locally controlled and maintained digital data and library systems to
interoperate more or less seamlessly. Genuine research data infrastructures should be ubiquitous,
reliable, and widely shared resources operating on national and transnational scales.
A research data infrastructure should include organizational practices, technical infrastructure
and social forms that collectively provide for the smooth operation of collaborative scientific work
across multiple geographic locations. All three should be objects of design and engineering; a data
infrastructure will fail if any one is ignored [4].
Another school of thought considers (data) infrastructure as a fundamentally relational concept. It
becomes infrastructure in relation to organized (research) practices [5]. The relational property of
(data) infrastructure talks about that which is between – between communities and
data/publications collections mediated by services and tools. According to this school of thought
the exact sense of the term (data) infrastructure and its “betweenness” are both theoretical and
empirical questions.
In [6] (data) infrastructure emerges with the following dimensions:
Embeddedness: Infrastructure is “sunk” into, inside of, other structures, social
arrangements and technologies
Transparency: Infrastructure is transparent to use, in the sense that it does not have to be
reinvented each time or assembled for each task, but invisibly supports those tasks.
Reach of scope: Infrastructure has reach beyond a single event or one-site practice.
Learned as part of membership: The taken-for-grantedness of artifacts and organizational
arrangements is a sine qua non of membership in a community of practice. Strangers and
outsiders encounter infrastructure as a target object to be learned about. New participants
acquire a naturalized familiarity with its objects as they become members.
Links with conventions of practice: Infrastructure both shapes and is shaped by the
conventions of a community of practice.
Embodiment of standards: Modified by scope and often by conflicting conventions,
infrastructure takes on transparency by plugging into other infrastructures and tools in a
standardized fashion.
Build on an installed base: Infrastructure does not grow de novo; it wrestles with the
“inertia of the installed base” and inherits strengths and limitations from that base.
GRDI2020 Final Roadmap Report Page 14 of 108
Becomes visible upon breakdown: The normally invisible quality of working infrastructure
becomes visible when breaks: the server is down, the bridge washes out, there is a power
blackout. Even when there are back-up mechanisms or procedures, their existence further
highlights the now-visible infrastructure.
Research data infrastructures should be science-and engineering-driven and when coupled with
high performance computational systems increase the overall capacity and scope of scientific
research. Optimization for specific applications may be necessary to support the entire research
cycle but work in this area is mature in many problem domains.
Science is a global undertaking and research data are both national and global assets. There is a
need for a seamless infrastructure to facilitate collaborative arrangements necessary for the
intellectual and practical challenges the world faces.
Therefore, there is a need for global research data infrastructures able to interconnect the
components of a distributed worldwide science ecosystem by overcoming language, policy,
methodology, and social barriers. Advances in technology should enable the development of
global research data infrastructures that reduce geographic, temporal, social, and National
barriers in order to discover, access, and use of data.
Their ultimate goal should be to enable researchers to make the best use of the world’s growing
wealth of data.
The next generation of global research data infrastructures is facing two main challenges:
To effectively and efficiently support data-intensive Science
To effectively and efficiently support multidisciplinary/interdisciplinary Science By data we mean any digitally encoded information that can be stored, processed and transmitted
by computers, including [7]:
Collections of data from instruments, observatories, surveys and simulations;
Results from previous research and earlier surveys;
Data from engineering and built-environment design, planning and production processes;
Data from diagnostic, laboratory, personal and mobile devices;
Streams of data from sensors in man-made and natural environments;
Data from monitoring digital communications;
Data transferred during transactions that enable business, administration, healthcare, and
government;
Digital material produced by news feeds, publishing, broadcasting and entertainment;
Documents in collections and held privately, texts and multi-media “images” in web pages,
wikis, blogs, emails and tweets; and
Digitized representations of diverse collections of objects, e.g. of museums’ curated
objects.
GRDI2020 Final Roadmap Report Page 15 of 108
4.1 Data-Intensive Science
By data-intensive science we mean any scientific research activity whose progress is heavily
dependent on careful thought about how to use data. Such research activities are characterized
by:
increasing volumes and sources of data,
complexity of data and data queries,
complexity of data processing,
high dynamicity of data,
high demand for data,
complexity of the interaction between researchers and data, and
importance of data for a large range of end-user tasks.
Fundamentally, data-intensive disciplines face two major challenges [8]:
Managing and processing exponentially growing data volumes, often arriving in time-sensitive
streams from arrays of sensors and instruments, or as the outputs from simulations; and
Significantly reducing data analysis cycles so that researchers can make timely decisions.
4.2 Multidisciplinary – Interdisciplinary Science
By multidisciplinary approach to a research problem we mean an approach that draws
appropriately from multiple disciplines in order to redefine the problem outside of normal
boundaries and reach solutions based on a new understanding of complex situations.
There are several barriers to the multidisciplinary approach of behavioural and technological
nature.
Among the major technological barriers we identify those that must be overcome when moving
data, information, and knowledge between disciplines. There is the risk of interpreting
representations in different ways caused by the loss of the interpretative context. This can lead to
a phenomenon called “ontological drift” as the intended meaning becomes distorted as the
information object moves across semantic boundaries (semantic distortion) [9].
A relatively similar concept is the interdisciplinary approach to a research problem. It involves the
connection and integration of expertise belonging to different disciplines for the purpose of
solving a common research problem.
Again, the barriers faced by an interdisciplinary approach are of two types: behavioural and
technological.
Among the major technological barriers we identify the need for integrating data, information,
and knowledge created by different disciplines. In fact, one of the major barriers to be overcome
concerns the integration of activities that are taking place on different ontological foundations.
GRDI2020 Final Roadmap Report Page 16 of 108
The requirements described above, imposed by data-intensive multidisciplinary-interdisciplinary
science are the motivations for building the theoretical foundations of the next generation data
infrastructures. To make this happen a considerable number of difficult data, application, system,
organizational, and policy challenges must be successfully tackled.
The breakthrough technologies needed to address many of the critical problems in data-intensive
multidisciplinary-interdisciplinary computing will come from collaborative efforts involving many
domain application disciplines as well as computer science, engineering and mathematics.
GRDI2020 Final Roadmap Report Page 17 of 108
5. A Strategic Vision for a Global Research Data Infrastructure
We envision that in the future several Digital Science Ecosystems will be established.
We use the ecosystem metaphor in order to conceptualize all the “research relationships “
between the components of the science universe.
The traditional notion of an ecosystem in biological sciences describes a habitat for a variety of
different species that co-exist, influence each other, and are affected by a variety of external
forces. Within the ecosystem, the evolution of one species affects and is affected by the evolution
of other species.
We think that a model of digital ecosystem of scientific research allows to have a better
understanding of its dynamic nature. We believe that the ecological metaphor for understanding
the complex network of data-intensive multidisciplinary research relationships is appropriate as it
is reminiscent of the interdependence between species in biological ecosystems. It emphasizes
that advances and transformations in scientific disciplines are as much a result of the broader
research environment as of simply technological progress.
In the world of science, there are many factors that influence the evolution of a specific scientific
discipline. By considering the digital science ecosystem as an interrelated set of data collections,
services, tools, computations, technologies and communities of research we can contribute to
identifying the factors that impact scientific progress.
5.1 Defining the Digital Science Ecosystem
We introduce a digital science ecosystem approach that considers a complex system composed of
Digital Data Libraries, Digital Data Archives, Digital Research Libraries, and Communities of
Research (see figure below).
GRDI2020 Final Roadmap Report Page 18 of 108
Figure 1 – GRDI2020 Digital Science Ecosystem
5.1.1 Digital Data Libraries (Science Data Centres)
Increasingly, the volumes of data produced by high-throughput instruments and simulations are so
large, and the application programs are so complex, that it is far more economical to move the
end-user’s programs to the data rather than moving the source data and its applications to the
user’s local system. The volumes of data used in large-scale applications today simply cannot be
moved efficiently or economically, even over very high-bandwidth internet links.
From the organizational point of view moving end-user applications to data stores calls for the
creation of Service Stations called Digital Data Libraries or Science Data Centers [10].
GRDI2020 Final Roadmap Report Page 19 of 108
These should be designed to ensure the long-term stewardship and provision of quality-assessed
data and data services to the international science community and other stakeholders. Each Digital
Data Library will have responsibilities for curation of datasets and the applications that provide
access to them, and employ technical support staff that understand the data and manage the
growth, quality and inherent value of the datasets.
Digital Data Libraries fall into one of several categories [11]:
Research Digital Data Libraries: contain the products of one or more focused research
projects. Typically, these data require limited processing or curation. They may or may not
conform to community standards for file formats, metadata structure, or content access
policies. Quite often, applicable standards may be nonexistent or rudimentary because the
data types are novel and the size of the user community small. Research data collections
may vary greatly in size but are intended to serve a specific group for specific purpose for
an identifiable period of time. There may be no intention to preserve the collection beyond
the end of a project.
Discipline or Community Digital Data Libraries: serve a single science or engineering
disciplinary or scholarly community. These digital data libraries often establish community-
level standards either by selecting from among preexisting standards or by bringing the
community together to develop new standards where they are absent or inadequate.
Reference Digital Data Libraries: are intended to serve large segments of the scientific and
education community. Characteristic features of this category of digital data libraries are
their broad scope and diverse set of user communities including scientists, students, and
educators from a wide variety of disciplinary, institutional, and geographical settings. In
these circumstances, conformance to robust, well-established, and comprehensive
standards is essential, and the selection of standards often has the effect of creating a
universal standard.
Specialized Service Digital Data Libraries: while the data libraries described above are
intended to provide access to data collections and a set of basic services (including
collection, curation, provision, short-term preservation, and publishing), another category
of data libraries will emerge offering specialized data services, i.e., data analysis, data
visualization, massive data mining, etc. These will play a very important role in advancing
data-intensive research.
5.1.2 Digital Data Archives
Scientific Data archiving refers to the long-term storage of scientific data and methods, that is, the
process of moving data that is no longer actively used to a separate data storage device for long-
term preservation. Data archives consist of older data that is still important and necessary for
future reference, as well as data that must be retained for regulatory compliance. Provisions for
long-term preservation (including means for continuously assessing what to keep and for how
long) should be provided.
Data archiving is more important in some fields than others. The requirement of digital data
GRDI2020 Final Roadmap Report Page 20 of 108
archiving is a recent development in the history of science and has been made possible by
advances in information technology allowing large amounts of data to be stored and accessed
from central locations.
Data Archives should be indexed and have search capabilities so that files and parts of files can be
easily located and retrieved [12].
Data Preservation issues become very important in data archiving. All storage media deteriorates
over time, although some more rapidly than others. As systems change, data formats and access
methods also change and it is already the case that a considerable amount of digital data cannot
be retrieved because the systems and software that created these data no longer exist. Data
preservation is an active area of computer science research and its importance will continue to
grow as data archives become larger and more numerous.
5.1.3 Digital Research Libraries
A Digital Research Library is a collection of electronic documents. The mission of research libraries
is to acquire information, organize it, make it available and preserve it. To meet user needs, the
founders of a Digital Research Library must accomplish two general tasks: establishing the
repository of electronic scholarly materials, and implementing the tools to use it. More
importantly, sustainability models must be established and put into place so that scholarly
information in a repository will be available to future generations of researchers. This implies
strong and enduring organizational commitments, fiscal commitments and institutional
commitments [13].
5.1.4 Communities of Research
Science is conducted in a dynamic, evolving landscape of communities of research organized
around disciplines, methodologies, model systems, project types, research topics, technologies,
theories, etc. These communities facilitate scientific progress and can provide a coherent voice for
their constituents, enhancing communication and cooperation and enabling processes for quality
control, standards development, and validation [14].
5.2 Digital Science Ecosystem Concepts
Digital Science Ecosystem Views
A community of research is interested in performing a research activity based on specific set of data
and tools. A specific ecosystem view can be defined by identifying a community of research and a
set of data collections, data services, and data tools necessary for the research activity undertaken
by this community. Ecosystem views are materialized by Science Gateways or Virtual Research
Environments.
GRDI2020 Final Roadmap Report Page 21 of 108
Digital Science Ecosystem Channels
Research channels can be established across the components of a science ecosystem. The data
and information exchanged between these components flow through the ecosystem channels. We
classify these eco system channels according to the resulting research results.
Channels enabling Multidisciplinary/interdisciplinary research: research channels across different
types of digital data libraries allow scientists belonging to different communities of research
and/or to different disciplines to work together.
Channels enabling Data Preservation: channels across digital data libraries and digital data
archives allow data together with the appropriate preservation information to move from short-
term to long-term preservation states based on well defined provision policies.
Channels enabling Unification of Research Data with Scientific Literature: channels across digital
data libraries and digital research libraries make it possible to merge scientific data with research
literature resulting in new document models in which the data and text gain new functionalities. By
integrating scientific data and research publications it will become possible for one to read a paper
and examine the original data on which the paper’s conclusions or claims are based. It will be also
be possible to reuse the data and replicate the research and data analysis. Yet another possibility
will be to locate all the literature referencing the data [15].
Channels enabling Cooperative research: channels across the members of communities of
research allow scientific cooperation and collaboration.
Digital Science Ecosystem Services
Digital ecosystem services are necessary in order to enable researchers to efficiently and
effectively carry out research activities. They include data registration, data discovery, data
citation, data service/tool discovery, data search, data integration, data sharing, data linking, data
transportation, data service transportation, ontology/taxonomy management, workflow
management and policy management.
5.3 Global Research Data Infrastructures: The GRDI2020 Vision
The Vision
We envision a Global Research Data Infrastructure as vital to the realization of an open,
extensible and evolvable digital science ecosystem. A Global Research Data Infrastructure both
creates and sustains a reliable operational digital science ecosystem environment.
Therefore it must support:
the creation and maintenance of science ecosystem views through:
o Science Gateways: A Science Gateway is a community-specific set of tools,
applications, and data collections that are integrated together and accessed via a
portal or a suite of applications.
o Virtual Research Environments: A Virtual Research Environment (VRE) is a
GRDI2020 Final Roadmap Report Page 22 of 108
“technological framework”, i.e., digital infrastructure and services, that together
allow on-demand creation of “virtual working environments” in which
“communities of research” can effectively and efficiently conduct their research
activities.
the creation and maintenance of research channels across the several components of the
ecosystem.
This implies that all the components of the ecosystem are able, along the research channels, to
exchange data and information without semantic distortions within a negotiated framework of
shared policies. The final result is a highly productive “interoperable science ecosystem” that
reduces fragmentation of science due to disparate data and contributes to both reducing
geographic fragmentation of datasets and at the same time accelerates the rate at witch data and
information can be made available and used to advance science.
In addition, a mediation technology capable of reconciling data and language heterogeneities
associated with different scientific disciplines must be developed in order to enable the next
generation data infrastructures to support ecosystem research paths.
the creation and maintenance of Service Environments that enable the efficient delivery of
ecosystem services. They include:
o Data Registration Environment: By Data Registration Environment we mean an
environment enabling researchers to make data citable as a unique piece of work
and not only as a part of a publication. Once accepted for deposit and archived,
data is assigned a “Digital Object Identifier” (DOI) for registration. A Digital Object
Identifier (DOI) [16] is a unique name (not a location) within a science ecosystem
and provides a system for persistent and actionable identification of data. DOIs
could logically be assigned to every single data point in a set; however in practice,
the allocation of a DOI is more likely to be to a meaningful set of data. Identifiers
should be assigned at the level of granularity appropriate for an envisaged
functional use. The Data Registration Environment should be composed of a
number of capabilities, including a specified numbering syntax, a resolution service,
a data model, and an implementation mechanism determined by policies and
procedures for the governance and application of DOIs.
o Data Discovery Environment: By Data Discovery Environment we mean an
environment enabling researchers to quickly and accurately identify and find data
that supports research requirements within the science ecosystem. It should be
composed of a number of capabilities and tools that support the pinpointing of the
location of relevant data.
GRDI2020 Final Roadmap Report Page 23 of 108
o Data Citation Environment: By Data Citation Environment we mean an
environment enabling researchers to provide a reference to data in the same way
as researchers routinely provide a bibliographic reference to printed resources.
Data citation is recognized as one of the key practices underpinning the recognition
of data as a primary research output rather than as a by-product of research. The
essential information provided by a citation links it and the cited data set.
Therefore, a Data Citation Environment should support a number of capabilities
including a standard for citing data sets that addresses issues of confidentiality,
verification, authentication, access, technology changes, existing subfield-specific
practices, and possible future extensions, as well as a naming resolution service.
Data citation will ensure scholarly recognition and credit attribution.
o Data Service/Tool Discovery Environment: By Data Service/Tool Environment we
mean an environment enabling the automatic location of data services/tools that
fulfill a researcher goal. The Data Service/Tool Environment should support a
number of capabilities, including ontology-based descriptions both of the
researcher’s goal and the data service/tool functionality as well as a mediation
support in case these descriptions use different ontologies.
o Data Search Environment: By Data Searching Environment we mean an
environment in which researchers can identify, locate and access required data. It
should support a number of capabilities that support a complex search process
characterized by multiple steps, spanning multiple data sources that may require
long-term sessions and continuous refinement of the search process.
o Data Integration Environment: By Data Integration Environment we mean an
environment enabling researchers to combine data residing at different sources,
and provide them with a unified view of these data. The Data Integration
Environment should include capabilities and tools that support data
transformation, duplicate detection and data fusion.
o Data Sharing Environment: By Data Sharing Environment we mean an environment
enabling the sharing of research results among the members of the Communities
of Research of the science ecosystem. It should be composed of a number of
capabilities and tools that support the contexts for shared data use.
o Data Linking Environment: By Data Linking Environment we mean an environment
enabling the connection of data sets from diverse domains of the science
ecosystem. It should be composed of a number of capabilities and tools that
support the creation of common data spaces that allow researchers to navigate
along links into related data sets.
o Ontology/Taxonomy Management Environment: By Ontology/Taxonomy
Management Environment we mean an environment enabling a wide range of
semantic science ecosystem data services. Ontologies and taxonomies provide the
semantic underpinning enabling intelligent data services including data and service
discovery, search, access, integration, sharing and use of research data. This type
of Service Environment should contain numerous capabilities including, but not
GRDI2020 Final Roadmap Report Page 24 of 108
limited to ontology and taxonomy models, ontology and taxonomy metadata, and
reasoning engines. These are necessary in order to efficiently create, modify,
query, store, maintain, integrate, map, and align top-level and domain ontologies
and taxonomies within the larger science ecosystem.
o Transportable Data Environment: By Transportable Data Environment we mean an
environment enabling researchers to copy data from a source database to a target
database. This environment should be based on a transport technology supporting
the creation of transportable modules which function like a shipping service that
moves a package of objects from one site to another at the fasted possible speed.
Transportable modules enables one to rapidly copy a group of related database
objects from one database to another. The physical and logical structures of the
objects contained in the transportable modules being restored are re-created in
the target database [17].
o Transportable Data Services/Tools Environment: Increasingly, the volumes of data
produced by high-throughput instruments and simulations are so large, that it is
much more economical to move computation to the data rather than moving the
data to the computation. A Transportable Data Services/Tools Environment should
support this model made possible through service-oriented architectures (SOA)
that encapsulate computation into transportable compute objects that can be run
on computers that store targeted data. SOA compute objects function like
applications that are temporarily installed on a remote computer, perform an
operation, and then are uninstalled [18].
o Scientific Workflow Management Environment: By Scientific Workflow we mean a
precise description of a scientific procedure-a multi-step process to coordinate
multiple tasks. Each task represents the execution of a computational process.
Scientific Workflows orchestrate e-Science services so that they cooperate to
efficiently perform a scientific application. A Workflow Management Service should
support the creation, maintenance, and operation of scientific workflows.
o Policy Management Environment: By Policy Management Environment we mean
an integrated set of formal semantic policies that enhances the authorization,
obligation, and trust processes that permit regulated access and use of data and
services (data policies). The same formal semantic policies are also used to
estimate trust based on parties’ properties (trust management policies). A Policy
Management Environment should provide accurate and explicit policy
representation and specification languages, policy editor tools, policy administrator
tools, algorithms for conflict detection and resolution, and graphical tools for
editing, updating, removing, and browsing policies as well as de-conflicting newly
defined policies.
The ultimate aim of a Global Research Data Infrastructure is to enable global collaboration in key
areas of science by supporting science ecosystem views, channels, and services and creating, thus,
a science collaborative environment.
GRDI2020 Final Roadmap Report Page 25 of 108
Social and Organizational Dimensions of a Research Data Infrastructure
A Digital Science Ecosystem model must also consider the fact that external environmental forces
influence research advances. Specifically, three major types of external environmental forces
should be considered: social and governmental forces, economic forces, and technical forces [19].
A viable vision of research data infrastructure must take into account social and organizational
dimensions that accompany the collective building of any complex and extensive resource.
A robust Global Research Data Infrastructure must consist not only of a technical infrastructure
but also a set of organizational practices and social forms that work together to support the full
range of individual and collaborative scientific work across diverse geographic locations. A data
infrastructure will fail or quickly become encumbered if any one of these three critical aspects is
ignored. By considering data infrastructure as just a technical system to be designed, the
importance of social, institutional, organizational, legal, cultural, and other non-technical problems
are marginized and the outcome is almost always flawed or less useful than originally anticipated
[4].
Tensions
New research data infrastructures are encountering and often provoking a series of tensions [4].
Because of its potential to upset or recast previously accepted relations and practices the
development of new data infrastructures may generate, in some degree, what economists have
labeled “creative destruction”. This occurs when established practices, organizational norms,
individual and institutional expectations adjust in a positive or negative fashion in reaction to the
new possibilities and challenges posed by infrastructure. In the best circumstances, individuals and
institutions take advantage of and build upon new resources. In other cases, the inertia of long-
standing organizational arrangements, scholarly approaches and research practices prove to be
too difficult to change with resulting disastrous consequences. Tensions should be thought of as
both barriers and resources to infrastructural development, and should be engaged constructively.
A second class of tensions can be identified in instances where changing infrastructures bump up
against the constraints of political economy: intellectual property rights regimes, public/private
investment models, ancillary policy objectives, etc. Clearly, the next generation of research data
infrastructures pose new challenges to existing regimes of intellectual property. Indeed,
intellectual property concerns are likely to multiply with the advent of increasingly networked and
collaborative forms of research supported by the data infrastructures [4].
Similar tensions arise in determining relationships between national policy objectives and the
transnational pull of science. Put simply, where large-scale policy interests (in national economic
competitiveness, security interests, global scientific leadership, etc.) stop at the borders of the
nation-state, the practice of science spills into the world at large, connecting researchers and
communities from multiple institutional and political locales. This state of affaires has a long
history of creating tension in science and education policy, revealing in very practical terms the
complications of co-funding arrangements across multiple national agencies [4].
GRDI2020 Final Roadmap Report Page 26 of 108
To the extent that research data infrastructures support research collaborations across national
borders, such national/transnational tensions must be carefully considered and efforts must be
continually undertaken to resolve them.
GRDI2020 Final Roadmap Report Page 27 of 108
6. Technological Challenges
6.1 Data Challenges
Data challenges include research data modelling, data management and data tools challenges.
There is a need for radically new data models and query languages and tools that enable scientists
to follow new paths, try new techniques, build new models and test them in new ways that
facilitate innovative research activities.
Data Modeling Challenges
There is a need for radically new approaches to research data modelling. In fact, the current data
models (relational model) and management systems (relational database management systems)
were developed by the database research community for business/commercial data applications.
Research data has completely different characteristics from business/commercial data and thus
the current database technology is inadequate to handle it efficiently and effectively.
There is a need for data models and query languages that:
more closely match the data representation needs of the several scientific disciplines;
describe discipline-specific aspects (metadata models);
represent and query data provenance information;
represent and query data contextual information;
represent and manage data uncertainty;
represent and query data quality information.
6.1.1 Data Modeling
While most scientific users can use relational tables and have been forced to do so by current
systems, we can find only a few users for whom tables are a natural data model that closely
matches their data. Conventional tabular (relational) database systems are adequate for analysing
objects (galaxies, spectra, proteins, events, etc.), but the support for time-sequence, spatial, text
and other data types is awkward.
For some scientific disciplines (astronomy, oceanography, fusion and remote sensing) an array
data model is more appropriate. Database systems have not traditionally supported science’s core
data type: the N-dimensional array. Simulating arrays on top of tables is difficult and results in
poor performance.
Some other disciplines, i.e., biology and genomics, consider more appropriate for their needs
graphs and sequences.
Lastly, solid modelling applications want a mesh data model [20].
The net result is that “one size will not fit all” and science users will need a mix of specialized
database management systems.
GRDI2020 Final Roadmap Report Page 28 of 108
This collection of problems is generally called the “impedance mismatch” – meaning the mismatch
between the programming model and the database capabilities. The impedance mismatch has
made it difficult to map many science applications into conventional tabular database systems.
6.1.2 Metadata Modeling
Metadata is the descriptive information about data that explains the measured attributes, their
names, units, precision, accuracy, data layout and ideally a great deal more. Most importantly,
metadata includes the data lineage that describes the data was measured, acquired, or computed.
The metadata is as valuable as the data itself [10].
If the data is to be analysed by generic tools, the tools need to “understand” the data. The tool will
want to know the metadata.
If scientists are to read data collected by others, then the data must be carefully documented and
must be published in forms that allow easy access and automated manipulation. In the next
generation data infrastructures, there will be powerful tools to make it easy to capture, organize,
analyse, visualize, and publish data. The tools will do data mining and machine learning on the
data, and will make it easy to script workflows and analyse the data. Good metadata for the inputs
is essential to make these tools automatic. Preserving and augmenting this metadata as part of the
processing (data lineage) will be a key benefit for next generation tools.
All the derived data that the scientist produces must also be carefully documented and published
in forms that allow easy access. Ideally, much of this metadata would be automatically generated
and managed as part of the workflow, reducing the scientist’s intellectual burden.
The use of purpose-oriented descriptive data models is of paramount importance to achieve data
usability.
The type of descriptive information to be provided by the data producer depends very much on
the requirements imposed by the data consumer tasks. For example, if the consumer entity wants
to perform a data analysis task on the imported information then quality of the information is of
paramount importance; without such information the task of data analysis cannot be performed.
Consequently, if a researcher is willing to export/publish the data produced, its possible uses by
the potential users must be carefully taken into account and it must be endowed with appropriate
descriptive information. Appropriate purpose-oriented metadata models to represent the
descriptive information must be chosen and used.
6.1.3 Data Provenance Modeling
In its most general form, provenance (also sometimes called lineage) captures where data came
from, how it has been updated over time. Provenance can serve a number of important functions:
Explanation: Users may be particularly interested in or wary of specific portions of a derived data
set. Provenance supports “drilling down” to examine the sources and the evolution of data
elements of interest, enabling a deeper understanding of the data [21].
GRDI2020 Final Roadmap Report Page 29 of 108
Verification: Derived data may appear suspect – due to possible bugs in data processing and
manipulation, due to stale data, or even due to maliciousness. Provenance enables auditing on
how data was produced, either to verify its correctness, or to identify the erroneous or out-dated
source data or processing nodes that are responsible for erroneous or out-dated data.
Recomputation / Repeatability: Having found out-dated or incorrect source data, or buggy
processing nodes, users may want to correct the errors and propagate the corrections forward to
all “downstream” data that is affected. Provenance helps to recompute only those data elements
that are affected by the corrections.
There has been a large body of very interesting work in lineage and provenance over the past two
decades. Nevertheless, there are still many limitations and open areas. Specifically, the primary
focus is on modelling and capturing provenance: how is provenance information represented?
How is it generated? There has been considerably less work on querying provenance: What can we
do with provenance information once we’ve captured it?
On the long-term, it is necessary the development of a standard open representation and query
model.
6.1. 4 Data Context Modeling
Humans are quite successful at conveying ideas to each other and reacting appropriately. This is
due to many factors: the richness of the language they share, the common understanding of how
the world works, and an implicit understanding of everyday situations. When humans talk with
humans, they are able to use implicit situational information, or context, to increase the
conversational bandwidth. Unfortunately, this ability to convey ideas does not transfer well to
humans interacting with computers. In fact, context is a poorly used source of information in our
computing environments. As a result, we have an impoverished understanding of what context is
and how it can be used.
Contextual information is any information which can be used to characterize the situation of a
digital information object. In essence, this information documents the relationship of the data to
its environment [22].
Context is the set of all contextual information that can be used to characterize the situation of a
digital information object.
Several context modelling approaches exist and are classified by the scheme of data structures
which are used to exchange contextual information in the respective system [23]:
Key-value Models,
Mark-up Scheme Models,
Object Oriented Models,
Logic Based Models,
Ontology Based Models.
Future scientific data infrastructures should be context-aware, i.e. they should use context to
provide relevant information and/or services to the user, where relevancy depends on the user’s
task.
GRDI2020 Final Roadmap Report Page 30 of 108
6.1.5 Data Uncertainty Modeling
As models of real world, scientific databases are often permeated with forms of uncertainty
including imprecision, incompleteness, vagueness, inconsistency, and ambiguity.
Uncertainty is the quantitative estimation of error presenting data; all measurements contain
some uncertainty generated through systematic error and/or random error. Acknowledging the
uncertainty of data is an important component of reporting the results of scientific investigation
[24].
Essentially, all scientific data is imprecise, and without exception science researchers have
requested a database management system that supports uncertain data elements. Of course,
current commercial products do not support uncertainty [20].
There is, among science users, a universal consensus on requirements in this area. Some of them
request a simple model of uncertainty while others request a more sophisticated model.
There has been a significant amount of work in areas variously known as “uncertain, probabilistic,
fuzzy, approximate, incomplete and imprecise” data management.
Undoubtedly, the development of suitable database theory to deal with uncertain database
information and transactions and the successful deployment of this theory in an actual database
system, remain challenges that have yet to be met.
6.1.6 Data Quality Modeling
Quality of data is a complex concept, the definition of which is not straightforward. There is no
common or agreed definition or measure for data quality, apart from such general notion as
fitness for use.
The consequences of poor data quality are often experienced in every scientific discipline, but
without making the necessary connections to its causes [25]. Awareness of the importance of
improving the quality of data is increasing in all scientific fields.
In order to fully understand the concept, researchers have traditionally identified a number of
specific quality dimensions. A dimension or characteristic captures a specific facet of quality. The
more commonly referenced dimensions include accuracy, completeness, and consistency.
An important aspect of data is how often it varies in time - stable data and time-variable data. In
order to capture aspects concerning temporal variability of data, different data quality dimensions
need to be introduced. The principal time-related dimensions are currency, timeliness and
volatility.
Data quality dimensions are not independent of each other but correlations exist among them. If
one dimension is considered more important than the others for a specific application, then the
choice of favouring it may imply negative consequences for the others. Establishing trade-offs
among dimensions is an interesting problem.
GRDI2020 Final Roadmap Report Page 31 of 108
The core set of data quality dimensions introduced here is shared by most proposals for data
quality dimensions in research literature. However, for specific categories of data and for specific
scientific disciplines, it may be appropriate to have more specific sets of dimensions. As an
example, for geographic information systems specific, standard sets of data quality dimensions are
under investigation (ISO 2005).
6.2 Data Management Challenges
If research data are well organized, documented, preserved and accessible, and their accuracy and
validity is controlled all times, the result is high quality data, efficient research, findings based on
solid evidence and the saving of time and resources. Researchers themselves benefit greatly from
good data management. It should be planned before research starts and may not necessarily incur
much additional time or costs if it is engrained in standard research practice. A data Management
Plan helps researchers consider, when research is being designed and planned, how data will be
managed during the research process and shared afterwards with the wider research community
[26].
6.2.1 Data Curation
By data curation we denote the activity of managing the use of data from its point of creation, to
ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic
datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels
of curation will also involve maintaining links with annotation and other published materials [27].
Most scientific data comes from instruments observing a physical process of some sort. Such
sensor readings enter a “curation process” whereby raw information is “curated” in order to
produce finished information. Curation entails converting sensor information into standard data
types, correcting for calibration information, etc. There are two schools of thought concerning
curation. One school suggests loading raw data into a database management system and the
performing all curation inside the system. The other school of thought suggests curating the data
externally, employing custom hardware if appropriate. In some applications the curation process is
under the control of a separate group, with little interaction with the storage group [20].
6.3 Data Tools Challenges
Currently, the available data tools for most scientific disciplines are not adequate. It is essential to
build better tools in order to make scientists more productive. There is a need for better
computational tools to visualize, analyze, and catalog the available enormous research datasets in
order to enable a data-driven research.
Scientists will need better analysis algorithms that can handle extremely large datasets with
approximate algorithms (ones with near-linear execution time), they will need parallel algorithms
GRDI2020 Final Roadmap Report Page 32 of 108
that can apply many processors and many disks to the problem to meet CPU-density and
bandwidth-density demands, and they will need the ability to “steer” long-running computations
in order to prioritize the production of data that is more likely to be of interest [28].
Scientists will need better data mining algorithms to automatically extract valid, authentic and
actionable patterns, trends and knowledge from large data sets. Data mining algorithms such as
automatic decision tree classifiers, data clusters, Bayesian predictions, association discovery,
sequence clustering, time series, neural networks, logistic regression integrated directly in
database engines will increase the scientist’s ability to discover interesting patterns in their
observations and experiments [28].
Large observational data sets, the results of massive numerical computations, and high-
dimensional theoretical work all share one need: visualization. Observational data sets such as
astronomic surveys, seismic sensor output, tectonic drift data, ephemeris data, protein shapes,
and so on, are infeasible to comprehend without exploiting the human visual system [28].
In essence, scientists need advanced tools that enable them to follow new paths, try new
techniques, build new models and test them in new ways that facilitate innovative
multidisciplinary/interdisciplinary activities and support the whole research cycle.
6.3.1 Data Visualization
Visual data analysis, facilitated by interactive interfaces, enables the detection and validation of
expected results while also enabling unexpected discoveries in science. It allows the validation of
new theoretical models, provides comparison between models and datasets, enables quantitative
and qualitative querying, improves interpretation of data and facilitates decision making. Scientists
can use visual data analysis systems to explore “what if” scenarios, define hypotheses, and
examine data using multiple perspectives and assumptions. They can identify connections among
large numbers of attributes and quantitatively assess the reliability of hypotheses. In essence,
visual data analysis is an integral part of scientific discovery and is far from a solved problem.
Many avenues for future research remain open [29].
Fundamentals advances in visualization techniques must be made to extract meaning from large
and complex datasets derived from experiments and from upcoming petascale and exascale
simulation systems. Effective data analysis and visualization tools in support of predictive
simulations and scientific knowledge discovery must be based on strong algorithmic and
mathematical foundations and must allow scientists to reliably characterize salient features in
their data. New mathematical methods in areas such as topology, high-order tensor analysis and
statistics will constitute the core of feature extraction and uncertainty modelling using forma
definition of complex shapes, patterns, and space-time distributions.
New visual data analysis techniques will need to dynamically consider high-dimensional probability
distributions of quantities of interest.
New approaches to visual data analysis and knowledge discovery are needed to enable
researchers to gain insight into the emerging forms of scientific data. Such approaches must take
into account the multi-model nature of the data; provide the means for scientists to easily
GRDI2020 Final Roadmap Report Page 33 of 108
transition views from global to local model data. Tools that leverage semantic information and
hide details of dataset formats will be critical to enabling visualization and analysis experts to
concentrate on the design of these approaches rather than becoming mired in the trivialities of
particular data representations [28].
6. 3.2 Massive Data Mining
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help researchers conducting many of their research
activities. Data mining tools predict future trends and behaviours, allowing researchers to make
proactive, knowledge-driven decisions [30].
Data mining techniques are the result of a long process of research and product development.
Data mining overcomes the retrospective data access and navigation and allows for prospective
and proactive information delivery. Data mining is supported by three technologies that are now
sufficiently mature: massive data collection, powerful multiprocessor computers and data mining
algorithms.
The most commonly used techniques in data mining are: artificial neural networks, decision trees,
genetic algorithms, nearest neighbour method and rule induction. Many of these technologies
have been in use for more than a decade in specialized analysis tools that work with relatively
small volumes of data. These capabilities are now evolving to integrate directly with large data
warehouses.
Researchers need timely and sophisticated analysis on an integrated view of data stored in huge
warehouses. However, there is a growing gap between more powerful storage and retrieval
systems and the user’s ability to effectively analyse and act on the information they contain. Both
relational and OLAP technologies have tremendous capabilities for navigating massive data
warehouses, but brute force navigation of data is not enough. A new technological leap is needed
to structure and prioritize information for specific end-user problems. The data mining tools can
make this leap.
Massive data mining: In many scientific disciplines, data now arrives faster than we are able to
mine it. To avoid wasting this data, we must switch from the traditional “one-shot” data mining
approach to systems that are able to mine continuous, high-level, open-ended data streams as
they arrive.
Data mining systems that are able to “keep up” with massive data streams are required.
GRDI2020 Final Roadmap Report Page 34 of 108
7. System Challenges
System challenges include virtual research environments, science gateways, scientific workflow
management, and policy management.
7.1 Virtual Research Environment (VRE)
By Virtual Research Environment (VRE) we mean the “technological framework”, i.e., digital
infrastructure and services that enable the creation of “virtual working environments” in which
“communities of practice” can effectively and efficiently conduct their research activities.
A Virtual Research Environment is always associated with a community of practice.
Communities of practice are groups of researchers who share a concern for a research problem.
Three features characterize a community of practice [31]: the domain, the community, and the
practice.
The domain: A community is not merely a network of connections between people. It has an
identity defined by a shared scientific domain of interest. Membership implies a commitment to
the domain, and therefore a shared competence that distinguishes the members of the
community from other people.
The community: In pursuing their interest in the domain, members are engaged in joint activities,
share information and help each other. They build relationships that enable them to learn from
each other.
The practice: A community of practice is not merely a community of interest. Members of a
community of practice are practitioners. They develop a shared repertoire of results:
methodologies, approaches, and scientific results both theoretical and implementations.
A Virtual Research Environment can be viewed as a framework within which tools, services, and
resources can be plugged [32]. VREs are part of an e-Infrastructure rather than a free-standing
product. They are the result of joining together new and existing components to support the
research activities of a community of practice. It is usually assumed that a large proportion of
existing components will be distributed and heterogeneous. There is a difference between VREs
and an e-Infrastructure; a VRE presents a holistic view of the context in which research takes place
whereas an e-Infrastructure focuses on the core, shared services over which the VRE is expected
to operate. A VRE is more than middleware!
The development of VREs is of paramount importance for data-intensive science. Within a VRE
framework access to grid-based computing resources may be one suite of services among others.
VREs have the potential to be profoundly multidisciplinary, both in their use and in their
development. It is expected that computer science will act in partnership with other disciplines to
lay the foundations, integrating methods and knowledge from the relevant subject areas.
By facilitating data-intensive multidisciplinary research, a VRE should transform how research is
undertaken within a particular community of practice. This effect may apply at various points in
the research “life cycle”.
GRDI2020 Final Roadmap Report Page 35 of 108
The VRE vision is still evolving and should be informed by the various on-going e-Infrastructure
projects under the 7th FP.
Next generation scientific data infrastructures must provide the necessary architectural and
management tools in order to be able to build, support, and maintain VREs.
7.2 Science Gateways
Increasingly, scientists are using portals and desktop applications as gateways to access
computational resources (data services and tools), data, and even instruments that are integrated
within a data infrastructure [33].
The concept of a Science Gateway is a community-specific set of tools, applications, and data
collections that are integrated together and accessed within a research data infrastructure via a
portal or a suite of applications.
These gateways can support a variety of capabilities including workflows, visualization as well as
resource discovery and job execution services.
As more and more communities build science gateways, it will be useful to develop a set of
conventions (policies, technical approaches, interactions) by which a science gateway interacts
with a data infrastructure or data infrastructure resource.
7.3 Ontology Management
Rationale and Definition
Ontologies/taxonomies/thesauri constitute a key technology enabling a wide range of science
ecosystem services. The growing availability of data has shifted the focus from closed, relatively
data-poor applications, to mechanisms and applications for searching, integrating and making use
of the vast amounts of data that are now available. A global research data infrastructure faces,
thus, the problem of accessing several heterogeneous data resources by means of flexible
mechanisms that are both powerful and efficient. Ontologies are widely considered as a suitable
formal tool for sophisticated data access [34]. In fact, ontologies provide the semantic
underpinning enabling intelligent search, access, integration, sharing and use of research data.
Services, to be provided by a research data infrastructure, which require extensive use of
ontologies include also data/information discovery, data service/tool discovery, data classification,
data exchangeability and interoperability, dealing with inconsistent information, etc. In addition,
in several communities of research ontologies are considered as the ideal formal tool to provide a
shared conceptualization of the domain of interest.
“In the context of knowledge sharing, I use the term ontology to mean a specification of a
conceptualization. That is, an ontology is a description (like a formal specification of a program) of
the concepts and relationships that can exist for an agent or a community of agents. This
definition is consistent with the usage of ontology as a set-of-concept-definitions, but more
general” [35].
GRDI2020 Final Roadmap Report Page 36 of 108
Ontologies were initially developed by the Artificial Intelligence community to facilitate knowledge
sharing and reuse. An ontology consists of a set of concepts, axioms, and relationships that
describes a domain of interest.
While an ontology may be in principle an abstract conceptual structure, from the practical
perspective, it makes sense to express it in some selected formal language to realize the intended
shareable meaning.
An important issue regards the effort to be undertaken in order to understand which language
would be best suited for representing ontologies in a setting where an ontology is used for
carrying out an activity, for example, accessing large volumes of data.
Ontology Management
We are now entering a phase of knowledge system development, in which ontologies are
produced in large numbers and exhibit greater complexity. Therefore, ontology management is a
needed capability of a global research data infrastructure in order to be able to effectively support
the semantic science ecosystem services.
Three are the main components of an ontology management capability: ontology model, ontology
metadata, and reasoning engine [36].
The core component is the ontology model. The ontology model, typically an in-memory
representation, maintains a set of references to the ontologies, their respective contents and
corresponding instance data sources. From a logical perspective, it can be seen as a set of
(uninterpreted) axioms. The ontology model can be typically modified by means of an API or via
well-defined interfaces.
The ontology metadata store maintains metadata information about the ontologies themselves,
for example author, version, and compatibility information.
The reasoning engine operates on top of the ontology model, giving the set of axioms an
interpretation and thus deducing new facts from the given primitives. Obviously, the behavior of
the reasoning engine will depend on the supported semantics, e.g., Description Logics in the case
of the OWL reasoning engines. The interaction with the reasoning engine from outside takes place
either by means of a defined API or by means of a query interface. Again, the query interface
needs to support a standardized query language, e.g., SPARQL for OWL/RDF data. The basic
reasoning capabilities based on the ontology model can be further extended by means of external
builtins.
The following key requirements for ontology management have been identified [37]:
1. Scalability, Availability, Reliability and Performance – These are considered essential for any ontology management solution, both during the development and maintenance phase and the ontology deployment phase.
2. Ease of Use – The ontology development and maintenance process had to be simple, and the tools usable by ontologists as well as domain experts and analysts.
GRDI2020 Final Roadmap Report Page 37 of 108
3. Extensible and Flexible Knowledge Representation – The ontology model needs to incorporate the best knowledge representation practices available and be flexible and extensible enough to easily incorporate new representational features and incorporate and interoperate with different ontology models.
4. Distributed Multi-User Collaboration – Collaboration is considered to be a key to knowledge sharing and building. Ontologists, domain experts and analysts need a tool that allows them to work collaboratively to create and maintain ontologies even if they work in different geographic locations.
5. Security Management – The system needs to be secure to protect the integrity of the data, prevent unauthorized access, and support multiple access levels.
6. Difference and Merging – Merging facilitates knowledge reuse and sharing by enabling existing knowledge to be easily incorporated into an ontology. The ability to merging ontologies is also needed during the ontology development process to integrate versions created by different individuals into a single, consistent ontology.
7. Internationalization- A global research data infrastructure enables applications using ontological data that have to serve researchers around the world. The ontology management capability needs to allow users to create ontologies in different languages and support the display or retrieval of ontologies using different locales based on the user’s geographic location.
8. Versioning – Since ontologies continue to change and evolve, a versioning system for ontologies is critical. As an ontology changes over time, applications need to know what version of the ontology they are accessing and how it has changed from one version to another so that they can perform accordingly.
The requirements of scalability, reliability, availability, security, internationalization and versioning
are considered to be the most important for an industrial strength ontology management
capability.
Ontology Integration/Alignment and Mappings
In large distributed information systems, individual data sources, service functionalities, etc. are
often described based on different, heterogeneous ontologies. To enable interoperability across
these sources, it is necessary to specify how the resource residing at a particular node corresponds
to resource residing at another node. This is formally done using the notion of mapping. There are
mainly two lines of work connected to the problem of ontology mapping [36]:
(i) identifying correspondences between heterogeneous ontologies, (ii) representing these correspondences in an appropriate mapping formalism and using
the mappings for a given integration task.
The first task of identifying correspondences between heterogeneous ontologies is also referred to
as mapping discovery or ontology matching. Today, there exist a number of tools that support the
process of ontology matching in automated or semi-automated ways.
A generic process for ontology matching is described here below. It is composed of six main steps
[36]:
GRDI2020 Final Roadmap Report Page 38 of 108
1. Feature Engineering: The role of feature engineering is to select relevant features of the ontology to describe specific ontology entities, based on which the similarity with other entities will later be assessed.
2. Search Step Selection: The derivation of ontology matches takes place in a search space of candidate matches. This step may choose to compute the similarity of certain candidate element pairs and to ignore others in order to prune the search space.
3. Similarity Computation: For a given description of two entities from the candidate matches this step computes the similarity of the entities using the selected features and corresponding similarity measures.
4. Similarity Aggregation: In general, there may be several similarity values for a candidate pair of entities, e.g., one for the similarity of their labels and one for the similarity of their relationship to other entities. These different similarity values for one candidate pair have to be aggregated into a single aggregated similarity value. Often a weighted average is used for the aggregation of similarity values.
5. Interpretation: The interpretation finally uses individual or aggregated similarity values to derive matches between entities. The similarities need to be interpreted. Common mechanisms are to use thresholds or to combine structural and similarity criteria. In the end, a set of matches for the selected entity pairs is returned.
6. Iteration: The similarity of one entity pair influences the similarity of the neighboring entity pairs, for example, if the instances are equal this affects the similarity of the concepts and vice versa. Therefore, the similarity is propagated through the ontologies by following the links in the ontology. Several algorithms perform an iteration over the whole process in order to bootstrap the amount of structural knowledge. In each iteration, the similarities of a candidate alignment are recalculated based on the similarities of the neighboring entity pairs. Iteration terminates when no new alignments are proposed.
Output: The output is a representation of matches and possibly with additional confidence values
based on the similarity of the entities.
Mapping Representation
In contrast to the area of ontology languages where the Web Ontology Language OWL has
become a de facto standard for representing and using ontologies, there is no agreement yet on
the nature and the right formalism for defining mappings between ontologies.
Recent research works study formalisms for specifying the correspondences between elements
(concepts, relations, individuals) in different ontologies, ranging from simple correspondences
between atomic elements, to complex languages allowing for expressing complex mappings.
Here, some general aspects of formalisms for mapping representation are briefly discussed [36].
What do mappings define? We can see mappings as axioms that define a semantic relation
between elements in different ontologies. Most common are the following kinds of semantic
relations:
Equivalence: Equivalence states that the connected elements represent the same aspect of the
real world according to some equivalence criteria.
GRDI2020 Final Roadmap Report Page 39 of 108
Containment: Containment states that the element in one ontology represents a more specific
aspect of the world than the element in the other ontology.
Overlap: Overlap states that the connected elements represent different aspects of the world, but
have an overlap in some respect. In particular, it states that some objects described by the
element in one ontology may also be described by the connected element in the other ontology.
In some approaches, these relations are supplemented by their negative counterparts. The
corresponding relations can be used to describe that two elements are not equivalent, not
contained in each other or not overlapping respectively.
What do mappings connect? Most mapping languages allow defining the correspondences of
mappings between ontology elements of the same type, e.g., classes to classes, properties to
properties, etc. These mappings are often referred to as simple, or one-to-one mappings. While
this already covers many of the existing mapping approaches, there are a number of proposals for
mapping languages that rely on the idea of view-based mappings and use semantic relations
between queries to connect models, which leads to a considerably increased expressiveness.
Possible applications such mappings are manifold, including for example data transformation and
data exchange. However, most commonly mappings are applied for the task of ontology-based
information integration, which addresses the problem of integrating a set of local information
sources, each described using a local ontology.
Ontology Alignment
The Ontology Mapping Language of the Ontology Management Working Group is an ontology
alignment language that is independent of the language in which the two ontologies to be aligned
are expressed.
The alignment between two ontologies is represented through a set of mapping rules that specify
a correspondence between various entities, such as concepts, relations, and instances. Several
concept and relation constructors are offered to construct complex expressions to be used in
mappings.
Networked Ontologies [38]
In the context of a science ecosystem, ontologies are not standalone artifacts. They relate to each
other in ways that might affect their meaning, and are inherently distributed in a network of
interlinked semantic resources, taking into account in particular their dynamics, modularity and
contextual dependencies. It is important to investigate the entire development and evolution
lifecycle of networked ontologies that enable complex, semantic applications. Methodologies and
tools must be developed which support the:
management of the dynamics and evolution of ontologies in an open, networked environment;
collaborative development of networked ontologies
GRDI2020 Final Roadmap Report Page 40 of 108
facilitation of contextual awareness and using context for developing, sharing, adapting and maintaining networked ontologies
improvement of human-ontology interaction; i.e. making it easier for users with different leves of expertise and experience to browse and make sense of ontologies
Ontology and Science Ecosystem
For each science ecosystem, we envision the need for building a top-level ontology, domain-
independent, supplemented by several domain ontologies, one for each community of research
belonging to the same science ecosystem.
The top-level ontology would confine itself to the specification of such high level general (domain-
independent) categories as: time, space, inherence, instantiation, identity, measure, quantity,
functional dependence, process, event, attribute, boundary, and so on. The top-level ontology
would be designed to serve as common neutral backbone [39].
It would be supplemented by the work of ontologists working on more specialized domains, e.g.
the domain ontologies. Such ontologies should extend or specify the top-level ontology with
axioms and definitions to the objects in some given domain. Each community of research should
develop its own domain ontology.
A research data infrastructure should have the capability of managing and maintaining aligned the
top-level and domain ontologies of the science ecosystem.
7.4 Scientific Workflow Management
Today, scientists face many of the same challenges found in enterprise computing, namely
integrating distributed and heterogeneous resources. Scientists no longer use just a single
machine, or even a single cluster of machines, or a single source of data. Research collaborations
are becoming more and more geographically dispersed and often exploit heterogeneous tools,
compare data from different sources, and use machines distributed across several institutions
throughout the world. Therefore, the task of running and coordinating a scientific application
across several administrative domains remains extremely complex.
Definition
Scientific Workflow is a key component in a research data infrastructure. It orchestrates e-Science
services so that they co-operate to implement efficiently a scientific application.
Scientific Workflow has seen massive growth in recent years as science becomes increasingly reliant on the analysis of massive data sets and the use of distributed resources. The workflow programming paradigm is seen as a means of managing the complexity in defining the analysis, executing the necessary computations on distributed resources, collecting information about the analysis results, and providing means to record and reproduce the scientific analysis [40].
Workflows provide [41]:
A systematic and automated means of conducting analyses across diverse datasets and applications;
GRDI2020 Final Roadmap Report Page 41 of 108
A way of capturing this process so that results can be reproduced and the method can be reviewed, validated, repeated, and adapted;
A visual scripting interface so that computational scientists can create pipelines without low-level programming concern;
An integration and access platform for the growing pool of independent resource providers so that computational scientists need not specialize in each one.
The workflow is thus becoming a paradigm for enabling science on a large by managing data
preparation and analysis pipelines, as well as the preferred vehicle for computational knowledge
extraction.
Workflow Definition
A workflow is a precise description of a scientific procedure – a multi-step process to coordinate
multiple tasks, acting like a sophisticated script. Each task represents the execution of a
computational process, such as running a program, submitting a query to a database, submitting a
job to a compute cloud or grid, or invoking a service over the Web to use a remote resource. Data
output from one task is consumed by subsequent tasks according to a predefined graph topology
that “orchestrates” the flow of data [41].
Workflow Systems
Workflow systems generally have three components: an execution platform, a visual design suite,
and a development kit [41].
The platform executes the workflow on behalf of applications and handles common crosscutting
concerns, including: invocation of services and handling heterogeneities of data types, recovery
from failures, optimization of memory, storage, and execution including concurrency and
parallelization, data handling (mapping, referencing, movement, streaming and staging), security
and monitoring of access policies.
The design suite provides a visual scripting application for authoring and sharing workflows and
preparing the components that are to be incorporated as executable steps.
The development kit enables developers to extend the capabilities of the system and enables
workflows to be embedded into applications, Web portals, or databases.
Workflow Usage [41]
Workflows liberate scientists from the drudgery of routine data processing so they can
concentrate on scientific discovery. They shoulder the burden of routine tasks, they represent the
computational protocols needed to undertake data-centric science, and they open up the use of
processes and data resources to a much wider group of scientists and scientific application
developers. Workflows are ideal for systematically, accurately, and repeatedly running routine
procedures: managing data capture from sensors or instruments; cleaning, normalizing, and
validating data; securely and efficiently moving and archiving data; comparing data across
repeated runs; and regularly updating data warehouses.
GRDI2020 Final Roadmap Report Page 42 of 108
Workflow Types
In [40] four basic types of workflows encountered in business have been identified, and most have
direct counterparts in science and engineering. The first type of workflows, referred to as
collaborative workflows, are those that have high business value to the company and involve a
single large project and possibly many individuals. For example, the production, promotion,
documentation, and release of a major product fall into this category. The workflow is usually
specific to the particular project, but it may follow a standard pattern used by the company.
Within the scientific community, it can refer to the management of data produced and distributed
on behalf of a large scientific experiment such as those encountered in high-energy physics.
Another example may be the end-to-end tracking of the steps required by a biotech enterprise to
produce and release a new drug.
The second type of workflow is ad hoc. These activities are less formal in both structure and
required response; for example, a notification that a business practice or policy has changed that
is broadcast to the entire workforce. Any required action is up to the individual receiving the
notification.
Within science, notification-driven workflows are common. A good example is an agent process
that looks at the output of an instrument. Based on events detected by the instrument, different
actions may be required and sub-workflow instances may need to be created to deal with them.
The third type of workflow is administrative, which refers to enterprise activities such as internal
bookkeeping, database management, and maintenance scheduling, that must be done frequently
but are not tied directly to the core business of the company.
On the other hand, the fourth type of workflow, referred to as production workflow, is involved
with those business processes that define core business activities. For example, the steps involved
with loan processing are one of the central business processes of a bank. These are tasks that are
repeated frequently, and many such workflows may be concurrently processed.
Both the administrative and production forms of workflow have obvious counterparts in science
and engineering. For example, the routine tasks of managing data coming from instrument
streams or verifying that critical monitoring services are running are administrative in nature.
Production workflows are those that are run as standard data analyses and simulations by users
on a daily basis. For example, doing a severe storm prediction based on current weather
conditions within a specific domain or conducting a standard data-mining experiment on a new,
large data sample are all central to e-Science workflow practice.
Workflow-enabled e-Science [40]
Scientific workflows have emerged and been adapted from the business world as a means to
formalize and structure the data analysis and computations on the distributed resources.
There is a clear case for the role of workflow technology in e-Science; however, there are technical
GRDI2020 Final Roadmap Report Page 43 of 108
issues unique to science. Business workflows are typically less dynamic and evolving in nature.
Scientific workflows tend to change more frequently and may involve very voluminous data
translations. In addition, while business workflows tend to be constructed by professional
software and business flow engineers, scientific workflows are often constructed by scientists
themselves. While they are experts in their domains, they are not necessarily experts in
information technology, the software, or the networking in which the tools and workflows
operate. Therefore, the two cases may require considerably different interfaces and end-user
robustness both during the construction stage of the workflows and during their execution.
In composing a workflow, scientists often incorporate portions of existing workflows, making
changes where necessary. Business workflow systems do not currently provide support for storing
workflows in a repository and then later searching this repository during workflow composition.
The degree of flexibility that scientists have in their work is usually much higher than in the
business domain, where business processes are usually predefined and executed in a routine
fashion. Scientific research is exploratory in nature. Scientists carry out experiments, often in a
trial-and-error manner wherein they modify the steps of the task to be performed as the
experiment proceeds. A scientist may decide to filter a data set coming from a measuring device.
Even if such filtering was not originally planned, that is a perfectly acceptable option. The ability to
run, pause, revise, and resume a workflow is not exposed in most business workflow systems.
Finally, the control flow found in business workflows may not be expressive enough for highly
concurrent workflows and data pipelines found in leading edge simulation studies. Scientific
workflows may require a new control flow operator to succinctly capture concurrent execution
and data flow.
Workflow–Enabled Data–Centric Science [41]
Workflows offer techniques to support the new paradigm of data-centric science. They can be
replayed and repeated. Results and secondary data can be computed as needed using the latest
sources, providing virtual data (or on-demand) warehouses by effectively providing distributed
query processing.
The workflows as first class citizens in data-centric science, can be generated and transformed
dynamically to meet the requirements at hand. In a landscape of data in considerable flux,
workflows provide robustness, accountability, and full auditing.
Workflows enable data-centric science to be a collaborative endeavor on multiple levels. They
enable scientists to collaborate over shared data and shared services, and they grant non-
developers access to sophisticated code and applications without the need to install and operate
them. Consequently, scientists can use the best applications, not just the ones with which they are
familiar. Multidisciplinary workflows promote even broader collaboration. In this sense, a
workflow system is a framework for reusing community’s tools and datasets that represent the
original codes and overcomes diverse coding styles.
GRDI2020 Final Roadmap Report Page 44 of 108
Although the impact of workflow tools on data-centric science is potentially profound – scaling
processing to match the scaling of data – many challenges exist over and above the engineering
issues inherent in large-scale distributed software.
There are a confusing number of workflow platforms with various capabilities and purposes and
little compliance with standards. Workflows are often difficult to author, using languages that are
at an inappropriate level of abstraction and expecting too much knowledge of the underlying
infrastructure.
The reusability of a workflow is often confined to the project it was conceived in – or even to its
author- and it is inherently only as strong as its components. Although workflows encourage
providers to supply clean, robust, and validated data services, components failure is common.
Unfortunately, debugging failing workflows is a crucial but neglected topic. Contemporary
workflow platforms fall short of adequately supporting rapid deployment into the user
applications that consume them, and legacy application codes need to be integrated and
managed.
7.5 Interoperability and Mediation Software
There is a wide range of views as to what “interoperability” means. Interoperability intended as
the ability of two entities to work together very much depends on the working context in which
these two entities are embedded (Web services, DRM, Command and Control Systems, Digital
Libraries, e-Science, etc.). Due to this inherent complexity and multifaceted nature,
interoperability has been often misunderstood and confused with simple information
exchangeability or other forms of compatibility. Furthermore, when addressing interoperability
between two entities the fact that often these belong to two different organizations which have
their own policies has been ignored.
Based on the IEEE definition of interoperability - “The ability of two or more systems or
components to exchange information and to use the information that has been exchanged” - in
order to achieve interoperability between two entities (producer, consumer) three conditions
must be satisfied [42]:
the two entities must be able to exchange “meaningful” information objects
(exchangeability);
the two entities must be able to exchange “logically consistent” information objects (when
the exchanged information objects are descriptions of functionality, policy, or behaviour)
(Compatibility);
the consumer entity must be able to use the exchanged information in order to perform a
set of tasks that depend on the utilization of this information (Usability).
GRDI2020 Final Roadmap Report Page 45 of 108
7.5.1 Exchangeability – The Heterogeneity Problem
During the information object exchange process between producer and consumer entities
different sources of heterogeneity can be encountered depending on: how information objects are
requested by the consumer entity; the use of different terminologies; how information objects will
be represented; the semantic meaning of each information object; how information objects are
actually transported over a network.
Therefore, there are three types of heterogeneity to be overcome in order to achieve a
meaningful exchange of information objects:
Firstly, heterogeneity between the native data /query language (of the consumer entity) and the
target data /query language (of the producer entity). When this heterogeneity is resolved we say
that syntactic exchangeability between the two entities has been achieved.
Secondly, heterogeneity between the models adopted by the producer and the consumer entities
for representing information objects. When this heterogeneity is resolved we say that structural
exchangeability between the two entities has been achieved.
Finally, heterogeneity between the “semantic universe of discourse” of the producer and
consumer entities (differences in granularity, differences in scope, temporal differences,
synonyms, homonyms, etc.). When this heterogeneity is resolved we say that semantic
exchangeability between the two entities has been achieved.
These three levels of exchangeability, i.e., syntactic, structural, and semantic allow a meaningful
exchange of information objects between the two entities and thus guarantee the exchangeability
between them.
7.5.2 Compatibility – The Logical Inconsistency Problem
Some logical inconsistencies may arise between functional descriptions of services (producer) and
requests (consumer). In fact, when the exchanged information objects specify the functionality of
a service or what is required to satisfy a request, some inconsistencies may arise between the
logical relationships of these descriptions. When these inconsistencies are resolved we say that
the two entities/functionalities are compatible (a logic compatibility between the two entities has
been established – service compatibility, policy compatibility, etc.).
7.5.3 Usability – The Usage Inconsistency Problem
Usage inconsistency means that the consumer’s goal, that is, the objectives that she/he wants to
achieve by using the provider’s resources cannot be reached. Usage inconsistency may arise when
the consumer goal description and the producer resource description are inconsistent.
Possible causes for inability of the consumer entity to use the exchanged information objects are:
Quality mismatching
Data-incomplete mismatching
GRDI2020 Final Roadmap Report Page 46 of 108
Quality mismatching occurs when the quality profile associated with the exported information
object does not meet the quality expectations of the consumer entity.
Data-incomplete mismatching occurs when the exported information object is lacking some useful
information to enable the consumer entity to fully exploit the received information object. In fact,
in order to allow a consumer entity to use the exchanged information this must be complemented
with some “descriptive” information, such as contextual, provenance/lineage, quality, security,
privacy, etc. information which gives additional meaning. The descriptive information should be
modelled by purpose-oriented descriptive data models /metadata models.
The use of purpose-oriented descriptive data models is of paramount importance to achieve
usability.
7.5.4 Mediation Software
The main concept enabling interoperability of data/services/policies is mediation. This concept has
been used to cope with many dimensions of heterogeneity spanning data language syntaxes, data
models and semantics. The mediation concept is implemented by a mediator, which is a software
device capable of establishing interoperability of resources by resolving heterogeneities and
inconsistencies. It supports a mediation schema capturing user requirements, and an
intermediation function between this schema and the distributed information sources [43].
A key characteristic of the mediation process is the kind of intermediation function implemented
by a mediator. There are three main functions: mapping, matching and consistency checking.
Mapping refers to how information structures, properties, relationships are mapped from one
representation scheme to another one, equivalent from the semantic point of view.
Matching refers to the action of verifying whether two strings/patterns match, or whether
semantically heterogeneous data match.
Consistency checking refers to the action of checking whether the logical relationships between
functional/policy/organizational descriptions of two entities share a logical framework.
There are four main mediation scenarios:
Mediation of data structures: permits data to be exchanged according to syntactic, structural and
semantic matching. The functions of mapping, matching and integration are mainly adopted to
implement this kind of mediation.
Mediation of functionalities: makes it possible to overcome mismatching of functional
descriptions of two entities that are expressed in terms of pre- and post-conditions. The functions
of mapping, matching and consistency checking are mainly adopted to implement this kind of
mediation.
Mediation of policies/business logics: employs techniques to solve policy or business mismatches.
The functions of mapping, matching and consistency checking are mainly adopted to implement
this kind of mediation.
Mediation of protocols: makes it possible to overcome behavioural mismatches among protocols
run by interacting parties.
GRDI2020 Final Roadmap Report Page 47 of 108
Automated mediation: heavily relies on adequate modelling of the exchanged information. In
essence, the intermediation functions (mapping, matching and consistency checking) must
translate languages, data structures, logical representations and concepts between two systems.
The effectiveness, efficiency, and computational complexity of the intermediation function very
much depend on the characteristics of the information models (expressiveness, levels of
abstraction, semantic completeness, reasoning mechanisms, etc.) and languages adopted by the
two systems. Ideally, they must provide a framework for semantics and reasoning. Therefore, the
interoperable systems must adopt formally defined and scientifically sound information models
and ontologies. They constitute the conceptual (semantic) and syntactic basis for data languages.
Several formal information models and languages have been defined and developed for
representing, organizing and exchanging information objects (for example, RDF, XML, etc.).
Several discipline-specific standard models have been proposed and developed for representing
discipline-specific descriptive information (discipline-specific metadata models) which greatly
support the mediation process.
Logic-based and ontology-based models and languages have been defined for specifying
behaviour, functionality, and policy (for example, OWL-S).
An important role in the mediation process is played by ontologies. Several domain-specific
ontologies are being developed (CIDOC, etc.). Ontologies were initially developed by the Artificial
Intelligence community to facilitate knowledge sharing and reuse. An ontology is a set of concepts,
axioms, and relationships that describes a domain of interest.
Ontologies have been extensively used to support all the mediation functions, i.e. mapping,
matching and consistency checking because they provide an explicit and machine-understandable
conceptualization of a domain.
Therefore, automated mediation relies on:
Adequate modelling of structural, formatting, and encoding constraints of the exchanged
information resources;
Adequate modelling of data descriptive information (metadata);
Formal domain-specific Ontologies;
Abstracts models and languages for policy specification;
Abstract models and languages for functionality specification; and
Formally defined transfer and message exchange protocols.
The ultimate aim should be the definition and implementation of an “integrated mediation
framework” capable of providing the means to handle and resolve all kinds of heterogeneities and
inconsistencies that might hamper the effective usage of the resources of a global scientific data
infrastructure information infrastructure [44].
We envision that one of the most important features of the future scientific data infrastructures
will be the mediation software.
GRDI2020 Final Roadmap Report Page 48 of 108
8. Infrastructural Challenges
An infrastructural service is defined as a network-enabled entity that provides some capability.
Entities are network-enabled when they are accessible from other computers than the one they
are residing on. Research data infrastructures must provide some network-enabled “support
services” in order to achieve the conditions needed to facilitate effective collaboration among
spatially and institutionally separated communities of research.
A support service should be [45]:
Shareable: it must be able to be used by any set of users in any context consistent with its overall
goals.
Common: it must present a common consistent interface to all users, accessible by a standard
mean. The term “common” may be synonymous with the term “standard”.
Enabling: it must provide the basis for any user or set of users to create, develop, and implement
any applications, utilities, or services consistent with its goals.
Enduring: it must be capable of lasting for an extensive period of time. It must have the capability
of changing incrementally and in an economical feasible fashion to meet the slight changes of the
environment, but be consistent with the worlds view. In addition, it must change in a fashion that
is transparent to the users.
Scale: the service can add any number of users or uses and can by its very nature expand in a
structured manner in order to ensure consistent levels of service.
Economically sustainable: it must have economic viability.
The infrastructural services must make the holdings of the components of a digital science
ecosystem findable, aggregable and interoperable.
8.1 Data and Data Service/Tool Findability By findability we mean ease in discovering data/information/knowledge as well as data tools/services for specific researcher needs while taking into account relevant aspects of data attributes, tool/service functionality and deployability, context, provenance, researcher profiles and goals, etc. Currently, there is a conceptual shift from search to findability as the Internet search paradigm is
characterized by a lack of context, where the search is conducted in an independent way from
professional profiles, context, provenance, and work goals.
In the context of a digital science ecosystem searching for data/information/knowledge, tools and
services is better served by findability.
A findability capability should be of paramount importance for the next generation of global
research data infrastructures. Such a capability must be supported by semantic data models,
semantically rich metadata models, ontology/logic based languages for specifying data
tool/service functionality, personalization and contextualization support, ontology/taxonomy
management, mapping/matching techniques, etc.
GRDI2020 Final Roadmap Report Page 49 of 108
8.1.1 Data Registration
Registration of scientific primary data, to make these data citable as a unique piece of work and
not only a part of a publication, has always been an important issue.
It has long been recognized that unique identifiers are essential for the management of
information in any digital environment.
Once accepted for deposit and archived, data is assigned by a Registration Agency a “Digital
Object Identifier” (DOI) for registration. A Digital Object Identifier (DOI) [46] is a unique name (not
a location) within a networked data environment and provides a system for persistent and
actionable identification of data. DOIs provide persistent identification together with current
information about the object. By this, scientific data is not exclusively understood as part of a
scientific publication, but has its own identity.
The data itself is accessible through resolving the DOI in any web browser.
If a scientist reads a publication where the registered data is used, he might be interested in
analyzing the data under different aspects. After gaining permission to do so by the research
institution maintaining the data, he can cite the data in his own publications using its DOI,
referring to the uniqueness and own identity of the original data.
Identifiers assigned in one context may be encountered, and may be re-used, in another place (or
time) without consulting the assigner, who cannot guarantee that his assumptions will be known
to someone else. To enable such interoperability requires the design of identifiers to enable their
use in services outside the direct control of the issuing assigner. The necessity of allowing
interoperability adds the requirement of persistence to an identifier: it implies interoperability
with the future. Further, since the services outside the direct control of the issuing assigner are by
definition arbitrary, interoperability implies the requirement of extensibility: users will need to
discover and cite identifiers issued by different bodies, supported by different metadata
declarations, combine these on the basis of a consistent data model, or assign identifiers to new
entities in a compatible manner. Hence DOI is designed as a generic framework applicable to any
digital object, providing a structured, extensible means of identification, description and
resolution. The entity assigned a DOI can be a representation of any logical entity.
DOI may be used to offer an interoperable common system for identification of science data.
A DOI system should be composed of the following components:
a specified numbering syntax, a resolution service, a data model, and an implementation mechanism through policies and procedures for the
GRDI2020 Final Roadmap Report Page 50 of 108
governance and application of DOIs
Specified Numbering Syntax
The DOI syntax is a standard for constructing an opaque string with naming authority and
delegation. It provides an identifier “container” which can accommodate any existing identifier.
The word identifier can mean several things: (i) labels: the output of numbering schemes : e.g.
“ISBN 3-540-40465-1”; (ii) specifications for using labels: e.g. on internet URL, URN, URI; or (iii)
implemented systems: labels, following a specification, in a system – e.g. the DOI system, which is
a packaged system offering labels, tools and implementation mechanisms [46].
DOI Resolution
Resolution is the process in which an identifier is the input (a request) to a network service to
receive in return a specific output of one or more pieces of current information (state data)
related to the identified entity: e.g. a location (such as URL) where the object can be found.
Resolution provides a level of managed indirection between an identifier and the output.
DOI Data Model
The DOI data model consists of a data dictionary and a framework for applying it [46]. Together
these provide tools for defining what a DOI specifies (through use of a data dictionary), and how
DOIs relate to each other, (through a grouping mechanism, Application Profiles, which associate
DOIs with defined common properties). This provides semantic interoperability, enabling
information that originates in one context to be used in another in ways that are as highly
automated as possible.
The data dictionary is built from an underlying ontology. It is designed to ensure maximum
interoperability with existing metadata element sets; the framework allows the terms to be
grouped in meaningful ways (DOI Application Profiles) so that certain types of DOIs all behave
predictably in an application through association with specified Services.
A data dictionary is a set of terms, with their definitions, used in a computerized system. Some
data dictionaries are structured, with terms related through hierarchies and other relationships:
structured data dictionaries are derived from ontologies. An interoperable data dictionary
contains terms from different computerized systems or metadata schemes, and shows the
relationships they have with one another in a formal way. The purpose of an interoperable data
dictionary is to support the use together of terms from different systems.
Metadata defined through a dictionary is not essential to all applications of DOIs: their use as
persistent links for example can be supported without metadata, once the link is defined.
However, many other systems have their own metadata declarations, often using quite different
terminology and identifiers.
DOI Implementation
GRDI2020 Final Roadmap Report Page 51 of 108
DOI is implemented through a federation of Registration Agencies which use policies and tools
developed through a parent body, the Governance Body of the DOI system. It safeguards (owns or
licences on behalf of registrants) all intellectual property rights relating to the DOI System. It works
with RAs and with the underlying technical standards of the DOI components to ensure that any
improvements made to the DOI system (including creation, maintenance, registration, resolution
and policymaking of DOIs) are available to any DOI registrant, and that no third party licenses
might reasonably be required to practice the DOI standard.
DOIs and Scientific Data
The identification of scientific data is logically a separate issue from the identification of the
primary publication of such data to the scientific community in the form of articles, tables, etc. DOI
is already the core technology for maintaining cross-references via persistent links between a
citation and internet access to article.
The DOI as a long-term linking option from data to source publication is of fundamental
importance.
Some projects or communities have developed their own identifier schemes, which may be useful
for their own area. Such identifiers can be incorporated into a DOI to make them globally
interoperable and extensible and take advantage of other features provided in a DOI system.
DOIs for Scientific Data Sets
DOIs could logically be assigned to every single data point in a set; however in practice, the
allocation of a DOI is more likely to be to a meaningful set of data following the indecs Principle of
Functional Granularity [47]: identifiers should be assigned at the level of granularity appropriate
for a functional use which is envisaged.
This use of DOI will provide for the effective publication of primary data using a persistent
identifier for long-term data referencing, allowing scientists to cite and re-use valuable primary
data. The DOI’s persistent and globally resolvable identifier, associated to both a stable link to the
data and also a standardized description of the identified data, offers the necessary functionality
and also ready interoperability with other material such as scientific articles. The DOI as a long-
term linking option from data to source publication is of fundamental importance.
A key problem regards the reliable re-use of existing data sets, in terms both of attribution of
data source, and the archiving of data in context so as to be discoverable and interoperable
(usable by others). The extensibility of the mechanism is provided by allocating DOIs for data sets,
with associated metadata using a core management metadata (applicable to all datasets) and
structured metadata extensions (mapped to a common ontology) applicable to specific science
disciplines.
DOIs for Taxonomic Data
DOIs can also act as persistent identifiers of taxonomic definitions.
A name ascribed to a given group in a biological taxonomy is fixed in both time and scope and may
GRDI2020 Final Roadmap Report Page 52 of 108
or may not be revised when new information is available. Change occurs resulting in changes of
names, genera, families, classes, and relationships over time. When taxonomic revisions do occur,
resulting in the division or joining of previously described taxa, authors frequently fail to address
synonymies or formally emend the descriptions of higher taxa that are affected. DOI has been
proposed as a tool to manage a data model of nomenclature and taxonomy (enabling
disambiguation of synonyms and competing taxonomies) using a metadata resolution service
(enabling dissemination of archived and updated information objects through persistent links to
articles, strain records, gene annotations and any other data [48].
A Global Research Data Infrastructure must create and operate a Data Registration Environment
enabling an efficient identification of research data.
8.1.2 Data Citation
Data citation refers to the practice of providing a reference to data in the same way as researchers
routinely provide a bibliographic reference to printed resources. The need to cite data is starting
to be recognized as one of the key practices underpinning the recognition of data as a primary
research output rather than as a by-product of research. While data has often been shared in the
past, it is seldom cited in the same way as a journal article or other publication might be. This
culture is, however, gradually changing. If datasets were cited, they should have enabled scholarly
recognition and credit attribution [49].
Unfortunately, no universal standards exist for citing quantitative data. Practices vary from field to
field, archive to archive, and often from article to article. The data cited may no longer exist, may
not be available publicly, or may have never been held by anyone but the investigator. Data listed
as available from the author are unlikely to be available for long and will not be available after the
author retires or dies. Sometimes URLs are given, but they often do not persist.
A standard for citing quantitative data set, should go beyond the technologies available for printed
matter and responds to issues of confidentiality, verification, authentication, access, technology
changes, existing subfield-specific practices, and possible future extensions, among others.
A quantitative data set represents a systematic compilation of measurements intended to be
machine readable. The measurements may be the intentional result of scientific research for any
purpose, so long as it is systematically organized and described [50].
A data set must be accompanied by “metadata”, which describes the information contained in the
data set, details of data formatting and coding, how the data were collected and obtained,
associated publications, and other research information. Metadata formats range from a text
“readme” file, to elaborate written documentation, to systematic computer-readable definitions
based on common standards.
GRDI2020 Final Roadmap Report Page 53 of 108
A Minimal Citation Standard [50]
Citations to numerical data should include, at a minimum, five required components. The first
three components are traditional, directly paralleling print documents. They include:
The author of the data set
The date the data set was published or otherwise made
public
The data set title
The author, data, and title are useful for quickly understanding the nature of the data being cited,
and when searching for the data. However, these attributes alone do not unambiguously identify a
particular data set, nor can they be used for reliable location, retrieval, or verification of the study.
Thus, at least two additional components using modern technology, each of which is designed to
persist even when the technology inevitably changes, are needed. They are designed to take
advantage of the digital form of quantitative data.
The fourth component is a Unique Global Identifier, which is a short name or character string
guaranteed to be unique among such names, that permanently identifies the data set
independent of its location. The chosen naming scheme must:
unambiguously identify the data set object;
be globally unique;
be associated with a naming resolution service that takes the
name as input and shows how to find one or more copies of
the identical data set
Some examples of unique global identifiers include the Life-Science Identifier (LSID), the Digital
Object Identifier (DOI), and the Uniform Resource Names (URN). All are used to name data sets in
some places, and under specific sets of rules and practices.
All unique global identifiers are designed to persist (and remain unique) even if the particular
naming authority that created it goes out of business or changes names or location. Including such
identifier provides enough information to identify unambiguously and locate a data set, and to
provide many value-added services, such as on-line analyses, or forward citation to printed works
that cite the data set, for any automated systems that are aware of the naming scheme chosen.
Uniqueness is also guaranteed across naming schemes, since they each begin with a different
identifying string.
It is recommended that the unique identifier resolve to a page containing the descriptive and
structural metadata describing the data set, presented in human readable form to web browsers,
instead of the data set itself. This metadata description page should include a link to the actual
data set, as well as a textual description of the data set, the full citation in standard format,
complete documentation, and any other pertinent information. The advantage of this general
GRDI2020 Final Roadmap Report Page 54 of 108
approach is that identifiers in citations can always be resolved, even if the data are proprietary,
require licensing agreements to be signed prior to access, are confidential, demand security
clearance, are under temporary embargo until the authors execute their right of first publication,
or for other reasons. Metadata description pages like these also make it easier for search engines
to find data.
The fifth component is a Universal Numeric Fingerprint (UNF). The need for introducing a fifth
component is justified by the fact that unique global identifiers do not guarantee that the data
does not change in any meaningful way when the data storage formats change. The UNF is a short,
fixed-length string of numbers and characters that summarize all the content in the data set, such
that a change in any part of the data would produce a completely different UNF. A UNF works by
first translating the data into a canonical form with fixed degrees of numerical precision and then
applies a cryptographic hash function to produce the short string. The advantage of
canonicalization is that UNFs are format-independent: they keep the same value even if the data
set is moved between software programs, file storage systems, compression schemes, operating
systems, or hardware platforms. Finding an altered version of a data set that produces the same
UNF as the original data is theoretically possible given enough time and computing power, but the
time necessary is so vast and the task so difficult that for good hash functions no examples have
ever been found.
The metadata page to which the global unique identifier resolves should include a UNF calculated
from the data, even if the data are highly confidential, available only to those with proper security
clearance, or proprietary. The one-way cryptographic properties of the UNF mean that it is
impossible to learn about the data from its UNF and so UNF’s can always be freely distributed.
Most importantly, this means that editors, copyeditors, or others at journals and book publishers
can verify whether the actual data exists and is cited properly even if they are not permitted to see
a copy. Moreover, even if they can see a copy, having the UNF as a short summary that verifies the
existence, and validates the identity, of an entire data set is far more convenient than having to
study the entire original data set.
The essential information provided by a citation is that which enables the connection between it
and the cited data set. Yet, authors, editors, publishers, data producers, archives, or others may
still wish to add optional features to the citation, such as to give credit more visibly to specific
organizations, or to provide advertising for aspects of the data set.
Institutional Commitment
The persistence of the connection between data citation and the actual data ultimately must also
depend on some form of institutional commitment.
This means that, at least early on, readers, publishers, and archives will have to judge the degree
of institutional commitment implied by a citation, just as with print citations. Obviously, if the
citation is backed by a major archive, i.e., a major university/ research center, there is less to
worry about than there might otherwise be. Journal publishers may wish to require that data be
deposited in places backed by greater institutional commitment, such as established archives.
GRDI2020 Final Roadmap Report Page 55 of 108
A science ecosystem should be strongly committed in the development of a distributed archive
that keeps and organizes all data used by its communities of research creating, thus, a trustworthy
structure.
Deep Citation
“Deep Citation” refers to references to subsets of data sets, and are analogous to page references
in printed matter. Data may be subsetted by row, by column, or both. Subsets also often include
additional processing, such as imputation of missing data.
Devising a simple standard for describing the chain of evidence from the data set to the subset
would be highly valuable. The task of creating subsets is relatively easy and is done in a large
variety of ways by researchers. It is suggested that a citation be made to the entire data set as
described above, and that scholars provide an explanation for how each subset is created in the
text, and refer to a subset by reference to the full data set citation with the addition of a UNF for
the subset (i.e., just as occurs now for page numbers in citations to printed matter).
Huge data sets sometimes come with more specific methods of referencing data subsets, and may
easily be added as optional elements. Any ambiguity in what constitutes a definable “data set”
which may be an issue in very large collections of quantitative information, is determined by the
author who creates the global unique identifier and UNF. If the subset includes substantial value-
added information, such as imputation of missing data or corrections for data errors, then it will
often be more convenient to store and cite the subset as a new data set, with documentation that
explains how it was created.
Versioning
It is recommended the treatment of subsequent versions of the same data set as separate data
sets, with links back to the first from the metadata description page. Forward links to new versions
from the original are easily accomplished via a search on the unique global identifier. New versions
of very large data sets (relative to available storage capacity) can be kept by creating a data set
that contains only differences from the original, and describing how to combine the differences
with the original on the data set’s metadata description page. Version changes may also be noted
in the title, date, or using the extended citation elements.
Concluding Remarks
Together, the global unique identifier and UNF ensure permanence, verifiability, and accessibility
even in the situations where the data are confidential, restricted, or proprietary; the sponsoring
organization changes names, moves, or goes out of business; or new citation standards evolve.
Together with the author, title, and date, which are easier for humans and search engines to
understand, all elements of the proposed citation scheme for quantitative data should achieve
what print citations do and, in addition to being somewhat less redundant, take advantage of the
special features of digital data to make it considerably more functional. The proposed citation
scheme enables forward referencing from the data set to subsequent citations or versions
GRDI2020 Final Roadmap Report Page 56 of 108
(through the persistent identifier) and even a direct search for all citations to any data set (by
searching for the UNF and appropriate version number).
A Global Research Data Infrastructure must efficiently and effectively support a data citation
scheme.
8.1.3 Data Discovery
One big challenge faced by researchers when conducting a research activity in a networked
multidisciplinary environment is pinpointing the location of relevant data.
The ability to determine where data sets are located, what is in those data sets, and who can
access them is a critical but necessary step in order to be able to access all the data stored in
several data collections distributed in a science ecosystem that are relevant to her/his research
activities.
By Data Discovery we mean the capability to quickly and accurately identify and find data that
supports research requirements.
The process of discovering data that exist within a data set is supported by search and query
capabilities which exploit metadata descriptons contained in data categorization/classification
schemes, data dictionaries, data inventories, and metadata registries.
8.1.3.1 Data Classification Data classification is the categorization of data for its most effective and efficient use. In a basic
approach to storing computer data, data can be classified according to its critical value or how
often it needs to be accessed, with the most critical or often-used data stored on the fastest media
while other data can be stored on slower (and less expensive) media. This kind of classification
tends to optimize the use of data storage for multiple purposes - technical, administrative, legal,
and economic. Data can be classified according to any criteria. A well-planned data classification
system makes essential data easy to find. This can be of particular importance in data discovery.
In the field of data management data classification as a part of Information Lifecycle Management (ILM) process can be defined as tool for categorization of data to enable/help researchers to effectively answer following questions:
What data types are available? Where are certain data located? What access levels are implemented? What protection level is implemented and does it adhere to compliance regulations?
Data Classification Tools: Data classification is typically a manual process; however, there are
many tools from different vendors that can help gather information about the data. They help
“categorie” data, primarily for the purpose of tiered storage and are focused on finding
unstructured data on a variety of file shares. This data can be categorized by content, file type,
usage and many other variables.
GRDI2020 Final Roadmap Report Page 57 of 108
8.1.3.2 Data Dictionary Data Dictionaries contain the information about the data contained in large data collections. Each
data element is defined by its data type, the location where it can be found, and the location that
it came from. Often the data dictionary includes the logic when a field is derived. The logic can be
business logic or research logic but it must be defined.
The data dictionary also includes the physical location, such as a server DNS (domain name
system) name or the IP address. The data collection name, the instance, the table, and the field
name are particularly important for the researcher seeking for relevant data. This information is
even more important if the researcher must cross multiple systems to gather the necessary pieces
of information for her/his research.
A data collection administration should be responsible for keeping this important information up
to date and accurate. Typically each data collection has its own data dictionary. It is a good
practice to have one owner of each data dictionary.
8.1.3.3 Metadata Registry Metadata registries are used whenever data must be used consistently within a research community or in a multidisciplinary context. Examples of these situations include:
Communities that transmit data using structures such as XML, Web Services or EDI Communities that need consistent definitions of data across time, between databases,
between communities or between processes, for example when a community builds a large data collection
Communities that are attempting to break down "silos" of information captured within applications or proprietary file formats
Central to the charter of any metadata management program is the process of creating trusting relationships with stakeholders and that definitions and structures have been reviewed and approved by appropriate parties. A metadata registry typically has the following characteristics:
Protected environment where only authorized individuals may make changes Stores data elements that include both semantics and representations Semantic areas of a metadata registry contain the meaning of a data element with precise
definitions Representational areas of a metadata registry define how the data is represented in a
specific format, such as in a database or a structured file format (e.g., XML)
8.1.3.4 Data Inventory Research communities will need to develop their own “data inventory” focused on identifying and
describing all the data elements contained across their different data collections.
The goal of a data inventory is to inventory the data researchers actually need. Inventorying the
data that moves between systems, data collections and scientific communities accomplishes two
things: it identifies the most valuable data elements in use, and it will also help identify data that’s
not high-value, as it is not being shared or used. This approach also provides a way to tackle initial
data quality efforts by identifying the most “active” data used by a research community. It
GRDI2020 Final Roadmap Report Page 58 of 108
ultimately helps the data management team understand where to focus its efforts, and prioritize
accordingly.
Legacy data inventory and profiling is a structured and comprehensive way of learning about the
corporate data asset. This activity is centered on a professional data analyst who gathers
information, runs a variety of reports and ad hoc queries to assess the existing data and creates or
updates documentation about the existence, scope, meaning and quality of the data asset.
The resulting documentation (including high-level summaries and easily navigated detail data
behavior documentation) should provide answers to the following key questions:
What data do we have? Where can it be found? What constraints limit read-only access? Where did each unit of data come from? What is its age distribution? What is its scope? What is its quality? Does it include test data? What is the business meaning of the data? Are there ambiguities in usage, meaning and expectations?
In addition to the technical issues of where the data is located, the lead data analyst should also
assess and document the political issues surrounding the data—an issue such as who “owns” any
portion of the data or who seeks to limit read-only access to the data.
A Research Data Infrastructure must efficiently support a Data Discovery Environment composed
of query and search capabilities as well as data discovery tools including data
categorization/classification schemes and/or data dictionaries and/or data inventories and/or
metadata registries.
8.1.4 Data Tool/Service Discovery
Acceptance of the open science principle entails open access not only to research data but also to
data services/tools/analyses/methods which enable researchers to conduct efficiently and
effectively their research activities.
Enabling automated location of data services/tools that adequately fulfill a given research need is
an important challenge for a global research data infrastructure.
Description of Data Services/Tools
Publishing a data service/tool requires a description of the data service/tool capability, i.e., what
functionality the data service/tool provides.
We have identified three different levels of service description:
GRDI2020 Final Roadmap Report Page 59 of 108
a first level which describes the static characteristics of the service also called abstract capabilities; the abstract capabilities of a service describe only what a published service can provide but no longer under which circumstances a concrete service can actually be provided [51].
a second level which describes the dynamic characteristics of the service also called contracting capabilities; the contracting capabilities describe what input information is required for providing a concrete service and what conditions it must fulfill (i.e. service pre conditions), and what conditions the objects delivered fulfill depending on the input given (i.e. post conditions) [51]. The abstract capability might be automatically derived from the contracting capability and both must be consistent with each other.
a third level which describes the characteristics of the operational environment where the service will be hosted, i.e., the operational conditions, capacity requirements, the service’s resource dependencies, and integrity and access constraints, also called deployment capabilities; the deployment capabilities describe the hosting operational environment.
Description of Researcher Needs [52]
Researchers may describe their desires in a very individual and specific way that makes immediate
mapping with data service/tool descriptions very complicated. Therefore, each service discovery
attempt requires a process where user expectations are mapped on more generic need
descriptions.
A researcher is expected to specify her/his needs in terms of what he wants to achieve by using a
concrete data service/tool. We assume that a researcher will in general care about what she/he
wants to get from a concrete service, but not about how it is achieved. Her/his desire is formally
described by the so-called goal. In particular, goals describe what kind of outputs and effects are
expected by the researcher.
Data service/tool requesters (researchers) are not expected to have the required background to
formally describe their goals. Thus, either goals can be expressed in a language they are familiar
with (like natural language) or appropriate tools should be available which can support requesters
to express their precise needs in a simpler manner. Hence, a possible approach could be the
availability of pre-defined, generic, formal and reusable goals defining generic objectives
requesters may have. They can be refined (or parameterized) by the requester to reflect his
concrete needs, as requesters are not expected to write formalized goals from scratch. It is
assumed that there will be a way for requesters to easily locate such pre-defined goals e.g.
keyword matching.
GRDI2020 Final Roadmap Report Page 60 of 108
Modeling Approaches to Goals and Data Services/Tools [52]
Keyword based Representation Models
By adopting this model, both requester and provider use keywords to describe their goals and
services respectively.
Controlled vocabularies
Another approach assumes that requester and provider use (not necessarily the same) controlled
vocabularies in order to describe goals and services respectively.
Ontologies
The border between controlled vocabularies and ontologies is thin and open for a smooth and
incremental evolvement. Ontologies are consensual and formal conceptualizations of a domain.
Controlled vocabularies organized in taxonomies resemble all necessary elements of an ontology.
Ontologies may simply add some logical axioms for further restricting the semantics of the
terminological elements. Notice that a service requester or provider gets these logical definitions
"for free". She/He can select a couple of concepts for annotating her/his service/request, but
she/he does not need to write logical expressions as long as she/he only reuses the ones already
included in the ontology.
Full-fledged Logic
Simply reusing existing concept definitions as described in the previous section has the advantage
of the simplicity in annotating services and in reasoning about them. However, this approach has
limited flexibility, expressivity, and grain-size in describing services and request. Therefore, it is
only suitable for scenarios where a more precise description of requests and services is not
required. For these reasons, a full-fledged logic is required when a higher precision in the results
of the discovery process is required.
The Data Service/Tool Location Process
Based on formal models for the description of data services/tools and goals, a conceptual model
for the semantic-based location process of services can be defined [51].
This process is composed of five steps:
Goal Discovery: starting from a user desire (expressed using natural language or any other means), goal discovery will locate the pre-defined goals, resulting on a selected pre-defined goal. Such a pre-defined goal is an abstraction of the requester desire into a generic and reusable goal.
Goal Refinement: The selected pre-defined goal is refined, based on the given requester desire, in order to actually reflect such desire. This step will result on a formalized requester goal.
GRDI2020 Final Roadmap Report Page 61 of 108
Service Discovery: Available services that can, according to their abstract capabilities, potentially fulfill the requester goal are discovered.
Service Contracting: Based on the contracting capability, the abstract services selected in the previous step will be checked for their ability to deliver a suitable concrete service that fulfills the requester’s goal. Such service(s) will be selected. This step might involve interaction between service requester and provider.
Service Workability: the final step has to verify whether the hosting computing environment is suitable for efficiently running the selected service(s).
Mediation Support in the Data Service/Tool Discovery Process
Data service/tool discovery is based on matching abstracted goal and service descriptions. In order
to lift discovery process on an ontological level two processes are required: a) the concrete user
input has to be generalized to more abstract goal descriptions, and b) concrete services and their
descriptions have to be abstracted to the classes of services a science ecosystem can provide [52].
In order to successfully carry out the data service/tool discovery process a mediation support must
be offered by the data infrastructure; such mediation support should establish a mapping
between the controlled vocabularies or ontologies used to describe goals and services. In fact, we
assume that goals and data services/tools most likely use different controlled vocabularies or
ontologies.
Depending on the modeling approach adopted for representing goals and data services/tools,
different mediation support should be provided:
Keyword based Representation Models
The mediation process consists in matching a set of keywords extracted from a goal description
against a set of keywords extracted from a service description.
Controlled vocabularies
The mediation process consists in mapping a set of concepts extracted from a goal description into
a set of concepts contained in a data service/tool description and equivalent from the semantic
point of view. Reasoning over hierarchical relationships may be required in case of taxonomies.
Mediation support is needed in case the requester and provider use different controlled
vocabularies.
Ontologies
The descriptions of abstract services and goals are based on ontologies that capture general
knowledge about the problem domains under consideration.
The mediation process consists in mapping the set of concepts describing the goal into
semantically equivalent concepts contained in the service description.
Mediation support is needed in case the requester and provider use different ontologies.
GRDI2020 Final Roadmap Report Page 62 of 108
Full-fledged Logic
If a full-fledged logic is adopted for describing goals and services, a mediation support can only be
provided if the terminology used in the logical expressions is grounded in ontologies. Therefore,
the mediation support required is the same as for ontology-based discovery.
Data Service/Tool Registration
A research data infrastructure should maintain a registry where all the abstract static, dynamic
and deployment data service/tool descriptions, made publicly available by the communities of
research of the science ecosystem, as well as pre-defined, generic, formal and reusable goals are
contained. These descriptions contain all the information necessary to enable an efficient and
effective automated location of data services/tools that adequately fulfill a given research need.
In addition, research data infrastructures should maintain the appropriate tools, i.e., controlled
vocabularies, ontologies, etc. as well as mapping algorithms in order to be able to provide an
efficient mediation support.
8.2 Data Federation
Data federation is an umbrella term for a wide range of decentralized data practices.
At one extreme of this range we have data integration. It is the process of combining data residing
at different sources, and providing the user with a unified view of these data. Data integration has
two broad goals: increasing the completeness and increasing the conciseness of data that is
available to users and applications.
A slightly different concept is data harmonization; it refers to the process of comparing similar
conceptual and logical data models to determine the common data elements, similar data
elements and dissimilar data elements in order to produce a resulting unified data model that can
be used consistently across organizational units.
On the other extreme of the range we have the concept of data linking. It refers to the process of
publishing data on a scientific data space in such a way that its meaning is explicitly defined, it is
linked to other external data sets and can in turn be linked to from external data sets.
A data linking capability does not act as a data integration system as this requires semantic
integration before any service can be provided; instead it follows a co-existence approach, i.e., to
provide base functionality over all data sets, regardless of how integrated they are.
An aggregation/federation capability should be of paramount importance for the next generation
of global research data infrastructures.
8.2.1 Data Integration
Data Integration is the problem of combining data residing at different sources, and providing the
user with a unified view of these data. Data integration has two broad goals [53]: increasing the
GRDI2020 Final Roadmap Report Page 63 of 108
completeness and increasing the conciseness of data that is available to users and applications. An
increase in completeness is achieved by adding more data sources (more objects, more attributes
describing objects) to the system. An increase in conciseness is achieved by removing redundant
data, by fusing duplicate entries and merging common attributes into one.
The problem of designing data integration systems is important in current scientific applications
and is characterized by a number of issues that are interesting from a theoretical point of view
[54]. The data integration systems are characterized by an architecture based on a global schema
and a set of sources. The sources contain the real data, while the global schema provides a
reconciled, integrated, and virtual view of the underlying sources.
Data integration is a three-step process: data transformation, duplicate detection, and data
fusion[53].
8.2.1.1 Data Transformation This step is concerned with the transformation of the data present in the sources into a common
representation (renaming, restructuring). Data from the data sources must be transformed to be
conform to the global schema of an integrated information system.
Modelling the relation between the sources and the global schema is therefore a crucial aspect.
Two basic approaches have been proposed for this purpose. The first approach, called global-as-
view (or schema integration), requires the global schema is expressed in terms of the data
sources. In essence, this approach regards the individual schemata and tries to generate a new
schema that is complete and correct with respect to the source schemata, that is minimal and
understandable. The second approach, called local-as-view (or schema mapping), requires the
global schema to be specified independently from the sources, and the relationships between the
global schema and the sources are established by defining every source as a view over the global
schema. This approach is driven by the need to include a set of sources in a given integrated
information system. A set of correspondences between elements of a source schema and
elements of the global schema are generated to specify how data is to be transformed.
The goal of both approaches is the same: transform data of the sources so that it conforms to a
common global schema.
8.2.1.2 Duplicate Detection This step regards the identification of multiple, possibly inconsistent representations of the same
real-world objects. The basic input to data fusion. The result of the duplicate detection step is the
assignment of an object-ID to each representation. Two representations with the same object-ID
indicate duplicates. Note that more than two representations can share the same object-ID, thus
forming duplication clusters. It is the goal of data fusion to fuse these multiple representations
into a single one.
8.2.1.3 Data Fusion This step combines and fuses the duplicate representations into a single representation while
inconsistencies in the data are resolved. A number of data fusion strategies have been defined.
Conflict-ignoring strategies do not make a decision as to what to do with conflicting data. They
escalate conflicts to user or application or create all possible value combinations.
GRDI2020 Final Roadmap Report Page 64 of 108
Conflict-avoiding strategies acknowledge the existence of possible conflicts in general, but do not
detect and resolve single existing conflicts.
Conflict resolution strategies regard all data and metadata before deciding on how to resolve a
conflict. They can further be subdivided into deciding and mediating strategies, depending on
whether they choose a value from all the already present values (deciding) or choose a value that
does not necessarily exist among conflicting values (mediating).
When handling the data integration problem special attention must be devoted to the following
aspects: modelling a data integration application, processing queries in data integration, dealing
with inconsistent data sources and reasoning on queries.
8.2.2 Data Harmonization
By and large, most of information is stored in databases on mainframe computers. Most of these
systems are not accessible via network connections due to security concerns. This has hindered
development of “real time” data aggregation efforts. Additionally, most of the data models for
these systems are not harmonized since most have evolved separately from different sets of
requirements. To further complicate matters, many different vendors have implemented most
existing systems using different products and information models [55].
Data Harmonization is the process of comparing similar conceptual and logical data models to
determine the common data elements, similar data elements and dissimilar data elements in order
to produce a resulting unified data model that can be used consistently across organizational units,
business systems, and Data Warehouses.
Data Harmonization covers both the data as well as the underlying business definitions.
There is a need for efficient harmonization tools to support the harmonization process. Some
possible features of a harmonization tool are:
The ability to import data models into the tool from various representation formats.
The ability to linguistically and semantically analyze the components of multiple data
models to determine equivalence, similarity and dissimilarity.
The ability to construct multi-modal views of the data models to assist in comparison
analysis.
The ability to harmonize and visualize models with multiple users simultaneously and to
jointly create the resultant data model.
The automated ability to extract common data elements across models to produce a new
resultant skeleton model that can be further refined by a manual process.
The ability to save the resultant data model in multiple representations (the same set of
representations available to import).
Data harmonization has a number of dimensions [56]:
GRDI2020 Final Roadmap Report Page 65 of 108
Legal requirements: in essence they concern political commitments.
Technical aspects: modalities applied in the data harmonization.
Operational aspects: adoption of international standards related to data length, format,
attributes and semantic interoperability.
A real time, event-driven and harmonized view of all data is desired. To facilitate this
development, a methodology and practice must be used to ensure any future work has a strong
foundation of data harmonization to be built upon.
8.2.3 Data Linking
In the context of a Science ecosystem linking data becomes an imperative as it allows users to
benefit from the use of multiple datasets created by different research communities/organizations
including the connection of publications with the subject data.
A scientific data infrastructure must lower the barrier to publishing and accessing data leading,
thus, to the creation of scientific data spaces by connecting data sets from diverse domains,
disciplines, regions and nations. The concept of scientific data space is an answer to the rapidly-
increasing demands of researchers for data-everywhere.
Linking data refers to the capability, supported by a scientific data infrastructure, of publishing
data on a scientific data space in such a way that it is machine-readable, its meaning is explicitly
defined, it is linked to other external data sets and can in turn be linked to from external data sets.
A data infrastructure supporting a data space does not act as a data integration system as this
requires semantic integration before any service can be provided; instead it follows a co-existence
approach, i.e., to provide base functionality over all data sets, regardless of how integrated they
are [57].
A scientific data infrastructure should offer the possibility for researchers to start browsing in one
data set and then navigating along links into related data sets; or it can support data search
engines that crawl the data space by following links between data sets and provide expressive
query capabilities over aggregated data. To achieve this, the data infrastructure must support the
creation of typed links between data from different sources.
Data providers willing to add their data to a scientific data space which allows data to be
discovered and used by various applications must publish them according to some principles.
These provide a basic recipe for publishing and connecting data using the scientific data
infrastructure architecture, services, and standards. The principles should include [58]:
the assignment of permanent universal data identifiers, i.e., strings or tokens that are
unique within a data space;
setting links to other data sources so that users can navigate the data space as a whole by
following the links; and
the provision of metadata so that users can assess the quality of published data.
A scientific data space enabled by a data infrastructure enjoys the following properties [58]:
GRDI2020 Final Roadmap Report Page 66 of 108
it contains data specific of the scientific disciplines supported by the data infrastructure;
any scientific community belonging to the disciplines supported by the data infrastructure
can publish on the scientific data space;
data providers are not constrained in choice of vocabularies with which to represent data;
data is connected by links, supported by the data infrastructure, creating a global data
graph that spans data sets and enables the discovery of new data sets.
From an application development perspective the scientific data space should have the following
characteristics [58]:
data is strictly separated from formatting and presentational aspects;
data is self-describing;
the scientific data space is open, meaning that applications do not have to be implemented
against a fixed set of data sets, but can discover new data sets at run time by following the
data links
From a system perspective, a data infrastructure should provide:
a registry service whose purpose is to manage a collection of identifiers and make it
actionable and interoperable, where that collection can include identifiers from many
other controlled collections;
a name resolution system, which resolves the data identifiers into the information
necessary to locate, and access them;
automated or semi-automated generation of links.
8.3 Data Sharing
Definition
Openness in the sharing of research results is one of the norms of modern science. The
assumption behind this openness is that progress in science demands the sharing of results within
the scientific community as early as possible in the discovery process.
Data Sharing is the use of information by one or more consumers that is produced by another
source other than the consumer.
Why Share Research Data [59]
Research data are a valuable resource, usually requiring much time and money to be produced.
Many data have a significant value beyond usage for the original research.
Sharing research data is important for several reasons:
Encourages scientific enquiry and debate
Promotes innovation and potential new data uses
Leads to new collaborations between data users and data creators
GRDI2020 Final Roadmap Report Page 67 of 108
Maximizes transparency and accountability
Enables scrutiny of research findings
Encourages the improvement and validation of research methods
Reduces the cost of duplicating data collection
Increases the impact and visibility of research
Promotes the research that created the data and its outcomes
Can provide a direct credit to the researcher as a research output in its own right
Provides important resources for education and training.
How to Share Data [59]
There are various ways to share research data, including:
Depositing them with a specialist data center, data archive or data bank
Submitting them to a journal to support a publication
Depositing them in an institutional repository
Making them available online via a project or institutional website
Making them available informally between researchers on a peer-to-peer basis
Approaches to data sharing may vary according to research environments and disciplines, due to
the varying nature of data types and their characteristics.
Data Documentation [59]
A crucial part of making data user-friendly, shareable and with long-lasting usability is to ensure
they can be understood and interpreted by any user. This requires clear and detailed data
description, annotation and contextual information.
Data documentation explains how data were created or digitized, what data mean, what the
content and structure are and any data manipulations that may have taken place. Documenting
data should be considered best practice when creating, organizing and managing data and is
important for data preservation. Whenever data are used sufficient contextual information is
required to make sense of that data.
Good data documentation includes information on:
The context of data collection: project history, aim, objectives and hypotheses
Data collections methods: sampling, data collection processes, instruments used, hardware and software used, scale and resolution, temporal and geographic coverage and secondary data sources used
Dataset structure of data files, study cases, relationships between files
Data validation, checking, profiling, cleaning and quality assurance procedures carried out
Changes made to data over time since their original creation and identification of different versions of data files
Information on access and use conditions or data confidentiality
Difficulties in Data Sharing [60]
GRDI2020 Final Roadmap Report Page 68 of 108
Despite its importance, however, sharing data is not easy. There are three categories of problems
hindering data sharing: (i) willingness to share, (ii) locating shared data, and (iii) using shared data.
First, there is a strong sense in which the scientist’s ability to profit from data collection depends
on maintaining exclusive control over the data – economists would say that the data are a source
of “monopoly rents” for the scientist. In this case, however, the profit, or rent, accrues largely in
the form of scientific reputation and its accompanying benefits, such as publications, grants, and
students [61].The point here is that the competition (and its associated benefits) in science is
intense, and there may be a strong reluctance on the part of scientists to share data, as such
sharing may amount to a sacrifice of future rents that could be extracted from the data were they
not shared.
Second, researchers must become aware of who has the data they need or where the data are
located, which can be a nontrivial problem [62]. After finding appropriate data, they often must
negotiate with the owner or develop trusting relationships to gain access [63].
Third, once in possession of a data set, understanding it requires knowledge of the context of its
creation [64]. How was each datum collected and analyzed? What format are the data in? If the
data are in electronic form, is there a key or metadata available to indicate what the various fields
in the database mean? Researchers also need to know something about the quality of the data
they are receiving, and if the original purpose of the data set is compatible with the proposed use.
Answering these questions requires a large amount of effort on the part of the data creator, but
the benefit of such effort goes largely to the secondary user. This renders it unlikely that adequate
documentation will be produced.
Even if documentation is provided, however, it is often the case that much of the knowledge
needed to make sense of data sets is tacit. Scientists are not necessarily able to explicate all of the
information that is required to understand someone else’s work.
Approaches to Data Sharing
Shared data is only useful if sufficient context is provided about the data such that collaborators
may comprehend and effectively apply it. Typically, data context is passed among collaborating
scientists via one-to-one discussion (f2f interactions, over the phone, and through emails).
More recently, data sharing environments have developed a number of approaches to the above
issues. For example, standardized reporting formats and metadata protocols can allow the same
data to be read across different hardware or software. Password and security systems can give a
certain degree of control over who does and does not have access to data sets. Metadata may
provide context about who collected data and how they were processed.
While these approaches deal effectively with the explicit technological problems inherent in data
sharing, it is not clear that they adequately deal with many of the tacit and social issues outlined
above. Indeed, a metadata model can only provide so much contextual information, leading to a
potentially recursive situation in which metadata models require “meta-metadata” in order to be
effectively understood.
GRDI2020 Final Roadmap Report Page 69 of 108
Recent calls for data sharing suggest that funding agencies believe that groundbreakings scientific
research requires more data sharing among scientists. Even if we provide the technical means to
move data from one lab to another, however, there may be social barriers to effectively using this
data in practice. To design technologies that truly support the conduct of science and not just the
sharing of a data set, the designer must understand both the scientific role that data play in
producing knowledge, and the social role that data play in the conduct of scientific work.
Data features and properties
Among the most prominent data features and properties which enable the sharing of data we
include [65]:
General data set properties (Basic data set properties such as owner, creation date, size, format, etc)
Experimental properties (Conditions and properties of the scientific experiment that generated or is to be applied to the data)
Data provenance (Relationship of data to previous versions and other data sources)
Integration (Relationship of data subsets within a full data set)
Analysis and interpretation ( Notes, experiences, interpretations, and knowledge generated from analysis of data)
Physical organization (Mapping of data sets to physical storage structure such as a file system, database, or some other data repository)
Project organization (Mapping of data sets to project hierarchy or organization)
Task (Research task(s) that generated or applies data set)
Experimental process (Relationship of data and tasks to overall experimental process)
User community (Application of data sets to different organizations of users)
Data Sharing Environments
A data sharing environment is composed of a number of capabilities and tools that support the
contexts for shared data use. Here, we list some of these data sharing capabilities/tools:
Data Browsing tools
Data Viewing tools
Data Translations capabilities and tools
Capabilities for accessing and applying external databases
Database Connection tools
Subscription capabilities
Notification capabilities
Annotation tools
Data Provenance tools
GRDI2020 Final Roadmap Report Page 70 of 108
As one important objective of the research data infrastructures is the creation of scientific
collaborative environments, they have to support efficient and effective data sharing
environments which constitute a key component of the collaborative environments.
GRDI2020 Final Roadmap Report Page 71 of 108
9. Application Challenges
9.1 Complex Interaction Modes
Scientific data infrastructures must tackle challenges that emerge when solving data-intensive
problems that take into account interaction. Interaction includes the ways in which researchers
interact with running analyses and visualizations, and the longer-term succession of operations
that researchers perform to progressively find, understand and use data. Many problems will
involve geographically distributed teams. Collaboration across institutions and disciplines is
becoming the norm, to gather skills and knowledge and to access sufficient information for
statistical significance. This leads to privacy and ownership issues, especially where sensitive
personal data is involved.
Challenges in supporting researchers interacting with data include: finding the right or sufficient
data, coping with missing or poor quality data, integrating data from diverse sources with
significant differences in form and semantics, understanding and dealing with the complexity of
the data, re-purposing or inventing and implementing analysis strategies that can extract the
relevant signal, planning and composing multiple steps from raw data to presentable answers,
engineering technology that can handle the data scale, the computational complexity of the data
and the demand created by many practitioners pursuing myriads of answers, accommodating
ways in which data and practices are changing, etc. [7].
Challenges are not restricted to interaction with large data or even complex data. It is important to
understand users have different skill sets and objectives, which lead to different patterns of
interaction. To ensure an inclusive community, users should be able to access technology at
different levels of intuitiveness. This calls for “learning ramps” to help everyone to progress as far
as they wish towards expert usage modes.
Data-intensive environments are often highly dynamic. It is important to understand how, in this
context, researchers build up their portfolio of data and analysis providers, and how they react as
new data and tools become available. The social behaviour and networks involved in making and
influencing choices will have an impact on what data and which tools become an established
standard [66].
9.2 Multidisciplinary – Interdisciplinary Research
The characteristics of knowledge that drive innovative problem solving “within” a discipline
actually hinder problem solving and knowledge creation “across” disciplines. In fact, knowledge
can be described as “localized’, “embedded” and “invested” [67].
Firstly, knowledge is localized around particular problems faced by a given discipline.
Secondly, knowledge is embedded in the technologies, methods, and rules of thumb used by
individuals in a given discipline.
GRDI2020 Final Roadmap Report Page 72 of 108
Thirdly, knowledge is invested in practice – invested in methods, ways of doing things, and
successes that demonstrate the value of the knowledge developed.
This specialization of “knowledge in practice” makes working across discipline boundaries and
accommodating the knowledge developed in another discipline especially difficult.
In essence, data/information/knowledge when moving between disciplines have to cross a
number of “knowledge boundaries”.
A first boundary (syntactic boundary) is constituted by the different syntax of the languages used
by the communities/disciplines in order to interact between them. One shared and stable syntax
across a given boundary could guarantee an accurate communication between two
communities/disciplines. Alternatively, a function that maps the syntax used by one
community/discipline into a semantically equivalent syntax used by the other
community/discipline is sufficient to overcome the syntactic boundary.
A second boundary (semantic boundary) can arise even if a common syntax or language is present
due to the fact that interpretations are often different. Different communities/disciplines could
interpret representations in different ways leading to serious problems caused by loss of
interpretative context which goes with the representation of information (semantic distortion). A
shared and stable ontology together with “portable” contexts or representations across a given
boundary could allow the interacting communities/disciplines to share the meaning of the
exchanged information. Particular attention must be paid to the challenges of “conveyed
meaning” and the possible interpretations by individuals; context specific aspects of creating and
transferring knowledge must also be considered.
Overcoming syntactic and semantic boundaries guarantees that a “meaningful” exchange of
data/information/knowledge between different communities/disciplines is achieved
(exchangeability). By meaningful information object exchange between two
communities/disciplines we mean that the information flow between them crosses the existing
syntactic and semantic boundaries without any semantic distortion. However, pure
exchangeability does not guarantee that the members of two different communities/disciplines
can work together. For this to occur a third boundary must be overcome.
A third boundary (pragmatic boundary) arises when a community/discipline is trying to influence
or transform the knowledge created by another community/discipline. A shared syntax and
meaning are not always sufficient to permit the cooperation between communities/disciplines. In
fact, some aspects of the information for example the quality or policies or rules that ‘discipline”
the activities of the communities/disciplines can hinder the exploitation of the information
received by them.
Compatible quality dimensions or policies established by cooperating communities/disciplines can
assist in overcoming pragmatic boundaries.
Therefore, a global scientific data infrastructure must ensure that the
“data/information/knowledge flow” between cooperating communities/disciplines can cross
syntactic and semantic boundaries without distortions. In addition, it must be able to check the
logical consistency of the policies and the quality dimensions adopted by these communities.
GRDI2020 Final Roadmap Report Page 73 of 108
9.2.1 Boundary Objects
A useful means of representing, learning about and transforming knowledge to resolve the
consequences that exist at a given boundary is known as the “boundary object” [68].
‘….both plastic enough to adapt to local needs and constraints of the several parties employing
them, yet robust to maintain a common identity across sites. They are weakly structured in
common use, and become strongly structured in individual site-use. Like a blackboard, a boundary
object “sits in the middle” of a group of actors with divergent viewpoints…”
The concept of a boundary object, developed by Star, describes information objects that are
shared and shareable across different problem solving contexts.
Here below we have adapted Star’s four categories of boundary objects (repositories,
standardized forms and methods, objects or models, and maps of boundaries) to describe the
information objects and their use by individuals in the settings present in a
multidisciplinary/interdisciplinary environment.
A boundary object should establish a shared syntax, a shared means for representing and
specifying differences, and a shared means for representing and specifying dependencies in order
to overcome a syntactic/semantic/pragmatic boundary.
At a syntactic boundary, a boundary object should establish a shared (meta) data model, a shared
data language, a shared database, a shared taxonomy, etc.
At a semantic boundary, a boundary object should provide a concrete means for all individuals to
specify and learn about their differences. Examples are a shared ontology, a shared methodology,
etc.
At a pragmatic boundary, a boundary object should facilitate a process where individuals can
jointly transform their knowledge. Examples are a shared quality framework, a shared policy
framework, etc.
Communities of practice in order to be able to work together must create a consistent set of
boundary objects at syntactic, semantic, and pragmatic boundaries.
A data infrastructure supporting cooperation of communities of practice has to efficiently
implement their set of boundary objects.
It is easy to foresee the development, in the future, of discipline specific boundary objects
(metadata models, data models, data languages, taxonomies, ontologies, quality and policy
frameworks, etc.) in order to allow the interoperation between communities of practice belonging
to the same discipline. By interoperation between two “communities of practice” we mean the
ability of their members to exchange meaningful information objects and effectively use them.
In fact, in many disciplines a major effort towards the definition of such discipline specific
boundary objects is currently underway.
A number of projects currently funded by the EC FP7 aim to develop “disciplinary data
infrastructures”. “Disciplinary data infrastructure” means a data infrastructure which implements
a consistent number of “discipline specific” boundary objects at the syntactic, semantic, and
pragmatic boundaries that allows different “communities of practice” involved a given discipline
to work effectively together.
GRDI2020 Final Roadmap Report Page 74 of 108
We, thus, envisage that one of the most important features of future “disciplinary data
infrastructures” will be the efficient implementation of a set of boundary objects defined by the
members of the disciplines being supported.
The definition of boundary objects between different scientific disciplines is more problematic.
We foresee that in order to enable multidisciplinary/interdisciplinary research new methods and
techniques must be developed which implement “a mediation function” between boundary
objects of different disciplines.
By mediation function we mean a function able to map a boundary object defined by a discipline
into a semantically equivalent boundary object of another discipline.
The future “multidisciplinary/interdisciplinary data infrastructures” must effectively support
multidisciplinary/interdisciplinary research by developing a mediation technology [see section
7.5.4].
9.3 Globalism and Virtual Proximity
Globalism refers to “any description and explanation of a world which is characterized by
networks of connections that span multi-continental distances”. The concept of globalism referred
to scientific data infrastructures means an infrastructure able to interconnect the components of a
science ecosystem (digital data libraries, digital data archives, and digital research libraries)
distributed worldwide by overcoming language, policy, methodology, social, etc. barriers.
Science is a global undertaking and scientific data are both national and global assets. There is a
need for a seamless infrastructure to facilitate collaborative (multi/inter-disciplinary) behaviour
necessary for the intellectual and practical challenges the world faces.
Therefore, the technology should enable the development of global scientific data infrastructures
which diminish geographic, temporal, social, and National barriers to discovery, access, and use of
data.
Virtual Proximity
Working together in the same time and place continues to be important, but through the next
generation of global scientific data infrastructures this can be augmented to enable collaboration
between people at different locations, at the same (synchronous) or different (asynchronous)
times. The distance dimension can be generalized to include not only geographical but also
organizational and/or disciplinary distance.
Future data infrastructures will contribute to collapsing the barrier of distance and removing
geographic location as an issue.
GRDI2020 Final Roadmap Report Page 75 of 108
10. Organizational Challenges
From the organizational point of view a research data infrastructure must support the Research
and Publication Process.
This process is composed of the following phases: (i) the original scientist produces, through
research activity, primary, raw data; (ii) this data is analysed to create secondary data, results
data; (iii) this is then evaluated, refined, to be reported as tertiary information for publication; (iv)
with the mediation of the pre-print and peer review mechanisms, this then goes into the
traditional publishing process and feeds publication archives. In alternative to phase (i), a scientist
may perform research based on data, i.e. using data to make new discoveries or to obtain further
insights [27].
Primary data is archived into dynamic digitally curated data repositories (Digital Data Libraries).
By curated data we mean that this data is associated with metadata and kept dynamic with
annotations and linking to other research.
Two roles are important: the data archivist and the data curator.
Data archivist: in general people in this role need to interact with the data generator to prepare
data for archiving (such as generating metadata which will ensure that it can be found, and can be
rendered or used in the future).
Data curator: people in this role need to keep data dynamic with annotations and linked to other
research as well as continuously reviewing the information in their care, though they may still
maintain archival responsibilities. They should also take an active role in promoting and adding
value to his holdings, and managing the value of his collection.
Static digital data is stored into Digital Data Archives for long-term preservation.
The relationship between constantly curated, evolving datasets and those in static digital archives
is one which needs to be explored, through research and accumulation of practical experience.
Publications are archived into publication archives (Digital Research Libraries).
Future research data infrastructures must guarantee interoperability between Digital Data
Libraries, Digital Data Archives and Digital Research Libraries in order to be able to support the
scientific processes.
10.1 Digital Data Libraries (Science Data Centres)
Why archive and curate primary research data? Major reasons to keep primary research data
include [69]:
Re-use of data for new research;
Retention of unique observational data which is impossible to re-create;
More data is available for research projects;
Compliance with legal requirements;
Ability to validate research results;
Use of data in teaching;
GRDI2020 Final Roadmap Report Page 76 of 108
For the public good.
Indirect benefits include the provision of primary research data to commercial entities, and use in
commercial products.
However, increasingly, the datasets are so large and the application programs are so complex, that
it is much more economical to move the end-user’s programs to the data and only communicate
questions and answers rather than moving the source data and its applications to the user’s local
system. This will require a new work style.
From the organizational point of view, this new work style has to be supported by service stations
called Science Data Centers or Digital Data Libraries [10]. Each scientific discipline should have its
own science data centre(s) and it should provide access to both the data and the applications that
analyse the data.
Each of these Digital Data Libraries curates one or more massive datasets, curates the applications
that provide access to that dataset, and supports staff that understand the data and are constantly
adding to and improving the dataset.
This new work style consists in sending questions to applications running at a Digital Data Library
and receiving answers, rather than to bulk-copy raw data from the Digital Data Library to a local
server for further analysis. Many scientists will prefer doing much of their analysis at Digital Data
Libraries because it will save them from having to manage local data and computer farms. Some
scientists may bring the small data extracts “home” for local processing, analysis and visualization
– but it will be possible to do all the analysis at the Digital Data Library.
Future Scientific Data Infrastructures should support the federation of these Digital Data Libraries.
10.2 Digital Data Archives
A Digital Data Archive is an archive, consisting of an organization of people and systems that have
accepted responsibility to preserve data and make it available for a designated scientific
community. The data being maintained has been deemed to need long-term preservation. Long-
term is long enough to be concerned with the impact of changing technologies, including support
for new media and data formats, or within a changing user community [69].
Mandatory responsibilities of a Digital Data Archive are to [69]:
Negotiate for and accept appropriate information from data generators;
Obtain sufficient control of the data provided at the level needed to ensure log-term
preservation;
Determine which communities should become the designated community and, therefore,
should be able to understand the data provided;
Ensure that the data to be preserved is independently understandable to the designated
community. In other words, the community should be able to understand the data without
needing the assistance of the experts who generated the data;
GRDI2020 Final Roadmap Report Page 77 of 108
Follow documented policies and procedures which ensure that the data is preserved
against all reasonable contingencies, and which enables the data to be disseminated as
authenticated copies of the original, or as traceable to the original;
Make the preserved data available to the designated community.
Scientific Data Infrastructures must support the following three categories of archive association
[12]:
Cooperating: Archives with potential data generators, common submission standards and
common dissemination standards, but no common finding aids.
Federated: Archives with both a local community (i.e., the original Designated Community served
by the archive) and a global community (i.e., an extended Designated Community) which has
interests in the holdings of several Data Archives and has influenced those activities to provide
access to their holdings via one or more common finding aids.
Shared Resources: Archives that have entered into agreements with other archives to share
resources, perhaps to reduce cost.
10.3 Personal Workstations
There is an emerging trend to store a personal workspace at Digital Data Libraries and deposit
answers there. This minimizes data movement and allows collaboration among a group of
scientists doing joint analysis. Longer term, personal workspaces at the Digital Data Library could
become a vehicle for data publication – posting both the scientific results of an experiment or
investigation along with the programs used to generate them in public read-only databases [10].
GRDI2020 Final Roadmap Report Page 78 of 108
11. A New Computing Paradigm: Cloud Computing
Definition
Cloud Computing is a new term for a long-held dream of computing as a utility, which has recently
emerged as a commercial reality.
In fact, five decades ago, in 1961, computing pioneer John McCarthy predicted that “computation
may someday be organized as a “public utility”. Cloud Computing is that realization, as the
paradigm facilitates the delivery of computing-on-demand much like other public utilities, such as
electricity and gas. However, Cloud Computing isn’t a new concept. Other computing paradigms –
utility computing, grid computing, and on-demand computing – precede Cloud Computing by
addressing the problems of organizing computational power as publicly available and easily
accessible resource.
By Cloud Computing it is meant a large-scale distributed computing paradigm that is driven by
economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed
computing power, storage, platforms, and services are delivered on demand to external customers
over the Internet [70].
The key points of this definition are:
Cloud computing is a specialized distributed computing paradigm.
It is massively scalable.
It can be encapsulated as an abstract entity that delivers different levels of services to customers outside the Cloud.
It is driven by economies of scale.
The services can be dynamically configured (via virtualization or other approaches) and delivered on demand.
Three main factors have contributed to the surge and interests in Cloud Computing:
Rapid decrease in hardware costs and increase in computer power and storage capacity and the advent of multi-core architecture and modern supercomputers consisting of hundreds of thousands of cores.
The exponentially growing data size in scientific instrumentations/simulations and Internet publishing and archiving.
The wide-spread adoption of Services Computing and Web 2.0 applications.
Key Characteristics
Here we highlight the key characteristics of the Cloud Computing paradigm [70]:
Business Model: a customer will pay the provider on a consumption basis, very much like the
utility companies charge for basic utilities and the model relies on economies of scale in order to
drive prices down for users and profit up for providers.
GRDI2020 Final Roadmap Report Page 79 of 108
Architecture: There are multiple versions of definition for Cloud architecture, we adopt the
definition given in [71] which consists of a four-layer architecture for Cloud Computing:
The fabric layer contains the raw hardware level resources, such as compute resources, storage
resources, and network resources.
The unified resource layer contains resources that have been abstracted/encapsulated (usually by
virtualization) so that they can be exposed to upper layer and end users as integrated resources,
for instance, a virtual computer/cluster, a logical file system, a database system, etc.
The platform layer adds on a collection of specialized tools, middleware and services on top of the
unified resources to provide a development and/or deployment platform. For instance, a Web
hosting environment, a scheduling service, etc.
The application layer contains the applications that would run in the Clouds.
Services: Clouds in general provide services at three different levels (IaaS, PaaS, and Saas) as
follows, although some providers can choose to expose services at more than one level.
Infrastructure as a Service (IaaS)[71] provisions hardware, software, and equipments (mostly at
the unified resource layer, but can also include part of the fabric layer) to deliver software
application environments with a resource usage-based pricing model. Infrastructure can scale up
and down dynamically based on application resource needs and different resources may be
provided via a service interface:
Data and Storage Clouds deal with reliable access to data of potentially dynamic size, weighting
resource usage with access requirements and/or quality definition [72].
Examples: Amazon S3, SQL Azure.
Compute Clouds provide computational resources, i.e. CPUs.
Examples: Amazon EC2, Zimory, Elastichosts.
Platform as a Service (PaaS)[71] offers a high-level integrated environment to build, test, and
deploy custom applications. Generally, developers will need to accept some restrictions on the
type of software they can write in exchange for built-in application scalability. PaaS typically makes
use of dedicated APIs to control the behavior of a server hosting engine which executes and
replicates the execution according to user requests.
Examples: Force.com, Google App Engine, Windows Azure (Platform)
Software as a Service (SaaS) [71] delivers special-purpose software that is remotely accessible by
consumers through the Internet with a usage-based pricing model.
Examples: Google Docs, Salesforce CRM, SAP Business by Design.
Overall, Cloud Computing is not restricted to Infrastructure/Platform/Software as a service
systems, even though it provides enhanced capabilities which act as (vertical) enablers to these
systems. As such I/P/SaaS can be considered specific “usage patterns” for cloud systems which
relate to models already approached by Grid, Web Services, etc. Cloud systems are a promising
GRDI2020 Final Roadmap Report Page 80 of 108
way to implement these models and extend them further [72].
Compute Model [70]: Cloud Computing compute model will likely look very different wrt the Grid
‘s compute model with resources in the Cloud being shared by all users at the same time (in
contrast to dedicated resources governed by a queuing system). This should allow latency
sensitive applications to operate natively on Clouds, although ensuring a good enough level of QoS
is being delivered to the end users will not be trivial, and will likely be one of the major challenges
for Cloud Computing as the Clouds grow in scale, and number of users.
Combining compute and data management [70]: The combination of the compute and data
resource management is critical as it leverages data locality in access patterns to minimize the
amount of data movement and improve end-application performance and scalability. Attempting
to address the storage and computational problems separately forces much data movement
between computational and storage resources, which will not scale to tomorrow’s peta-scale
datasets and millions of processors, and will yield significant underutilization of the raw resources.
It is important to schedule computational tasks close to the data, and to understand the costs of
moving the work as opposed to moving the data. Data-aware schedulers and dispersing data close
to processors are critical in achieving good scalability and performance.
Virtualization [70]: Virtualization has become an indispensable ingredient for almost every Cloud,
the most obvious reasons are for abstraction and encapsulation. Clouds need to run multiple (or
even up to thousands or millions of) user applications, and all the applications appear to the users
as if they were running simultaneously and could use all the available resources in the Cloud.
Virtualization provides the necessary abstraction such that the underlying fabric (raw compute,
storage, network resources) can be unified as a pool of resources and resource overlays (e.g. data
storage services, Web hosting environments) can be built on top of them. Virtualization also
enables each application to be encapsulated such that they can be configured, deployed, started,
migrated, suspended, resumed, stopped, etc., and thus provides better security, manageability,
and isolation.
Elasticity [72]: Elasticity is an essential core feature of cloud systems and circumscribes the
capability of the underlying infrastructure to adapt to changing, potentially non-functional
requirements, for example amount and size of data supported by an application, number of
concurrent users, etc. One can distinguish between horizontal and vertical scalability, whereby
horizontal scalability refers to the amount of instances to satisfy e.g. changing amount of requests,
and vertical scalability refers to the size of the instances themselves and thus implicit to the
amount of resources required to maintain the size. Cloud scalability involves both (rapid) up-and
down-scaling.
Elasticity goes one step further, tough, and does also allow the dynamic integration and extraction
of physical resources to the infrastructure.
Security [70]: Clouds mostly comprise dedicated data centers belonging to the same organization,
GRDI2020 Final Roadmap Report Page 81 of 108
and within each data center, hardware and software configurations, and supporting platforms are
in general more homogeneous as compared with those in Grid environments. Interoperability can
become a serious issue for cross-data center, cross-administration domain interactions. Currently,
the security model for Clouds seems to be relatively simpler and less secure than the security
model adopted by Grids. Cloud infrastructure typically rely on Web forms (over SSL) to create and
manage account information for end-users, and allows users to reset their passwords and receive
new passwords via Emails in an unsafe and unencrypted communication.
New Application Opportunities [73]
While we have yet to see fundamentally new types of applications enabled by Cloud Computing,
there are several important classes of existing applications which will become even more
compelling with Cloud Computing and contribute further to its momentum. Here we are
examining what kinds of applications represent particularly good opportunities and drivers for
Cloud Computing:
Mobile interactive applications: Such services will be attracted to the cloud not only because they
must be highly available, but also because these services generally rely on large datasets that are
most conveniently hosted in large datacenters.
Parallel batch processing: Although thus far we have concentrated on using Cloud Computing for
interactive SaaS, Cloud Computing presents a unique opportunity for batch-processing and
analytics jobs that analyze terabytes of data and can take hours to finish. If there is enough data
parallelism in the application, users can take advantage fo the cloud’s new “cost associativity”:
using hundreds of computers for a short time costs the same as using a few computers for a long
time.
Extension of compute-intensive desktop applications: The latest versionsof the mathematics
software packages Matlab and Mathematica are capable of using Cloud Computing to perform
expensive evaluations. Other desktop applications might similarly benefit from seamless extension
into the cloud.
“Earthbound” applications: Some applications that would otherwise be good candidates for the
cloud’s elasticity and parallelism may be thwarted by data movement costs, the fundamental
latency limits of getting into and out of the cloud, or both. Until the cost of wide-area data transfer
decrease, such applications may be less obvious candidates for the cloud.
Deployment Types [72]
Similar to P/I/SaaS, clouds may be hosted and employed in different fashions, depending on the
use case, respectively the business model of the provider.
The following deployment types can be distinguished:
GRDI2020 Final Roadmap Report Page 82 of 108
Private Clouds (example: eBay)
Public Clouds (example: Amazon, Google Apps, Windows Azure)
Hybrid Clouds (there are not many hybrid clouds in use today, though initial initiatives such as the one by IBM and Juniper already introduce base technologies for their realization)
Community Clouds: Community (Community Clouds as such are still just a vision); and
Special Purpose Clouds ( example: Google App Engine).
Obstacles and Opportunities for Cloud Computing
In [73] a ranked list of 10 obstacles to the growth of Cloud Computing is offered:
Availability of a Service: Organizations worry about whether Utility Computing services will have
adequate availability, and this makes some wary of Cloud Computing.
Data Lock-In: Software stacks have improved interoperability among platforms, but the APIs for
Cloud Computing itself are still essentially proprietary, or at least have not been the subject of
active standardization. Thus, customers cannot easily extract their data and programs from one
site to run on another. Concern about the difficult of extracting data from the cloud is preventing
some organizations from adopting Cloud Computing. Customer lock-in may be attractive to Cloud
Computing providers, but Cloud Computing users are vulnerable to price increases, to reliability
problems, or even to providers going out of business.
Security( including Data Confidentiality and Auditability):
Current cloud offerings are essentially public (rather than private) networks, exposing the system
to more attacks. There are also requirements for auditability.
There are no fundamental obstacles to making a cloud-computing environment as secure as the
vast majority of in-house IT environments and that many of the obstacles can be overcome with
well understood technologies such as encrypted storage, virtual local area networks, and network
middleboxes.
Similarly, auditability could be added as an additional layer beyond the reach of the virtualized
guest OS, providing facilities arguably more secure than those built into the applications
themselves and centralizing the software responsibilities related to confidentiality and auditability
into a single logical layer.
Data Transfer Bottlenecks: Applications continue to become more data-intensive. If we assume
applications may be “pulled apart” across the boundaries of clouds, this may complicate data
placement and transport. Cloud users and cloud providers have to think about the implications of
placement and traffic at every level of the system if they want to minimize costs.
Performance Unpredictability: One unpredictability obstacle regards the fact that multiple Virtual
GRDI2020 Final Roadmap Report Page 83 of 108
Machines can share CPUs and main memory surprisingly well in Cloud Computing, but that I/O
sharing is more problematic. There is a problem of I/O interference between virtual machines.
Another unpredictability obstacle concerns the scheduling of virtual machines for some classes of
batch processing programs, specifically for high performance computing.
Scalable Storage: There are three properties whose combination gives Cloud Computing its
appeal: short-term usage (which implies scaling down as well as up when resources are no longer
needed), no up-front cost, and infinite capacity on-demand. While it’s straightforward what this
means when applied to computation, it’s less obvious how to apply it to persistent storage. The
challenge is to create a storage system able not only to support the complexity of data structures
(e.g., schema-less blobs vs. column-oriented storage) but also to combine them with the cloud
advantages of scaling arbitrarily up and down on-demand, as well as meeting programmer
expectations in regard to resource management for scalability, data durability, and high
availability.
Bugs in Large-Scale Distributed Systems: One of the difficult challenges in Cloud Computing is
removing errors in these very large scale distributed systems. A common occurrence is that these
bugs cannot be reproduced in smaller configurations, so the debugging must occur at scale in the
production datacenters.
Scaling Quickly: Pay-as-you-go certainly applies to storage and to network bandwidth, both of
which count bytes used. Computation is slightly different, depending on the virtualization level.
There is a need for automatically scaling quickly up and down in response to load in order to save
money, but without violating service level agreements.
Reputation Fate Sharing: Reputations do not virtualize well. One customer’s bad behavior can
affect the reputation of the cloud as a whole. Another legal issue is the question of transfer of
legal liability—Cloud Computing providers would want legal liability to remain with the customer
and not be transferred to them.
Software Licensing: Current software licenses commonly restrict the computers on which the
software can run. Users pay for the software and then pay an annual maintenance fee. Hence,
many cloud computing providers originally relied on open source software in part because the
licensing model for commercial software is not a good match to Utility Computing.
Cloud Computing Interoperability – Intercloud [74]
More and more service providers are adopting the Cloud Computing paradigm for offering various
computational services on a “utility basis”. In fact, as software and expertise becomes more
available, enterprises and smaller service providers are also building Cloud Computing
implementations.
With next generation services being developed in Cloud environments, these services effectively
GRDI2020 Final Roadmap Report Page 84 of 108
become more centralized. Federated Clouds may facilitate the ability to openly discover the
service residing within them. To realize this opportunity, the Cloud community must investigate
new approaches to cross-cloud connectivity.
The concept of a Cloud operated by one service provider interoperating with a Cloud operated by
another is a powerful idea. So far that is limited to use cases where code running on one Cloud
references a service on another Cloud. There is no implicit and transparent interoperability.
Cloud Computing services are offered with a self-contained set of conventions, file formats, and
programmer interfaces. If one wants to utilize that variation of Cloud, one must create
configurations and code specific to that Cloud.
Of course from within one Cloud, explicit instructions can be issued over the Internet to another
Cloud. However there is no implicit ways that Cloud resources and services can be exported or
caused to interoperate.
Active work needs to occur to create interoperability amongst varied implementations of Clouds.
From the lower level challenges around network addressing, to multicast enablement, to virtual
machine mechanics, to the higher level interoperability desires of services, this is an area
deserving of much progress and will require the cooperation of several large industry players.
In conclusion, we can affirm that the long dreamed vision of computing as a utility is finally
emerging. The elasticity of a utility matches the need of businesses providing services directly to
customers over the Internet, as workloads can grow (and shrink) far faster than 20 years ago. It
used to take years to grow a business to several million customers – now it can happen in months.
From the Cloud provider’s view, the construction of very large datacenters at low cost sites using
commodity computing, storage, and networking uncovered the possibility of selling those
resources on a pay-as-you-go model below the costs of many medium-sized datacenters, while
making a profit by statistically multiplexing among a large group of customers.
From the Cloud user’s view, it would be as startling for a new software startup to build its own
datacenter as it would for a hardware startup to build its own fabrication line [73].
Data-Intensive Science and Cloud Computing
Current grid computing environments are primarily built to support large-scale batch
computations, where turnaround may be measured in hours or days – their primary goal is not
interactive data analysis.
While these batch systems are necessary and highly useful for the repetitive “pipeline processing”
of many large scientific collaborations, they are less useful for subsequent scientific analyses of
higher level data products, usually performed by individual scientists or small groups. Such
exploratory, interactive analyses require turnaround measured in minutes or seconds so that the
scientist can focus, pose questions and get answers within one session. The databases, analysis
tasks and visualization tasks involve hundreds of computers and terabytes of data. Of course this
interactive access will not be achieved by magic – it requires new organizations of storage,
GRDI2020 Final Roadmap Report Page 85 of 108
networking and computing, new algorithms, and new tools.
As CPU cycles become cheaper and data sets double in size every year, the main challenge for a
rapid turnaround is the location of the data relative to the available computational resources –
moving the data repeatedly to distant CPUs is becoming the bottleneck.
Interactive users measure a system by its time-to-solution: the time to go from hypothesis to
results. The early steps might move some data from a slow long-term storage resource. But the
analysis will quickly form a working set of data and applications that should be co-located in a high
performance cluster of processors, storage, and applications.
A data and application locality scheduling system can observe the workload and recognize data
and application locality. Repeated requests for the same services lead to a dynamic rearrangement
of the data: the frequently called applications will have their data “diffusing” into the grid, most
residing in local, thus fast storage, and reach a near-optimal thermal equilibrium with their
competitor processes for the resources. The process arbitrating data movement is aware of all
relevant costs, which include data movement, computing, and starting and stopping applications.
Such an adaptive system can respond rapidly to small requests, in addition to the background
batch processing applications. The system architecture requires aggressive use of data partitioning
and replication among computational nodes, extensive use of indexing, pre-computation,
computational caching, detailed monitoring of the system and immediate feedback
(computational steering) so that execution plans can be modified. It also requires resource
scheduling mechanisms that favor interactive uses.
The key to a successful system will be “provisioning”, i.e., a process that decides how many
resources to allocate to different workloads.
We think that all these requirements of data-intensive scientific applications can be efficiently, and
cost effectively addressed by the Cloud Computing paradigm.
Although data-intensive applications may not be typical applications that Cloud deal with today, as
the scales of Cloud grow, it may just be a matter of time for many Clouds.
We envision that the future Digital Data Libraries (Science Data Centers) will be based on cloud
philosophy and technology. Each scientific community of practice will have its own Cloud(s); the
federation of these Clouds will allow collaboration among.
GRDI2020 Final Roadmap Report Page 86 of 108
12. A New Programming Paradigm: MapReduce
Many data-intensive applications require hundreds of special-purpose computations that process
large amounts of raw data. Most such computations are conceptually straightforward. However,
the input data is usually large and the computations have to be distributed across hundreds or
thousands of machines in order to finish in a reasonable amount of time. The main issues for such
kind of applications are how to parallelize the computation, distribute the data, and handle
failures.
What is MapReduce?
MapReduce is a programming model and an associated implementation for processing and
generating large data sets while hiding the messy details of parallelization, fault-tolerance, data
distribution, and load balancing.
The basic idea of Map Reduce is straightforward. It consists of two programs that the user writes
called map and reduce plus a framework for executing a possibly large number of instances of
each program on a compute cluster [75].
The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket. In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.
The key thing to observe is that all map instances use the same hash function. Hence, all output
records with the same hash value will be in corresponding output files.
The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤
M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all
output records from the map phase with the same hash value will be consumed by the same
reduce instance — no matter which map instance produced them. After being collected by the
map-reduce framework, the input records to a reduce instance are grouped on their keys (by
sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is
an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with
its records. For example, it might compute some additional function over other data fields in the
record. Each reduce instance can write records to an output file, which forms part of the “answer”
to a MapReduce computation.
GRDI2020 Final Roadmap Report Page 87 of 108
The MapReduce Framework
The MapReduce Framework is composed of the following components/functions:
Input reader: It divides the input into appropriate size 'splits' (in practice typically 16MB to 128MB)
and the framework assigns one split to each Map function. The input reader reads data from
stable storage (typically a distributed file system) and generates key/value pairs.
Map Function: Each Map function takes a series of key/value pairs, processes each, and generates
zero or more output key/value pairs. The input and output types of the map can be (and often are)
different from each other.
Partition Function: Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce. A typical default is to hash the key and modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish. Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. Compare Function: The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Reduce Function: The framework calls the application's Reduce function once for each unique key
in the sorted order. The Reduce can iterate through the values that are associated with that key
and output 0 or more values.
Output Writer: It writes the output of the Reduce to stable storage, usually a distributed file
system.
An Example
Consider the problem of counting the number of occurrences of each word in a large collection of
documents [76]. The user would write code similar to the following pseudo-code:
Map (String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
GRDI2020 Final Roadmap Report Page 88 of 108
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associated count of occurrences (just `1' in this simple
example). The reduce function sums together all counts emitted for a particular word.
Advantage of MapReduce
The advantage of MapReduce is that it allows for distributed processing of the map and reduction
operations. Provided each mapping operation is independent of the others, all maps can be
performed in parallel — though in practice it is limited by the data source and/or the number of
CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is
required is that all outputs of the map operation which share the same key are presented to the
same reducer, at the same time. While this process can often appear inefficient compared to
algorithms that are more sequential, MapReduce can be applied to significantly larger datasets
than "commodity" servers can handle — a large server cluster can use MapReduce to sort a
petabyte of data in only a few hours. The parallelism also offers some possibility of recovering
from partial failure of servers or storage during the operation: if one mapper or reducer fails, the
work can be rescheduled — assuming the input data is still available.
In fact, MapReduce achieves an excellent fault-tolerance by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task. The reduce operations operate much the same way. Because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on. This property is desirable as it conserves bandwidth across the backbone network of the datacenter.
Criticisms
The database research community has expressed some concerns regarding this new computing
paradigm [75]:
MapReduce is a step backwards in database access.
GRDI2020 Final Roadmap Report Page 89 of 108
The database community has learned three important lessons: (i) schemas are good, (ii)
separation of the schema from the application is good, and (iii) high-level access languages
are good. MapReduce has learned none of these lessons.
MapReduce is a poor implementation.
All modern DBMSs use hash and B-tree indexes to accelerate access to data. In addition,
there is a query optimizer to decide whether to use an index or perform a brute-force
sequential search.
MapReduce has no indexes and therefore has only brute force as a processing option.
MapReduce is not novel.
The MapReduce community seems to feel that they have discovered an entirely new
paradigm for processing large data sets. In actuality, the techniques employed by
MapReduce are more than 20 years old.
MapReduce is missing features. All of the following features are routinely provided by
modern DBMSs, and all are missing from MapReduce: Bulk loader, Indexing, Updates,
Transactions, Integrity constraints, Referential integrity, Views.
MapReduce is incompatibile with the DBMS tools. A modern SQL DBMS has available all of
the following classes of tools: Report writers, Business intelligence tools, Data mining tools,
Replication tools, Database design tools. MapReduce cannot use these tools and has none
of its own.
The advocates of the MapReduce paradigm reject these views. They assert that DeWitt and Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be used as a database. MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data. However, it is worthwhile to note that some database researchers are beginning to explore using
the MapReduce framework as the basis for building scalable database systems. The Pig project at
Yahoo! Research is one such effort.
Factors of success
The MapReduce programming model has been successfully used for many different purposes. This
success can be attributed to several reasons. First, the model is easy to use, even for
programmers without experience with parallel and distributed systems, since it hides the details of
parallelization, fault-tolerance, locality optimization, and load balancing. Second, a large variety of
problems are easily expressible as MapReduce computations. For example, MapReduce is used
for the generation of data for Google's production web search service, for sorting, for data mining,
for machine learning, and many other systems. Third, several implementations of MapReduce
have been developed that scale to large clusters of machines comprising thousands of machines.
These implementations make efficient use of these machine resources and therefore are suitable
for use on many of the large computational problems encountered in data-intensive applications.
GRDI2020 Final Roadmap Report Page 90 of 108
13. Policy Challenges
The need for using semantic policies in science ecosystem environments is widely recognized.
It is important to adopt a broad notion of policy, encompassing not only access control policies,
but also trust, quality of service, and others. In addition, all these different kinds of policies should
eventually be integrated into a single coherent framework, so that (i) this policy framework can be
implemented and maintained by a research data infrastructure, and (ii) the policies themselves
can be harmonized and synchronized [77].
Policy Management
The interactions between the different components of a science ecosystem should be governed by
formal semantic policies which enhance their authorization processes allowing to regulate access
and use of data and services (data policies), and to estimate trust based on parties’ properties
(trust management policies).
Policies are means to dynamically regulate the behavior of system components without changing
code and without requiring the consent or cooperation of the components being governed [78].
Policies, which constrain the behavior of system components, are becoming an increasingly
popular approach to dynamic adjustability of applications in academia and industry.
Policies are pervasive in distributed and networked environments, for example in web and Grid
applications. They play crucial roles in enhancing security, privacy, and usability of distributed
services, and indeed may determine the success (or failure) of a service. However, users will not
be able to benefit from these protection mechanisms unless they understand and are able to
personalize policies applied in such contexts.
Many are the benefits of policy-based approaches; they include reusability, efficiency,
extensibility, context-sensitivity, verifiability, support for both simple and sophisticated
components, and reasoning about component behavior. In particular, policy-based network
management has been the subject of extensive research over the last decade.
Policy based Interaction [77]
Policies allow for security, privacy, authorization, obligation and etc. descriptions in a machine
understandable way. More specifically, service or data providers may use security policies to
control access to resources by describing the conditions a requester must fulfill (e.g. a requester to
resource A must belong to institution B and prove it by means of a credential). At the same time,
service or data consumers may regulate the data they are willing to disclose by protecting it with
privacy policies. Given two sets of policies, an engine may check whether they are compatible, that
is, whether they match. The complexity of this process varies depending on the sensitivity of
policies (and the expressivity of the policies). If all policies are public at both sides, provider and
requester may initially already provide the relevant policies together with the request and the
evaluation process can be performed in one-step evaluation by the provider policy engine and
return a final decision. Otherwise, if policies may be private, this process may consist of several
GRDI2020 Final Roadmap Report Page 91 of 108
steps negotiation in which new policies and credential are disclosed at each step, therefore,
advancing after each iteration towards a common agreement.
Policy Specification [77]
Multiple approaches for policy specification have been proposed that range from formal policy
languages that can be processed and interpreted easily and directly by a computer, to rule-based
policy notation using if-then-else format, and to the representation of policies as entries in a table
consisting of multiple attributes. There are also ongoing standardization efforts toward common
policy information models and frameworks.
Policy specification tools like the KAoS Policy Administrator Tool [79] and the PeerTrust Policy
Editor provide an easy to use application to help policy writers. This is important because the
policies will be enforced automatically and therefore errors in their specification or
implementation will allow outsiders to gain inappropriate access to resources, possibly inflicting
huge and costly damages. In general, the use of ontologies on policy specification reduces the
burden on administrators, helps them with their maintenance and decreases the number of
errors.
Policy language provides a framework for specifying both authorization policies and obligation
policies. A policy in KAoS may be a positive (respectively negative) authorization, i.e., constraints
that permit (respectively forbid) the execution of an action, or a positive (respectively negative)
obligation, i.e., constraints that require an action to be executed. A policy is then represented as
an instance of the appropriate policy type, associating values to its properties, and giving
restrictions on such properties.
In Rei [80] policies are described in terms of deontic concepts: permissions, prohibitions,
obligations and dispensations, equivalently to the positive/negative authorizations and
positive/negative obligations of KAoS.
Rule-based languages are commonly regarded as the best approach to formalizing policies due to
its flexibility, formal semantics and closeness to the way people think.
Policy Constraints. A constraint can optionally be defined as part of a policy specification to restrict
the applicability of the policy. It is defined as a predicate referring to global attributes such as time
(temporal constraints) or action parameters (parameter value constraints). Preconditions could
define the resources which must be available for a management policy to be accomplished.
Propagation to Sub-domains. Policies apply to sets of objects within domains, but domains may
contain sub-domains. To avoid having to re-specify policy for each sub-domain, policy applying to
a parent domain, should propagate to member sub-domains of the parent. A sub-domain is said to
inherit, the policy applying to parent domains.
Policies can be specified in many different ways and multiple approaches have been proposed in
different application domains. There are, however, some general requirements that any policy
representation should satisfy regardless of its field of applicability:
GRDI2020 Final Roadmap Report Page 92 of 108
Expressiveness to handle the wide range of policy requirements arising in the system being managed;
Simplicity to ease the policy definition tasks for administrators with different degrees of expertise;
Enforceability to ensure a mapping of policy specifications into implementable policies for various platforms;
Scalability to ensure adequate performance; and
Analyzability to allow reasoning about policies.
The existing policy languages differ in expressivity, kind of reasoning required, features and
implementation provided, etc. However, specifying policies, getting a policy right and maintaining
a large number of them is hard. Fortunately, ontologies and policy reasoning may help users and
administrators on specification, conflict detection and resolution of such policies.
An ontology-based description of the policy enables the system to use concepts to describe the
environments and the entities being controlled, thus simplifying their description and facilitating
the analysis and the careful reasoning over them. Several capabilities can benefit by this powerful
feature, such as the policy conflict detection and harmonization.
In addition, ontology-based approaches simplify the access to policy information, with the
possibility of dynamically calculating relations between policies and environments, entities or
other policies based on ontology relations rather than fixing them in advance.
Ontologies can also simplify the sharing of policy knowledge thus increasing the possibility for
entities to negotiate policies and to agree on a common set of policies.
Policy Classification
Authorization Policy defines what activities a subject is permitted to do in terms of the operations
it is authorized to perform on a target object. In general an authorization policy may be positive
(permitting) or negative (prohibiting) i.e. not permitted = prohibited.
Activity Based Authorization: The simplest policies are expressed purely in terms of subject, target
and activity.
State Based Authorization policies include a predicate based on object state (i.e. a value of an
object attribute) in the policy specification.
Obligation policy defines what activities a subject must (or must not) do. The underlying
assumption is that all subjects are well behaved, and attempt to carry out obligation policies with
no freedom of choice. Obligation policies are subject based in that the subject is responsible for
interpreting the policy and performing the activity specified.
Activity based Obligations: Simple obligation policies can also be expressed in terms of subject,
target and activity, but may also specify an event which triggers the activity.
State Based Obligation: An obligation may also be specified in terms of a predicate on object state.
Conflict Detection and Resolution [77]
GRDI2020 Final Roadmap Report Page 93 of 108
A conflict may occur between any two policies if one policy prevents the activities of another
policy from being performed or if the policies interfere in some way that may result in the
managed objects being put into unwanted states. As the activity of a policy can specify a set of
actions, there may also be conflicts between these actions within a single policy.
Policy languages allow for advanced algorithms for conflict detection and its resolution. Conflicts
may arise between policies either at specification time or runtime. A typical example of a conflict
is when several policies apply to a request and one allows access while another denies it (positive
vs negative authorization). Description Logic based languages may use subsumption reasoning to
detect conflicts by checking if two policies are instances of conflicting types and whether the
action classes, that the policies control, are not disjoint. Both KAoS and Rei handle such conflicts
within their frameworks and both provide constructs for specifying priorities between policies,
hence the most important ones override the less important ones.
KAoS also provides a conflict resolution technique called “policy harmonization”. If a conflict is
detected the policy with lower priority is modified by refining it with the minimum degree
necessary to remove the conflict.
Policy Management
The adoption of a policy based-approach for controlling a system requires an appropriate policy
representation and the design and development of a policy management framework.
The scope of policy management is increasingly going beyond these traditional applications in
significant ways. New challenges for policy management include:
Sources and methods protection, digital rights management, information filtering and transformation, capability-based access;
Active networks, agile computing, pervasive and mobile systems;
Organizational modeling, coalition formation, formalizing cross-organizational agreements;
Trust models, trust management, information pedigrees;
Effective human-machine interaction: interruption/notification management, presence management, adjustable autonomy, teamwork facilitation, safety; and
Intelligent retrieval of all policies relevant to some situation.
Graphical tools should be provided for editing, updating, removing, and browsing policies as well
as de-conflicting newly defined policies.
Policy Enforcement [77]
Cooperative policy enforcement involves both machine-to-machine and human-machine aspects.
The former is handled by negotiation mechanisms: published policies, provisional actions, hints,
and other metalevel information can be interpreted by the client to identify what information is
needed to access a resource and how to obtain that information.
GRDI2020 Final Roadmap Report Page 94 of 108
It is recommended a cooperative policy enforcement, where negative responses are enriched with
suggestions and other explanations wherever such information does not violate confidentiality.
For these reasons greater user awareness and control on policies is one of our main objectives,
making policies easier to understand and formulate to the common user in the following ways: (i)
adopt a rule-based policy specification language, (ii) make the policy specification language more
friendly, and (iii) develop advanced explanation mechanisms.
Trust Management [77]
Currently, two major approaches for managing trust exist: policy-based and reputation-based trust
management. The two approaches have been developed within the context of different
environments. On the one hand, policy-based trust relies on “strong security” mechanisms such as
signed certificates and trusted certification authorities in order to regulate access of users to
services. On the other hand, reputation-based trust relies on a “soft computational” approach to
the problem of trust. In this case, trust is typically computed from local experiences together with
the feedback given by other entities in the network. The reputation-based approach is more
suitable for environments such as Peer-to-Peer, Semantic Web and Science Ecosystems, where the
existence of certifying authorities can not always be assumed but where a large pool of individual
user ratings is often available.
Another approach -very common in today’s applications- is based on forcing users to commit to
contracts or copyrights by having users click an “accept” button on a pop-up window.
During the past few years, some of the most innovative ideas on security policies arose in the area
of automated trust negotiation. That branch of research considers peers that are able to
automatically negotiate credentials according to their own declarative, rule-based policies. Rules
specify for each resource or credential request which properties should be satisfied by the
subjects and objects involved. At each negotiation step, the next credential request is formulated
essentially by reasoning with the policy, e.g. by inferring implications or computing abductions.
Applying Policies on Science Ecosystems
The need for using semantic policies in science ecosystem environments is widely recognized.
It is important to adopt a broad notion of policy, encompassing not only access control policies,
but also trust, quality of service, and others. In addition, all these different kinds of policies should
eventually be integrated into a single coherent framework, so that (i) this policy framework can be
implemented and maintained by a research data infrastructure, and (ii) the policies themselves
can be harmonized and synchronized.
In the general view depicted above, policies may also establish that some events must be logged,
that user profiles must be updated, and that when an operation fails, the user should be told how
to obtain missing permissions. In other words, policies may specify actions whose execution may
be interleaved with the decision process. Such policies are called provisional policies. In this
context, policies act both as decision support systems and as declarative behavior specifications.
GRDI2020 Final Roadmap Report Page 95 of 108
14. Open Science – Open Data
The Concept
There is an emerging consensus among the members of the academic research community that
“e-science” practices should be congruent with “open science”. The essence of the e-science is
“global collaboration” in key areas of science and the next generation of data infrastructures must
enable it. Global scientific collaboration takes many forms, but from the various initiatives around
the world a consensus is emerging that collaboration should aim to be “open” or at least should
be a substantial measure of “open access” to data and information underlying published research,
and to communication tools [81].
The concept of Open Data is not new; but although the term is currently in frequent use, there are
no commonly agreed definitions (unlike, for example, Open Access where several formal
declarations have been made and signed).
The Organization for Economic Co-operation and Development, OECD, an international
organization helping governments tackle the economic, social, and governance challenges of a
globalized economy has produced a Report “Recommendations of the Council concerning Access
to Research Data from Public Funding” [82] where the following definition of openness is given:
“Openness means access on equal terms for the international research community at the lowest
possible cost, preferably at no more than the marginal cost of dissemination. Open access to
research data from public funding should be easy, user-friendly and preferable Internet-based”
Open Data is focused on data from scientific research. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Open Data has a similar ethos to a number of other "Open" movements and communities such as open source and open access. However these are not logically linked and many combinations of practice are found. The practice and ideology itself is well established but the term "open data" itself is recent. An essential prerequisite for successful adoption of Open Data principle is the willingness of
scientific communities to share data. Researchers acknowledge that data sharing increases the
impact, utility, and profile of their work. Conversely, research is highly competitive, and
publications depend on individual ability to produce novel data, which can be a disincentive for
collaboration.
There are also major ethical considerations in sharing data between researchers and between
countries and in making data available for open access. Acceptance of the open data principle
entails a cultural shift in the science regarding the importance of data sharing and mining.
Much data is made available through scholarly publication, which now attracts intense debate under "Open Access". The Budapest Open Access Initiative (2001) coined this term:
GRDI2020 Final Roadmap Report Page 96 of 108
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The logic of the declaration permits re-use of the data although the term "literature" has connotations of human-readable text and can imply a scholarly publication process. In Open Access discourse the term "full-text" is often used which does not emphasize the data contained within or accompanying the publication. While “open data” will enhance and accelerate scientific advance, there is also a need for “open science”—where not only data but also analyses and methods are preserved, providing better transparency and reproducibility of results. The extent of openness is an important issue that must be regulated by rules and norms governing
the disclosure of data and information about research methods and results [81]:
How fully and quickly is information about research procedures and data released?
How completely is it documented and annotated so as to be not only accessible but also useable by those outside the immediate research group?
On what terms and with what delays are external researchers able to access materials, data and project results?
Are findings held back, rather than being disclosed in order to first obtain IPRs on a scientific project’s research results, and if so how long is it usual for publication to be delayed?
The Open Data Principle has three dimensions: policy, legal, and technological. Technology must
render physical and semantic barriers irrelevant, while policies and laws must allow to overcome
legal jurisdictional boundaries
Policy Dimension There are several political reasons advocating for open access to data:
“Data belong to human race”. Typical examples are genomes, environmental data, medical science data, etc.
Public money was used to fund the work and so it should be universally available.
In scientific research, the rate of discovery is accelerated by better access to data.
Facts cannot legally be copyrighted.
Sponsors of research do not get full value unless the resulting data are freely available
GRDI2020 Final Roadmap Report Page 97 of 108
There are also some important economic and social reasons in the pursuit of knowledge, and making explicit the supportive role played by norms that tend to reinforce cooperative behaviors among scientists. In brief, rapid disclosures abet:
rapid validation of findings,
reduces excess duplication of research effort,
enlarge the domain of complementarities and
yield beneficial “spill-overs”among research programs. Advocates of open data argue that access restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license. As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions. Their arguments may include:
this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)
the government gives specific legitimacy for certain organizations to recover costs (NIST in US, Ordnance Survey in UK)
government funding may not be used to duplicate or challenge the activities of the private sector (e.g. PubChem)
It may be noted that it is the difficulty of monitoring research effort that make it necessary for
both the open science system and the intellectual property regime to tie researchers’ rewards in
one way or another to priority in the production of observable “research outputs” that can be
transmitted to “validity tests and valorization” – whether directly by peer assessment, or indirectly
through their application in the markets for goods and services.
An acceptable compromise between open and closed data could be the introduction of a data-
disclosure norm where a limited, in time, embargo period is defined. During this period of time
the novel data produced by a researcher remains closed thus allowing him/her to produce
publications based on this data without running the risk of fraudulent publications. After the
embargo period has elapsed the data become open (publicly available). The length of the embargo
period must be agreed among the scientific communities and could be discipline dependent.
The issues around consent and ownership are yet more complex within networked science
environments.
Indeed, we appear to be in the midst of a massive collision between unprecedented increases in
data production and availability and the privacy rights of human beings worldwide.
Common frameworks and defined principles first need to be established if an Open Science Data
space is to be established, particularly when it comes to the ethical and privacy issues.
The key strategy in ensuring that international policies requiring “full and open exchange of data”
GRDI2020 Final Roadmap Report Page 98 of 108
are effectively acted on in practice lies in the development of a coherent policy and legal
framework at a national level. The national framework must support the international principles
for data access and sharing but also be clear and practical enough for researchers to follow at a
research project level [83].
The development of a national framework for data management based on principles promoting
data access and sharing (such as the OECD recommendations) would help to incorporate
international policy statements and protocols such as the Antarctic Treaty and the GEOSS
Principles into domestic law.
Legal Dimension
It is generally held that factual data cannot be copyrighted. However publishers frequently add their copyright statements (often forbidding re-use) to scientific data accompanying (supporting, supplementing) a publication. It is also usually unclear whether the factual data embedded in full text are part of the copyright. While the human abstraction of facts from paper publications is normally accepted as legal there is often an implied restriction on the machine extraction by robots. Some Open Access publishers do not require the authors to assign copyright and the data associated with these publications can normally be regarded as Open Data. Some publishers have Open Access strategies where the publisher requires assignment of the copyright and where it is unclear that the data in publications can be truly regarded as Open Data. The ALPSP and STM publishers have issued a statement about the desirability of making data freely available : “Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research. Data searching and mining tools permit increasingly sophisticated use of raw data. Of course, journal articles provide one ‘view’ of the significance and interpretation of that data – and conference presentations and informal exchanges may provide other ‘views’ – but data itself is an increasingly important community resource. Science is best advanced by allowing as many scientists as possible to have access to as much prior data as possible; this avoids costly repetition of work, and allows creative new integration and reworking of existing data.” And “We believe that, as a general principle, data sets, the raw data outputs of research, and sets or sub-sets of that data which are submitted with a paper to a journal, should wherever possible be made freely accessible to other scholars. We believe that the best practice for scholarly journal publishers is to separate supporting data from the article itself, and not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question.” Even though this statement was without any effect on the open availability of primary data related to publications in journals of the ALPSP and STM members. Data tables provided by the authors as supplement with a paper are still available to subscribers only.
Technological Dimension
GRDI2020 Final Roadmap Report Page 99 of 108
The importance of building scientific data infrastructures in which research findings can be readily
made available to and used by other researchers has long been recognized in international
scientific collaborations. However, creating and maintaining conditions of openness might mean
not simply putting data on-line but making them sufficiently robust and well-documented to be
widely utilized.
There are two main options in making data openly accessible and sharable:
Open access to data/metadata with re-use restrictions
Open access to data/metadata without re-use restrictions
In order to implement the Open access to data/metadata with re-use restrictions policy the data
producer/provider must make them understandable to the researchers belonging to the same or
different scientific disciplines.
Therefore, the data must be endowed with some auxiliary information which contribute to enrich
its semantics. Such auxiliary information could include contextual information as well as open
access community ontologies, terminologies and taxonomies.
In addition, the adoption of annotation practices based on standardized terminologies will greatly
increase the understandability of the published data.
In order to implement the Open access to data/metadata without re-use restrictions policy the
data producer/provider must make them not only understandable but also usable.
Therefore, the data must be endowed with some auxiliary information which contribute to make
them usable. Such auxiliary information could include provenance, quality, and uncertainty
information.
The challenge for the next generation of global scientific data infrastructures is to support
automated agents and search engines able to crawl the science data space in order to discover,
mine, relate and interpret data from datasets as well as the literature.
We envision that the future research data infrastructures will constitute infrastructures for open
scientific research.
The principles of open science data and open science can be widely accepted only if realized
within an Integrated Science Policy Framework to be implemented and enforced by global
research data infrastructures.
GRDI2020 Final Roadmap Report Page 100 of 108
15. Recommendations
1. Future Scientific Data Infrastructures must enable Science Ecosystems.
Several discipline-specific Digital Data Libraries, Digital Data Archives and Digital Research
Libraries are under development or will be developed in the near future. These systems
must be able to interwork and constitute disciplinary and/or multidisciplinary ecosystems.
The next generation of global research data infrastructures must enable the creation of
efficient and effective Science Ecosystems.
2. Science organizational aspects should be taken in due consideration when designing
global research data infrastructures as well as potential tensions which could be faced or
provoked by them.
A viable vision of research data infrastructure must take into account social and
organizational dimensions that accompany the collective building of any complex and
extensive resource. A robust Global Research Data Infrastructure must consist not only of a
technical infrastructure but also a set of organizational practices and social forms that work
together to support the full range of individual and collaborative scientific work across
diverse geographic locations. A data infrastructure will fail or quickly become encumbered
if any one of these critical aspects is ignored.
New research data infrastructures are encountering and often provoking a series of
tensions. Tensions should be thought of as both barriers and resources to infrastructural
development, and should be engaged constructively.
3. Global Research Data Infrastructures must be based on scientifically sound foundations.
It is widely recognized that current database technology is not adequate to support data-
intensive multidisciplinary research.
Existing research data infrastructures suffer from the following main limitations:
not based on scientifically sound foundations
application-specific software of limited long term value coupled with the absence
of a consistent computer science perspective
discipline-specific
Science re-builds rather than re-uses software and has not yet come up with a set of
common requirements.
It is time to develop the theoretical foundations of research data infrastructures. They will
allow the development of generic data infrastructure technology and incorporate it into
industrial-strength systems.
4. Formal models and query languages for data, metadata, provenance, context,
uncertainty and quality must be defined and implemented.
GRDI2020 Final Roadmap Report Page 101 of 108
Radically new approaches to scientific data modelling are required. In fact, the data models
(relational model) developed by the database research community are appropriate for
business/commercial data applications. Scientific data has completely different
characteristics from business/commercial data and therefore current database technology
is inadequate to handle it efficiently and effectively.
Data models and query languages that more closely match the data representation needs
of the several scientific disciplines, describe discipline-specific aspects (metadata models),
represent and query data provenance information, represent and query data contextual
information, represent and manage data uncertainty, and represent and query data quality
information are necessary.
Formally defined data models and data languages will allow the development of
automatized data tools and services (i.e., mediation software, curation, etc.) as well as
generic software.
5. New advanced data tools (data analysis, massive data mining, data visualization) must be
developed.
Current data management tools as well as data tools are completely inadequate for most
science disciplines. It is essential to build better tools and services in order to make
scientists more productive; tools helping them to capture, curate, analyze and visualize
their data; in essence tools and services that support the whole research cycle.
Advanced tools and services are needed so as to enable scientists to follow new paths, try
new techniques, build new models and test them in new ways that facilitate innovative
multidisciplinary/interdisciplinary activities.
6. New advanced infrastructural services (data discovery, tool discovery, data integration,
data/service transportation, workflow management, ontology/taxonomy management,
policy management, etc) must be developed.
The ultimate aim of a Global Research Data Infrastructure is to enable global collaboration
in key areas of science. Therefore, the infrastructural services must achieve the conditions
needed to facilitate effective collaboration among geographically and institutionally
separated communities of research. To this end a Global Research Data Infrastructure must
provide advanced support services that make the components of a science ecosystem
interoperable and their holdings discoverable, aggregable and usable.
7. Future Research Data Infrastructures must support open linked data spaces.
A research data infrastructure must lower the barrier to publishing and accessing data
leading, therefore, to the creation of open scientific data spaces by connecting data sets
from diverse domains, disciplines, regions and nations. Researchers should be able to
navigate along links into related data sets.
GRDI2020 Final Roadmap Report Page 102 of 108
8. Future Research Data Infrastructures must support interoperation between science data
and literature.
In future all scientific literature and data will be on-line. The scientific data must be unified
with all the literature to create a world in which the data and the literature interoperate
with each other. Such a capability will increase the “information velocity” of the sciences
and will improve the scientific productivity of researchers. Future research data
infrastructures must make this happen by supporting the interoperation between digital
data libraries and digital research libraries.
9. The principles of open science and open data in order to be widely accepted must be
realized within an Integrated Science Policy Framework to be implemented and enforced
by global research data infrastructures.
There is an emerging consensus among the members of the academic research community
that “e-science” practices should be congruent with “open science”. The open science
principle entails not only open access to data but also to scientific analyses, methods, etc.
This principle has not only a technological dimension but also policy and legal dimensions.
Policies and laws must deal with legal jurisdictional boundaries and they must be
integrated into a shared Science Policy Framework.
10. A new international research community must be created.
The building of scientifically sound research data infrastructures can only be achieved if
supported by an active new international research community capable of tackling all the
scientific and technological challenges that such an enterprise implies. This community
would embrace two main components:
researchers who use data-intensive methods and tools (biologists, astronomers,
etc.), and
researchers who create or enable these models and methods (computer scientists,
mathematicians, engineers, etc.).
So far, these two components have operated in isolation and have only come together
sporadically. We firmly believe that the development of the new Data-Intensive
Multidisciplinary Science must spring from a synergetic action between these two
components. We firmly believe that without the creation and active involvement of such a
research community it is illusory to think about the development of the Data-Intensive
Multidisciplinary Science.
11. New Professional Profiles must be created.
In order to be able to exploit the huge volumes of data available and the expected new
data and network technologies, new professional profiles must be created: data scientist,
GRDI2020 Final Roadmap Report Page 103 of 108
data-intensive distributed computation engineer, data curator, data archivist, and data
librarians.
These professionals must be capable of operating in the fast moving world of network and
data technologies. Education and training activities to enable them to use and manage the
data and the infrastructures must be defined and put in action.
GRDI2020 Final Roadmap Report Page 104 of 108
16. References
[1] G. Bell, T. Hey, and A. Szalay, “Beyond the Data Deluge”, Science, 323, 1297-1298, March
2009
[2] T. Hey, S. Tansley and K. Tolle (Eds.), The Fourth Paradigm: Data Intensive Scientific
Discovery. Redmond, WA: Microsoft, 2009.
[3] C. Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”,
Wired Magazine: 16.07 . Retrieved from
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
june 2011
[4] P. Edwards, S. Jackson, G. Bowker and C. Knobel, Understanding Infrastructure: Dynamics,
Tensions, and Design, Final Report of the Workshop on History & Theory of Infrastructure:
Lessons for New Scientific Cyberinfrastructures, Jan. 2007. Retrieved from
http://hdl.handle.net/2027.42/49353
June 2011
[5] T. Jewett, R. Kling, “The dynamics of computerization in a social science research team: a case
study of infrastructure, strategies, and skills”, Social Science Computer Review, 9 (2), 246-275,
1991.
[6] S. Star, and K. Ruhleder, “Steps toward an ecology of infrastructure: Design and access for
large information spaces”, Information Systems Research, 7 (1), 111-134, 1996.
[7] D. De Route and M. Atkinson, “Realizing the Power of Data-Intensive Research”
[8] I. Gordon, P. Greenfiels, A. Szalay and R. Williams, “Data Intensive Computing in the 21st
Century”, Computer, 41 (4), 30-32, April 2008.
[9] L. Bannon, and S. Bodker, “Constructing Common Information Spaces” in: J. Hughes, T.
Rodden, W. Prinz and K. Schmidt (eds.), ECSCW’97: Proceedings of the Fifth European
Conference on Computer-Supported Cooperative Work, Kluwer Academic Publishers, 1997.
[10] J. Gray, D. Liu, A. Szalay, D. DeWitt and G. Heber, “Scientific Data management in the Coming
Decade”, SIGMOD Record, 34 (4), 35-41, Dec. 2005
[11] National Science Board, Long-Lived Digital Data Collections: Enabling Research and Education
in the 21st Century, National Science Foundation, 2005
Retrieved from
http://www.nsf.gov/pubs/2005/nsb0540/
June 2011
[12] What is Data Archiving? [Definition]
Retrieved from
http://searchdatabackup.techtarget.com/definition/data-archiving
June 2011
[13] P. Graham, The Digital Research Library: Tasks and Commitments, 1995.
Retrieved from
www.csdl.tamu.edu/DL95/papers/graham/graham.html
GRDI2020 Final Roadmap Report Page 105 of 108
June 2011
[14] NSF’ Cyberinfrastructure Vision for 21st Century Discovery, NSF Cyberinfrastructure Council,
March 2007.
Retrieved from
http://www.nsf.gov/pubs/2007/nsf0728/index.jsp
June 2011
[15] “Jim Gray on eScience: A Transformed Scientific Method”, in: The Fourth Paradigm: Data
Intensive Scientific Discover [2], xix-xxxiii.
[16] N. Paskin ,“Digital Object Identifiers for scientific data”, Data Science Journal, 4, 12-20,
March 2005
[17] “Moving Large Volumes of Data Using Transportable Modules” in Oracle® Warehouse Builder
Data Modeling, ETL, and Data Quality Guide” 11g Release 2 (11.2)
Retrieved from
http://download.oracle.com/docs/cd/E14072_01/owb.112/e10935/trans_mod.htm
June 2011
[18] S. Kahn, “On the Future of Genomic Data”, Science, 331 (6018), 728-729.
[19] G. Adomavicius, J. Bockstedt, A. Gupta and R. Kauffman, “Understanding Patterns of
Technology Evolution: An Ecosystem Perspective”, in: System Sciences, 2006. HICSS '06.
Proceedings of the 39th Annual Hawaii International Conference on , vol.8, pp. 189a, 04-07 Jan. 2006.
[20] M. Stonebraker, J. Becla, D. DeWitt, K. Lim, D. Maier, O. Ratzesberger, S. Zdonik,
“Requirements for Science Data Bases and SciDB” in CIDR Perspectives, 2009
[21] R. Ikeda, J. Widom, “Panda: A System for Provenance and Data” in IEEE Data Engineering
Bulletin, Special Issue on Data provenance, 33(3), September 2010
[22] “Reference Model for an Open Archival Information System (OAIS)” CCSDS 650.0-B-1, BLUE
BOOK, 2002
[23] T. Strang, C. Linnhoff-Popien, “A Context Modeling Survey” in Workshop on Advanced Context
Modelling, Reasoning and Management associated with the Sixth International Conference on
Ubiquitous Computing (UbiComp 2004), Nottingham/England
[24] A. Carpi, A. Egger, “Data: Uncertainty, Error, and Confidence”
[25] C. Batini, M. Scannapieco, “Data Quality: Concepts, Methodologies and Techniques” Springer
[26] V. Van den Eynden, L. Corti, M. Woollard, L. Bishop and L. Horton, Managing and Sharing
Data: Best Practices for Researchers. UK Data Archive, University of Essex, May 2011
[27] P. Lord, A. Macdonald, “Data Curation for e-Science in the UK”, Report, 2004
[28] Microsoft Draft Roadmap “Towards 2020 Science”
Retrieved from: http://www. Jyu.edu.cn
[29] C. Hansen, C. Johnson, V. Pascucci, C. Silva, “Visualization for Data-Intensive Science” in “The
Fourth Paradigm” [2]
[30] K. Thearling, “An Introduction to Data Mining”
http://www.thearling.com/dmintro/dmintro_2.htm
[31] E. Wenger, “Communities of practice: An brief introduction”, 2006
www.ewenger.com/theory/
GRDI2020 Final Roadmap Report Page 106 of 108
[32] M. Fraser, “Virtual Research Environments: Overview and Activity” in Ariadne issue 44 July
2005
[33] N. Wilkins-Diehr, “Special Issue: Science Gateways – Common Community Interfaces to Grid
Resources”, Concurrency and Computation: Practice and Experience, 19 (6) 743-749, April
2007.
[34] A. Poggi, D. Lembo, D. Calvanese, G. de Giacomo, M. Lenzerini, R. Rosati, “Linking Data to
Ontologies”
[35] T. R. Gruber,”Towards principles for the design of ontologies used for knowledge sharing” Intl.
Journal of Human-Computer Studies 43 (5/6) 1993
[36] S. Bloehdorn, P. Haase, Z. Huang, Y. Sure, J. Voelker, F. van harmelen, R. Studer, “Ontology
Management” In J. davies et al. (eds), Semantic Knowledge Management, Springer-Verlag
Berlin Heidelberg 2009
[37] A. Das, W. Wu, D. McGuinness, “Industrial Strength Ontology Management”, Stanford
Knowledge Systems Laboratory Technical Report KSL-01-09 2001; in the Proc. Of the
International Semantic Web Working Symoosium, Stanford, CA, July 2001
[38] NeOn Project Deliverable D1.1.1: Networked Ontology Model
[39] Barry Smith, “Ontology” Preprint version of chapter “Ontology” in L. Floridi (ed.), Blackwell
Guide to the Philosophy of Computing and Information, Oxford: Blackwell, 2003
[40] I. Taylor, D, Gannon, and M. Shields (eds), “Workflows for e-Science”, Springer, ISBN 978-1-
84628-519-6
[41] C. Goble and D. De Roure, “The Impact of Workflows on Data-centric Research” in [2]
[42] C. Thanos, Interoperability: A Holistic Approach, Manuscript, 2010.
[43] G. Wiederhold, “Mediators in the architecture of future information systems”, Computer, 25
(3), 38-49, March 1992.
[44] M. Stollberg, E. Cimpian, A. Mocan and D. Fensel, “A Semantic Web Mediation
Architecture”, in: Proceedings of the 1st Canadian Semantic Web Working Symposium (CSWWS
2006), 22 pp., Springer, 2006.
[45] O. Nanseth and E. Monteiro, Understanding Information Infrastructure, Manuscript, 1998.
[46] N. Paskin, “Digital Object Identifiers for Scientific Data”, Paper presented at the 19th
International CODATA Conference, Berlin, 2004
[47] G. Rust, M. Bide, “The <indecs>Metadata Framework: Principles, model and data dictionary”
http://www.indecs.org/pdf/framework.pdf
[48] G. Garrity, C. Lyons, “Future-proofing Biological Nomenclature”. Omics, 2003, Volume 7,
Number 1
[49] Australian National Data Service, “Data Citation Awareness”
http://www.ands.org.au/guides/data-citation-awareness.html
[50] M. Altman and G. King “A Proposed Standard for the Scholarly Citation of Quantitative
Data”, in D-Lib Magazine March/April 2007
[51] U. Keller, U. Lara, H. Lausen, A. Polleres, D. Fensel, “Automatic Location of Services”, Proc. of
2nd European Semantic Web Conference (ESWC), 2005
[52] U. Keller, R. Lara (eds), WSMO Web Service Discovery” Deliverable D5.1vo.1, Nov. 12, 2004
[53] J. Bleiholder, F. Naumann, “Data Fusion” ACM Computing Surveys, Vol. 41, No. 1, Dec. 2008
GRDI2020 Final Roadmap Report Page 107 of 108
[54] M. Lenzerini, “Data Integration: A Theoretical Perspective” Proc. PODS 2002
[55] D. Nickul, “A Modelling Methodology to Harmonize Disparate Data Models” (A White Paper to
aid intelligence gathering capabilities for the Office of Homeland Security)
[56] “ASW&Data Harmonization” Uniscap Symposium 2009
[57] M. Franklin, A. Halevy, D. Maier, “From Databases to Dataspaces: A New Abstraction for
Information Management” in ACM SIGMOD Record, December 2005
[58] C. Bizer, T. Heath, T. Berners-Lee, C. Bizer, “Linked Data” in Special Issue on Linked Data,
International Journal on Semantic Web and Information Systems
[59] V. Van den Eynden, L. Corti, M. Woollard, L. Bishop and L. Horton “Managing and Sharing
Data” Published by: UK Data Archive, University of Essex, Third edition 2011
[60] J. Birnholtz, M. Bietz, “Data at Work: Supporting Sharing in Science and Engineering” in
GROUP’03, Nov 9-12, 2003, Sanibel Island, Florida, USA
[61] R. Whitley, “The Intellectual and Social Organization of the Sciences” Oxford University Press,
Oxford, 2000
[62] A. Zimmerman, “Data Sharing and Secondary Use of Scientific Data: Experiences of
Ecologists”, Unpublished Dissertation, Information and Library Studies, University of Michigan,
Ann Arbor, 2003
[63] N. Van House, M. Butler, and L. Schiff, “Cooperative knowledge work and practices of trust:
Sharing environmental planning data sets” in (eds) Proc. Of CSCW 1998
[64] J. Haas, H. Samuels, and B. Simmons, “Appraising the records of modern science and
technology: A guide”, MIT Press, Cambridge, MA, 1985
[65] G. Chin Jr and C. Lansing, “Capturing and Supporting Contexts for Scientific Data Sharing via
Biological Science Collaboratory” in CSCW’04, 2004, Chicago, Illinois, USA
[66] “Data-Intensive Research Workshop Report” run by the e-Science Institute at the University of
Edinburgh, 15-19 March 2010
[67] P. Carlile, “A Pragmatic View of Knowledge and Boundaries: Boundary Objects in New Product
Development” in Organization Science Vol. 13, No. 4, 2002
[68] S. Star, “The structure of ill-structured solutions: Boundary Objects and heterogeneous
distributed problem solving” in Readings in Distributed Artificial Intelligence, 1989
[69] “Reference Model for an Open Archival Information System (OAIS)” CCSDS 650.0-B-1, BLUE
BOOK, 2002
[70] I. Foster, Y. Zhao, I. Raicu, S. Lu, “Cloud Computing and Grid Computing 360-Degree
Compared”, ”, in: Proc. IEEE Grid Computing Environments Workshop, IEEE Press, 2008
[71] “What is Cloud Computing”, Whatis.com.
http://searchsoa.techtarget.com/sDefinition/0,,sid26_gci12788,00html, 200 [72]
[73] M. Armbrust et al, “Above the Clouds: A Berkeley View of Cloud Computing” Technical Report No.
UCB/EECS-2009-28
[73] “The Future of Cloud Computing” Expert Group Report, Version 1.0, European Commission
[74] D. Bernstein, E. Ludvigson, K. Sankar, S. Diamond, M. Morrow, “Blueprint for the Intercloud –
Protocols and Formats for Cloud Computing Interoperability” in Proc. 4th International
Conference on Internet and Web Applications and Services, 2009
GRDI2020 Final Roadmap Report Page 108 of 108
[75] D. DeWitt, M. Stonebraker, “MapReduce: A major step backwards”, craig-
henderson.blogspot.com. Retrieved 2008-08-27.
[76] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters”, ”, in:
OSDI 2004, USENIX Symposium on Operating System Design and Implementation, 137-150.
Retrieved from
http://www.usenix.org/events/osdi04/tech/dean.html , June 2011
[77] P. Bonatti, D. Olmedilla, “Rule-Based Policy Representation and Reasoning for the Semantic
Web”
[78] N. Damianou, N. Dulay, E. Lupu, M. Sloman, “The Ponder Policy Specification language” in
Proc. of Workshop on Policies for Distributed Systems and Networks, Springer-Verlag , LNCS
1995, Bristol, UK, 2001
[79] A. Uszok, et al., “”KAoS policy and domain services: Towards a description-logic approach to
policy representation, deconfliction, and enforcement” in POLICY, 2003
[80] L. Kagak, T. Finin, A. Johshi, “A Policy language for Pervasive Computer Environment” in
Proc. of IEEE Fourth International Workshop on Policy (POLICY 2003)
[81] P. David, M. den Besten, R. Schroeder, “Will e-Science be Open Science?” in World Wide
Research, W. Dutton, P. Jeffreys (eds), MIT Press
[82] OECD Recommendations of the Council concerning Access to Research Data from Public
Funding” C (2006)184, Dec. 14, 2006.
[83] A. Fitzgerald, B. Fitzgerald, K. Pappalardo, “The Future of Data Policy” in The Fourth Paradigm
[2]