2016 05-20-clariah-wp4

15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WP4: the structured datahub: linked data, big and small Auke Rijpma, [email protected] May 20, 2016

Transcript of 2016 05-20-clariah-wp4

Page 1: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

WP4: the structured datahub: linked data, big andsmall

Auke Rijpma, [email protected]

May 20, 2016

Page 2: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Today

▶ Data problem in ES(D)H▶ Linked-data solution▶ Demos interaction triplestore

Page 3: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Problems to solve

▶ Economic, social, and demographic historians have a long traditionof data-intensive research.

▶ Two issues:1. As databases grow bigger and more complex, working with them

becomes more difficult. Examples: HSN, CamPop, NAPP.2. There are many small-to-medium size datasets that are isolated (exist

on one computer, one repository, or are not harmonised with otherdatasets): the “long tail” of research data.

▶ Difficult to describe, share, and replicate research results.▶ Equally important: difficult to answer questions that span more than

one dataset (comparative or multilevel research).

Page 4: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Disconnected data

!

Page 5: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Disconnected data

Data Preparation

Common Motifs in Scientific Workflows:An Empirical Analysis

Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble†⇤Ontology Engineering Group, Universidad Politecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es

†School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk‡Information Sciences Institute, Department of Computer Science, University of Southern California. [email protected]

Abstract—While workflow technology has gained momentumin the last decade as a means for specifying and enacting compu-tational experiments in modern science, reusing and repurposingexisting workflows to build new scientific experiments is still adaunting task. This is partly due to the difficulty that scientistsexperience when attempting to understand existing workflows,which contain several data preparation and adaptation steps inaddition to the scientifically significant analysis steps. One wayto tackle the understandability problem is through providingabstractions that give a high-level view of activities undertakenwithin workflows. As a first step towards abstractions, we reportin this paper on the results of a manual analysis performed overa set of real-world scientific workflows from Taverna and Wingssystems. Our analysis has resulted in a set of scientific workflow

motifs that outline i) the kinds of data intensive activities that areobserved in workflows (data oriented motifs), and ii) the differentmanners in which activities are implemented within workflows(workflow oriented motifs). These motifs can be useful to informworkflow designers on the good and bad practices for workflowdevelopment, to inform the design of automated tools for thegeneration of workflow abstractions, etc.

I. INTRODUCTION

Scientific workflows have been increasingly used in the lastdecade as an instrument for data intensive scientific analysis.In these settings, workflows serve a dual function: first asdetailed documentation of the method (i. e. the input sourcesand processing steps taken for the derivation of a certaindata item) and second as re-usable, executable artifacts fordata-intensive analysis. Workflows stitch together a varietyof data manipulation activities such as data movement, datatransformation or data visualization to serve the goals of thescientific study. The stitching is realized by the constructsmade available by the workflow system used and is largelyshaped by the environment in which the system operates andthe function undertaken by the workflow.

A variety of workflow systems are in use [10] [3] [7] [2]serving several scientific disciplines. A workflow is a softwareartifact, and as such once developed and tested, it can beshared and exchanged between scientists. Other scientists canthen reuse existing workflows in their experiments, e.g., assub-workflows [17]. Workflow reuse presents several advan-tages [4]. For example, it enables proper data citation andimproves quality through shared workflow development byleveraging the expertise of previous users. Users can alsore-purpose existing workflows to adapt them to their needs[4]. Emerging workflow repositories such as myExperiment

[14] and CrowdLabs [8] have made publishing and findingworkflows easier, but scientists still face the challenges of re-use, which amounts to fully understanding and exploiting theavailable workflows/fragments. One difficulty in understandingworkflows is their complex nature. A workflow may containseveral scientifically-significant analysis steps, combined withvarious other data preparation activities, and in differentimplementation styles depending on the environment andcontext in which the workflow is executed. The difficulty inunderstanding causes workflow developers to revert to startingfrom scratch rather than re-using existing fragments.

Through an analysis of the current practices in scientificworkflow development, we could gain insights on the creationof understandable and more effectively re-usable workflows.Specifically, we propose an analysis with the following objec-tives:

1) To reverse-engineer the set of current practices in work-flow development through an analysis of empirical evi-dence.

2) To identify workflow abstractions that would facilitateunderstandability and therefore effective re-use.

3) To detect potential information sources and heuristicsthat can be used to inform the development of tools forcreating workflow abstractions.

In this paper we present the result of an empirical analysisperformed over 177 workflow descriptions from Taverna [10]and Wings [3]. Based on this analysis, we propose a catalogueof scientific workflow motifs. Motifs are provided through i)a characterization of the kinds of data-oriented activities thatare carried out within workflows, which we refer to as data-oriented motifs, and ii) a characterization of the different man-ners in which those activity motifs are realized/implementedwithin workflows, which we refer to as workflow-orientedmotifs. It is worth mentioning that, although important, motifsthat have to do with scheduling and mapping of workflowsonto distributed resources [12] are out the scope of this paper.

The paper is structured as follows. We begin by providingrelated work in Section II, which is followed in Section III bybrief background information on Scientific Workflows, and thetwo systems that were subject to our analysis. Afterwards wedescribe the dataset and the general approach of our analysis.We present the detected scientific workflow motifs in SectionIV and we highlight the main features of their distribution

Fig. 3. Distribution of Data-Oriented Motifs per domain

Fig. 4. Distribution of Data Preparation motifs per domain

databases and shipping data to necessary locations for analysis.The impact of the environmental difference of Wings and

Taverna on the workflows is also observed in the workflow-oriented motifs (Figure 7). Stateful invocations motifs are notpresent in Wings workflows, as all steps are handled by adedicated workflow scheduling framework and the details arehidden from the workflow developers. In Taverna, the work-flow developer is responsible for catering for various differentinvocation requirements of 3rd party services, which mayinclude stateful invocations requiring execution of multipleconsecutive steps in order to undertake a single function.

Regarding workflow-oriented motifs, Figure 8 shows thatHuman-interaction steps are increasingly used in scientificworkflows, especially in the Biodiversity and Cheminformat-ics domains. Human interactions in Taverna workflows arehandled either through external tools (e.g., Google Refine),facilitated via a human-interaction plug-in, or through simplelocal scripts (e.g., selection of configuration values frommulti-choice lists). We have observed that non-trivial humaninteractions involving external tooling require a large numberof workflow steps dedicated to deploying or configuring theexternal tools, resulting in very large and complex workflows.Wings workflows do not support human interaction steps.

Finally, the large proportion of the combination of Compos-ite Workflows and Atomic Workflows motif in Figure 8 shows

Fig. 5. Data Preparation Motifs in the Genomics Workflows

Fig. 6. Data-Oriented Motifs in the Genomics Workflows

that the use of sub-workflows is an established best practicefor modularizing functionality.

VI. DISCUSSION

Our analysis shows that the nature of the environment inwhich a workflow system operates can bring-about obstaclesagainst the re-usability of workflows.

A. Obfuscation of Scientific WorkflowsData-intensive scientific analysis could be large and com-

plex with several processing steps corresponding to differentphases of data analysis performed over various kinds of data.This complexity is exacerbated when the workflow operates inan open environment, like Taverna’s, and composes multiplethird party services supporting different data formats andprotocols. In such cases the workflow contains additional stepsfor coping with different format and protocol requirements.This obfuscation of the workflow burdens the documentationfunction and creates difficulty for the workflow re-user sci-entists, who seeks to have a complete understanding of thefunction and the details of the workflow that they are re-usingin order to be able make scientific claims with their workflowbased studies.

Obfuscation is caused by the abundance of data preparationsteps, data movement operations and multi-step stateful invo-cations. One way to overcome obfuscation is to encapsulate

Fig. 3. Distribution of Data-Oriented Motifs per domain

Fig. 4. Distribution of Data Preparation motifs per domain

databases and shipping data to necessary locations for analysis.The impact of the environmental difference of Wings and

Taverna on the workflows is also observed in the workflow-oriented motifs (Figure 7). Stateful invocations motifs are notpresent in Wings workflows, as all steps are handled by adedicated workflow scheduling framework and the details arehidden from the workflow developers. In Taverna, the work-flow developer is responsible for catering for various differentinvocation requirements of 3rd party services, which mayinclude stateful invocations requiring execution of multipleconsecutive steps in order to undertake a single function.

Regarding workflow-oriented motifs, Figure 8 shows thatHuman-interaction steps are increasingly used in scientificworkflows, especially in the Biodiversity and Cheminformat-ics domains. Human interactions in Taverna workflows arehandled either through external tools (e.g., Google Refine),facilitated via a human-interaction plug-in, or through simplelocal scripts (e.g., selection of configuration values frommulti-choice lists). We have observed that non-trivial humaninteractions involving external tooling require a large numberof workflow steps dedicated to deploying or configuring theexternal tools, resulting in very large and complex workflows.Wings workflows do not support human interaction steps.

Finally, the large proportion of the combination of Compos-ite Workflows and Atomic Workflows motif in Figure 8 shows

Fig. 5. Data Preparation Motifs in the Genomics Workflows

Fig. 6. Data-Oriented Motifs in the Genomics Workflows

that the use of sub-workflows is an established best practicefor modularizing functionality.

VI. DISCUSSION

Our analysis shows that the nature of the environment inwhich a workflow system operates can bring-about obstaclesagainst the re-usability of workflows.

A. Obfuscation of Scientific WorkflowsData-intensive scientific analysis could be large and com-

plex with several processing steps corresponding to differentphases of data analysis performed over various kinds of data.This complexity is exacerbated when the workflow operates inan open environment, like Taverna’s, and composes multiplethird party services supporting different data formats andprotocols. In such cases the workflow contains additional stepsfor coping with different format and protocol requirements.This obfuscation of the workflow burdens the documentationfunction and creates difficulty for the workflow re-user sci-entists, who seeks to have a complete understanding of thefunction and the details of the workflow that they are re-usingin order to be able make scientific claims with their workflowbased studies.

Obfuscation is caused by the abundance of data preparationsteps, data movement operations and multi-step stateful invo-cations. One way to overcome obfuscation is to encapsulate

Page 6: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The team

Rinke Hoekstra, Kathrin Dentler, Albert Meroño Peñuela (VU), LaurensRietveld (Triply), Richard Zijdeman, Ashkan Ashkpour (IISH).

Page 7: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Linked data as a solution

▶ To solve this, we use web-based linked data technology.▶ Method of publishing data on the web so that it can be interlinked

and given semantic meaning.+ : Very flexible, sidesteps harmonisations, very expressive query

language (SPARQL), cross-database queries (even on differentservers), live querying, browseable database, ability to combinemetadata and “codebook” with actual data.

– : not optimised for very big databases (10m+ observations→ 100mtriples), unfamiliar technology.

Page 8: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Plan▶ Offer updated versions of important databases for economic and

social historians (HSN, Clio-Infra, Campop, Mosaic, Henry-Fleury,CMGPD, Opgaafrollen, etc.) in one place, as linked data, in anaccessible way:.

▶ Allow users to upload and share their datasets.▶ Allow and encourage users to link datasets to other datasets or

important standards to grow a graph of connected datasets.▶ Provide direct, browseable, queryable access and tooling for

visualisation and analysis.

Empower Individual Researchers• Augment and link individual datasets according to best

practices of the community or against colleagues

• Share machine-interpretable code books with fellow researchers

• Align codes and identifiers across datasets

• Publish standards-compliant, reusable datasets

Grow a giant graph of interconnected datasets

Page 9: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Demos

Demonstrate triplestore and tools to interact with it.QBer: http://qber.clariah-sdh.eculture.labs.vu.nlBrwsr: http://data.clariah-sdh.eculture.labs.vu.nl/doc/

resource/napp/observation/canada1891/62489YASGUI: http://virtuoso.clariah-sdh.eculture.labs.vu.nl/

yasgui-auth/Grlc: http://grlc.clariah-sdh.eculture.labs.vu.nl/

CLARIAH/wp4-queries/api-docs

Page 10: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Demos

▶ QBer: “Connect your data to the cloud” (Rinke)1. Augment and link datasets according to best practice.2. Align codes and identifiers across datasets.3. Share machine-readable codebooks.4. Publish standardised datasets.

▶ http://qber.clariah-sdh.eculture.labs.vu.nl▶ http://inspector.clariah-sdh.eculture.labs.vu.nl

Page 11: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Demos

▶ Brwsr: Lightweight linked data browser (Rinke).1. Browse linked data as if it is a web page.

▶ http://data.clariah-sdh.eculture.labs.vu.nl/doc/resource/napp/observation/canada1891/62489

Page 12: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Demos

▶ Querying: Virtuoso, SPARQL, and YASGUI (Kathrin, Laurens)1. Virtuoso triplestore, currently with 1b+ triples (1 188 852 440)2. SPARQL as expressive query language3. YASGUI as feature-rich editor.

▶ http://virtuoso.clariah-sdh.eculture.labs.vu.nl/yasgui-auth/

Page 13: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Demos

▶ Grlc: git repository linked data API constructor (Albert).1. Builds Web APIs using SPARQL queries stored in git repositories.2. Store and share queries.3. Parametrise queries to overcome unfamiliarity with Grlc.

▶ http://grlc.clariah-sdh.eculture.labs.vu.nl/CLARIAH/wp4-queries/api-docs

Page 14: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

DemosIdentify locally, extrapolate globally…?

canada sweden

(Intercept) 3.616*** 4.430***

(0.134) (0.033)

log(gdppc) 0.036** -0.070***

(0.018) (0.004)

I(age^2) -0.000*** -0.000***

0.000 0.000

age 0.007*** 0.001***

0.000 0.000

R2 0.013 0.021

Adj. R2 0.012 0.021

Num. obs. 36201 275127

RMSE 0.142 0.102

●●

●●

●●●

20 30 40 50 60 70

3.98

4.00

4.02

4.04

Canada

age

log(hiscam

)

●●

6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5

3.98

4.00

4.02

4.04

Canada

log(gdppc)

log(hiscam

)

●●●

●●●●●●●

●●●●

●●●

●●●

●●●

●●●

●●●●●●

●●●●●●

●●●

●●●●●●●

●●

●●

●●

20 30 40 50 60 70

3.90

3.94

3.98

4.02

Sweden

age

log(hiscam

)

●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●

●●●●

●●

●●

●●●●

●●

● ●

6.8 6.9 7.0 7.1 7.2 7.3

3.90

3.94

3.98

4.02

Sweden

log(gdppc)

log(hiscam

)

Page 15: 2016 05-20-clariah-wp4

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Future work

▶ Continue working with researchers (Ivo Zandhuis, Ruben Schalk)for use cases and spread the gospel.

▶ Increase data volume.▶ Make sure triplestore can handle data volume.▶ Make all components work together.▶ Make a good (appealing and accessible) interface.

More info:▶ http://datalegend.net▶ https://github.com/clariah▶ See(n) us at conferences: ESSHC, Posthumus, WHiSe, …