Collaborative Workflow Development and Experimentation in the Digital Humanities

58
A Service-Oriented Architecture for Collaborative Workflow Development and Experimentation Clemens Neudecker, KB @cneudecker Zeki Mustafa Dogan, SUB-DL Sven Schlarb, ÖNB @SvenSchlarb Juan Garcés, GCDH @juan_garces eHumanities Seminar 2012 University of Leipzig 10-10-2012

description

A Service-Oriented-Architecture for Collaborative Workflow Development and Experimentation in the Digital Humanities 2012 Leipzig eHumanities Seminar, 10 October 2012, Leipzig, Germany.

Transcript of Collaborative Workflow Development and Experimentation in the Digital Humanities

Page 1: Collaborative Workflow Development and Experimentation in the Digital Humanities

A Service-Oriented Architecture for Collaborative Workflow

Development and Experimentation

Clemens Neudecker, KB @cneudeckerZeki Mustafa Dogan, SUB-DL

Sven Schlarb, ÖNB @SvenSchlarbJuan Garcés, GCDH @juan_garces

eHumanities Seminar 2012University of Leipzig

10-10-2012

Page 2: Collaborative Workflow Development and Experimentation in the Digital Humanities

Idea

• Provide web-based versions of tools (web services)

• Package web services, data and documentation into ready-to-run “components” (encapsulation)

• Chain the components to create workflows via drag-and-drop operation

• Share and use workflows to re-run experiments and to demonstrate results

Page 3: Collaborative Workflow Development and Experimentation in the Digital Humanities

Background

• High degree of diversity in research topics, but also tools and frameworks being used

• Technical resources should be easy to use, well documented, accessible from anywhere

• Prevent re-inventing of the wheel

Page 4: Collaborative Workflow Development and Experimentation in the Digital Humanities

Requirements

• Interoperability = connect different resources• Flexibility = easy to deploy and adapt• Modularity = allow different combinations of tools• Usability = simple to use for non-technical users• Re-usability = easy to share with others• Scalability = apt for large-scale processing• Sustainability = resources simple to preserve• Transparency = tools evaluated separately• Distributed development and deployment

Page 5: Collaborative Workflow Development and Experimentation in the Digital Humanities

Interoperability Framework (IIF)

• Modules:- Java Wrapper for command line tools- Web Services (incl. format converters)- Taverna Workflow Engine- Client interfaces- Repository connectors

Page 6: Collaborative Workflow Development and Experimentation in the Digital Humanities

Sources

https://github.com/impactcentre/interoperability-framework

Page 7: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Command Line Wrapper

• Java project, builds using Maven2

• Creates a web service project from a given tool description (XML)

• Web service exposes SOAP & REST endpoints and Java API interface

• Requirements: command line call, no direct user interaction

Page 8: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 9: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 10: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 11: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 12: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Web Services

• Web services are described by a WSDL

• Input/output data structures

• Data is referenced by URL

• Annotations

• Default values

Page 13: Collaborative Workflow Development and Experimentation in the Digital Humanities

REST

Page 14: Collaborative Workflow Development and Experimentation in the Digital Humanities

SOAP

Page 15: Collaborative Workflow Development and Experimentation in the Digital Humanities

IIF Workflows

• What is a workflow? (Yahoo Pipes, etc.)

• Different kinds of workflows: for a single command, application, chain of processes

• Main benefit: Encapsulation, Reuse

• Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource

• Web 2.0 workflow registry: myExperiment

Page 16: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 17: Collaborative Workflow Development and Experimentation in the Digital Humanities

Why workflows?• “In-silico experimentation”

• Good structuring of experiment setup:– Challenge/Research question– Dataset definition– Processing with algorithms– Evaluation/Provenance– Presentation of results

• All this can be modelled into a workflow

Page 18: Collaborative Workflow Development and Experimentation in the Digital Humanities

Integration into Taverna

• Web Services (SOAP and REST)

• Command line tools (SH and SSH)

• Beanshells (can import Java libraries)

• R (statistics)

• Excel, CSV

• Additional service types can be added through dedicated plug-ins

Page 19: Collaborative Workflow Development and Experimentation in the Digital Humanities

Taverna flavours

• Workbench – local GUI client for Linux, Windows, OSX

• Command line tool – run workflows from the command line

• Server – Webapp with REST API and Java/Ruby client libs

• Web-Wf-Designer – Javascript version for designing workflows in a browser

Page 20: Collaborative Workflow Development and Experimentation in the Digital Humanities

Workbench

Page 21: Collaborative Workflow Development and Experimentation in the Digital Humanities

Webapp

Page 22: Collaborative Workflow Development and Experimentation in the Digital Humanities

Workflow registry

Page 23: Collaborative Workflow Development and Experimentation in the Digital Humanities

Client interfaces

• Web service client: create a simple HTML form from a given web service description

• Taverna client: create a simple HTML form from a given Taverna workflow description

integration into production and presentation environments via iframes

Page 24: Collaborative Workflow Development and Experimentation in the Digital Humanities

WS-client

Page 25: Collaborative Workflow Development and Experimentation in the Digital Humanities

T2-client

Page 26: Collaborative Workflow Development and Experimentation in the Digital Humanities

Repositories

• Accessible via web service API– Fedora Commons – WebDAV – PRImA

Page 27: Collaborative Workflow Development and Experimentation in the Digital Humanities

Architecture

Page 28: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 1: OCR (IMPACT)

• Start: Images (scanned documents)

• Processing: OCR, NLP, Evaluation

• Result: Full text, Entities, Sentiments

Page 29: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 30: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 31: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 32: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 33: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 2: Preservation (SCAPE)

• Start: Document collection preparation

• Processing: Hadoop, Hive

• Result: Statistics

Page 34: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 35: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 36: Collaborative Workflow Development and Experimentation in the Digital Humanities

find

/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h60.000 books

24 Million pages

Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...

Reading image metadata

Page 37: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 38: Collaborative Workflow Development and Experimentation in the Digital Humanities

find

/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h60.000 books

24 Million pages

HtmlPathCreator SequenceFileCreator

Sequence file creation

Page 39: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 40: Collaborative Workflow Development and Experimentation in the Digital Humanities

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005...

: ~ 6 h60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400

Z119585409/00000001 2250

Z119585409/00000002 2250

Z119585409/00000003 2250

Z119585409/00000004 2250

Z119585409/00000005 2250

Map Reduce

HadoopAvBlockWidthMapReduce

SequenceFile Textfile

HTML parsing

Page 41: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 42: Collaborative Workflow Development and Experimentation in the Digital Humanities

: ~ 6 h60.000 books

24 Million pages

HiveLoadExifData & HiveLoadHocrData

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidth

jp2width

Z119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700

Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250

CREATE TABLE jp2width(hid STRING, jwidth INT)

CREATE TABLE htmlwidth(hid STRING, hwidth INT)

Analytic Queries

Page 43: Collaborative Workflow Development and Experimentation in the Digital Humanities

: ~ 6 h60.000 books

24 Million pages

jid jwidth

Z119585409/00000001 2250

Z119585409/00000002 2150

Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

hid hwidth

Z119585409/00000001 1870

Z119585409/00000002 2100

Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

htmlwidthjp2width

jid jwidth hwidth

Z119585409/00000001

2250 1870

Z119585409/00000002

2150 2100

Z119585409/00000003

2125 2015

Z119585409/00000004

2125 1350

Z119585409/00000005

2250 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Analytic QueriesHiveSelect

Page 44: Collaborative Workflow Development and Experimentation in the Digital Humanities

Examples

• Use case 3: Curation (GDZ)

• Start: Get documents from repository

• Processing: Enrichment (OCR, Entities, GeoNames)

• Result: Online presentation

Page 45: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 46: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 47: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 48: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 49: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 50: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 51: Collaborative Workflow Development and Experimentation in the Digital Humanities
Page 52: Collaborative Workflow Development and Experimentation in the Digital Humanities

ROPEN(= Resource Oriented Presentation ENvironment)

Page 53: Collaborative Workflow Development and Experimentation in the Digital Humanities

Scalability

• Multiple options:

- Service parallelization

- Cloud

- Grid

- Hadoop

Page 54: Collaborative Workflow Development and Experimentation in the Digital Humanities

Compatibility

• Taverna UIMA

• Taverna Galaxy

• Taverna Kepler

• Taverna Weblicht

• Taverna Seasr

Page 55: Collaborative Workflow Development and Experimentation in the Digital Humanities

But…

• Multi-layered approach increases complexity (debugging, maintenance)

• Diverse set of endpoints (OS, CPU, etc.)

• Multiple dependencies

• Shared responsibilities

• Authentication & Authorization

• Error handling / Fail-over / Monitoring

Page 56: Collaborative Workflow Development and Experimentation in the Digital Humanities

Demo(s)

Page 57: Collaborative Workflow Development and Experimentation in the Digital Humanities

Discussion

• Potential/use cases DH?

• Tools/features to make available?

• Questions, comments or remarks?

Page 58: Collaborative Workflow Development and Experimentation in the Digital Humanities

Thank you!