IMPACT Final Conference - Clemens Neudecker

15
The IMPACT Interoperability Framework: Workflows for OCR and beyond Clemens Neudecker, KB National Library of the Netherlands 2 nd IMPACT Conference, British Library, London 24/25 October 2011

description

The IMPACT Interoperability

Transcript of IMPACT Final Conference - Clemens Neudecker

Page 1: IMPACT Final Conference - Clemens Neudecker

The IMPACT Interoperability Framework: Workflows for OCR and beyondClemens Neudecker, KB National Library of the Netherlands

2nd IMPACT Conference, British Library, London 24/25 October 2011

Page 2: IMPACT Final Conference - Clemens Neudecker

Background

> 20 individual software components for specific challenges

Prototyping new algorithms, improving commercial solutions

Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)

Extensible with 3rd party applications

IMPACT Interoperability Framework (IIF)

Page 3: IMPACT Final Conference - Clemens Neudecker

Architecture

Java

Web Services

Apache

Taverna

Open Source available on https://github.com/impactcentre

Free Hackathon 14/15 November, University of Manchesterhttp://impact-mygrid-taverna-hackathon.wikispaces.com/

Page 4: IMPACT Final Conference - Clemens Neudecker

Integration

Only requirement:command line executable

Generic command line wrapperproduces web service

Web service exposed as workflow module withdocumentation

Quick & easy integration: developers can focus on their application and have to worry less about integration = higher quality software

Page 5: IMPACT Final Conference - Clemens Neudecker

Workflows OCR workflow =

data pipeline

Building blocks = processing modules (nodes)

Integration = interaction between nodes (mashups)

Collaboration with

Page 6: IMPACT Final Conference - Clemens Neudecker
Page 7: IMPACT Final Conference - Clemens Neudecker

Evaluation features Text comparison of result with ground truth,

using Levenshtein distance method Word evaluation (with reading order) Layout based comparison of result with ground truth,

using the Page Analysis And Ground Truth Elements Framework

Page 8: IMPACT Final Conference - Clemens Neudecker

Community

Web2.0 style workflow registry

Ready-to-use and documented resources

Community of experts

Sharing of experimentsand know how

Page 9: IMPACT Final Conference - Clemens Neudecker

Local client: Taverna Workbench

Background: BioSciences

Developed and maintained bymyGrid, UK

Open source

GUI for design and execution of web services & workflows

Page 10: IMPACT Final Conference - Clemens Neudecker

Remote client: Portal

SOAP/REST API Remote execution of web services & workflows

Page 11: IMPACT Final Conference - Clemens Neudecker

Results Repository

Custom service for IMPACT:

automatic storage of

workflow outputs and

provenance via WebDAV Fully interoperable,

since HTTP-based Configurable storage of

result sets Create reports using POI

Page 12: IMPACT Final Conference - Clemens Neudecker

Scalability

Central ESB proxy manages multiple service copies

Process parallelization,Load distribution,Fail over, Security

Served >2M requests

Throughput improvements of 94% with every additional instance

Tested on Dutch Supercomputing Cloud (“Enlighten Your Research”)

Page 13: IMPACT Final Conference - Clemens Neudecker

Outlook

Online service for testing/evaluation Specification & Guidelines

Extending the scope:Workflows for linguistic analysis: CLARINWorkflows for preservation: SCAPE

Even better scalability: Map/Reduce

Supported by a community of developers & practitioners

Page 14: IMPACT Final Conference - Clemens Neudecker
Page 15: IMPACT Final Conference - Clemens Neudecker

“Anyway, the thing about progress is that is always seems greater than it really is.”

Ludwig Wittgenstein, Philosophical Investigations (quoting Johann Nestroy)

xkcd.com/688