Dr. Ross King AIT Austrian Institute of Technology GmbH SCAPE/OPF Executive Seminar: Managing...

Post on 15-Dec-2015

219 views 2 download

Tags:

Transcript of Dr. Ross King AIT Austrian Institute of Technology GmbH SCAPE/OPF Executive Seminar: Managing...

Dr. Ross KingAIT Austrian Institute of Technology GmbH

SCAPE/OPF Executive Seminar: Managing Digital PreservationThe Hague, April 2, 2014

SCAPETools and Solutions

• SCAPE Project• SCAPE Tools• SCAPE Solutions• SCAPE and Preservation Management• SCAPE Additional Information

• Online Resources• Events• Contact Information

2

Outline

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE – what is it about?• Planning and executing computing-intensive digital preservation

processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets

• SCAPE results include• Preservation scenarios• Preservation tools• Preservation workflows• Preservation infrastructure• Preservation best-practices

SCAPE is a follow-up to the highly successful FP6 IP Planets.

3

SCAPE Project Data• Project instrument: FP7 Collaborative Project• 20 Partners from 11 countries• 6. Call

• Objective ICT-2009.4.1: Digital Libraries and Digital Preservation• Target outcome (a) Scalable systems and services for preserving

digital content• 10. Call

• Objective ICT-2013.11.4: Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union

• Duration: 44 months• February 2011 – September 2014

• Budget: 12.0 Million Euro• Funded: 9.2 Million Euro

4

SCAPE Consortium

5

SCAPE Tools

6

• Toolwrapper• Application that adapts existing tools to the SCAPE Platform

• https://github.com/openplanets/scape-toolwrapper

• Enhances wrapped tools• Standard naming scheme for CC, AS and QA tools• Standard invocation method (CLI)• Debian packages for easy deployment on the cluster• Support for data streaming (useful for Hadoop jobs)

• Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery• Automatic publication of components on myExperiment (to support discoverability)• Standard ports to enable composition of Preservation Components (based on well defined component profiles,

CC, AS & QA)

• Digital Preservation Toolkit• Software suite that contains a large set of DP tools

• 77 operations in total

• Easy to deploy on Linux machines (via apt-get)• apt-get install digital-preservation-tools

7

Scalable Tools

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Jpylyzer• JP2 (JPEG 2000 Part 1) validator and properties extractor• http://openplanets.github.io/jpylyzer/

• Pagelyzer• Suite of tools for detecting changes in web pages and their rendering• http://openplanets.github.io/pagelyzer/

• xcorrSound• Suite of tools for automated quality assurance of audio migration processes• https://github.com/openplanets/scape-xcorrsound

• Matchbox• Duplicate image detection tool• http://openplanets.github.io/matchbox/

• ToMaR• Supports the scalable execution preinstalled tools or other applications• Wraps command-line invocation of a tool into a MapReduce program• https://github.com/openplanets/scape/tree/master/pt-mapred

8

Scalable Tools

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• SCOUT: an automated preservation watch system• Enables planning tool and decision makers to monitor the world and the organisation• Collects relevant knowledge and enable automated notification• Open and extensible

• c3po: scalable content profiling• c3po analyses characterisation data based on fits• Scale-out MongoDB (100k/min/node)• Visual drill-down and well-documented profile• Automated sample selection

• PLATO 4.4: scalable preservation planning• www.ifs.tuwien.ac.at/dp/plato• Technology upgrade - refactored, rebuilt, standardised, tested • New features

• Groups allow collaborative planning• Integration of control policies for group• Quality domain – measures

9

Planning and Watch Tools

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Fedora 4.0.0• All REST, no SOAP• RDF as first class objects• JCR 2.0 Implementation (ModeShape)• Infinispan distributed NoSQL datastore

• RODA• KEEP Solutions’ open source repository• Implements all SCAPE APIs

• Rosetta• Ex Libris ’ commercial long-term preservation system• Implements SCAPE Data Connector API

10

Repositories

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

11

SCAPE Architecture

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE Components

3rd Party Components

1 2 n

HDFS

Hadoop

...

PigToMaR

3rd Party Componentswith SCAPE contributions

STAGER

LOADER

Fedora 4RosettaRODA

Taverna

Data Connector APISCAPE APIs

PPL

toolspecDigital Objects

SCAPE Platform

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE Components

3rd Party Components

1 2 n

HDFS

Hadoop

...

PigToMaR

3rd Party Componentswith SCAPE contributions

STAGER

LOADER

Fedora 4RosettaRODA

Taverna

Data Connector APISCAPE APIs

PPL

toolspec

Tool wrapper

Components

Digital Objects

Preservation Tools

SCAPE Platform + Preservation Components

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE Components

3rd Party Components

1 2 n

HDFS

Hadoop

...

PigToMaR

3rd Party Componentswith SCAPE contributions

STAGER

LOADER

Fedora 4RosettaRODA

PLATO 4 Taverna

Data Connector API

Report API

Plan Management

API

SCOUT

SCAPE APIs

PPL

toolspec

Tool wrapper

Components

Digital Objects

Preservation Tools

SCAPE Planning and Watch

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE Solutions

see alsohttp://wiki.opf-labs.org/display/SP/SCAPE+Stories

15

• User StoryAs a curator of image files, I need a digital preservation system that can migrate a large number of images from one format to another, ensuring that the migrated images conform to our institutional profile, that no image data is lost and that the migration is cost effective (saving storage for example).

• SCAPE Solution• SCAPE Platform• ImageMagick (with SCAPE toolspec description)• Jpylyzer

16

Migration: Large Scale Image Migration

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• User StoryAs the owner of a large audio collection, I need a digital preservation system that can migrate large numbers of audio files from one format to another and ensure that the migration is a good and complete copy of the original.

• SCAPE Solution• SCAPE Platform• xcorrSound

17

Migration: Large Scale Audio Migration

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• User StoryAs a Web Archive I need a Digital Preservation System that can process both ARC and WARC files and identify file formats/characterize of items contained so that I can assess preservation risks and plan which tools will be required for access to those formats.

• SCAPE Solution• SCAPE Platform• ARC Unpacker• FITS Tool (with SCAPE toolspec description)

18

Analysis: File Format Identification and Characterisation of Web Archives

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• User StoryIn order to be confident that we have preserved a website we need a digital preservation system that can automate the comparison of the two Web Snapshots - for example a harvested copy and a previous harvested copy that has been manually verified as an accurate representation of the site. This will enable us to ensure Web content has been successfully harvested and inform harvesting policies.

• SCAPE Solution• Pagelyzer• Hadoop Platform

19

Quality Control: Comparison of Web Snapshots

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Open Source Development• And/or implementation of open APIs

• Uniform Deployment• Use the SCAPE Toolspec+Toolwrapper to publish tools

• As Advanced Packaging Toolkit (APT) packages• As SCAPE Components

• Preservation Planning• Use PLATO to test tools (as SCAPE Components) and make policy-based plans

• Process Modelling• Use Taverna to model preservation workflows

• Taverna works directly with SCAPE components for experimental workflows• Taverna workflows can be converted to Hadoop/Pig workflows in some cases

• Hadoop Deployment• Use APT packages to deploy to a Hadoop environment

• Scalable Execution• SCAPE ToMaR can directly access tools through the toolspec

20

Solving Preservation Problems the SCAPE Way

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

from digitalbevaring.dk

SCAPE and Preservation Management

21

Research and Development

• Focus on innovation

• Services are prototypes• Unstable• Buggy• Maintenance pool limited to

a few (or one) expert(s)

22

Production

• Focus on daily business needs

• Service availability is a priority• Services are stable• Enjoy a large maintenance pool

The Wall

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Research and Development

23

Production

The Wall

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

1 2 n

HDFS

Hadoop

...

PigToMaR

FedoraRosettaRODA

Digital Objects

Research and Development

24

Production

The Wall

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

1 2 n

HDFS

Hadoop

...

PigToMaR

STAGER

LOADER

FedoraRosettaRODA

Data Connector API

Digital Objects

• Other problems with The Wall?

• How can we break through The Wall?

25

The Wall

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

SCAPE Additional Information

26

Additional Resources of Interest• Development Infrastructure

• Code repository hosted by the Open Planets Foundation and GitHub• https://github.com/openplanets/scape/

• Development Wiki• http://wiki.opf-labs.org/display/SP/Home

• Experimental Workflows• http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search

• Publications• http://www.scape-project.eu/category/publication

• Public Deliverables• http://www.scape-project.eu/category/deliverable

• Tools• http://www.scape-project.eu/tools

27

SCAPE Events

• DL2014: Joint SCAPE/APARSEN Workshop• September 8, 2014, London• Registration: http://scape-future-formats-first.eventbrite.co.uk/

28

See http://www.scape-project.eu/events

SCAPE Contact Information

• http://www.scape-project.eu/• Twitter: #scapeproject• office@list.scape-project.eu

• Dr. Ross KingAIT Austrian Institute of Technology GmbHDonau-City-Strasse 1A-1220 Wien

29

Thank you for your attention!

Questions?

30