EUDOR: Digital Archive at the Publications Office of the ...
Transcript of EUDOR: Digital Archive at the Publications Office of the ...
EUDOR: Digital Archive at the Publications Office of the European Union PREMIS ongoing implementation
Lina Bountouri ([email protected])
2/x
Collections and formats
• Legislative collections (EUR-
Lex)
• 1M works, +120M files, 15
TB
• Content formats
• PDF/A + signed (OJ) XML
• XML (Formex), XHTML, JPEG
• Other XML
• General publications (EU-Bookshop)
• 700K files, 9 TB
• Content formats
• PDF/X, PDF/A
• JPEG for thumbnails
• Other: ePub, XML…
• Representation information
• Ontologies & other KOS, XML
schemas, etc.
Automated Production Workflow of the EU Official Publications
IMMC XML +
Digital Objects
RDF/OWL+
Digital Objects
in METS
METS SIPs
4/x
Architecture: production system and archive
• Production system: Cellar
• Public
• Ingestion server not exposed
• Replicated to N read/only servers
exposed to internet
• Two parts
• Repository (Fedora)
• Triplestore (Virtuoso and Oracle)
• Ontology based on FRBR
• Normalized KOS
• Archive: Eudor
• Not public
• Roda/OAIS
• SIPs, DIPs: METS
• AIPs: e-Ark
• Fed mainly by Cellar
Representation information: descriptive metadata
• Available at <http://publications.europa.eu/mdr/>
• CDM ontology based on FRBR: RDF/OWL
• Work / Expression (language) / Manifestation (file format) / Item (file/s)
• Complexity: compound works, +24 languages (sometimes multiple), volume (massive
updates), dependencies (works linked +40K)…
• KOS: SKOS/XML
• Over 70 tables, incl. Eurovoc, updated often, backwards compatibility, other clients…
• Access
• RESTful interface, dereferencing with same URI
• SPARQL endpoint: <http://publications.europa.eu/webapi/rdf/sparql/>
5/x
Representation information: technical metadata
• Ontology and files not public (yet)
• Cannot make statistical SPARQL queries on it
• Contains
• File format
• Fixity: checksum and algorithm
• Size
• …
6/x
Provenance/contextual metadata in the EUPO
• Who drove us to this direction?
• Many events are taking place related to each resource
• Internal decision of our management
• New software for our digital archival repository (RODA/KEEP Solutions)
• OAIS/ISO 16363 Consultants for Auditing
• “Are there any metadata for the custody/context/provenance of your
metadata/digital objects?”
Why we have chosen PREMIS?
• PROV-O
• PREMIS
• It was already implemented in the new version of Eudor (v3) - RODA
• It suited our provenance/contextual documentation needs
• It is widely implemented by libraries and archives
• It has a strong community of users
• It is based on RDF
Which provenance/contextual events will we encode?
• We had to limit the number of events to be encoded
• Decision was influenced by the number of triples
• Based partially on the LOC events list, need for new types of events
• For all our WEMI objects
• Modelling almost completed/implementation in our systems
9/x
Which provenance/contextual events will we encode?
• Events in newCeres
• Transmission (metadata by the Data Providers)
• Reception (metadata by newCeres)
• Events in Cellar
• Validation (against METS, KOS, CDM Ontology)
• Operations in Cellar, such as creation, deletion, update, embargo/disembargo
• METS-export (could be useful for traceability and avoid replication of provenance
data in different environments)
• Provenance/contextual metadata from newCeres and Cellar will be stored as
RDF triples in Cellar.
Which provenance/contextual events will we encode?
• Events in Eudor v3
1. Ingest start: The ingest process has started.
2. Unpacking: Extracted objects from package in file/folder format.
3. Wellformedness check: Checked that the received SIP is well formed, complete and that no unexpected files were included.
4. Virus check: Scanned package for malicious programs using ClamAV.
5. Wellformedness check: Checked whether the descriptive metadata is included in the SIP and if this metadata is valid according to the established policy.
6. Message digest calculation: Created base PREMIS objects with file original name and file fixity information (SHA-256).
7. Format identification: Identified the object's file formats and versions using Siegfried.
8. Authorization check: Producer permissions have been checked to insure that he has sufficient authorization to store the AIP under the desired node of the classification scheme.
9. Accession: Added package to the inventory. After this point, the responsibility for the digital content’s preservation is passed on to the repository.
10. Ingest end: The ingest process has ended.
Encoding of preservation actions
11/x
Open issues/Key points
• We have not yet implemented provenance/contextual metadata in Cellar
• We are currently writing the specs
• The size of the RDF triples should not get to high: limitation for the events
we will encode in our production workflow
• The current list of events in the LOC does not cover most of our cases
12/x