Aggregation Workflow at Europeana Aggregator Forum
description
Transcript of Aggregation Workflow at Europeana Aggregator Forum
Aggregation workflow
Dimitra Atsidis
Aggregator Forum, 22-23 May,The Hague
Content
Publication policy
Potential partners process
Submission deadlines for new and existing providers
Europeana ingestion workflow
Acceptance criteria and Europeana validation
Guidance and help – Europeana pro
Future plans for Europeana aggregation workflow
Exercise – The ideal aggregation workflow
Publication Policy
Publication Policy
Clear criteria for acceptance or decline of metadata for publication and for take down of legacy metadata from the Europeana database
Ingestion workflow (deadlines, timelines, prioritisation)
Content scope (what is a digital object, kind of content)
Technical validation of metadata quality (expected values)
Metadata licensing (CC0)
Rights Statements for digital objects
• All digital objects with valid edm:rights
• PD objects labelled as PD
• edm:rights & dc:rights not contradictory
Publication Process and Workflow
How to become a data provider to Europeana?
New Provider Timeline
First delivery of data: samples or full datasets
Feedback on: - structure of the metadata
- mandatory elements- rights statements
Feedback taken into account: new delivery of datasets ready
to be ingested
Ingestion of the datasets that are compliant to the
publication policy
Publication in Europeana of the submitted datasets
Provider ProviderEuropeana Europeana Europeana
Between the 1st and the 5th of
month 1
After the 5th of month 1
Before the 21st of month 1
Before the 21st of month 2
Between the 10th and the 20th of
month 2
Between the 10th and the 20th of
month 3
TIMELINE
Before the 21st of month 3
Between the 10th and the 20th of
month 4Month 1
Around the 21st of month 1
Around the 1st of month 2
Between the 10th and the 20th of month 2
PROVIDER FOR EUROPEANA
EUROPEANA AGGREGATION TEAM
Delivers samples of metadata or full datasets for feedback
Sends feedback about: - structure of the metadata
- mandatory elements- rights statements
SUBMISSION DEADLINE: no new dataset will be accepted for
month 2 publication after the deadline
Delivers full datasets to be included in the coming
publication
Processes the datasets or/and sends feedback about:
- structure of the metadata- mandatory elements
- rights statements
Monthly ingestion complete Publication is finalized
If needed, delivers a corrected version of datasets ready to be
ingested
Ingests the datasets to be included in the coming
publication, provided that they are compliant to the
publication policy
Communicates that the publication is live and sends
final report for the processed datasets
1st of month 1
Processed datasets can be retrieved on the Europeana
portal
TIMELINE
Regular Ingestion Cycle Diagram Timeline
Europeana Ingestion WorkflowIn SugarCRM
populate with info about organizations
and datasets
In UIM Enrich Collection workflow (to index and
enrich dataset/s)
In Thumbler caching links to thumbnails (objects/previews)
SugarCRM (to control and set the correct Ingestion Status)
Content Contribution Form (for new partners)
Scheduling of Ingestion
Load dataset/s in UIM Harvest in REPOX
In MINT map dataset/s
In UIM Dereference Collection (if needed)
In UIM Create Record Redirects (if needed)
Dataset/s published in Europeana.euVia UIM load
dataset/s in MINT
Via MINT load mapped
dataset/s in UIM
Acceptance criteria
Acceptance criteria Completed and submitted the Data Exchange Information Form.
Data Exchange Agreement to Europeana
o Aggregators need to submit the signed Data Exchange Agreements of their data providers
o Aggregators can use template clauses for the agreement
between aggregators and data providers:
http://pro.europeana.eu/ensuring-permissions-for-aggregators
Metadata are accepted for publication after the feedback of the Europeana Operations Officers
o EDM schema and guidelines
o Rights labeling
Datasets are prioritised for publication if the edm:rights in the majority of the metadata of the dataset is PDM, CC0, CC BY or CC BY-SA
Datasets submitted via OAI-PMH protocol, FTP or file
Automatic validation:
Validation according to the EDM schema (or ESEv3.4)
Validation of the mandatory properties
Unique identifiers
oMetadata records that don’t meet this validation are discarded
oProviders can fix issues first and resubmit or let Europeana ingest the records that are valid, and fix the invalid records at a later stage
Validation of urls for thumbnail creation (ImageMagick)
Europeana validation
Applicable class Mandatory Properties (or alternatives)
Aggregation edm:dataProvider
Aggregation edm:isShownAt or edm:isShownBy
Aggregation edm:provider
Aggregation edm:rights
Aggregation edm:aggregatedCHO
Aggregation edm:ugc (when applicable)
ProvidedCHO dc:title or dc:description
ProvidedCHO dc:language for text objects
ProvidedCHOdc:subject or dc:type or dc:coverage or dcterms:spatial
ProvidedCHO edm:type
Mandatory properties
Validation by the operations officers:
Feedback is following to the EDM schema and guidelines
Check if links are working, are direct links of reasonable size
Recommendations to include thumbnails, geolocations, etc.
Feedback on (near) duplicate records, and taking the advantages of the EDM
Feedback on rights statements in edm:rights and dc:rights
Relations between the EDM classes
Correct use of vocabularies
Literals vs resources (e.g. a thumbnail always need to be a valid url)
Feedback on any other metadata quality related matters (duplication of properties, encoding in the data, wrongly mapped properties, etc.)
Etc, etc.
Europeana validation
Guidance and help
Guidance and help
Europeana Professional:http://pro.europeana.eu/provide-data
Content inbox – for all ingestion & metadata related matters [email protected]
Questions?
Future plans for aggregation workflow
Future plans for aggregation workflow The big plan is to open up part of the ingestion workflow to providers
• Providers can log-in, identify the aggregator/project they work for
• Providers can select the datasets they want to update, or add new datasets
• Providers can upload their data – protocols besides OAI-PMH and FTP are under discussion
• Providers can map their data to EDM, or edit data that is already EDM
• Providers can validate the data against the EDM schema and preview in a preview portal
Europeana wants to provide tools for uploading data, validating, mapping, and previewing
Other tools and workflows being considered: link checking, thumbnail caching, enrichment
Start with a test environment, to preview and validate subset of data before sending to Europeana
Eventually to open up part of the workflow of Europeana to providers, not only for test but to integrate in the ingestion workflow.
Future plans for aggregation workflow
Benefits for providers:
• Possibility to map to EDM
• Validation according to the EDM schema (with schematron rules we implemented)
• Previewing before publication
• Self service, less dependent on Europeana, saving time (you can do many steps yourself, and you spot errors earlier)
Benefits for Europeana:
• Scale up operations – amount of projects, aggregators and therefore datasets has grown exponentially in the last years
• To focus more on metadata quality and assisting providers as much as possible with EDM, modelling and metadata related questions
• Making the ingestion process transparent and more connected to the process at aggregators side
The ideal aggregation workflowConsider your own aggregator route from data provider, to the aggregator to data provision to Europeana
Consider also the current aggregation workflow of Europeana and the future plans presented
Now, draw the ideal workflow to get your data from the data provider, through your aggregator into Europeana. Make a diagram, a mindmap, or whatever comes to mind.
Think, for instance, about the following questions:
What steps in your current workflow could you use help with (e.g. mapping, validation, rights clearance)
Would you use any of the workflow steps Europeana plans to open up? Why, or why not?
Are there any tools you use already, you could recommend to everyone?
Would the aggregator or the data providers (or both) benefit and use the tools?
Use the yellow post-its to signal positive things, improvements, easy wins (and why?)
Use the pink post-it to signal forseeable issues, or difficulties (and why?)