Introduction to Persistent Identifiers

49
www.eudat.eu EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Introduction to Persistent Identifiers PIDs in EUDAT This work is licensed under the Creative Commons CC-BY 4.0 licence. Attribution: EUDAT – www.eudat.eu

Transcript of Introduction to Persistent Identifiers

Page 1: Introduction to Persistent Identifiers

www.eudat.eu

EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065

Introduction to Persistent IdentifiersPIDs in EUDAT

This work is licensed under the Creative Commons CC-BY 4.0 licence.Attribution: EUDAT – www.eudat.eu

Page 2: Introduction to Persistent Identifiers

Content

What are persistent identifiers?Why use persistent identifiers?Different persistent identifier systemsThe HANDLE systemEPIC PID system PoliciesUse cases

Page 3: Introduction to Persistent Identifiers

PID Training

PERSISTENT IDENTIFIERS

Page 4: Introduction to Persistent Identifiers

Science Data

Data generation is getting easier/cheaperComplexity-shift from data generation to data processing & analysis The number of data output is increasing

Data needs to be

ReusableAccessible

FindableInteroperable

Page 5: Introduction to Persistent Identifiers

PID Training

Briefly, what are PIDs?

Pointers to data resourcesData files, metadata files, documents …

Globally uniqueExist infinitely long

Identify and retrieve resourcesCan be resolved to the resource

Examples: ISBN, DOIs, PURLs, Handles …

Page 6: Introduction to Persistent Identifiers

Data Creation Cycle

tem

pora

ry d

ata

cita

ble

data

refe

rabl

e da

ta

raw data

registration & preservation

analysis & enrichment

Citable publication

Persistent and robust identification

Page 7: Introduction to Persistent Identifiers

PID Training

PIDs are static

Data2Data

1

Data4

Data3

PID 1 PID 2 PID 3 PID 4

World of data infrastructure

Page 8: Introduction to Persistent Identifiers

PID Training

Example: Data hierarchies

Data

PIDMetadata

Data

PIDMetadata

Data

PIDMetadata

Data

PIDMetadata

Data

PIDMetadata Raw data

Published data

Analysed data

Page 9: Introduction to Persistent Identifiers

What is the Problem? Why not use simple URLs?

The URL specifies the location, on a particular server, from which the resource could be retrieved, strictly network locations for digital resources.

domain may changeresource may be relocated link may change

B2SAFE Training

BUT

URLs a year or two later, often no longer workin the long term In the long term URLs a year later, often no longer work

“link rot”

Page 10: Introduction to Persistent Identifiers

Persistent over time

today 2016 .... .... 2030

11839/abc123 11839/abc123

111000010001111

111000010001111

http://www.example.com/ http://www.moved.com/

Supports access to resource as it moves from one location to another.

.. by design

Page 11: Introduction to Persistent Identifiers

Persistent Identifiers

A Persistent Identifier is distinct from a URLnot strictly bound to a specific server or filename

“A persistent identifier (PID) is a long-lasting reference to a digital object—a single file or set of files.“

https://en.wikipedia.org/wiki/Persistent_identifier

111000010001111

11839 / abc123

reso

lutio

n

prefix suffix

Identifier points to a resource with no actual knowledge of the resource

Responsibility of the PID owner to keep it up-to-date when the resource changes

Page 12: Introduction to Persistent Identifiers

Persistent Identifier

points to a resource(s) Is globally unique

111000010001111

11839 / abc123

reso

lutio

nprefix sufffix

11839 / abc123prefix sufffix

Once the PID created, the resource is globally addressable.

DataMetadata

DocumentCode

Prefix: designates administratory domain, comes from an issuing instance Suffix: unique in the realm of the prefix

Page 13: Introduction to Persistent Identifiers

PID Training

Managing Persistent Identifiers

Managing data Includes managing the persistent Identifier for the data. =

domain may changeresource may be relocated link may change

“link rot”PID needs to be updated to point to the new location (URL). PID continues to provide the latest information about the resource.

With PIDs

Page 14: Introduction to Persistent Identifiers

Redirection Layer

1839/bc123

Unstableadministrative recordStable

111000010001111111000010001111111000010001111

http://….http://….http://….

Redirection layer bridging the stable and unstable worlds at the cost of some administrative responsibilities

Data moves over time

Data is always Reachable by its PID

Page 15: Introduction to Persistent Identifiers

PID Advantages

Persistent Identity via IndirectionStatic references into fluid systems over time

Data on networks movesOwnership/responsibility changeFormats change

Embedded IdsFor data object in hand – current state data

UpdatesNew related entities

Networks of Persistent LinksData / metadata linksProvenance chains

Page 16: Introduction to Persistent Identifiers

PID Disadvantages

Extra level of effort / cost on creationAnalysis – what to identify / granularityCoordination across organisationsMaintain resolution system

Persistence requires sustained effortOrganisational disciplineTechnology necessary but not sufficient

Analyse cost/benefit ratioDon’t start unless it is worthwhileIs your data worth it?

Page 17: Introduction to Persistent Identifiers

PID Training

PID SYSTEMS

Page 18: Introduction to Persistent Identifiers

PID Training

Identifier – PID

Every identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix

Prefix - designates adminstratory domain, is generated by an issuing instance which makes sure tat all prefixes are uniqueSuffix - local name must be unique under its prefix.

The uniqueness of a prefix and the local name under that prefix ensures that any identifier is globally unique within the context of the System.

< PREFIX > / < SUFFIX > (e.g. 11111/123456745)

Page 19: Introduction to Persistent Identifiers

PID Systems

Persistent URLs (PURLs)a

Cost: noMetadata: No additional metadata

purl: GPO/gpo46189

Handle Systemb

Cost: $50 annual fee per prefixMetadata: Associate any metadata

hdl:11210/123

Digital Object Identifier (DOI)d

Cost: fee per DOI + annual feeMetadata: The INDECS schema

DOI: 10.1000/182

Archival Resource Key (ARK)c

Cost: noMetadata: ERC (Electronic Resource Citation) metadata

ark: /12025/654xz321

Page 20: Introduction to Persistent Identifiers

PID Training

PIDs system Requirements

Attach multiple URLs to a PID

Allow part identifiers for complex objects. Granularity issue

Allow attaching of extra data records to the PID (MD5 check, etc)

Actionable (URLified) PIDs

HTTP proxy for resolving (use port 80 only)

Control by user community

REST or SOAP interface for administration of PIDs from applications

Delegation of PID administration to other organisations

Distributed, robust, highly-available, scalable

No single-point of failure

Acceptable non-commercial business model

Page 21: Introduction to Persistent Identifiers

PID Training

Identifier String Requirements

Not based on any changeable attributes of the entity

LocationOwnershipAny other attribute that may change w/o changing identity

UniqueAvoid collisions, referential uncertainty

Opaque, preferably a ‘dumb number’

A well known pattern invites assumptions that may be misleadingMeaningful semantics invite IP wars, language problems

Nice to haveHuman-readableCut-able, paste-ableFits common systems, e.g., URI specification

that contributes to persistence

Page 22: Introduction to Persistent Identifiers

PID Training

PIDs in EUDAT

EUDAT has adopted Handle-based persistent identifiers

A combined solution of handle system and EPIC service (today)

Employing the latest Handle v.8

EUDAT developed a library to interact with Handle v.8 B2HANDLE

Page 23: Introduction to Persistent Identifiers

PID Training

HANDLE SYSTEM

Page 24: Introduction to Persistent Identifiers

PID Training

The Handle System

The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources. The protocols specified enable a distributed computer system to store identifiers (names, or handles) of digital resources and resolve those handles to the information necessary to locate, access, and otherwise make use of the resources. That information can be changed as needed to reflect the current state or location of the identified resource without changing the handle.

Page 25: Introduction to Persistent Identifiers

PID Training

Handle System

The main goal of the handle system is to contribute to persistence.The Handle system is:

reliable scalable flexible trusted built on open architecturetransparent

Page 26: Introduction to Persistent Identifiers

PID Training

A handle Record

Handle Data Type

Index Handle data Timestamp

10232/1234

URL 1 https://www.eudat.eu/ex1 2014-04-09 12:46:53Z

INST 2 EUDAT 2014-04-09 12:46:53Z

HS_ADMIN 100 eudat/user1 2014-04-09 12:46:53Z

PID – handle : 10232/1234Actionable URL: http://hdl.handle.net/10232/1234

Page 27: Introduction to Persistent Identifiers

HANDLE Record Types

Common typesURL: one or more, pointing to the location(s) referenced by this HANDLE

HS_ADMIN: special record encoding the permissions configured for this HANDLE

10320/LOC: supports multiple locations based on intelligent decision.

CustomChecksum: Useful for integrity verification

EUDAT/ROR: EUDAT specific for B2SAFE. ROR: (Repository of Records), the repository where data was stored first.

EUDAT/PPID: EUDAT specific for B2SAFE. the PID associated to the source object in a replication chain. If the chain has only two elements, the master copy and the first replica, then the PPID = ROR.

Page 28: Introduction to Persistent Identifiers

PID Training

10320/loc Handle Type

The 10320/LOC field is specifically designed to allow the http handle resolver to make an intelligent decision which location to return if multiple locations are availableOptions:

Weight: specifies a weight per location. Load will be distributed over all locations according to their assigned weightsCountry: specifies where this location is being hosted. This allows the http resolver to return the location closest to the user (based on GeoIP lookup)Weight: Selects a single location based on a random choice. 

Page 29: Introduction to Persistent Identifiers

PID Training

10320/loc Handle Type Example

<locations>    <location id="0" href="http://uk.example.com/" country="gb" weight="0" />    <location id="1" href="http://www1.example.com/" weight="1" />    <location id="2" href="http://www2.example.com/" weight="1" />  </locations>

PID: 10232/1234

Reference-1:  from a client located in the UKReference 2:  from a client located outside the UK Reference 3: 10232/1234?locatt=id:1Reference 4: 10232/1234?locatt=id:0 Reference 5: 10232/1234?locatt=country:us

Page 30: Introduction to Persistent Identifiers

PID Training

Part Identifiers

Part identifiers compute an unlimited number of handles on the fly, by registering just one.

A single template handle can be created as a base that will allow any number of extensions to that base to be resolved as full handles, according to a pattern, without each such handle being individually registered.

In the handle system the part - fragment identifier is enabled with a template. The template is a syntax that defines a delimiter and an extension (extension is the option to add any kind of string behind the delimiter).

Page 31: Introduction to Persistent Identifiers

PID Training

Part Identifiers - Examples

Use Part Identifiers: to reference a part of a dictionaryto reference an unlimited number of ranges within a videoto reference a part of a collection of items

Video Example Create one handle : 10232/1234576A range: 10232/1234576@from=1:05&to=1:14 

PlD is used to point to a location. So please note that when your system offers part identifiers, it is responsible of maintaining the part identification fragment as well

Page 32: Introduction to Persistent Identifiers

PID Training

PID SYSTEM IN EUDAThandle system and EPIC

Page 33: Introduction to Persistent Identifiers

PID Training

PID System: How does it work?

PID Service generate and manage PIDs for digital objects

PID Replication replicate the database of Handles to guarantee an robust and high-availability PID resolution function

Resolution Serviceservice to guarantee reliable resolution of the PIDs. Forwarding the user to the resource.

Global Handle Mirror

A mirror of the Global Handle in Europe

handle system and EPIC

Page 34: Introduction to Persistent Identifiers

PID Training

PID Service

A RESTful web service, using the HTTP application protocol.

[GET]: for getting the data of a selected PID, search for PIDs[POST]: for creating a new PID with automatic generation of suffix name[PUT]: for creating/updating a PID with manual generation of suffix name[DELETE]: for deleting a PID

Page 35: Introduction to Persistent Identifiers

PID Training

Resolution Service

The web address for the handle resolution service that EUDAT uses is http://hdl.handle.net. 

Page 36: Introduction to Persistent Identifiers

PID Training

EUDAT options for PIDs

In order to access a data object stored in EUDAT, an associated persistent identifier (PID) is needed. EUDAT requires integration of Handle in your infrastructure. Before your community or data centre can create PIDs you need a prefix. There are two options:

you can run your own Handle system; or you can pass the details to EUDAT partners to manage it on your behalf.

additional benefit of using the EUDAT systems is access to a REST API to manage your PID handles

Page 37: Introduction to Persistent Identifiers

PID Training

POLICIES

Page 38: Introduction to Persistent Identifiers

PID Training

How may I use a PID

By the time you own a PID use itOnlineIn your PublicationsIn your linked data

You may also use itTo get the dataTo refer to the data

Use it as an actionable URL: http://hdl.handle.net/11239/GRNET

Page 39: Introduction to Persistent Identifiers

PID Training

Policy Document

When to use persistent identifiers?There is no one-size fits all strategy for implementing PIDs

Create a Policy Document of What & WhenAnalyze the use of PIDs, create a policy for the management What to registerWhen it the data management life cycle

analysis and thought

Page 40: Introduction to Persistent Identifiers

PID Training

Policy Document

Simple QuestionsWhich data objects need a PID (collections, files., metadata records)?What kinds of data are likely to stay online long enough?What kinds of data are likely to be linked to ?What kinds of data are likely to be analysed/processed with tools? What will happen after data goes off-line?etc..

analysis and thought

Page 41: Introduction to Persistent Identifiers

PID Training

USE CASES

Page 42: Introduction to Persistent Identifiers

PID Training

Example 1: B2SHARE

B2SHARE is a user-friendly, reliable and trustworthy way for researchers, scientific communities and citizen scientists to store and share small-scale research data from diverse contexts.

Page 43: Introduction to Persistent Identifiers

PID Training

Example 2: B2SAFEB2SAFE employs PIDs to keep track and link replicas of data in the EUDAT network

Page 44: Introduction to Persistent Identifiers

PID Training

Example 2: Enable data flows

Link directly to the data (?locatt=id:0 )Optionally include a (mime)type in the handle record - Can be used to select appropriate tooling

Page 45: Introduction to Persistent Identifiers

Summary

Persistent Identifiers provide a solution to the “link rot” problem by providing an extra layer of indirection

Several systems are available; some offer additional functionality in the form of support for storing additional metadata, providing a global resolver, etc.

Policy Document: How to use persistent identifiers in your repository requires some analysis and thought

Page 46: Introduction to Persistent Identifiers

Summary

The HANDLE system - via EPIC system - is EUDAT PID framework of choice because:

Low cost, only a flat annual feeRobust, scalable and performantFlexible, allows addition of any metadataProvides a global resolver

However, there are some challenges. Especially in scenarios where multiple administrative domains are involved

Page 47: Introduction to Persistent Identifiers

Hands-on material

Material on PID hands-on (part 7)Hands-on tutorial which shows how to:

Create, manage and delete PIDsWork with PIDs in workflows

Examples for handle V8

Epicclient.pycURL commandsB2HANDLE library

https://github.com/EUDAT-Training/B2SAFE-B2STAGE-Training

Training module which provides hands-on material for:

EUDAT B2SAFEiRODS4B2HANDLEand the EUDAT B2STAGE service.

Page 48: Introduction to Persistent Identifiers

Thanks

Page 49: Introduction to Persistent Identifiers

www.eudat.eu

Authors Contributors

This work is licensed under the Creative Commons CC-BY 4.0 licence

EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.Contract No. 654065

Themis Zamani, GRNETWillem Elbers, CLARINChristine Staiger, SURFsara

Ellen Leenarts, DANS

Thank you