arXiv:1605.06154v1 [cs.DL] 19 May 2016 · A study [23] estimated that at least 1.8M articles were...

Web Infrastructure to Support e-Journal Preservation (and More)

Herbert Van de SompelLos Alamos National Laboratory

[email protected]://orcid.org/0000-0002-0715-6126

David S. H. RosenthalLOCKSS Program, Stanford University Libraries


Michael L. NelsonOld Dominion University


AbstractE-journal preservation systems have to ingest millions ofarticles each year. Ingest, especially of the “long tail”of journals from small publishers, is the largest elementof their cost. Cost is the major reason that archives con-tain less than half the content they should. Automationis essential to minimize these costs. This paper examinesthe potential for automation beyond the status quo basedon the API provided by CrossRef, ANSI/NISO Z39.99ResourceSync, and the provision of typed links in pub-lishers’ HTTP response headers. These changes wouldnot merely assist e-journal preservation and other cross-venue scholarly applications, but would help remedy thegap that research has revealed between DOIs’ potentialand actual benefits.

Copyright c©2016 Herbert Van de Sompel & David S. H. Rosenthal & Michael L. Nelson

1 IntroductionA study [23] estimated that at least 1.8M articles werepublished in academic journals in 2012. Systems thatpreserve these journals in electronic form must ingestthese articles in real time, so their ingest processes mustbe highly automated. Current systems have implementedtwo different techniques for doing so. Harvest uses webarchiving technology to collect newly-published con-tent from the journal publisher’s web site, whereas FileTransfer relies on the publisher to push such content tothe archive via FTP or rsync. Portico [19] relies on filetransfer, the Global LOCKSS Network [15] uses harvest,and the CLOCKSS Archive [2] uses both,

In each case, the automated ingest process has to an-swer three questions:

1. Manifest: what should be ingested?

2. Completeness: was everything that should havebeen ingested actually successfully ingested?

3. Bibliography: what bibliographic metadata de-scribes the ingested content? This includes deter-

mining the Identity of the object.

Via an organization named CrossRef [4], articles inmost journals are assigned a Digital Object Identifier(DOI), a location-independent identifier that can be, butalas often is not [7], used to cite the article. CrossRefhas developed an API for querying information aboutDOIs [3]. This paper examines various possible waysthis API, ANSI/NISO Z39.99 ResourceSync, and the re-cent “Signposting” proposal from Van de Sompel andNelson [22], could together be used to improve the ingestprocesses of e-journal preservation systems. It concludeswith a set of recommendations to CrossRef and publish-ers, aimed not merely at assisting e-journal preservation,and other applications such as citation management, butalso at helping remedy the gap research has revealedbetween DOIs’ potential and actual benefits [21]. Assuch, this paper calls upon CrossRef, academic publish-ers, and, more broadly, all operators of portals involvedin scholarly communication to start a discussion aboutthe feasibility of the recommendations, and to eventuallyimplement (a version of) them, not just as a means tosupport archival use cases, but to increase overall inter-operability for web-based scholarly communication.

2 Current Harvest Ingest ProcessesThe current harvest content ingest pipeline for theCLOCKSS Archive is documented in detail in theArchive’s TRAC audit materials [18]. Briefly, the pro-cess of ingesting journal content involves:

• At regular intervals a web crawler visits a known“start page” for the journal. This is typically thejournal’s home page or the current volume’s tableof contents (ToC) page. It collects the current stateof the start page.

• The crawler identifies as yet unvisited links on thestart page and follows them to collect the linked-tocontent. These pages frequently link to many things

arX

iv:1

605.

0615

4v1

[cs

.DL

] 1

9 M

ay 2

016

http://orcid.org/0000-0002-0715-6126

http://orcid.org/0000-0002-3094-1376

http://orcid.org/0000-0003-3749-8116

other than journal content, so the crawler has heuris-tics, called crawl rules, intended to eliminate thoselinks unlikely to point to journal content.

• The process repeats to find articles. Typical jour-nals have a structure in which a home page links tovolume ToC pages, which link to issue ToC pages,which link to article “landing pages” (See Figures 1,and 2).

• Landing pages typically contain an abstract andmetadata for the article. The crawler collects thelanding page and analyzes its links. Among themare typically full-text HTML and PDF versions, andany supplemental materials. Landing pages are notuniversal, in some cases (e.g., [22]) the ToC pagelinks directly to the full-text HTML.

• The crawler again identifies these links usingheuristics, and follows them to collect the HTML,the PDF and the supplemental materials.

• The crawler analyzes the links in the HTML. Again,heuristics are used to identify the link targets thatform part of the article and must thus be saved.

– Some will not be part of this or any of thisjournal’s other articles, such as links to articlescited from other journals. These links shouldnot be followed.

– Some of their targets will be part of the ar-ticle, and not part of any other article, suchas the images of figures, and the spreadsheetsthat generated graphs. These links should befollowed.

– Some will be to targets shared by many arti-cles in the journal, an obvious example beingthe link back to the ToC page. These linksshould only be followed if they have not pre-viously been.

• The metadata extractor [10] analyzes the land-ing page, the HTML or the PDF, depending uponwhere suitable metadata may be found, and cre-ates an entry containing the metadata in a metadatadatabase [9]. This again uses heuristics to locate themetadata in one or more of the article’s files, and tonormalize it into a canonical form [16].

In an ideal world, the heuristics work well and thisprocess collects everything the publisher “objectively”published, in the sense of making it available to read-ers, shortly after they did so and obtains useful meta-data. This process works well, especially for journalspublished on one of the major publishing platforms, such

as HighWire, Atypon, PLOS or Open Journal System.But in the real world, the heuristics are less than perfect.

Publishing platforms and publishers are continually“optimizing their user’s experience” by making changesto their web sites look-and-feel. Some of these changescan render the heuristics ineffective. The LOCKSS sys-tem implements automated “substance” monitoring. Thesystem is configured with per-publisher information de-scribing the minimum number and size of PDF andHTML files expected from each journal in a publishingcycle; it alerts if these limits are violated.

Further, these heuristics operate on a single source ofdata, the publisher’s web site. All large-scale data feedscontain noise, and this data is no exception. Examples in-clude aberrant spellings of publisher and journal names,and malformed, unregistered or incorrect DOIs.

The ingest process thus needs continual monitoringand adjustment to ensure that the three questions areanswered satisfactorily, and inconsistencies addressed.This is a significant part of the total cost of ingest, whichis the largest part of the total cost of preservation. Au-tomation of this monitoring would be greatly aided ifthere were a second, independent source of data againstwhich the data extracted from the publisher’s web sitecould be compared.

To what extent can data obtained from CrossRef playthis role? It is not completely independent, as it alsocomes from the publisher. But CrossRef applies somevalidation techniques to the data it receives, and the chan-nel by which it is received is quite different, so it mayplay a significant role.

3 Current File Transfer Ingest ProcessesOne might expect that the process of ingesting contentvia file transfer would be reliable, and would not needheuristics, but this is not the case. A 2014 study byUniversity of California librarians [14] assessed the reli-ability of the holdings Portico obtains via file transfer byexamining “a random sample of 104 titles” from “a listof 10,460 CDL licensed titles from the Portico e-JournalPublishers list”:

The holdings for each title were deter-mined by looking up each title on the pub-lisher’s website, recording the volumes, is-sues/volume, year, month and supplements (ifany), and then totaling the number of issues foreach title. The holdings reported by the pub-lisher were compared to the holdings currentlyaccessible in the Portico Archive and the per-centage of “preserved” issues/title was calcu-lated. No attempt was made in this study toascertain if each issue is complete.

The study concluded that:

2

The major finding is that only 50% of theissues to which UC currently has access via thepublishers’ sites are accessible now in the Por-tico Archive.

Portico’s response was “that some lacunae were causedby issues lacking data needed to properly ingest them;some others are in the backlog of ingest work”.

Portico is not alone in this respect. As regards “thebacklog of ingest work”, a study by the CLOCKSSArchive compared the dates at which content was re-ceived at the archive from a major publisher via file trans-fer with the data of publication. It concluded that ap-proximately half the articles were received in the year ofpublication, about a quarter in the year following publi-cation, and the remaining quarter over the decade follow-ing publication.

As regards “lacking data needed to properly ingestthem”, the CLOCKSS Archive’s experience suggeststhat the quality of publisher-supplied metadata dependsfar more on the publisher than on the mode of ingest.

The current file transfer content ingest pipeline for theCLOCKSS Archive is similar in outline to those of othere-journal archives using file transfer. It is documentedin detail in the Archive’s TRAC audit materials [18].Briefly, the process of ingesting journal content involves:

• Obtaining packages of content from each publisherby one of the following techniques:

– Pull FTP, in which the publisher gives thearchive a account on their FTP server. Thearchive uses it to poll the FTP server to findand download content newly placed there bythe publisher.

– Push FTP, in which the archive gives the pub-lisher an account on an FTP server at thearchive. The publisher uses it to place contentthere, which the archive subsequently ingests.

– rsync, in which rsync [20] is used to synchro-nize a directory tree at the publisher with a di-rectory tree at the archive. The publisher addscontent to their tree, the archive ingests con-tent that arrives in their tree.

• At scale, network transfers of content are not com-pletely reliable. Ideally, the package format used bythe publisher provides checksums against which thereceived package can be validated. Packages thatfail validation should be discarded. Some feedbackmechanism is needed:

– In the Pull and rsync cases to tell the publisherwhen the content has been successfully down-loaded so that they can free up the space ittakes.

– In the Push case to tell the publisher that thecontent they sent was corrupted in transit andshould be re-sent.

• Some archives, especially those which normalizethe content for dissemination, unpack the contentfrom its publisher-specific format for storage. Oth-ers, especially dark archives, postpone this step un-til the content is triggered for dissemination.

• The metadata extractor [10] is programmed to un-derstand each publisher’s package format, and toknow where in it to find the necessary metadata.This may be in separate XML files or, as withharvest content, in the content itself. It then cre-ates an entry containing the metadata in a metadatadatabase [9]. This may use heuristics to locate themetadata in one or more of the article’s files, anddoes use heuristics to normalize it into a canonicalform [16].

File transfer has the advantage that it uses fewerheuristics than harvest, but it has the disadvantage thatit obtains content only when the publisher deigns to sendit, rather than when it is made available to the readers.In the real world, both the heuristics and the publishers’processes are less than perfect. It works well for the ma-jor publishers that use it, but it is too complex for smallerpublishers in the “long tail” of journals, which are, afterall, those at most risk and thus at most need of preserva-tion.

Although, per article, the file transfer ingest processneeds less monitoring and adjustment than the harvestprocess, it can’t be left to operate unsupervised. Automa-tion of this monitoring would be greatly aided if therewere a second, independent source of data against whichthe data extracted from the content the publisher trans-ferred could be compared. To what extent can data ob-tained from CrossRef play this role?

4 Building Blocks for a SolutionThe previous sections described the significant problemsthat e-journal archiving efforts face in the current envi-ronment, when trying to address the crucial Manifest,Completeness, and Bibliography (including Identity) re-quirements. These include:

• The lack of a reliable and machine-actionable startpage to support collecting newly published objects.

• The reliance on publisher-specific heuristics to de-termine links on human start pages that lead tonewly published objects. These pages frequentlylink to many things other than new objects.

• The reliance on publisher-specific heuristics to de-termine the web boundary of new objects. The entry

3

pages for these objects link to web resources that arepart of the object but also to many resources that arenot. In addition, different publishers have differentimplementations of these entry pages: sometimesthey are landing pages providing an abstract, some-times they are an HTML version of an article.

• The reliance on publisher-specific heuristics to ob-tain bibliographic metadata describing the new ob-jects, including the object’s identifier. Publishermetadata is of variable quality, can be located in anumber of places (e.g., in HTML meta-tags, in thetext, in dedicated bibliographic records arranged ac-cording to various metadata formats).

• In case of the File Transfer approach, the inability toverify whether content that is published is also madeavailable (in a timely manner) by the publisher forarchiving.

This section outlines a solution to fundamentally ad-dress these problems. The solution involves new webinfrastructure to be operated by CrossRef and academicpublishers, and also leverages existing CrossRef infras-tructure. And, although the focus of this paper is one-journals, the proposed solution can also be imple-mented by institutional repositories, discipline reposito-ries, and, more generally, portals that contribute to web-based scholarly communication. In addition, the solutionhas merits beyond e-journal archiving. It can empowerother applications such as citation managers and alt met-rics. And it can help DOIs achieve their full potential aspersistent identifiers addressing the problem that, manytimes, DOI-identified objects are referenced by meansof their locating URI instead of their persistent DOI atdx.doi.org [21].

The proposed solution builds on two components:

• Machine-Actionable Change Feeds - As a means toprovide information about new, changed, or deletedscholarly objects at an aggregate level, that is, at thelevel of CrossRef, a publisher’s platform, a journal,an institutional repository, etc.

• Signposting - As a means to provide informationabout the resources that make up the web presenceof a scholarly object, when HTTP-interacting witheach of those resources individually.

4.1 ModelingBoth components crucially depend on an approach tomodel the various resources that make up the web pres-ence of a scholarly object in a manner that can be appliedacross publishers, repositories, and portals; modeling re-source types is addressed in Section 4.1.1. Both compo-nents also crucially rely on typed links that can be used

to interconnect these various web resources as a meansto express identity of the object, its web boundary, thelocation of bibliographic metadata describing the object,etc.; typed links used for these purposes are addressed inSection 4.1.2.

4.1.1 Resource Types and Resource PatternsSeveral web resources of different types make up thedistributed web presence of a scholarly digital object.For example, for most academic publications, a DOI atdx.doi.org provides a persistent HTTP identity for the ob-ject, and a so-called landing page that describes the ob-ject serves as the page of first entry into a publisher’sportal when dereferencing that DOI. But, sometimes thepage of entry is not a landing page but rather an HTMLversion of a publication. Also, several institutional anddiscipline repositories do not assign DOIs but use lo-cal persistent HTTP URIs that are preferred for linkingand citing. To address the Manifest, Completeness, andBibliography requirements, these types of web resourcesmust be distinguished:

• Identifying HTTP URI - An HTTP URI that pro-vides a scholarly object with a web identity thatis intended to be persistent. The persistent Iden-tifying HTTP URI redirects to a location URI thatcan change over time because the identified con-tent moves from one web location to another, forexample, because of platform migrations (locationURI changes within the same domain) or transfer ofcontent to a new custodian (location URI changesto another domain). There are two common ap-proaches to achieve the intended persistence. Oneconsists of operating shared, cross-domain, persis-tent identifier infrastructure. In this case, Identify-ing HTTP URIs are minted in a dedicated domainand are redirected to locating URIs in a wide va-riety of domains. Examples include DOI, handle,identifiers.org, PURL, and W3ID. In this approach,the correspondence between the Identifying HTTPURI and the locating URI is contained in a sharedlook-up table that is kept up-to-date by custodiansof the identified content. Another approach consistsof minting Identifying HTTP URIs in the same do-main as the locating URI. The correspondence be-tween the Identifying HTTP URIs and the locatingURI is locally maintained, either in a look-up tableor by means of web server rewrite rules.

• Entry Page - The page in a publisher’s portal whereone typically enters when accessing a scholarly ob-ject, usually by following redirects from an Identi-fying HTTP URI or via a search engine result. Formany publishers, this Entry Page is an HTML land-ing page that describes the object. But other pub-

4

lishers provide an HTML version of an article asthe Entry Page.

• Publication Resource - A web resource provided byor via the publisher that is considered an integralpart of the scholarly object. Examples include var-ious renderings (HTML, PDF, PS, XML, etc.) ofan article but also supplementary materials that areconsidered part of the object.

• Bibliographic Resource - A web resource that de-scribes the object in a structured manner, for exam-ple, using BibTeX or RIS formatting styles.

Not all platforms have all of the above types of webresources associated with a digital object. Also, the wayin which these resources are combined differs dependingon the platform. Figures 1, 2, and 3 show various con-figurations as well as example platforms that use them.Each configuration shows the pattern involving Publica-tion Resources to the left, and the pattern with Biblio-graphic Resources to the right.

4.1.2 Relation TypesIn order to connect all resources that make up the webpresence of a digital object, typed links are used. Theselinks serve the purpose of expressing the type of a re-source, when necessary, of indicating the identity of theobject and delineating its web boundary, and to provideaccess to bibliographic metadata. Each typed link con-tains:

• The URI that is the source of the link. Sometimes,the source URI is provided implicitly such as, forexample, in HTML where it is the URI of the docu-ment in which the link is embedded. Sometimes thesource is provided explicitly.

• The URI that is the target of the link. A link be-tween a source URI and a target URI expresses thatthe target is somehow related to the source.

• The type of the relationship between the source andthe target. This is expressed by means of the relattribute. Values for this attribute are strings regis-tered in the IANA Link Relation Type Registry [6]and typically represent rather coarse relation typesintended for web-wide interoperability. Values canalso be URIs minted by communities, allowing forincreased expressiveness but leading to decreasedinteroperability.

• Optionally, additional attributes can be conveyed onthe link, such as type to express the mime-type of thetarget resource and others listed in RFC5988 [12].

Both proposed components - Change Feeds and Sign-posting - can and should use the same link relation typesand link attributes. But the way in which these are con-veyed differs per component:

• Machine-Actionable Change Feeds - The links andattributes are conveyed as part of a feed entry that isdedicated to a change event pertaining to a specificscholarly object.

• Signposting - The links and attributes are conveyedin the HTTP Link header [12] when issuing anHTTP HEAD/GET on the resource.

Research conducted since the publication of [22],which introduced the Signposting concepts, has led toa shortlist of relation types - summarized in Table 1 -that can be used to address the Manifest, Completeness,and Bibliography requirements. Links with these rela-tion types must be applied to the various configurationsdiscussed above to interlink all resources in such a waythat a crawler or other application can unambiguously(and hence automatically) navigate its way around theweb resources that make up a digital object. With oneexception, all required relation types are already regis-tered in the IANA Link Relation Type Registry [6]:

• item & collection: In order to group all the pub-lisher’s Publication Resources that are part of ascholarly object, the Entry Page for the object ismodeled as a collection that contains the other re-sources as items. This Entry Page collection re-source binds the other Publication Resources bymeans of the item relation type. These item Publi-cation Resources in their turn link back to the EntryPage collection resource using the collection rela-tion type. Which resources should be considered asitems for an object is at a publisher’s discretion butdefinitely included should be all renderings of an ar-ticle (PDF, HTML, PS, ...) and any supplementaryfiles considered to be an integral part of the object.

• persistent-id: In order for the Entry Page and thePublication Resources to express that they are partof an object with an Identifying HTTP URI they linkto that URI using the persistent-id relation type.For example, for a DOI-identified object the targetof this persistent-id is the DOI at dx.doi.org. Thepersistent-id relation type is not yet registered butthe plan is to author an RFC to define and registerit. At the time of writing, the name persistent-id isused as a placeholder.

• type: In order for the Entry Page and the Pub-lication Resources to express their own nature,the type relation type is used. To avoid end-less classification discussions, the proposal is to

5

IdentifyingHTTP URI

Entry Page

BibliographicResources

IdentifyingHTTP URI

Entry Page

PublicationResources

DOI

PublicationHTML

PublicationPDF

PublicationXML

Supple-mentary

redirect

DOI

PublicationHTML

MetadataBibTex

MetadataRIS

redirect

Pattern used by PLOS

IdentifyingHTTP URI

Entry Page


IdentifyingHTTP URI

Entry Page


DOI

LandingPage

PublicationPDF

Supple-mentary

redirect

DOI

LandingPage

MetadataBibTex

redirect

Pattern used by APS

HTML

Figure 1: Patterns used by PLOS and APS, respectively

6

IdentifyingHTTP URI

Entry Page


IdentifyingHTTP URI

Entry Page


handle

LandingPage

PublicationPDF

Supple-mentary

redirect

handle

LandingPage

MetadataDC

redirect

Pattern used by DSpace

IdentifyingHTTP URI

Entry Page


IdentifyingHTTP URI

Entry Page


DOI

LandingPage

DataFile

DataFile

redirect

DOI

LandingPage

MetadataBibTex

redirect

Pattern used by Dryad

DataFile

MetadataRIS

Figure 2: Patterns used by DSpace and Dryad, respectively

7

IdentifyingHTTP URI

Entry Page


IdentifyingHTTP URI

Entry Page


LandingPage

PublicationPDF

PublicationPS

LandingPage

MetadataBibTex

Pattern used by arXiv.org

IdentifyingHTTP URI

Entry Page


IdentifyingHTTP URI

Entry Page


LandingPage

LandingPage

Pattern used by eprints

OtherFormats

Publicationfor print

Supple-mentary

HTTP URI

HTTP URI

MetadataBibTex

MetadataRIS

redirect redirect

Figure 3: Patterns used by arXiv.org and eprints, respectively

8

use as targets of these type links URIs that arecoarse indicators of the resource’s own nature.The info:eu-repo/ vocabulary1, originated by SURF2

and meanwhile maintained by COAR3 providesgood basic candidates. For example, it uses theURI info:eu-repo/semantics/humanStartPage (or aPURL variant) to indicate that a resource is alanding page, info:eu-repo/semantics/article to con-vey it’s an article, info:eu-repo/semantics/datasetfor a dataset, info:eu-repo/semantics/objectFile fora file that is neither article nor dataset, andinfo:eu-repo/semantics/descriptiveMetadata for bib-liographic metadata.

• describedby & describes: The describedby relationtype is used to link to a metadata resource that de-scribes the scholarly object. A publisher uses it tolink from the Entry Page to a metadata resource andCrossRef uses it to link from a DOI at dx.doi.orgto its metadata that describes the DOI-identified ob-ject. The inverse relation type describes is used tolink back from the metadata resource. Following thesame examples, this links from a publisher’s meta-data record to the Entry Page and from CrossRef’smetadata to the DOI at dx.doi.org.

For typed links that use these relation types, the addi-tion of the following attributes - summarized in Table 1- can be important. The first attribute is already widelyused, and work is under way to define the two additionalattributes4 [13].

• type: In order to express the mime-type of the targetresource, the type attribute is used. This can, for ex-ample, be very useful on item links pointing at Pub-lication Resources to allow distinguishing between,e.g., the PDF, HTML, PS version of an article.

• profile: The profile attribute can be used to providefurther details about the representation of a linkedresource. It can be especially helpful for Biblio-graphic Resources. Many have text/plain, appli-cation/xml, and, more recently, application/json asmime-type and it would be helpful to further in-form clients about the specific nature of a format,for example BibTeX, RIS, Dublin Core, or MODS.This can be achieved by adding a profile attributeto the describedby links with as value a URI thatuniquely identifies the format, e.g., http://www.bibtex.org/ for BibTeX, and an XML SchemaURI for XML metadata.

1https://wiki.surfnet.nl/display/standards/info-eu-repo

2https://www.surf.nl/3https://www.coar-repositories.org/4https://github.com/mnot/I-D/issues/175

Link Relation Type Attributescollection type, profile, sem-typedescribedby type, profile, sem-typedescribes type, profile, sem-typeitem type, profile, sem-typepersistent-idtype type, profile, sem-type

Table 1: Relation types and attributes to addressthe Manifest, Completeness, and Bibliography require-ments.

• sem-type As mentioned above, a link with a typerelation type is used to point from a resource to aURI that expresses the resource’s own nature, e.g.,whether it is a landing page, an article, etc. Thesem-type attribute can be used on a link to expressthe nature of the target resource. As such, whilethe type attribute on a link expresses the mime-type of the linked resource (e.g., application/pdf),the sem-type attribute expresses its semantic type(e.g., info:eu-repo/semantics/article). At the time ofwriting, the name sem-type is used as a placeholder.

Typical relationships among web resources that makeup a scholarly article are shown in Figure 4. The figuredepicts a pattern used by several publishers, includingPLOS. Relations pertaining to identification (persistent-id) are in red. Links pertaining to delineating the webboundary of an object (item & collection) are shown inblue, and the ones expressing the nature of a resource(type) in green. In this pattern, the Entry Page is theHTML version of the article. A crawler that has fol-lowed redirects from the DOI at dx.doi.org and has ar-rived at the Entry Page can use links in that page’s HTTPheader to find all Publication Resources without havingto screen-scrape the page. In this pattern, Publication Re-sources can be gathered by following the item links andby collecting the Entry Page itself because it has a typelink that identifies itself as an article. Note also how thecollection links allow a crawler to collect all PublicationResources even if it did not land on the Entry Page. Sim-ilarly, the persistent-id links allow the crawler to deter-mine the Identifying HTTP URI of a Publication Resourcewhen it lands on one. These links are also very useful forapplications other than archival crawlers. For example,they allow a citation manager to automatically detect theDOI of an article, even when landing on its PDF version.

The Figure also shows the use of the type attributeon item links to convey the mime-type of a linked re-source. This can significantly assist selective crawlers,which only want to collect components of the object suit-able for their processes. An example would be a crawlersupporting data-mining, which only needs the text andwould much prefer HTML over PDF [1]. The sem-type

9

http://www.bibtex.org/


https://wiki.surfnet.nl/display/standards/info-eu-repo

https://wiki.surfnet.nl/display/standards/info-eu-repo

https://www.surf.nl/

https://www.coar-repositories.org/

https://github.com/mnot/I-D/issues/175

DOI Identifying HTTP URI

PUB

PDF

PUB

HTML

rel="item"

type="application/pdf"

HTTPredirect(s)

rel="collection"

PUB

XML

Entry Page

rel="collection"

info:eu-repo/semantics/publication

rel="type"rel="type"

rel="type"

rel="persistent-id" rel="persistent-id"

rel="persistent-id"

Publication Resources

rel="item"type="application/xml"

Figure 4: Typed links to connect resources that make up the web presence of a scholarly digital object.

10

DOI Identifying HTTP URI

PUB

HTMLEntry Page

info:eu-repo/semantics/descriptiveMetadata

BIB

TEX

BIB

RISBibliographic Resources

BIB

XREF

rel="describedby"type="application/json"profile="crossref"

rel="describedby"type="text/plain"profile="ris"

rel="describedby"type="text/plain"profile="bibtex"

rel="describedby"type="application/json"profile="crossref"

rel="type"

rel="type"

rel="type"

rel="describes"

rel="describes"

rel="describes"

HTTPredirect(s)

Figure 5: Typed links to support discovery of bibliographic metadata.

11

Infrastructure Manifest Completeness BibliographyImplemented by CrossRef

CrossRef API +\- - +ResourceSync + - +Signposting - - +

Implemented by PublisherCrossRef API - - -ResourceSync + + +Signposting - + +

Table 2: Infrastructure to address the Manifest, Com-pleteness, and Bibliography requirements

attribute, not shown in the figure, would support simi-lar selectivity regarding the nature of a linked resourcerather than its mime-type.

Figure 5 shows the relationships pertaining to Biblio-graphic Resources for the same pattern. It shows twometadata records exposed by the publisher. Note that itis essential that a metadata record be accessible at a URIby using HTTP GET. Strangely enough, some publish-ers use a form-based HTTP POST approach for meta-data access. For brevity, the figure merely shows strings(e.g., BibTeX) as the value of the profile attribute, in real-ity, a URI (e.g., http://www.bibtex.org/) mustbe used. Note also that, in the figure, CrossRef providesa describedby link pointing from the DOI at dx.doi.org atits metadata that describes a DOI-identified object. In ad-dition, the publisher also links to that metadata. It can doso because the URI of CrossRef’s metadata can be con-structed when the DOI is known. All these links supportdiscovery of metadata without out-of-band knowledge.

4.2 Web InfrastructureThis section introduces concrete technologies that canbe used to address the Manifest, Completeness, andBibliography requirements: the CrossRef API [3], AN-SI/NISO Z39.99 ResourceSync [11], and Signposting.These technologies are used in a manner that leveragesthe modeling approach described above. The section alsohighlights which technology addresses which of the threerequirements; this is summarized in Table 2.

4.2.1 CrossRef APIThe CrossRef API provides two functionalities that canbe used as building blocks for a solution:

• A means to retrieve recently registered DOIs and as-sociated metadata, which can address the Manifestand Bibliography requirements.

• A means to access metadata about a DOI-identifiedobject, which can address the Bibliography require-ment.

Listing 1 shows an API query issued on March 17,2016 that returned the then 20 most recently depositedDOIs; only two entries are shown.

This output illustrates some of the problems of us-ing the CrossRef API to address the Manifest require-ment. The second entry doi:10.1016/j.jmaa.2016.03.023is a current paper published by Elsevier, whichan archive would want to ingest. The first entrydoi:10.1029/jd094id06p08425 was published June 20,1989. The 10.1029 prefix belongs to the American Geo-physical Union, whose member number is 13. The mem-ber number 311 is Wiley-Blackwell. This entry is pre-sumably the result of a journal transfer, about whicharchives need to be informed but which may or may notrequire that the journal’s back content be re-ingested, de-pending upon arrangements with the previous and newpublisher. As such, the fact that the API response doesnot discriminate between newly registered and changedDOIs is problematic.

Archives could poll this API, concatenate and dedu-plicate the results, filter for the publishers they preserve,and synthesize what they really want, which is a feed ofnewly registered DOIs for their publishers. Using thisAPI query would also address the Bibliography require-ment. But, since this approach to address the Manifestrequirement involves recurrent queries against the entireCrossRef database, it may not be the most efficient ap-proach for either CrossRef or the users of their API. Thenext section describes how the use of the ANSI/NISOZ39.99 ResourceSync standard is an appealing alterna-tive.

Listing 2 shows how the CrossRef API is used to re-turn JSON bibliographic metadata pertaining to a DOI-identified object. Assuming the DOI of a newly reg-istered or changed DOI is known, this API query canbe used to address the Bibliography requirement. Thismetadata is obtained without using heuristics and Cross-Ref applies validation processes to it on receipt. Thismetadata can be used as is but can also be compared tometadata that might be exposed by publishers as Biblio-graphic Resources.

4.2.2 ResourceSync

ANSI/NISO Z39.99 ResourceSync [11] was published in2014, as the result of a 3-year joint effort by NISO andthe Open Archives Initiative aimed at devising a “webby”successor to OAI-PMH [8], the popular Protocol forMetadata Harvesting. ResourceSync is about synchro-nizing web resources - anything with a URI - across sys-tems, not merely about synchronizing repository meta-data like its predecessor OAI-PMH. But, since metadatacan and is made available at URIs, ResourceSync canalso be used to synchronize metadata. ResourceSync isbased on the Sitemap protocol [5] that is used by searchengines to index the web, but extends that protocol inthree ways that make it rather attractive for the purposeof the e-journal archiving use case:

12


Listing 1: CrossRef API used to retrieve the 20 most recently registered DOIsRequest

wget -O query.out \protect\vrule width0pt\protect\href{http://api.crossref.org/works?sort=deposited}{http://api.crossref.org/works?sort=deposited}&order=desc

Response

{ "message" : { "facets" : { },"items" : [ { "DOI" : "10.1029/jd094id06p08425",

"ISSN" : [ "0148-0227" ],"URL" : "http://dx.doi.org/10.1029/jd094id06p08425","archive" : [ "Portico" ],"author" : [ { "affiliation" : [ ],

"family" : "Livingston","given" : "John M."

},{ "affiliation" : [ ],

"family" : "Russell","given" : "Philip B."

}],

"container-title" : [ "Journal of Geophysical Research: Atmospheres","J. Geophys. Res."

],"created" : { "date-parts" : [ [ 2008,

2,6

] ],"date-time" : "2008-02-06T13:40:44Z","timestamp" : 1202305244000

},"deposited" : { "date-parts" : [ [ 2016,

3,17


},"indexed" : { "date-parts" : [ [ 2016,

3,17


},"issue" : "D6","issued" : { "date-parts" : [ [ 1989,

6,20

] ] },"license" : [ { "URL" : "http://doi.wiley.com/10.1002/tdm_license_1",

"content-version" : "tdm","delay-in-days" : 0,"start" : { "date-parts" : [ [ 1989,

6,20


}},{ "URL" : "http://onlinelibrary.wiley.com/termsAndConditions",

"content-version" : "vor","delay-in-days" : 8494,"start" : { "date-parts" : [ [ 2012,

9,21


}}

],"link" : [ { "URL" : "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1029%2FJD094iD06p08425",

"content-type" : "application/pdf","content-version" : "vor","intended-application" : "text-mining"

} ],"member" : "http://id.crossref.org/member/311","page" : "8425-8433","prefix" : "http://id.crossref.org/prefix/10.1002","published-online" : { "date-parts" : [ [ 2012,

9,21

] ] },"published-print" : { "date-parts" : [ [ 1989,

6,20

] ] },"publisher" : "Wiley-Blackwell","reference-count" : 16,"score" : 1.0,"source" : "CrossRef","subject" : [ "Earth and Planetary Sciences (miscellaneous)",

"Space and Planetary Science","Atmospheric Science","Geophysics"

],"subtitle" : [ ],"title" : [ "Retrieval of aerosol size distribution moments from multiwavelength particulate extinction measurements" ],"type" : "journal-article","volume" : "94"

},...

13

{ "DOI" : "10.1016/j.jmaa.2016.03.023","ISSN" : [ "0022-247X" ],"URL" : "http://dx.doi.org/10.1016/j.jmaa.2016.03.023","alternative-id" : [ "S0022247X16002419" ],"author" : [ { "affiliation" : [ ],

"family" : "Li","given" : "C."

},{ "affiliation" : [ ],

"family" : "Ng","given" : "K.F."

}],

"container-title" : [ "Journal of Mathematical Analysis and Applications" ],"created" : { "date-parts" : [ [ 2016,

3,17



3,17



3,17


},"issued" : { "date-parts" : [ [ 2016,

3] ] },

"license" : [ { "URL" : "http://www.elsevier.com/tdm/userlicense/1.0/","content-version" : "tdm","delay-in-days" : 0,"start" : { "date-parts" : [ [ 2016,

3,1


}} ],

"link" : [ { "URL" : "http://api.elsevier.com/content/article/PII:S0022247X16002419?httpAccept=text/xml","content-type" : "text/xml","content-version" : "vor","intended-application" : "text-mining"

},{ "URL" : "http://api.elsevier.com/content/article/PII:S0022247X16002419?httpAccept=text/plain",

"content-type" : "text/plain","content-version" : "vor","intended-application" : "text-mining"

}],

"member" : "http://id.crossref.org/member/78","prefix" : "http://id.crossref.org/prefix/10.1016","published-print" : { "date-parts" : [ [ 2016,

3] ] },

"publisher" : "Elsevier BV","reference-count" : 35,"score" : 1.0,"source" : "CrossRef","subject" : [ "Applied Mathematics",

"Analysis"],

"subtitle" : [ ],"title" : [ "Extended Newton methods for conic inequalities: Approximate solutions and the extended Smale \alpha-theory" ],"type" : "journal-article"

},...

"message-type" : "work-list","message-version" : "1.0.0","status" : "ok"

}

14

Request

wget \protect\vrule width0pt\protect\href{http://api.crossref.org/works/10.1045/september2015-rosenthal}{http://api.crossref.org/works/10.1045/september2015-rosenthal}

Response

{ "message" : { "DOI" : "10.1045/september2015-rosenthal","ISSN" : [ "1082-9873" ],"URL" : "http://dx.doi.org/10.1045/september2015-rosenthal","author" : [ { "affiliation" : [ ],

"family" : "Rosenthal","given" : "David S. H."

},{ "affiliation" : [ ],"family" : "Vargas","given" : "Daniel L."

},{ "affiliation" : [ ],"family" : "Lipkis","given" : "Tom A."

},{ "affiliation" : [ ],"family" : "Griffin","given" : "Claire T."

}],

"container-title" : [ "D-Lib Magazine" ],"created" : { "date-parts" : [ [ 2015,

9,15



9,15



12,22


},"issue" : "9/10","issued" : { "date-parts" : [ [ 2015,

9] ] },

"member" : "http://id.crossref.org/member/72","prefix" : "http://id.crossref.org/prefix/10.1045","published-online" : { "date-parts" : [ [ 2015,

9] ] },

"publisher" : "CNRI Acct","reference-count" : 0,"score" : 1.0,"source" : "CrossRef","subject" : [ "Library and Information Sciences" ],"subtitle" : [ ],"title" : [ "Enhancing the LOCKSS Digital Preservation Technology" ],"type" : "journal-article","volume" : "21"

},"message-type" : "work","message-version" : "1.0.0","status" : "ok"

}

Listing 2: CrossRef API used to retrieve metadata about a DOI-identified article

15

• The Sitemap protocol only allows for the publi-cation of an inventory of web resources, essen-tially a list of URIs currently exposed by a system.This basic capability is called Resource Lists in Re-sourceSync. But, in addition, ResourceSync alsoallows the publication of Change Lists (lists of re-cently created, updated, or deleted resources), andChange Notifications (resource change events com-municated on a PubSubHubbub channel to whichapplications subscribe). While the payloads in thesecommunications point at resources by means oftheir URI, ResourceSync also provides Dump ap-proaches that allow packaging resource representa-tions in ZIP files while conveying their web URIsin a manifest embedded in the ZIP. For exam-ple, a Change Dump points at a ZIP file of re-cently created/updated/deleted resources. ChangeLists, Change Notifications, and Change Dumps arestandard-based approaches to implement Machine-Actionable Change Feeds. All can play a role forcrawl-based as well as file transfer archiving.

• ResourceSync defines a variety of elements and at-tributes in support of resource synchronization, in-cluding ways to express fixity information, to detailthe temporal interval covered by a Change List, toconvey the datetime and nature (created, updated,deleted) of a change.

• ResourceSync allows to express typed links froma resource subject to synchronization to related re-sources, for example, a link from an article to meta-data describing it. As such, all the relation types(and attributes) described in the above can be usedin ResourceSync payloads.

CrossRef can use ResourceSync Change Lists to ad-dress the Manifest and Bibliography requirements bycommunicating about newly registred, changed, anddeleted (if such a thing exists) DOIs. This can bedone by pre-computation instead of by evaluating aquery against its database at each poll. The use ofResourceSync Change Notifications would additionallyeliminate polling. Listing 3 in Section 5 illustrates howCrossRef can use ResourceSync to report on newly reg-istered DOIs using ResourceSync.

Publishers can use ResourceSync Change Lists orChange Notifications to communicate about newly pub-lished objects. These ResourceSync feeds would ad-dress the Manifest requirement by serving as a machine-actionable, article-centric, alternative to the current jour-nal “start page” visited by the archival web crawler. But,since ResourceSync can provide typed links to relatedresources, it can also be used to address the Complete-ness and Bibliography requirements. Moreover, since

ResourceSync also supports reporting on changed anddeleted resources, publishers could use it to generallykeep applications informed about their evolving content.In addition, since ResourceSync allows expressing fix-ity information, it could also support verification of col-lected resources. Listing 4 in Section 5 shows how a pub-lisher can use ResourceSync to announce the availabilityof a new article.

If CrossRef, publishers, institutional and disciplinerepositories, as well as other portals involved in web-based scholarly communication would uniformly use Re-sourceSync to report on holdings and changes to hold-ings, clients would have a uniform method to stay in-formed about evolving scholarly content. This was theoriginal motivation for creating the ResourceSync stan-dard in the first place. But, unfortunately, so far, Re-sourceSync has not enjoyed wide adoption despite itspower and simplicity. Possible explanations includecompetition with its significantly less powerful and 15year old predecessor OAI-PMH, an overall tendency byportals to resort to rolling their own bespoke APIs, and alack of concrete action regarding interoperability despitea general sentiment that more is needed.

4.2.3 Signposting

A recent paper by Van de Sompel and Nelson [22] showshow simple changes to the publishing platforms can ad-dress the Completeness and Bibliography requirements.Their “Signposting the Scholarly Web” proposal useslinks conveyed in HTTP Link headers, as defined inRFC5988 [12], to interconnect the web resources thatmake up a scholarly digital object. The essence of theSignposting idea is that each such web resource can pro-vide informative typed links in an HTTP Link header inresponses to HTTP HEAD/GET requests.

In order to address the Completeness requirement,publishers can express typed links and attributes, as de-scribed in Section 4.1.2 and as illustrated in Figure 4, us-ing HTTP Link headers when a client issues an HTTPHEAD/GET on the Entry Page and Publication Re-sources. These links allow a crawler or other applica-tion to unambiguously navigate across the web resourcesthat make up a scholarly object. Listing 6 in Section 5shows a response header for the Entry Page, which in-cludes links to the Identifying HTTP URI and to each ofthe Publication Resources. Another link indicates thatthe Entry Page is actually an article. Listing 7 shows aresponse header for the PDF version of an article con-taining links that indicate the PDF is actually an article,what its Identifying HTTP URI is, and where the EntryPage can be found from which other resources that arepart of the object can be discovered.

In order to address the Bibliography requirement,CrossRef can include a link in the HTTP response header

16

for a DOI at dx.doi.org that points with the describedbyrelation type to the URI of the metadata record. Thislink should also have the type attribute to convey themime-type of CrossRef’s metadata, i.e., application/json,and the profile attribute to provide further informationabout the specific JSON used by CrossRef. Listings5 in Section 5 provides an illustration. The metadatarecord can link back to the DOI at dx.doi.org using theinverse describedby relation type. Publishers can addexactly the same describedby link to point from an En-try Page to the guessable URI for metadata at Cross-Ref’s API. In addition, bibliographic metadata exposedby publishers can be handled in a similar way by in-cluding describedby links, each pointing from the EntryPage to the URI of a metadata record expressed accord-ing to a specific format. Again, these links should usethe type and profile attributes. Listing 6 in Section 5 il-lustrates these links. The URIs of bibliographic recordsboth at CrossRef and at the publisher can further in-clude a link in their HTTP response header that charac-terizes the resources as bibliographic metadata (info:eu-repo/semantics/descriptiveMetadata in the info:eu-repovocabulary) using the type relation type.

5 SummaryThe remainder of this section provides a concrete in-sight into how CrossRef and publishers can support theManifest, Completeness, and Bibliography requirementsby implementing ResourceSync and Signposting, and byleveraging existing CrossRef API functionality. The ex-amples shown below are for the PLOS One paper “Schol-arly Context Not Found: One in Five Articles Suffersfrom Reference Rot” [7] for which Table 3 shows theURIs of web resources that make up the paper’s webpresence. The examples use all the relation types andattributes discussed above.

5.1 ResourceSyncPresume CrossRef uses ResourceSync Change Lists orChange Notifications to convey newly registered DOIs.The payload to convey the new PLOS One paper isshown in Listing 3. Note that the design choice ismade to express the payload in terms of the Identify-ing HTTP URI (the DOI at dx.doi.org in the loc element)and include a link from that URI to the metadata recordthat describes the DOI-identified object. As such, Re-sourceSync is used to communicate about change eventsin the DOI namespace. These can include new DOIregistration (created event), changes to a registration(updated event), and deletions of a DOI (deleted event),if such a thing exists.

The publisher can also use ResourceSync ChangeLists or Change Notifications to convey the publicationof a new paper. Listing 4 shows an intentionally rich

payload that does this for paper [7], and that is expressedin terms of the URI of the Entry Page (in this case theHTML article) modelled as a collection. Links have theURI of the Entry Page (the URI in the loc element) assource and express the nature of the Entry Page (info:eu-repo/semantics/article), reveal the Identifying HTTP URI,and provide a pathway to Publication Resources and Bib-liographic Resources. Note that the mime-type of linkedresources is conveyed using the type attribute, and theirnature using the sem-type attribute.

In addition to the information shown, ResourceSyncalso allows expressing fixity information pertaining tothe resource that is the subject of the payload (the URIin the loc element) as well as pertaining to linked re-sources. Changes to the paper, for example, an updateor deletion of a Publication Resources can be communi-cated in a similar way. In this case, however, the changeevent should be expressed in terms of the URI of thechanged resource (the URI in the loc element), and linksto neighboring resources (Entry Page, Persistent HTTPURI) should be provided.

5.2 Signposting

Signposting is about communicating information in theLink header of a web resource associated with a schol-arly digital object when HTTP-interacting with it. List-ing 5 shows that CrossRef can use Signposting to supportdiscovery of its metadata for a DOI-identified object.

Publishers can also use Signposting to make meta-data about their articles discoverable. Listing 6 showslinks from the Entry Page to metadata available from theCrossRef API as well as two types of publisher meta-data. Most importantly, it also shows links to all Publica-tion Resources as well as links that reveal the IdentifyingHTTP URI. In general, publishers can use Signpostingfor each of their web resources associated with a schol-arly object. As an example, Listing 7 shows an HTTPHEAD interaction with the PDF version of the PLOSarticle. The links provided in the HTTP response Linkheader characterize the resource as an article, convey itsIdentifying HTTP URI (DOI at dx.doi.org), and link backto the collection resource (the Entry Page) from whichother components of the scholarly object are linked.

Note that Link headers could eventually become bigif additional relations that serve other use cases would beincluded. While web servers and clients can deal with ex-tensive HTTP headers, it may eventually become worth-while to consider defining a relation type, for examplelinks dedicated to pointing at a document that containslinks as an alternative for providing all links in the HTTPheader.

17

<url><loc>\protect\vrule width0pt\protect\href{http://dx.doi.org/10.1371/journal.pone.0115253</loc}{http://dx.doi.

org/10.1371/journal.pone.0115253</loc}><rs:md change="created" datetime="2014-12-26T00:00:00Z"/><rs:ln rel="describedby"

href="http://api.crossref.org/works/10.1371/journal.pone.0115253"type="application/json"profile="https://github.com/CrossRef/rest-api-doc"/>

</url>

Listing 3: Payload for a ResourceSync change event in which CrossRef announces a new DOI

<url><loc>\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone

.0115253</loc}{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253</loc}><rs:md change="created" datetime="2014-12-26T00:00:00Z"/><rs:ln rel="type"

href="info:eu-repo/semantics/article" /><rs:ln rel="persistent-id"

href="http://dx.doi.org/10.1371/journal.pone.0115253" /><rs:ln rel="item"

href="http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.PDF"type="application/pdf"sem-type="info:eu-repo/semantics/article"/>

<rs:ln rel="item"

href="http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.XML"type="application/xml"sem-type="info:eu-repo/semantics/article"/>

<rs:ln rel="item"

href="http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0115253.s012"

type="text/html"sem-type="info:eu-repo/semantics/objectFile"/>

<rs:ln rel="describedby"

href="http://api.crossref.org/works/10.1371/journal.pone.0115253"type="application/json"profile="https://github.com/CrossRef/rest-api-doc"/>

<rs:ln rel="describedby"

href="http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371%2Fjournal.pone.0115253"type="text/plain"profile="http://bibtex.org"/>

<rs:ln rel="describedby"

href="http://journals.plos.org/plosone/article/citation/ris?id=10.1371%2Fjournal.pone.0115253"type="text/plain"profile="https://en.wikipedia.org/wiki/RIS_(file_format)"/>

</url>

Listing 4: Payload for a ResourceSync change event in which PLOS announces a new paper

18

Resource Resource URICrossRef

DOI http://dx.doi.org/10.1371/journal.pone.0115253CrossRef metadata http://api.crossref.org/works/10.1371/journal.pone.0115253

Publisher - PLOSHTML article http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253PDF article http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.

0115253.PDFXML article http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.

0115253.XMLSupplementary file http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/

journal.pone.0115253.s012PLOS BibTeX metadata http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371%2Fjournal.

pone.0115253PLOS RIS metadata http://journals.plos.org/plosone/article/citation/ris?id=10.1371%2Fjournal.

pone.0115253

Table 3: URIs of resources that make up the web presence of the PLOS One example article [7]

Request

curl -I \protect\vrule width0pt\protect\href{http://dx.doi.org/10.1371/journal.pone.0115253}{http://dx.doi.org/10.1371/journal.pone.0115253}

Response

HTTP/1.1 303 See OtherVary: AcceptLocation: \protect\vrule width0pt\protect\href{http://dx.plos.org/10.1371/journal.pone.0115253}{http://dx.

plos.org/10.1371/journal.pone.0115253}Link: <\protect\vrule width0pt\protect\href{http://api.crossref.org/works/10.1371/journal.pone.0115253}{http

://api.crossref.org/works/10.1371/journal.pone.0115253}>; rel="describedby"; type="application/json"; profile="https://github.com/CrossRef/rest-api-doc"Content-Type: text/html;charset=utf-8Content-Length: 179Date: Thu, 05 May 2016 21:59:52 GMT

Listing 5: HTTP HEAD on a DOI provides a link to metadata describing the DOI-identified object

19

http://dx.doi.org/10.1371/journal.pone.0115253

http://api.crossref.org/works/10.1371/journal.pone.0115253

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253

http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.PDF

http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.PDF

http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.XML

http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0115253.XML

http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0115253.s012

http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0115253.s012

http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371%2Fjournal.pone.0115253

http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371%2Fjournal.pone.0115253

http://journals.plos.org/plosone/article/citation/ris?id=10.1371%2Fjournal.pone.0115253

http://journals.plos.org/plosone/article/citation/ris?id=10.1371%2Fjournal.pone.0115253

Request

curl -I \protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}

Response

HTTP/1.1 200 OKDate: Mon, 09 May 2016 19:18:04 GMTLink: <info:eu-repo/semantics/article>; rel="type" ,<\protect\vrule width0pt\protect\href{http://dx.doi.org/10.1371/journal.pone.0115253}{http://dx.doi.org

/10.1371/journal.pone.0115253}>; rel="persistent-id" ,<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/asset?id=10.1371}{http://

journals.plos.org/plosone/article/asset?id=10.1371}%2Fjournal.pone.0115253.PDF>; rel="item"; type="application/pdf"; sem-type="info:eu-repo/semantics/article"/> ,<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/asset?id=10.1371}{http://

journals.plos.org/plosone/article/asset?id=10.1371}%2Fjournal.pone.0115253.XML>; rel="item"; type="application/xml"; sem-type="info:eu-repo/semantics/article"/> ,<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/asset?unique}{http://journals

.plos.org/plosone/article/asset?unique}&id=info:doi/10.1371/journal.pone.0115253.s012>; rel="item"; type="text/html"; sem-type="info:eu-repo/semantics/objectFile"/> ,<\protect\vrule width0pt\protect\href{http://api.crossref.org/works/10.1371/journal.pone.0115253}{http://api

.crossref.org/works/10.1371/journal.pone.0115253}>; rel="describedby"; type="application/json"; profile="https://github.com/CrossRef/rest-api-doc",<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371}{

http://journals.plos.org/plosone/article/citation/bibtex?id=10.1371}%2Fjournal.pone.0115253>; rel="describedby"; type="text/plain"; profile="http://bibtex.org",<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/citation/ris?id=10.1371}{http

://journals.plos.org/plosone/article/citation/ris?id=10.1371}%2Fjournal.pone.0115253>; rel="describedby"; type="text/plain"; profile="https://en.wikipedia.org/wiki/RIS_(file_format)",Content-Type: text/html;charset=utf-8Content-Language: en-USContent-Length: 300137

Listing 6: HTTP HEAD on the Entry Page provides links to CrossRef and publisher metadata describing the DOI-identified object

Request

curl -I \protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article/asset?id=10.1371}{http://journals.plos.org/plosone/article/asset?id=10.1371}%2Fjournal.pone.0115253.PDF

Response

HTTP/1.1 200 OKDate: Thu, 05 May 2016 21:59:52 GMTLast-Modified: Mon, 02 Mar 2015 22:52:02 GMTContent-Type: application/pdfContent-Length: 1794628Link: <info:eu-repo/semantics/article>; rel="type" ,<\protect\vrule width0pt\protect\href{http://dx.doi.org/10.1371/journal.pone.0115253}{http://dx.doi.org

/10.1371/journal.pone.0115253}>; rel="persistent-id" ,<\protect\vrule width0pt\protect\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone

.0115253}{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}>; rel="collection"; type="text/html"; sem-type="info:eu-repo/semantics/article"

Listing 7: HTTP HEAD on the PDF article provides links to the DOI, the canonical URI, the collection resource, andcharacterizes the PDF as an article

20

5.3 RecommendationsBased on the above discussions, this section providesrecommendations for CrossRef, academic publishers,and, more broadly, all operators of portals involved inscholarly communication.

5.3.1 Recommendations - CrossRefIn order to support the Manifest and Bibliography re-quirements, CrossRef is invited to consider these recom-mendations:

• Recommendation 1 - Regarding Manifest: Imple-ment ResourceSync Change Lists and/or ChangeNotifications to provide timely information regard-ing changed DOI registrations, i.e., new, updated,deleted. See Listing 3.

• Recommendation 2 - Regarding Bibliography - Ex-press ResourceSync Change Lists and/or ChangeNotifications in terms of the DOI at dx.doi.org andprovide a describedby link pointing at the metadatathat describes the DOI-identified object. See Listing3.

• Recommendation 3 - Regarding Bibliography: Im-plement Signposting as a means to support discov-ery of bibliographic metadata describing a DOI-identified object without the need for out-of-boundknowledge, by providing a describedby link and as-sociated attributes (as per Table 1) pointing from theDOI at dx.doi.org to the metadata available via theCrossRef API. See Listing 5.

Note that both the CrossRef API and ResourceSynccan support the Manifest requirement. However, regard-ing Manifest, the CrossRef API has drawbacks whencompared to ResourceSync in that it involves recurrentsearches against the entire database and returns bothnewly registered and changed DOIs without discrimi-nation. In addition, the use of the same technology -ResourceSync - by CrossRef, publishers, institutionalrepositories, etc. would yield a considerable increase ininteroperability for web-based scholarly communication.

5.3.2 Recommendations - Publishers & PortalsIn order to support the Manifest, Completeness, and Bib-liography requirements, publishers are invited to con-sider these recommendations:

• Recommendation 4 - Regarding Manifest: Imple-ment ResourceSync Change Lists and/or ChangeNotifications to provide timely information regard-ing new objects. See Listing 4.

• Recommendation 5 - Regarding Manifest: ExpressResourceSync Change Lists and/or Change Notifi-cations for new objects in terms of the URI of the

Entry Page (e.g., the landing page, the HTML arti-cle), which is modeled as a collection. See Listing4.

• Recommendation 6 - Regarding Completeness: InResourceSync Change Lists and/or Change Notifi-cations, provide item links and associated attributes(as per Table 1) that point from the Entry Page (e.g.,the landing page, the HTML article) to each of thePublication Resources. See Listing 4.

• Recommendation 7 - Regarding Completeness: Forthe Entry Page (e.g., the landing page, the HTMLarticle), implement Signposting by providing itemlinks and associated attributes (as per Table 1) thatpoint at each of the Publication Resources. See List-ing 6

• Recommendation 8 - Regarding Completeness: Foreach Publication Resource, implement Signpostingby providing a collection link and associated at-tributes (as per Table 1) pointing at the Entry Page.See Listings 7.

• Recommendation 9 - Regarding Bibliography: InResourceSync Change Lists and/or Change Notifi-cations, provide describedby links and associatedattributes (as per Table 1) that point at bibliographicmetadata describing the object, including the meta-data available from the CrossRef API. See Listing4.

• Recommendation 10 - Regarding Bibliography: InResourceSync Change Lists and/or Change Notifi-cations, include the persistent-id link ponting to theIdentifying HTTP URI (e.g., DOI at dx.doi.org), ifone exists. See Listing 4.

• Recommendation 11 - Regarding Bibliography: Forthe Entry Page (e.g., the landing page, the HTMLarticle), implement Signposting by providing de-scribedby links and associated attributes (as perTable 1) that point at bibliographic metadata de-scribing the object, including the metadata avail-able from the CrossRef API. See Listing 6. Frompublisher metadata resources, link back to the EntryPage using the describes relation type.

• Recommendation 12 - Regarding Bibliography: Forthe Entry Page and all Publication Resources, im-plement Signposting by providing a persistent-idlink ponting to the Identifying HTTP URI (e.g., DOIat dx.doi.org), if one exists. See Listings 6 and 7.

6 ConclusionIngest is a major contributor to the cost of preserving e-journals. Cost is the major reason that archives contain

21

less than half the content they should [17]. This paper hasshown that the use of existing infrastructure (CrossRefAPI) combined with the introduction of new infrastruc-ture (ResourceSync Change Lists and/or Change Notifi-cations, Signposting) promises significant cost savings,and the potential for improvement in the quality of boththe completeness of archived content, and of the meta-data (including identity) describing it. The implementa-tion of new infrastructure certainly comes at a cost. But,software that implements ResourceSync Change Notifi-cations is available5 and individual notifications can eas-ily be grouped into Change Lists. Also, while the Linkheaders used for Signposting may come across as intim-idating, in many cases, they can be automatically gen-erated using web server rewrite rules because publisherplatforms typically have a consistent approach regardingURI patterns for web resources that are part of an object.In other cases, typed links can be implemented using in-ternal API calls.

The combination of ResourceSync implemented byCrossRef and Signposting implemented by publisherswould go a long way to address crucial problems facedby current archiving efforts, when DOI-identified objectsare concerned. For publishers that assign DOIs, im-plementing ResourceSync would also be beneficial be-cause it would help address the asynchronous registra-tion of a DOI and the publication of the DOI-identifiedobject, and would allow monitoring whether newly pub-lished articles are also made available for archiving byfile transfer. For publishers that don’t assign DOIs, andhence can’t rely on Change feeds from CrossRef to ad-dress the Manifest requirement, implementation of Re-sourceSync is essential to allow (archiving) applicationsto remain up to date with evolving content. Generally,implementation of ResourceSync by publishers, reposi-tories, etc. would significantly augment interoperabilityfor web-based scholarly communication.

Because the ResourceSync and Signposting guidelinesapply to the URIs at dx.doi.org, and to the targets theyredirect to, CrossRef could also build a tool that au-dits publishers’ compliance. The carrot could be, say,lower dues for conformance, and the stick could be, say,naming-and-shaming non-compliant publishers. Most e-journals are now published by one of a limited number ofpublishing platforms, which would have to make the nec-essary changes. They would be driven by demand fromtheir customer publishers.

The proposed infrastructure would also benefit otherapplications. For example, although DOIs have beenwidely adopted, research has shown that their full po-tential benefits are not being achieved. The fundamentalbenefit DOIs should provide is as a persistent identifier,

5https://github.com/resync/resourcesync_push

but:

we found a significant number of references topapers by their locating URI instead of theiridentifying URI. For these links, the persis-tence intended by the DOI persistent identifierinfrastructure was not achieved. [21]

Locating URIs are used for many reasons, but a key oneis that they, and not the DOI URI, are available to becut-and-pasted in the usual way from the browser’s ad-dress bar. The persistent-id link provided by Signpost-ing and ResourceSync would supply the information de-stroyed by the use of locating URIs rather than DOIURIs, if publishers could be persuaded to include it inthe HTTP response header of web resources that are partof a DOI-identified object. Tools like citation managers,browsers, search engines could evolve to use this link torecord the Identifying HTTP URI instead of the locationURI of the web resource. Adding this link would alsoallow connecting open annotations6 made to an HTMLor PDF version of a paper back to the DOI. More gener-ally, it would uniformly provide the HTTP version of theDOI, in contrast to current practice in citations, landingpages, and bibliographic records that still frequently usea string rather than URI version of the DOI, and, manytimes even use a publisher HTTP URI instead of the DOIHTTP URI. Hence, we believe it would be in CrossRef’sinterest to encourage their members to adopt Signpostingand ResourceSync, as described above.

The low-barrier Signposting approach as such hasenormous potential for web-based scholarly communi-cation. For example, the describedby links potentiallyallow the creation of a citation graph by crawling, as-suming article reference lists are openly accessible witheach reference linking to an HTTP URIs that, via Sign-posting, leads to structured bibliographic metadata aboutthe referenced article. If, additionally, author links wouldbe provided, ideally pointing to the ORCID at orcid.org,co-author graphs could be created by crawling. As weexplore Signposting beyond the pattern described in thispaper, information will be made available at http://signposting.org. What are you waiting for?

AcknowledgmentsGrateful thanks are due to Fernando García-Loygorri,Daniel Vargas, Craig Van Dyck, Victoria Reich, SimeonWarner, Geoff Bilder, Harihar Shankar, and ShawnJones.

References[1] Michelle Brook, Peter Murray-Rust, and Charles

Oppenheim. The Social, Political and Legal As-

6https://hypothes.is/annotating-all-knowledge/

22

https://github.com/resync/resourcesync_push

http://signposting.org

http://signposting.org

https://hypothes.is/annotating-all-knowledge/

https://hypothes.is/annotating-all-knowledge/

pects of Text and Data Mining (TDM). D-Lib Mag-azine, 20(11/12), November 2014. http://dx.doi.org/10.1045/november14-brook.

[2] CLOCKSS. CLOCKSS: A Trusted Community-Governed Archive. http://www.clockss.org/.

[3] CrossRef. CrossRef REST API. https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md.

[4] CrossRef. crossref.org. http://www.crossref.org/.

[5] Google Inc., Yahoo Inc., and Microsoft Corpora-tion. Sitemap protocol. http://sitemaps.org.

[6] IANA. Link Relations. https://www.iana.org/assignments/link-relations/link-relations.xml.

[7] Martin Klein, Herbert Van de Sompel, RobertSanderson, Harihar Shankar, Lyudmila Balakireva,Ke Zhou, and Richard Tobin. Scholarly Con-text Not Found: One in Five Articles Suf-fers from Reference Rot. PLoS One, Decem-ber 2014. http://dx.doi.org/10.1371/journal.pone.0115253.

[8] Carl Lagoze, Herbert Van de Sompel, MichaelNelson, and Simeon Warner. The OpenArchives Initiative Protocol for Metadata Har-vesting. http://www.openarchives.org/OAI/openarchivesprotocol.html, 2002.

[9] LOCKSS. LOCKSS: Metadata Database. http://documents.clockss.org/index.php/LOCKSS:_Metadata_Database, July 2014.

[10] LOCKSS. LOCKSS: Metadata Extractor.http://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_Metadata, July 2014.

[11] NISO and The Open Archives Initiative. Z39.99-2014, ResourceSync Framework Specification.http://www.openarchives.org/rs/toc, 2014.

[12] Mark Nottingham. RFC 5988: Web Link-ing. http://tools.ietf.org/html/rfc5988, October 2010.

[13] Mark Nottingham. Link Hints Internet Draft.https://tools.ietf.org/html/draft-nottingham-link-hint-00,June 2013.

[14] Preservation CKG. Reliability of Por-tico for preservation of UC’s e-journals.https://wiki.library.ucsf.edu/download/attachments/328601907/portico_report_2014_11_03.docx,November 2014.

[15] LOCKSS Program. Lots Of Copies Keep StuffSafe. http://lockss.org/.

[16] David S. H. Rosenthal. Talk on LOCKSS Meta-data Extraction at IIPC 2013. http://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html, April2013.

[17] David S. H. Rosenthal. The Half-Empty Archive.http://blog.dshr.org/2014/03/the-half-empty-archive.html, March 2014.

[18] David S. H. Rosenthal. TRAC Certification ofthe CLOCKSS Archive. http://blog.dshr.org/2014/07/trac-certification-of-clockss-archive.html, July 2014.

[19] ITHAKA S+R. PORTICO: Your content. Pre-served here. http://www.portico.org/digital-preservation/.

[20] Andrew Tridgell and Paul Mackerras. The rsyncalgorithm. Technical Report TR-CS-96-05, Aus-tralian National University, 1996.

[21] Herbert Van de Sompel, Martin Klein, andShawn M Jones. Persistent URIs Must Be Used ToBe Persistent. In 25th international world wide webconference, April 2016. http://arxiv.org/abs/1602.09102.

[22] Herbert Van de Sompel and Michael Nelson.Reminiscing About 15 Years of InteroperabilityEfforts. D-Lib Magazine, 21(11/12), Novem-ber 2015. http://dx.doi.org/10.1045/november2015-vandesompel.

[23] Mark Ware and Michael Mabe. The STMReport: An overview of scientific and schol-arly journal publishing. Technical Report 4thEdition, International Association of Scientific,Technical and Medical Publishers, March 2015.http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf.

23

http://dx.doi.org/10.1045/november14-brook

http://dx.doi.org/10.1045/november14-brook

http://www.clockss.org/

http://www.clockss.org/

https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md



http://www.crossref.org/

http://www.crossref.org/

http://sitemaps.org

http://sitemaps.org

https://www.iana.org/assignments/link-relations/link-relations.xml





http://www.openarchives.org/OAI/openarchivesprotocol.html

http://www.openarchives.org/OAI/openarchivesprotocol.html

http://documents.clockss.org/index.php/LOCKSS:_Metadata_Database



http://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_Metadata



http://www.openarchives.org/rs/toc

http://www.openarchives.org/rs/toc

http://tools.ietf.org/html/rfc5988

http://tools.ietf.org/html/rfc5988

https://tools.ietf.org/html/draft-nottingham-link-hint-00

https://tools.ietf.org/html/draft-nottingham-link-hint-00

https://wiki.library.ucsf.edu/download/attachments/328601907/portico_report_2014_11_03.docx



http://lockss.org/

http://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html



http://blog.dshr.org/2014/03/the-half-empty-archive.html

http://blog.dshr.org/2014/03/the-half-empty-archive.html

http://blog.dshr.org/2014/07/trac-certification-of-clockss-archive.html



http://www.portico.org/digital-preservation/

http://www.portico.org/digital-preservation/

http://arxiv.org/abs/1602.09102

http://arxiv.org/abs/1602.09102

http://dx.doi.org/10.1045/november2015-vandesompel

http://dx.doi.org/10.1045/november2015-vandesompel

http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf

http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf

arXiv:1605.06154v1 [cs.DL] 19 May 2016 · A study [23] estimated that at least 1.8M articles were...

Documents

Transcript of arXiv:1605.06154v1 [cs.DL] 19 May 2016 · A study [23] estimated that at least 1.8M articles were...