OAI-Publishers in Repository Infrastructures

OAI-Publishers in Repository InfrastructuresMichele Artini, Federico Biagini, Paolo Manghi, Marko Mikulicic

Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo” - CNRVia G. Moruzzi, 1 - 56124 PISA - Italy

Email: {michele.artini, federico.biagini, paolo.manghi, marko.mikulicic}@isti.cnr.it

Abstract—OAI-Publishers are software modules implement-ing the OAI-PMH protocol interface, enabling harvesting ofmetadata records describing the digital resources of a DigitalRepository. Typically, such modules are manually configuredand implemented to export records contained into a local storeaccording to the metadata formats supported by the relativeRepository. Due to their static nature, these are not apt to thedynamic context of Repository Infrastructures, where a virtualstore, made of a dynamic pool of distributed physical stores,may hosts records of unpredictable metadata formats. In thispaper, we describe the internal architecture of the OAI-PublisherService built for the DRIVER Repository Infrastructure. TheService is capable of self-configuring its internals so as to adaptto the variable content of the virtual store. Finally, we shall showhow, the immersion into a rich “living” infrastructure enablesits extension with extra functionalities, novel to traditional OAI-Publishers technology.

I. INTRODUCTION

In the Digital Library (DL) context, the term Repositorygenerally refers to “containers of digital objects”, namelytechnologies for maintaining a store of digital resources; Fe-dora [1], DSpace [2] are examples of known Repository tech-nology. OAI-PMH compliant Repositories (OAI-Repositories)are Repositories endowed with an OAI-Publisher component,which is a software component implementing the OAI-PMHprotocol standard interface [3].

By adopting this standard, Repositories enable access totheir digital resources as if they were OAI-Items, i.e. entitiesadhering to the OAI-Item Model [4] (See Figure 1). Accordingto the model an OAI-Item represents the existence of a realworld resource – be it digital or physical – and exposes itsdescription, given in terms of a number of metadata records.Such records conform to specific metadata formats, one ofwhich must be Dublin Core [5].

Fig. 1. OAI-PMH Model

OAI-Publishers enable bulk-extraction, namely harvesting,of OAI-Items w.r.t. to one of the metadata formats exposedby the Repository metadata store. Typically, OAI-Publishersare build around the such store, not expecting changes in the

exported metadata formats, and expecting only rare changes inthe OAI-Sets to be exported. As such, the technology requiredfor their implementation is quite ad-hoc, entailing only raremaintenance human interventions.

The DRIVER Repository Infrastructures is a dynamic dis-tributed environment, where several Service-oriented DL ap-plications can coexist. An application is here intended as adynamic pool of Services orchestrated by the applicationlogic in order to deliver the functionality expected by theapplication user community. Service instances can join orleave the infrastructure at any time (register or unregister, ina peer-to-peer fashion) and be part of one ore more appli-cations. In such dynamic environment applications can sharebetween them content (e.g. metadata records), which theyeither harvested from OAI-Repositories or natively created intothe infrastructure and placed in a highly distributed storagespace, supported by special Store Services; the “union” of theirdistributed running instances forms a virtual store, namelythe Information Space, populated and accessible by all DLapplications in the infrastructure.

Currently, the largest DL application running over theDRIVER Infrastructure,1 is called DRIVER European Infor-mation Space. The application includes a set of AggregatorService instances, capable of harvesting records from validOAI-Publishers of European Digital Repositories. The har-vesting activity collects all records from OAI-Repositories inEurope, regardless of the kind of metadata records they yieldand the size of their content. When a Repository is willingto join the Information Space, it registers to the infrastructureby providing the information required for its harvesting; i.e.geographical location, URL, metadata formats, and OAI-Sets.The application logic orchestrates (i) the Aggregator Serviceinstances in order to harvest all metadata formats availablefrom the Repository; and (ii) the Store Service instancesin order to allocate the storage space required to host theincoming records.

The DRIVER OAI-Publisher Service must implement anOAI-PMH interface over the virtual store, i.e. the DRIVERInformation Space, in order to provide the available recordsto whoever is willing to access them. Accordingly, it has tobe devised to automatically adapt to the potential changesdetermined by the harvesting of new Repositories, so as toalways provide OAI-PMH responses coherent with the actual

1Largest in terms of resources involved, currently more than 350’000records.

93

content of the Information Space. In this paper we presenthow this can be done in the dynamic environment of theDRIVER infrastructure and show how powerful variations oftraditional OAI-Publisher technology can be added to thisimplementation.

A. Outline

In Section II we shall first describe the notion of DigitalLibrary applications built on top of the DRIVER RepositoryInfrastructure, while in Section III we describe how the virtualDRIVER Information Space is maintained by the infrastruc-ture. In Section IV we explain how self-configurable and pow-erful OAI-Publisher Services can be devised to operate oversuch Information Space. Finally, in Section VI we summarizethe work done so far in such direction and underline possiblefuture avenues.

II. DRIVER REPOSITORY INFRASTRUCTURE

Nowadays some DL user communities have changed theirrequirements, by posing no constraints on time, space,and functionalities. DL applications range from temporarywork spaces to elaborate collaborative environments (e.g.DL endowed with peer-review mechanisms), and very largedistributed-federated object stores, whose functionality maychange very frequently in time. In order to face this newrequirements we envisage the shift from DL applications basedon standalone Repository technology to a loosely coupledService-oriented architecture, which offers greater reusabilityand helps to reduce development and maintenance costs.

The aim of Repository Infrastructures is that of enabling theconstruction and maintenance of DL applications with higherorders of sustainability. Infrastructures exploit the Service-oriented architecture in order to provide distribution andsharing of content and functionality, i.e. Services offering effi-cient content storage/access and business logic. Organizationsbuilding their DL applications on top of an Infrastructure canprofit from reusing the Services offered by other Organizationsand add (therefore themselves share) their own Services whenthose available are not enough for their needs.

A real running Repository Infrastructure has been imple-mented in the DRIVER project [6]. The project has beenfunded by the European Commission with the goal to con-tribute in implementing the future knowledge infrastructureof the European Research Area. The project concluded inNovember 2007 and has focused on (i) the production andmaintenance of a Repository Infrastructure and (ii) the con-struction of the first DL application on top of it. The applica-tion, named European Information Space, aims at harvestingOpen Access content from all European OAI-Repositoriesinto the Repository Infrastructure so as to make it accessibleas a whole to world-wide academics and students troughadvanced user functionalities. Other DL applications have beensubsequently built, exploiting the Repository Infrastructurepopulated by the European Information Space application.

The DRIVER Infrastructure consists of three layers ofrunning Service instances: Enabling, Data, and ApplicationLayers.

Fig. 2. Driver infrastructure

A. Enabling Layer

The layer includes instances of Information Service, Man-ager Service, and Authentication and Authorization Service.These accomplish core tasks typical of “volatile” infrastruc-tures, such as registering, unregistering, authenticating, au-thorizing Service instances residing at different nodes on theInternet. Differently from similar SOA-oriented architectures,an application is not here intended as a specific set of dedicatedand interconnected Service instances. Apart from User Inter-faces, applications can use any of the Service instances whichare available in a certain moment in time and whose function-ality is required to accomplish a certain operation. To this aim,the layer provides resource discovery and orchestration logicmechanisms. The former enable discovery of Service instancessatisfying certain functional condition; searches are based onthe Service profile information provided at Service instanceregistration time. The latter perform application activities byapplying discovering techniques. For example, user runningqueries entail a number of orchestration logic activities. Thequery can be sent by the UI to any of the Search Serviceinstances available (part of the Data Layer). To this aim, thelogic discovers the “best” instance available (e.g. the lessoverloaded, the IP-closest), and sends the query to it. Thesame holds for the Search Service instance, which needs todiscover the Index Service instances capable and available torun the query, decide which to use (query optimization), andthen send them the query. In such scenario, for example, IndexService instances may be added/removed at any time, withoutthe need of reconfiguring the available Search Services.

94

B. Data Layer

The layer builds on top of the Enabling Layer and includesinstances of Store Service, Index Service, Search Service,Collection Service, Aggregation Service, and OAI-PublisherService. As such, the Data Layer can be considered anapplication with a specific orchestration logic.

Fig. 3. Index Factory Services

Store Services and Index Services instances, also calledfactory services, can return new empty stores and indexesto the consuming applications. Such entities are called DataStructures and are registered to the infrastructure with adescriptive profile. In practice, there is a distinction between aStore Service and the Store Data Structures (DS) it manages:consuming Services may ask a Store Service instance togenerate a new Store DS for their use, as well as to put orretrieve some metadata records from the Store DS. Similarly,Index Service instances can be asked to create Index DSs,feed them with metadata records, or query them to searchfor records. The same Index and Store Service instances cantherefore serve different DL applications, as well as the DScreated in the context of one application can be reused in thecontext of another; i.e. sharing of Store DSs and Index DSs.This is the case for Search Service instances, which resolvequeries by discovering the Index DSs capable of answeringthe query in the shortest amount of time. After identifying thecandidate Index DSs, they need to interact with the responsibleIndex Services in order to send the query, merge the results,and send the answer to the consumer, e.g. the User Interface.

Aggregator Service instances can harvest records from OAI-Repositories, clean and enrich the records according to specialrewriting rules (to be provided by an administrator), and storethem into a given Store DS. Aggregator Services can alsomap records of an input metadata format into another possiblyproprietary metadata format, thus generating records native tothe specific DL application involved.

The Data Layer orchestration logic is in charge of main-taining the Information Space consistent and robust. It there-fore reacts to operations such as Repository registering andunregistering to and from the infrastructure. For example,the registration of a Repository tells the infrastructure whatmetadata formats are to be harvested from the Repository.The logic assigns the Repository to a discovered AggregationService instance (currently it chooses the one in the same geo-graphical region of the Repository), together with information

on which Store DSs should be used to save the harvestedrecords. In the case of need, the logic also creates the StoreDSs required to host the incoming records: it first discoversthe Store Service instances available to do so, creates the StoreDSs, then informs, i.e. dynamically configures, the AggregatorService involved on where to divert the harvested records.Furthermore, any new record should be inserted into at leastone Index DS, in order to be reachable through searches.The logic ensures this, by sending the incoming records tothe appropriate Index DS and creating new Index DSs whennecessary. Finally, the logic ensures robustness by managingreplication of Store and Index DSs and space optimization byremoving obsolete DSs from the infrastructure.

C. Application Layer

The largest application running into the layer is the Eu-ropean Information Space application. It builds on top ofthe Data Layer and Enabling Layer by adding new Servicetypologies. Among these: User Profiling Services, Commu-nity Services, Recommendation Services, and User Interfaces.Its administrators operate a number of Aggregator Serviceinstances to aggregate Open Access content of Repositoriesfrom countries including the Netherlands, Germany, France,Belgium, Italy, Ireland, Poland, Spain, and UK. Currently,the application has registered and harvested from 64 OAI-Repositories, reaching the threshold of about 300.000 OpenAccess OAI-Items from all over Europe.

III. DRIVER INFORMATION SPACE

In principle, one Index DS and one Store DS must containrecords of the same metadata format, but not necessarilyoriginating from the same Repository. Such a choice dependson the orchestration logic, which can apply different storageand indexing policies exploiting the factory Service instancesat hand. In particular, each Store and Index DS specifies inits registration profile the kind of metadata format it supports.More specifically, the infrastructure envisages special systemData Structures, called MDFormat DSs, created whenever anew metadata format is introduced into the system; this eventcan be implicit, when harvesting from a Repository recordsmatching a format that was never harvested before, or explicit,when an Aggregator Service is used to generate records ofan DL application-specific metadata format. In both cases, anew MDFormat DS is created and its profile registered to thesystem, specifying the name of the format, the relative XMLschema, and other relevant information, not to be explainedhere.

The records harvested from the OAI-Repositories becomepart of a virtual Information Space (i.e. the union of all StoreDSs), which virtually hosts so-called DRIVER Objects. OneDRIVER object groups all records relative to one OAI-Item,plus extra records possibly needed by the DL applicationresponsible of harvesting. For example, in the case of theEuropean Information Space application, DRIVER Objectscontain the original metadata records, plus a further record inDRIVER Metadata Format (DMF). DMF records are intended

95

as metadata descriptions of the OAI-Item as a whole, thusinclude Dublin Core fields (resource description), fields con-taining keywords extracted from the full-text (resource synthe-sis), provenance-related fields (resource origin, e.g. Repositoryname, institution, geographical location, etc) and others. DMFrecords are uniform in content, meaning that values of theirfields are derived from the records of the relative OAI-Itembut chosen from well-defined DMF field domains.

As specified above, harvested records are kept in Store DSsdedicated to a given metadata format. Accordingly, records ofthe same OAI-Item, virtually belonging to the same DRIVERObject, are contained into different Store DSs. In DRIVER,records are uniquely identified in the Information Space with atriple: OAI-Item identifier (unique by definition in the contextof the originating Repository), Repository identifier (uniqueand created by the infrastructure at Repository registrationtime), and Store DS identifier (unique and created by theinfrastructure at DS registration time). Accordingly, DRIVERObjects are virtually identified by the pair Repository identifierand OAI-Item identifier.

Access Services enable random access to one metadatarecord, given its record identifier (OAIid, RepId, StoreId). TheService discovers the location of the Store Service in chargeof the Store DS identified by StoreId and fires it a request forretrieving the given record.

IV. DRIVER OAI-PUBLISHER SERVICE

The DRIVER OAI-Publisher Service is designed to supportan OAI-PMH interfaces over the dynamic pool of Services andDSs available to the infrastructure. The Service exploits theinformation available into the Service and DS profiles of theinfrastructure to self-configure its internals and accomplish itstasks with not need of human administration and maintenance.In particular, DRIVER OAI-Items are uniquely identified bythe DRIVER Object identifier and contain all metadata recordsavailable to the DRIVER Object; OAI-Sets are intended asthe Repositories harvested so far. The OAI-PMH methods areimplemented as follows:

Fig. 4. OAI Publisher

• Identify: the method returns the description of the“DRIVER Information Space Virtual Repository”;

• ListMetadataFormats: the method discovers allMDFormat DSs registered to the infrastructure and con-structs the appropriate response;

• GetRecord: the method expects an OAI-Item identifierand returns the relative record; to this aim, the Servicediscovers the first available Access Service and sends ita random access request; it then returns the record to thecaller;

• ListSets: the method answers with the list of Repos-itories harvested so far;

• ListRecords: the method expects a metadata formatand, optionally, a from-to condition on dates and/or anOAI-Set; it returns a list of records with relative resump-tion token, according to the standard specification. Twodifferent implementations of the method are possible. Thefirst exploits the Data Layer as it is, by discovering allStore DSs containing records of the given format andoriginating from a specified Repository, that is locatedin a given OAI-Set; given the Store DSs, the methodsends the relative Store Services a request for retrievingthe records, merges the results, and delivers to the caller.The second exploits instead special Index DSs, created onpurpose to efficiently answer such calls; the Data Layerorchestration logic maintains one Index DSs for eachformat in the system, enabling record searches by recordidentifier, by date range, and by Repository of origin. Inthis scenario, the OAI-Publisher Service searches for theappropriate Index DSs and efficiently answers the call;note that the method GetRecord could exploit suchIndex DS too.

• ListIdentifiers: the same as ListRecords, butreturns only the record identifiers.

A. OAI-Publisher extensions

DRIVER aims at experimenting two extensions of the OAI-Publisher Service. User and application consumers will be ableto configure from the outside the OAI-Publisher Service, inorder to define special OAI-Sets or enable record harvestingaccording to different metadata formats. More specifically, theextensions regard:

1) Dynamic format configuration: the Service acceptsformat-to-format mappings (rewriting rules), let’s as-sume from format A onto format B. This operation addsto the MDFormat DS of B the dependency from A andregisters a new MDFormat DS for B in the EnablingLayer, if B does not exist. The method ListRecords,when called for records of format B, will return allrecords in Store DSs of format B, plus all records fromStore DSs of format A, after applying the mapping. Thisfeature can therefore be used to permit consumers toharvest formats not present in the Information Spaceor to enlarge the number of records matching an ex-isting format and will use the same techniques usedby the Aggregator Services for cleaning and enriching

96

the harvested metadata records. The rewriting rules arecomposed using a transformation language specificallytailored for metadata records. The OAI-Publisher willexploit this tool in order to provide the transparentmapping between formats, when possible.

2) Dynamic OAI-Set configuration: the Service acceptsuser defined Collections [7], i.e. named and fieldedqueries over a given format, and builds and maintainsthe corresponding OAI-Sets; harvesting w.r.t. such OAI-Sets, restricts the attention to the records matching theCollection query at the time of harvesting; the OAI-PMH listSets call to this Service returns the namesof the Repositories, plus the names of the Collectionssubmitted so far.

B. Incremental harvesting

The OAI-PMH protocol specifies a mechanism for contin-uous synchronization called “incremental harvesting”. Con-sumers of OAI-Items (harvesters) are able to retrieve onlyrecent additions to the source repository by specifing a daterange. Modifications of already existing metadata recordsare gracefully propagated as each modification updates thedate stamp associated with each OAI-Item, while the OAI-Item identifier is preserved. However, deleted record handlingproves to be a problematic area.

According to the OAI-PMH standard, OAI-Publishers mayimplement three levels of support for deleted records byannouncing it in the OAI-PMH Identify response:

• no: the repository does not maintain information aboutdeletions. A repository that indicates this level of supportmust not reveal a deleted status in any response.

• persistent: the repository maintains informationabout deletions with no time limit. A repository thatindicates this level of support must persistently keep trackof the full history of deletions and consistently reveal thestatus of a deleted record over time.

• transient: the repository does not guarantee that a listof deletions is maintained persistently or consistently. Arepository that indicates this level of support may reveala deleted status for records.

If the source OAI-Publisher supports deleted record han-dling, the records provided by the OAI-PMH ListRecordsmethod may be tagged with a deleted status label, inform-ing the OAI harvester that it should remove the record fromits store.

If an OAI-Publisher does not support deleted record han-dling, already harvested records may refer to unexisting items.Harvesters aware of that fact may decide to periodically refreshthe whole repository from the source, in order to minimize theprobability of stale data accumulation. The DRIVER harvestercan be configured to perform a mixed method of full andincremental harvesting strategies, according to performancerelated decisions specific to each single source repository.

When the DRIVER harvester doing a full refresh delivers

the records to a store, the Store Service2 may compute thedifference between the current content and the incomingrecords, in order to reconstruct the information relative tothe deleted records. However, since such operation may beimpractical for large data sets, the system can be configured toavoid even trying this last resort strategy. In this case the OAI-Publisher exposing the infrastructure’s virtual store contentsneeds to able to keep track of the inability to implement the fulllevel of deleted record handling support, and to announce thereal level of compliance through the OAI-PMH Identifyresponse of the exported view.

The level of deleted record support is thus highly dependentupon the dynamic nature of the virtual store and the subset ofthe records to be exported. Moreover, since the announceddeleted record support level must be shared by all the OAI-Sets exported by the same OAI-PMH entry point, multipleDRIVER OAI-Publisher instances can be configured to pro-vide a partition of the virtual store.

A similar situation may arise with the different level of dategranularity support.

V. REFERENCE IMPLEMENTATION

Our current implementation of the OAI-Publisher serviceis based on the DRIVER Testbed infrastructure accessible athttp://testbed.driver.research-infrastructures.eu.3

Fig. 5. DRIVER Infrastructure repositories

The DRIVER Infrastructure contains data from a 117 Eu-ropean repositories (See Figure 54), containing more than350’000 items describing textual documents in over 20 lan-guages authored by more than 200’000 authors.5

2depending on the specific Store Service implementation.3Currently an alpha release, to be released definitively in June 2008.4An up to date map can be seen at http://admin1.driver.research-

infrastructures.eu/IS/RepositoryMap.5The data cannot be precise because of the lack of quality control on the

source.

97

The sizes of source repositories vary between very smallones and relatively large ones (85’000 items), with an averagesize of around 8’000 items.

For each repository we keep the originally harvested meta-data format, enriched with provenance-related informations,along with native DRIVER Metadata Format (DMF) records,obtained from the original metadata.

All the metadata is exported through several redundant OAI-Publisher instances which expose each repository as singleOAI-Set of the whole DRIVER Information Space.

VI. CONCLUSIONS AND FUTURE ISSUES

In this work we presented OAI-Publisher technology asneeded in the context of highly dynamic Repository Infras-tructures. In particular, the case of the DRIVER InfrastructureOAI-Publisher Service was presented. Currently the Infras-tructure offers an OAI-Publisher Service capable to self-adaptits configuration so as to return responses consistent with thedynamic and continuous status changes of the DRIVER Infor-mation Space. Future steps will be those of extending the OAI-Publisher with dynamic formats and OAI-Set configurationinterfaces as described above.

The DRIVER Project has been re-financed in the 7thFP so as to extend the Information Space with CompoundObjects. Compound objects are sets of digital objects relatedby relationships representing the nature of their association;for example, a “Research Package Object” might consist ina set of publications, e.g. PDF, DOC files, interconnectedby relationships named “refersTo” and a set of raw datafiles linked to such publications through relationships named“experimentalData”. Compund object graphs will be exposedboth through the OAI-ORE protocol [8], and through the OAI-PMH protocol using MPEG-21 DIDL payload [9].

ACKNOWLEDGMENT

This work is partially supported by the Information SocietyTechnologies (IST) Program of the European Commission aspart of the DRIVER (Project no. IST-034047).

REFERENCES

[1] C. Lagoze, S. Payette, E. Shin, and C. Wilper, “Fedora: An Architec-ture for Complex Objects and their Relationships,” Journal of DigitalLibraries, Special Issue on Complex Objects, 2005.

[2] R. Tansley, M. Bass, and M. Smith, “DSpace as an Open ArchivalInformation System: Current Status and Future Directions,” in Researchand Advanced Technology for Digital Libraries, 7th European Confer-ence, ECDL 2003, Trondheim, Norway, August 17-22, 2003, Proceedings,ser. Lecture Notes in Computer Science, T. Koch and I. Sølvberg, Eds.Springer-Verlag, 2003, pp. 446–460.

[3] C. Lagoze and H. Van de Sompel, “The open archives initiative: buildinga low-barrier interoperability framework,” in Proceedings of the firstACM/IEEE-CS Joint Conference on Digital Libraries. ACM Press, 2001,pp. 54–62.

[4] Carl Lagoze and Herbert Van de Sompel, “The OAI Protocol for MetadataHarvesting,” http://www.openarchives.org/OAI/openarchivesprotocol.html.

[5] “Dublin Core Metadata Initiative,” http://dublincore.org.[6] “DRIVER Digital Repository Infrastructure Vision for European Re-

search,” http://www-driver-repository.eu.

[7] L. Candela, D. Castelli, and P. Pagano, “A Service for SupportingVirtual Views of Large Heterogeneous Digital Libraries,” in 7th EuropeanConference on Research and Advanced Technology for Digital Libraries,ECDL 2003, ser. Lecture Notes in Computer Science, T. Koch andI. Sølvberg, Eds. Trondheim, Norway: Springer-Verlag, August 2003,pp. 362–373.

[8] Carl Lagoze and Herbert Van de Sompel, “Object Reuse and Exchange(ORE),” http://www.openarchives.org/ore/.

[9] H. V. de Sompel, M. L. Nelson, C. Lagoze, and S. Warner, “Resourceharvesting within the oai-pmh framework,” in D-Lib Magazine, Volume10 Number 12, ISSN 1082-9873, December 2004. [Online]. Available:http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html

98

OAI-Publishers in Repository Infrastructures

Documents

Transcript of OAI-Publishers in Repository Infrastructures