Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically...

23
Enhancing Infrastructure for OAI A Proposal to the Andrew W. Mellon Foundation by the Digital Library Research Group Old Dominion University Kurt Maly, Michael L. Nelson, and Mohammad Zubair Old Dominion University Norfolk VA 23529 {maly,mln,zubair}@cs.odu.edu And Herbert Van de Sompel Los Alamos National Laboratory Los Alamos NM 87545 [email protected] Cover letter from ODU Research Foundation http://www.cs.odu.edu/~maly/mellon/Foundation Letter.pdf

Transcript of Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically...

Page 1: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Enhancing Infrastructure for OAI

A Proposalto the

Andrew W. Mellon Foundation

by the

Digital Library Research GroupOld Dominion University

Kurt Maly, Michael L. Nelson, and Mohammad ZubairOld Dominion University

Norfolk VA 23529{maly,mln,zubair}@cs.odu.edu

And

Herbert Van de SompelLos Alamos National Laboratory

Los Alamos NM [email protected]

Cover letter from ODU Research Foundationhttp://www.cs.odu.edu/~maly/mellon/Foundation Letter.pdf

Support letter by ODU President Roseanne Runte: http://www.cs.odu.edu/~maly/mellon/runtesupletter.pdf

IRS letter for ODU Research Foundationhttp://www.cs.odu.edu/~maly/mellon/IRS Letter.pdf

Board of Trustees for ODU Research Foundationhttp://www.cs.odu.edu/~maly/mellon/Board Members.doc

Page 2: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Enhancing Infrastructure for OAI

1 Introduction Digital libraries (DLs) provide an infrastructure for publishing and managing content so it is discovered easily and effectively. Digital libraries are being viewed as a means for dissolving inequities in access to scientific information for both researchers and students alike. A number of digital libraries currently exist, both in the commercial world as well as in the government and educational domains. However, there is no federated service that provides a unified interface to all these libraries, which we believe is necessary for faster dissemination. The biggest obstacle for building a federated service is that many digital libraries use different, non-interoperable technologies. One major effort that addresses interoperability is the Open Archive Initiative (OAI) framework to facilitate the discovery of content stored in distributed archives. The OAI framework supports data providers (archives) and service providers. Service providers develop value-added services based on the information collected from cooperating archives. These value added services could take the form of a federated search engine like Arc [Liu01]. A typical data provider would be a digital library without any constraints on how it implemented its services with its own set of publishing tools and policies. In this work, we propose to enhance the two key infrastructure components of the OAI framework by building:

(a) a Digital Library Grid to support a high performance OAI federated search service(b) an Apache module, mod_oai, to enable OAI for the general Web community.

We now describe the two parts of the project in details along with their motivation, benefits, and relationship to Mellon's Research in Information Technology program.

2 Background

2.1 OAI

The Open Archives Initiative (OAI) is an international effort focused on furthering the interoperability of DLs through the use of "metadata harvesting". Many previous DL interoperability projects focused on "distributed searching" as the method for federating different DLs into a single service. While feasible for small numbers of nodes (e.g., < 20), large-scale distributed searching has proven difficult in an Internet environment for large numbers of nodes (e.g., > 100).

The OAI has released version 2.0 of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) in July 2002. The major contribution of the OAI-PMH is that it defines a common format for metadata exchange that is independent of the underlying database. The OAI-PMH defines 6 “verbs”, or URL-formatted requests, and defines their XML-formatted responses. While the default metadata format in OAI-PMH is Dublin Core (www.dublincore.org), any XML-expressible metadata format is allowed (even encouraged). For example, the following OAI-PMH request yields the response in Figure 1:

http://naca.larc.nasa.gov/oai2.0/?verb=ListRecords&metadataPrefix=oai_dc

- 2 -

Page 3: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Figure 1. An XML-Formatted OAI-PMH Response

The OAI-PMH retreats from the model of distributed searching, and attempts far less technical specification than previous DL interoperability projects. As a result of this decreased scope, the OAI-PMH is proving to be a more flexible and resilient for interoperability - a sort of "RISC" (reduced instruction set computer) model for DL interoperability. The OAI-PMH defines only a generic bulk metadata transport protocol, and leaves other features to be borrowed from other technologies or implemented as independent services.

Key to understanding the philosophy of the OAI-PMH is understanding the separation of responsibilities of "service provider" (SP) and "data provider" (DP). In practice, a SP and a DP can reside in the same entity, but it is important to understand the distinction. A DP is a repository (or "archive" - the “archive” in OAI-PMH is a remnant of its e-print origins, it does not carry the technical connotation of preservation commitment that archivists reserve for the word) is simply a collection of metadata records (which may or may not point to corresponding full-text documents). A SP provides value added services (e.g., searching, browsing) on the metadata extracted from one or more DPs. Figure 2 illustrates a simple SP/DP model, where users can choose between two SPs, one of which harvests from two DPs and another that harvests from three DPs. The SPs are free to define their own services, presentation and interfaces tailored to the user base. These services could be complimentary or competitive.

It is important to stress that the OAI-PMH is DL middleware – it is never directly exposed to users. The OAI-PMH defines the interaction between SPs and DPs. A DL can be both an SP and DP (in fact, historically this is often true); however the separation of these functions allows some organizations to focus only on being "publishers" (i.e., filling their DP) and some organizations to focus on the development of value added services targeted to their specific

- 3 -

Page 4: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

customer base. The OAI-PMH is quite simple, and its utility derives just as much from what is missing as what is explicitly defined. Although the OAI origins come from academic e-print distribution projects, it is applicable to a variety of applications, including the structured exposure of the "hidden web" or "deep web” [Ber01].

Figure 2. Relationship of Service Providers and Data Providers

2.2 ArcArc originates from the Universal Preprint Service (UPS) prototype [Van00], which was developed as a proof-of-concept for various DL technologies, including the feasibility of constructing a cross-archive searching service. UPS became the foundation of the Santa Fe convention that in turn led to the Open Archives Initiative. Once the OAI metadata harvesting protocol stabilized, it was possible to realize the vision of UPS in Arc, with a higher performance search capability and its contents being kept up to date through the use of OAI -PMH based harvesting.Arc was initially released as an experimental service to investigate issues in metadata harvesting [Liu01]. It immediately attracted interest since at that time it was the only vehicle to demonstrate the potential and promise of OAI-PMH. Arc was also the first service to demonstrate the re-exporting of metadata records, what is now known as “aggregators” in OAI-PMH parlance. Arc provided an example that other service providers, such as OAIster [Hag03] and the NASA Technical Report Server [Nel03] would follow. As new data providers appeared, they often requested to be added in Arc for demonstration purpose; by continuously integrating various new data providers, the software was made stable and fault-tolerant. It has become a valuable tradition that Arc system tries to keep track as many as possible of the OAI data providers, so far Arc has harvested 6.4M metadata records from 165 data providers. Since there is no centralized registration in the OAI framework, this number is far from complete. We are continuously working to discover and include additional data providers in Arc. Originally conceived more of a tour de force, Arc has become a useful tool for helping new data providers achieve OAI-PMH compliance by giving them feedback on implementation errors that we discover during harvesting. Secondly, it is becoming the ‘Google’ of the OAI world, however, at a cost: performance is becoming degraded as the collection approaches the 10 million record size while being maintained on a shoestring. For 2003, we had approximately 320K unique visitors (Figure 3), despite slowly degrading performance.

- 4 -

Users

Service Provider Service Provider

Data Provider Data Provider Data ProviderData Provider

Page 5: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Figure 3. Arc statistics for Year 2003

2.3 Institutional Adoption of OAI-PMH

The mod_oai project is just one software project in the burgeoning OAI-PMH community. A number of open-source OAI-PMH data provider software projects exist, such as Kepler (kepler.cs.odu.edu), a lightweight, personal data provider as well as institutional-level data providers, such as Eprints (www.eprings.org), DSpace (www.dspace.org), CDSWare (cdsware.cern.ch) and Fedora (www.fedora.info). Open-source service provider projects are emerging as well, such as Arc (arc.cs.odu.edu), which has been used by, among others, Emory University, University of Pennsylvania, and the University of Santiago (Chile) to rapidly build OAI-PMH-based digital libraries. OAI-PMH has also attracted attention from software vendors and publishers; some of the organizations with OAI-PMH resources include: OCLC, Ex Libris, Ingenta, Elsevier, Institute of Physics (IOP), and the American Physical Society.

With these components in place, more ambitious projects are being considered that use OAI-PMH as an enabling technology. One example is the DARE (Digital Academic Repositories) project in The Netherlands is using OAI-PMH as the core technology for enhancing its international research reputation. Another example is the Digital Library Grid,

- 5 -

Page 6: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

described in the first part of this propsal. While the mod_oai project focuses on increasing the number of data providers through providing and extension to the popular Apache http server, the Digital Library Grid aims to increase the quality of service providers through using grid computing nodes for dramatic performance increases in harvesting and indexing. The Arc project is already experiencing bottlenecks in harvesting and indexing the existing OAI-PMH corpus; the success of the mod_oai project will necessitate the kind of performance increases the Digital Library Grid project is investigating.

2.4 The GridThe Grid is an emerging technology for infrastructure that enables the integrated, collaborative use of high-end computers, networks, and databases owned by multiple organizations. Grid applications typically involve large amounts of computing and data that requires secure resource sharing across organizational boundaries for which the current Internet infrastructure is not adequate [GloA, GloB, Fos01]. The National Science Foundation is funding NPACI and NCSA for building a National Technology Grid. Both initiatives are deploying the Globus Toolkit across a wide range of facilities and resources, including supercomputing centers, research labs, and college and university campuses. In Europe, the main project in this area is the DataGrid project, which is an international project for shared cost research and technological development. This European project has six main partners (CERN, CNRS, ESRIN, INFN, NIKHEF, PPARC) and fifteen associated partners. Recently, there has been interest in using the Grid for managing large data set by creating and storing descriptive metadata, which is used for discovery [Sin03].

The Globus Tookit is the core software package that is used for building useful grid applications and programming tools. The latest release of Globus Toolkit is based on the Open Grid Services Infrastructure (OGSI) specification. The OGSI defines mechanisms for creating, managing, and exchanging information among entities called Grid services. Succinctly, a Grid service is a Web service that conforms to a set of conventions (interfaces and behaviors) that define how a client interacts with a Grid service.

3 Purpose and Expected Outcomes

The combined outcomes of the mod_oai and digital library grid portions of this project will have the net effect of integrating the currently disparate digital library (e.g., OAI-PMH) and general Web (e.g., Google) communities. Despite sharing a common toolset (http, TCP/IP, etc.), there is not nearly enough interaction between the two communities. mod_oai will greatly increase the number of people that will be able to export their metadata (and resources) via OAI-PMH. Encouraging a switch from the current, resource intensive Web harvesting model to the more efficient OAI-PMH harvesting model will also greatly decrease the load on web servers, decrease the amount of repetitive traffic on the network and increase the “freshness” of harvested resources.

Once this Web/DL integration takes place, providing a unified interface to all these libraries that is as efficient as Google would be useful to a wide audience. Google does an incredible job at providing discovery services of the ‘shallow’ web’ to the general public; we envision a similar quality, sustainable, free discovery service for students and researchers for parts of the ‘deep’ web [Ber01]. The parts of the deep web we refer to in this vision are digital libraries and collections that are exposing their metadata using OAI-PMH (Protocol for Metadata Harvesting). A high performance federated search service that exploits the resources of a Grid will make available a large amount of information that is distributed amongst heterogeneous digital libraries. A search user will be able to access a research paper, preprint, a technical report, an image of a renowned painting, or a musical performance in a few seconds from thousands of libraries scattered all over the world.

- 6 -

Page 7: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

The digital library grid audience consists on the one hand of the students and researchers who want fast access with one simple search interface to the vast information stored in digital libraries that is largely inaccessible to traditional web crawlers. On the other hand, we also provide incentives for digital libraries (and their publishing authors) to participate in our digital library grid by providing rapid dissemination of their material to a global audience beyond their specific user groups that support their collections. In a sense we are seeking to realize the full cycle of information flow by providing an ultimate union catalogue, material that can be published anywhere, anytime, material that will be disseminated and readers who provide feedback through annotation.

4 Project Rationale

If mod_oai is adopted, the installed base of OAI-PMH repositories will grow significantly, and OAI-PMH enabled web robots could be much more efficient than current web robots, which traverse an entire web site to locate new and updated web pages. Using the OAI-PMH, the new and updated pages would be immediately visible. Eliminating the unnecessary accesses to unchanged pages would result in quicker harvesting for web robots as well as significantly reduced network traffic for web sites. Exposing the deep web, which is estimated to be up to 550 times the size of the web currently visible to robots [Ber01], would greatly benefit search engines and their customers by increasing the corpus of web pages discoverable through services such as Google.

Assuming that a rapid increase (e.g., several orders of magnitude) in the adoption of OAI-PMH occurs, we now have a different problem: how to efficiently discover, harvest and index the burgeoning OAI-PMH corpus. We propose to take advantage of the Open Archives Initiative, a number of open source software systems such as Arc, Kepler, and DP9 [Liu01, Mal01, Liu02], and the various grids such as the NSF Grid and the Access Grid to produce a sustainable system that scales eventually beyond the level of what Google covers now. Currently our research group at Old Dominion University provides a federation service – Arc – pro bono publico. Since harvesting, indexing, and searching are all running on the same server, performance is close to become too slow, and the reliability is low. We expect the number of OAI compliant collections to steadily grow and a model has to be found that provides good performance as well as one that is sustainable and does not rely on the good will of one group alone. The straightforward solution of applying Google technology is beyond the reach of any university (Google advertises a solution that costs 17 cents per record indexed) and has sustainability issues. We propose to distribute the cost of publishing to collection builders (data providers), the cost of harvesting and indexing to existing grid nodes, and only leave the cost of maintaining the federated search service to one institution (service provider), thus making it more sustainable. Since grid nodes by definition have unused capacity, no new hardware needs to be acquired and we can, in essence, piggyback the onus of maintaining the infrastructure on the efforts to maintain the grid. The second advantage of this approach is availability of the service. The current Arc is running on a single processor without any redundancy. In the new approach, we plan to use hardware redundancy by exploiting the Grid technology. For searching, we plan to exploit parallelism by partitioning the indices amongst a cluster of PCs. A user query will be executed in parallel across these partitions resulting in high performance. For supporting parallel indexing and searching, we will extend the open source Apache Jakarta Lucene search engine.

- 7 -

Page 8: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

5 Proposed Work 5.1 Digital Library GridThe key concept of our architecture (Figure 4) is the separation of low-latency and high-latency tasks in the process that culminates in a fast search service to all users while paying attention to distribute the cost of maintaining the overall system. The two groups of users distinguished in Figure 4 are the creators of content and the readers who wish to discover the content they need. Since the fundamental assumption of our architecture is OAI-PMH compliance, we do provide besides the individual existing publication tools of various OAI-compliant digital libraries also the Kepler tools. Kepler is a self-installing software that provides the individual user (within a community) with an instant ‘pocket’ digital library that is OAI-compliant yet one that can be used anywhere, anytime as for instance on a laptop on travel. We list DSpace as one of the more traditional digital library environments for which we will be providing a plug-in to make it OAI harvestable. The important factor is to get as large as possible an audience of content creators. We make it simple for individuals and digital libraries to become OAI-compliant in return they have to provide metadata for their objects that in turn will provide great precision in the searches users will make of the entire federation.

Figure 4. Architecture for Digital Library Grid

Most of the new software to be developed will be to do harvesting and exposing of metadata work over the grid. In the past OAI-PMH simply worked over http, making a direct connection between the data provider and the service provider. Now the harvesting nodes will have to make the harvest of the data providers through the grid. As is well known, searching indices

- 8 -

Types of Nodes on a Digital Library Grid

Digita l L ibrary G rid

Federa ted S ervice N ode D ata P rovider N ode H arveste r/Indexing N ode

U ser/S earch

D SPace/Kepler

Publisher/A rchive let Publisher/A rchive let

H arvest M etada taH arvest Indexes

Search C luster

Page 9: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

is an embarrassingly parallel problem and can easily be decomposed onto several processors. The problems we will have to solve are how to get the indices (and metadata) from the harvester nodes to the search cluster (Figure 4) and keep them consistent with the regularly updated harvester nodes’ indices (metadata). The parallel search software will not in itself break new ground but we have to pay particular attention to the fact that the results of the search have to be presented with the lowest possible latency independent of the breadth of the search.

As part of this project we propose to build a testbed that will use 3 grid nodes to perform the high-latency tasks of harvesting and indexing from 3 data providers. We will use the grid also to transmit these indices and metadata to a small cluster (3 nodes) of search engines each of which will be working on one or more indices it receives from the harvesting nodes. We will develop the software tools to:

Adapt existing OAI-PMH harvesting (Arc) and Lucene indexing software to the grid Deploy a cluster to do parallel, high performance search based on Lucene engine Develop software support to move indices and metadata between low and high

latency nodes.

Based on our results of the testbed on a small set of nodes, we will do a scalability study to identify bottlenecks when deploying on a large scale (more than 100 Grid and cluster nodes). We will adjust our design to address the issues we uncover in the study. As part of the project we will also develop plans for a phased roll-out of such a system. The targeted group of collections and harvesters will be existing grid nodes which are mostly universities and laboratories. However, once the digital library grid is jumpstarted by these organizations by itself will then be an incentive for other organizations to join the DL grid to disseminate their information.

We will study and explore solutions to the following issues:

Plug-ins for existing OAI data providers to make them grid enabled Adaption of existing publishing tools (DSpace, Kepler) to work on the grid Policies for who and what can be harvested from a grid enabled data provider Enforcement of policies specified by data provider Ability to express for what purpose harvested (meta) data will be used and under

what restriction by who.

As part of exploring the use of Grid technology for our project, we have already set up an independent Grid using GT3 toolkit, the latest toolkit based on OGSI specification, on two nodes. We have successfully implemented a small subset of OAI-PMH as a grid service.

This element of the proposal is an exploratory project that will demonstrate the feasibility of a Google like engine based on Grid technology for OAI compliant digital libraries.

5.2 mod_oai – Getting OAI for Free

We further propose the development of an Apache module, mod_oai, which will allow for the easy proliferation and adoption of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Apache is an open-source web server (www.apache.org) that is used by 63% -- approximately 27 million -- of the websites in the world [Net03]. While the OAI-PMH has made a huge impact in the field of digital libraries (DLs), it has yet to make an impact in the general web community despite recent studies that have shown that OAI-PMH

- 9 -

Page 10: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

is applicable for a variety of purposes [Nel02, Van03]. While the total number of OAI-PMH sites is not known, we believe it to be around 500. The Apache web server defines an extensible module format to allow specific functionality to be incorporated directly into the web server. Building an Apache module that “automatically does” OAI-PMH would make the power and flexibility of OAI-PMH available to the web community at large.

The OAI-PMH is a simple protocol that defines six “verbs” to facilitate the incremental harvesting of “metadata”, or more generally, XML-expressible content4. Typically, an OAI-PMH repository is attached to an existing digital library, database, or some other pre-existing content management system. While this allows for the harvesting of the “deep” or “hidden” web, commercial web robots (e.g. Google) do not yet implement the OAI-PMH. This is partially due to the fact that the OAI-PMH community is not yet large enough to appear on their radar.

5.2.1 Development

mod_oai will enable sites that have URLs similar to:

http://www.getty.edu/

to automatically process OAI-PMH requests, such as:

http://www.getty.edu/?verb=Identifyhttp://www.getty.edu/?verb=ListRecords&after=2003-05-29&metadataPrefix=oai_dc

The OAI-PMH requests would be made on all the files that can be “seen” by the Apache web server. To the extent possible, Dublin Core metadata will be extract from the file and used when the metadataPrefix “oai_dc” is used. A separate metadataPrefix, “oai_file”, will be defined to allow the entire file to be transmitted via OAI-PMH using Base64 encoding. This will enable sites such as Google to harvest full-text documents via the OAI-PMH and index them. The presence of mod_oai will not interfere with any other application-level OAI-PMH implementations that might be running on the same server. For example, mod_oai would not interfere with an eprints.org OAI-PMH repository at:

http://www.getty.edu/collections/perl/oai.pl?verb=ListMetadataFormats

To understand how mod_oai would work, first consider how a regular web robot such as Google works. Consider a website of 100 pages, where 5 of the pages are updated weekly. Assume that the website is harvested weekly, both by a Google robot and by an OAI-PMH harvester through the mod_oai interface. Assume the mod_oai interface is configured to distribute 10 documents, batched together, per connection. Assuming smart web robots that perform conditional HTTP GETs (http status code 304) based on last modified dates, the robot will not download more files than it needs to, but it will have to query every individual web page at the web site to determine its date of last modification. Figure 5 illustrates this model.

But the OAI-PMH model saves the considerable overhead of establishing TCP/IP and HTTP connections for documents that have not changed. Instead of having to ask each of the 100 files if their modification date has changed, the OAI-PMH harvester asks the mod_oai interface which files have changed, and the mod_oai interface only responds with the files that meet the criteria (Figure 6).

- 10 -

Page 11: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Figure 5. Unnecessary Harvesting with Web Robots

Figure 6. Incremental Harvesting with mod_oai

Given the parameters stated above, Table 1 shows the relative load placed by each method. If the web site is larger, say 1000 or 10000 files, the unnecessary network traffic avoided with mod_oai would be even greater. Even if web sites updated their content (or added new content) more rapidly, the mod_oai approach would still reduce the number of connections by a factor of the number of files batched together in the response (in this example, by a

- 11 -

Page 12: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

factor of 10). Not only would this reduce the network load for the robots and web sites, it would also allow for much quicker harvesting of updates and thus more up-to-date web indices.

Initial Harvest Weekly UpdatesGoogle Robot 100 connections 100 connectionsOAI-PMH Harvester 10 connections 1 connection

Table 1. Network and Server Load By Harvesting Type

5.3 Prospective User BaseThis DL Grid proposal is an exploratory project that will demonstrate the feasibility of a Google like engine based on Grid technology for OAI compliant digital libraries. As part of the project we will also develop plans for a phased roll-out of such a system. The targeted group of collections and harvesters will be existing grid nodes which are mostly universities and laboratories. However, once the digital library grid is jumpstarted by these organizations by itself will then be an incentive for other organizations to join the DL grid to disseminate their information. To achieve maximum dissemination and adoption, we will develop both mod_oai and DL Grid under the GNU Public License (GPL) and distribute it through sourceforge.net, a popular method of distributing open source software.

We will also register the mod_ai module with modules.apache.org, as well as prominently feature it on the OAI website (www.openarchives.org). We also intend to document and demonstrate mod_oai at these (or similar) conferences, which are the canonical events for their respective communities:

Presentation, Digital Library Federation (DLF) Spring Forum, New Orleans, LA, April 2004 (www.diglib.org/forums.htm)

Presentation, Coalition for Networked Information (CNI) Spring Task Force Meeting, April 2004 (www.cni.org/tfms/2003a.spring/)

Poster, Thirteenth International World Wide Web Conference (WWW 2004), New York NY, May 2004 (www.www2004.org)

Poster, Joint ACM/IEEE Conference on Digital Libraries (JCDL 2004), Tucson AZ, June 2004 (www.jcdl2004.org)

Presentation, Digital Library Federation (DLF) Fall Forum, Baltimore MD, October 2004 (www.diglib.org/forums.htm)

If mod_oai is successful within the Apache community, we will investigate developing similar functionality for other web servers, such as Microsoft IIS, which represents approximately 21% of the known webservers [Net03].

We believe that sufficient interest exists within the OAI-PMH community for tools to further ease the creation of additional DPs. Since the two principal investigators on this proposal are 50% of the OAI-PMH governing body, we are in a unique position to encourage adoption within the OAI-PMH community. To encourage adoption of mod_oai (and OAI-PMH) in the general web community, we intend to extol the network traffic that can be eliminated with incremental harvesting, as was illustrated in Table 1.

- 12 -

Page 13: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

6 Relationship to Mellon’s Information Technology ProgramThe proposed work will lead to better ways of managing information infrastructure for research, teaching and learning. It will enhance infrastructure for OAI and thereby by making a large amount of resources, which is scattered in various collections, accessible to researchers, students, and general public.

7 Project Managment

The work will be jointly coordinated by Kurt Maly, Michael Nelson, Herbert Van de Sompel, and Mohammad Zubair. The digital library grid part will be managed by Kurt Maly and Mohammad Zubair; and mod_oai part will be managed by Michael Nelson and Herbert Van de Sompel. A detailed budget is included for Old Dominion’s expenses; Los Alamos’ expenses will be covered under an existing project.

8 Tasks and Timelines

9 ODU Digital Library BackgroundODU is actively involved in the area of digital library with particular focus to building interoperable digital libraries (http://dlib.cs.odu.edu/). The ODU group has three major digital library grants from the NSF along with digital library grants from NASA Langley Research Center, Los Alamos National Laboratory, and Sandia National Laboratories. We developed the first OAI-compliant service provider - Arc (http://arc.cs.odu.edu). Arc is currently harvesting around 200 collections and contains over 6 Million records. Besides Arc, ODU has worked with Phillips Air Force Research Laboratory (AFRL), Los Alamos National

- 13 -

TasksMonths 3 6 9 12 15 18

Grid DL

Software

GT3 Based Grid Testbed of 9 nodes

Distributed Harvesting on Grid

Parallelizing Lucene Indexing

Software for Moving Indices

High-performance, Parallel Search

Testing & Evaluation

Study - Issues

OAI as Grid Service, OAI DP Plug-in

Harvesting and Use Policies for DP

Publishing Tools, Exposure to Web-crawlers

Roll-out Plan

mod_oai

Code development

Metadata specification

Small scale testing (~ 1k files)

Large-scale testing (> 1M files)

Dissemination

Schedule

Page 14: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

Laboratory, and NASA Langley Research Center in building a Technical Report Interchange (TRI) federation of report collections available at the three organizations in their native digital library. Recently, ODU demonstrated how OAI could be used for building Kepler a framework for individual publishers and projects for the NSDL and a new OAI-based NCSTRL. It also has an NSF funded project on a peer-to-peer based digital library and a federated digital library for the physics community. The latter exploits the rich metadata available from some of the contributors and creates such specialized services as: reference linking, search by equations, search by similar authors and subjects, and annotations. ODU also has several active projects in the area of complex digital objects, digital preservation, log analysis and recommendation systems.

10 Intellectual Property

We will develop all software under the GNU Public License (GPL) and distribute it through sourceforge.net, a popular method of distributing open source software, and thus will be in accordance with the Andrew Mellon Foundation’s intellectual property policy.

11 References

[Berg01] Bergman, “The Deep Web: Surfacing Hidden Value”, Journal of Electronic Publishing, 7(1), http://www.press.umich.edu/jep/07-01/bergman.html

[Fos01] The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

[GloA] http://www.globus.org/research/papers/anatomy.pdf

[GloB] http://www.globus.org/research/papers/Final_OGSI_Specification_V1.0.pdf

[Hag03] Hagedorn, K. “OAIster: a no `dead ends’ OAI service provider”, Library Hi Tech, 21(2), 2003, pp. 170-181.

[Lag01] Lagoze, C. & Van de Sompel, H. (2001). The Open Archives Initiative: Building a low-barrier interoperability framework. Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA.

[Lag02] Lagoze, Van de Sompel, Nelson & Warner, “The Open Archives Initiative Protocol for Metadata Harvesting”, http://www.openarchives.org/OAI/openarchivesprotocol.html

[Liu01] Liu, X., Maly, K., Zubair, M. and Nelson, M.L. Arc: An OAI Service Provider for Cross Archive Searching, Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, June 24-28, 2001, pp. 65-66.

[Liu02] Liu, X., Maly, K., Zubair, and M., Nelson, M. DP9 – An OAI Gateway Service for Web Crawlers. Submitted to the Second ACM/IEEE Joint Conference on Digital Libraries, Portland, Oregon, July 14-18, 2002.

[Mal01] Maly, K., Zubair, M. and Liu, X. Kepler: An OAI Data/Service Provider for the Individual. D-Lib Magazine, 7(4), 2001.

[Nel02] Nelson, M. “Service Providers: Future Perspectives”, 2nd OAI Workshop, CERN 2002, http://agenda.cern.ch/fullAgenda.php?ida=a02333

- 14 -

Page 15: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

[Nel03] Nelson, M., Rocker, J. and Harrison, T. “OAI and NASA Scientific and Technical Information, Library Hi Tech, 21(2), 2003, pp. 140-150

[Net03] Netcraft Web Server Survey Archives, http://news.netcraft.com/archives/web_server_survey.html

[Sin03] A Metadata Catalog Service for Data Intensive Applications. G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Mahohar, S. Pail, L. Pearlman. To appear in Proceedings of Supercomputing 2003 (SC2003), November 2003.

[Van00] Van de Sompel, H., Krichel, T., Nelson, M. L., Hochstenbach, P., Lyapunov, V. M., Maly, K., Zubair, M., Kholief, M., Liu, X. & O' Connell, H. (2000). The UPS Prototype: An Experimental End-user Service across E-print Archives. D-Lib Magazine, 6(2).

[Van03] Van de Sompel, Young & Hickey, “Using the OAI-PMH…Differently”, D-Lib Magazine, 9(7/8), http://www.dlib.org/dlib/july03/young/07young.html

- 15 -

Page 16: Librella, a Self Evolving Digital Library Based on P2P … · Web viewGrid applications typically involve large amounts of computing and data that requires secure resource sharing

12 Description of Old Dominion University Located in historic Norfolk, Va., the 188 acres of the Old Dominion University campus stretch from the Elizabeth River to the Lafayette River. Although situated in a metropolitan setting, the University offers a small-college look and feel, with tree-lined walkways, a mix of old and new buildings, and colorful gardens and ponds. Founded in 1930 as a division of the College of William and Mary, Old Dominion has grown into its own over the years and is now one of only 100 public universities with a Carnegie/Doctoral Research-Extensive distinction (http://www.odu.edu/).

The Digital Library Research Group is housed in the Computer Science Department (http://www.cs.odu.edu/) that is part of the College of Sciences (http://web.odu.edu/webroot/orgs/sci/colsciences.nsf/pages/sciences)

- 16 -