Dapsys08 dl on_grid

22
Cluj Napoca, 28 August 2008 2008 IEEE International Conference on Intelligent Computer Communication and Processing Digital Libraries Workshop Towards a GRID-Based Digital Library Management System. Gheorghe Sebestyén-Pál 1 , Doina Banciu 2 , Tünde Bálint 1 , Bogdan Moscaiuc 1 , and Ágnes Sebestyén-Pál 1 1- Technical University of Cluj-Napoca 2 - ICI Bucharest

description

 

Transcript of Dapsys08 dl on_grid

Page 1: Dapsys08 dl on_grid

Cluj Napoca, 28 August 2008

2008 IEEE International Conference on Intelligent Computer Communication and Processing

Digital Libraries Workshop

Towards a GRID-Based Digital Library Management System.

Gheorghe Sebestyén-Pál1, Doina Banciu2, Tünde Bálint1, Bogdan Moscaiuc1, and Ágnes Sebestyén-Pál1

1- Technical University of Cluj-Napoca

2 - ICI Bucharest

Page 2: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Content

Classical vs. Digital Libraries Recent research on Digital Libraries (DL) Main issues and requirements for DLs An ontology-based DL model Grid-enabled DL Implementation considerations of a pilot DL Experiments Conclusions

Page 3: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Classical vs. Digital Libraries

Classical library a repository of knowledge organized mainly on

paper Digital library

Not only a digitized version of a classical library A new set of functionalities and services are added (e.g.

access control, resources management and allocation, complex search and processing services, etc.)

A data exchange and cooperation environment DLs are becoming digital content management systems Incorporates a wide variety of formats and data types ( text,

audio, video, multi-document complex digital objects) Uses a variety of communication and data-exchange

protocols and standards

Page 4: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

IT and Communication technologies involved in the implementation of digital libraries

http://mapageweb.umontreal.ca/turner/meta/english/metamap.html

Page 5: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Goals for modern DLs

DELOS project’s vision – “to enable any person to access all human knowledge

anytime and anywhere, in a friendly, multi-modal, efficient, and effective way, by overcoming barriers of distance, language, and culture and by using multiple Internet-connected devices”

DL - a knowledge repository and an information exchange infrastructure that allows:

data generation, processing and seamless access to relevant information, regardless of the

geographic distribution of hardware resources, databases or persons.

Page 6: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Research in digital libraries Delos Network of Excellence –

Goals: to define and implement digital libraries on new computing and communication technologies

Achievements: definition of functional and architectural requirements for DL implementation

BRICKS project Goals: to design a user and service-oriented space to share

knowledge and resources in a multi-cultural heritage. Achievements:

Definition of a digital library architecture for a very broad and heterogeneous user community; automatic indexing and annotation functionalities

OpenDlib project Goal: development of a software toolkit for dedicated DLs generation Achievements: tools for content harvesting form existing resources

Fedora, DSpace – open source software for DLs Lucene – open source Search engines

Page 7: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Research in digital libraries (cont.) Diligent project (part of EGEE project)

Goal: the use of GRID infrastructure for DL implementation Achievements: a new vision about the DL concept:

DL = a dynamic digital content repository and management system dedicated for a purpose (e.g. a project, an art collection, an academic course)

Definition of generic DL services mapped on GRID services DLs dedicated for different domains – with powerful processing

capabilities SINRED project – National Excellency project

Goal: development of a national framework for DLs specialized on technical sciences and research

Achievements: evaluation of requirements, evaluation of existing software, infrastructure development, DL model definition, implementation of a pilot DL

SIPADOC project – National research program Goal: reevaluation of the national patrimony through DLs Achievements: evaluation of digitizing tools

Page 8: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Key issues in DL implementation Architectural issues:

distributed nature of storage, processing and access resources Scalability, flexibility, interoperability

Functional requirements: Core functions: storage, indexing and annotation, data-search, content

retrieval, users management Content organization should reflect semantic connections

Processing facilities Data processing services – specialized for different fields Pattern search and recognition

QoS issues Restricted time to obtain relevant information Reasonable time for complex data processing

User and access control management Virtual organizations Role-based access

Page 9: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

DL = Essence & Metadata Management

TextAudio

VideoText

Digital content generation and

harvesting

Management of essence

Automatic feature (metadata) extraction

Metadata Management

Cataloging, indexing,

annotation

Access and visualization

Cataloging information

system

Page 10: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

An ontology-based Digital Library approach

Ontology: concepts and relations together with a reasoning engine

Ontology for technical and scientific domains Main concepts:

Digital objects: association of content, metadata and

procedures Examples: articles, technical reports,

prospects, PhD Thesis, patents Digital collections

Set of digital objects structured for a given goal/purpose of based on a given criterion

Examples: articles of an author, documents of a domain

Events Conferences, workshops, seminars

Processes Projects Courses

Virtual organizations Roles users

Page 11: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Grid-enabled digital library services

Why DLs on GRID infrastructure? Huge volume of documents/digital objects Concurrent access and multiple search engines (see

Google) Multimedia streaming Automatic indexing and annotation Complex processing requires prohibitive time User management through virtual organizations Job distribution facilities offered by GRID

Page 12: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

DL functions mapped on GRID services

Computing, storage and communication resources

Digital Library

GRID Services

Collections management

Catalog and metadata

management

Digital objects management

Users’ management

Data visualization

Virtual organizations management

Resource management

Task distribution

Processing

Data distribution and replication

Data processing

Page 13: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Experiments Two approaches:

DL implementation on Alchemi GRID (Microsoft) Job distribution at thread level Explicit GRID programming Experiments with multimedia streaming (multimedia content

distribution) DL implementation on Condor GRID (Open source)

Job distribution at task level Job and data distribution is transparent to the DL application

( distribution is made through separate scripts) Experiments with “key-word search” in the whole DL content

The execution time decreased with the number of executor computers

For more than 5 executors the scheduling and communication time is comparable with the execution time

Page 14: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

A pilot implementation of a Digital library framework developed with GRID support Goal: implementation of a digital content storage and retrieval system

dedicated for educational and scientific activities (courses, projects, etc.)

Main requirements: A DL adaptable for a given purpose/goal Access controlled and restricted with virtual organizations Ontology-based approach (concepts, relations, semantic search) Advanced search procedures GRID-enabled full-text search services – for better reaction time Access through Internet browsers

The result: A distributed digital library application, which allows:

Management of digital objects (upload, storage, indexing, metadata creation

Management of collections Management of users and virtual organizations

Page 15: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Pilot DL details: (www.bib-dig.utcluj.ro)

Management of digital objects Digital Documents’ upload, Annotation, metadata generation according with

Dublin Core Distributed Storage of data

Management of collections Define a new collection Attach new documents to an existing collection Associate access rights to a collection

Management of users and virtual organizations Define new users and new virtual organizations Define roles Associate roles to users and collections

Page 16: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Snapshots of the DL application’s interface

bib-dig.utcluj.ro

Page 17: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Snapshots of the DL application’s interface

Page 18: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Search techniques in DLs

through key-word or index search: Database techniques

through semantic Information Retrieval: Semantic graph with documents

and concepts through non-semantic Information

Retrieval: Naive Bayes Algorithm

Probabilistic approach Based on probabilistic

similarity between documents Topic-Based Vector Space

Model Algorithm

Page 19: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Experimental results

Execution time v. s. number of executor nodes

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 3 4 5

Nodes

Tim

e (

s)

Search execution time

Scheduling andcommunication time(case 1)

Scheduling andcommunication time(case 2)

Total time (case1)

Total time (case2)

Page 20: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Experiments

Page 21: Dapsys08 dl on_grid

Debrecen, 3-5 September 2008, DAPSYS’087th INTERNATIONAL CONFERENCE ON DISTRIBUTED AND PARALLEL SYSTEMS

Conclusions

DLs are complex content management systems that extend the functionalities of classical libraries: Semantic organization of a wide variety of information formats Multiple search and data retrieval techniques (including full-text and

semantic search): Key-word full-text search Semantic search Statistical and probabilistic retrieval and classification

Access control to distributed and remote data DLs are Data exchange and cooperation environments

Useful for remote and cooperative work DLs must include powerful search and data retrieval engines GRID infrastructures may be a feasible support in the implementation of DLs

For more efficient parallel search, classification or automatic annotation

Page 22: Dapsys08 dl on_grid

Cluj Napoca, 28 August 2008

2008 IEEE International Conference on Intelligent Computer Communication and Processing

Digital Libraries Workshop

Thank you for your attention

Questions ?