PSU/Villanova/VT Discussion Virginia Tech’s Digital Library Research Laboratory Jan. 10, 2005 --...

69
PSU/Villanova/VT Discussion Virginia Tech’s Digital Library Research Laboratory Jan. 10, 2005 -- PSU Edward A. Fox, [email protected] Virginia Tech, Blacksburg, VA 24061 USA http://fox.cs.vt.edu/talks/ http://fox.cs.vt.edu/cv.htm

Transcript of PSU/Villanova/VT Discussion Virginia Tech’s Digital Library Research Laboratory Jan. 10, 2005 --...

PSU/Villanova/VT Discussion

Virginia Tech’s Digital LibraryResearch Laboratory

Jan. 10, 2005 -- PSUEdward A. Fox, [email protected]

Virginia Tech, Blacksburg, VA 24061 USAhttp://fox.cs.vt.edu/talks/

http://fox.cs.vt.edu/cv.htm

Acknowledgements (Selected)

• Sponsors: ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS

Acknowledgements: Faculty, Staff

• Lillian Cassel, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Eberhard Hilf, John Impagliazzo, Filip Jagodzinski, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, Gail McMillan, Claudia Medeiros, Manuel Perez, Naren Ramakrishnan, Layne Watson, …

Acknowledgements: Students

• Pavel Calado, Yuxin Chen, Fernando Das Neves, Shahrooz Feizabadi, Robert France, Marcos Goncalves, Nithiwat Kampanya, S.H. Kim, Aaron Krowne, Bing Liu, Ming Luo, Paul Mather, Saverio Perugini, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Wensi Xi, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …

Stepping Stones & Pathways:

Improving retrieval by Improving retrieval by chains of relationshipschains of relationships

between between document topicsdocument topics

Fernando Das-Neves, Virginia Tech DLRL

A Little Experiment(Compare a simple query with a longer version that explicitly includes

stepping stones)

• “Literary Style in Sherlock Holmes stories”

• Note: Numbers are total relevant web pages in top 20 Google results for the query made up of terms on either end of the link.

Connan Doyle

Victorian Novel

Sherlock Holmes Literary Style

4

5

20

5

Sherlock Holmes Literary Style 2

VS.

No. of rel. docs.

Another Example

• “What is the Relationship between Data Mining and Recommender Systems?”

• Naïve Results: There are many matches that are possible answers.• Discussion: But, many of the pages with co-occurrences give no real

information about the requested relationship.

Social Networks

Collaborative Filtering

Recommender Systems

Data Mining

Recommender Systems

Data Mining

Machine Learning

VS.

7

10 10

9 11 15

An Alternative Interpretation of a Query in IR:

• A query represents two related, separable concepts.

• Objective: Retrieve a sequence of documents that support a valid set of chains of relationships between the two concepts.

• Input: a query representing two concepts.• Output: two groups of documents + a set of

stepping stones (document groups, i.e., clusters) connecting the topics by pathways (relations among clusters).

Type of Questions Matching Alternative Interpretation

• Ill-defined questions, with non-enumerated answers:– “How or why is X related to Y?” – “What is the X of Y?”

• Even if queries with form “give me something about X” lead to relevant docs, it is possible to increase the quantity and quality of information in the query result, when relations are explicit (as a result of our semi-automatic method).

Why is this useful?

• Questions of this type are common.– For example, such questions often occur

during research studies.– These occur often in educational settings,

e.g., for homework.– These occur often in workplace settings,

requiring gathering and relating of information.

• Handling of this type of question by current systems often is inadequate.

How to Build Stepping Stones and Pathways?

• Our approach involves a belief network, to combine content+structure in document similarity calculation, including citation and co-citation similarities.

• Find two relevant document sets, each related to one of the two original sub-queries.

• Find a diverse set of strong candidates, each connecting the two subsets, but as different as possible from other candidates.

• Create stepping stones by finding similar documents to those candidates; keep the clusters that are heavily cited, or whose documents are highly correlated (in all aspects).

• Repeat the process, finding a new stepping stone in between each pair of clusters that are weakly related, until the pathway length is too long, or the similarity is sufficient.

Streams, Structures, Spaces, Scenarios, and Societies (5S): A

Formal Digital Library Framework and Its Applications

Marcos André GonçalvesDoctoral defense

Virginia Tech, Blacksburg, VA 24061 USA

Informal 5S Definition: DLs are complex systems that

• help satisfy info needs of users (societies)

• provide info services (scenarios)

• organize info in usable ways (structures)

• present info in usable ways (spaces)

• communicate info with users (streams)

5Ss

Ss Examples Objectives

Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data

Structures Collection; catalog; hypertext; document; metadata

Specifies organizational aspects of the DL content

Spaces Measure; measurable, topological, vector, probabilistic

Defines logical and presentational views of several DL components

Scenarios Searching, browsing, recommending

Details the behavior of DL services

Societies Service managers, learners, teachers, etc.

Defines service managers, responsible for running DL services; actors, that use those services

Hypotheses

• A formal theory for DLs can be built based on 5S.

• The formalization can serve as a basis for modeling and building high-quality DLs.

5S Framework and DL Development (Gonçalves)

Requirements Analysis Design Implementation Test

5S 5SLOO ClassesWorkflow Components

DLEvaluation

5SGraph 5SLGenFormalTheory/Metamodel

DL XMLLog

5SLGen: Automatic DL Generation

5S Meta

Model5SLGraph

DL Expert

DL Designer

5SL DL

Model

5SLGen

Practitioner

Researcher

TailoredDL

Services

Teacher

componentpool

ODLSearch,ODLBrowse,ODLRate,ODLReview,

…….

Requirements (1) Analysis (2)

Implementation (4)

Design (3)

Research Questions1. Can we formally elaborate 5S?

2. How can we use 5S to formally describe digital libraries?

3. What are the fundamental relationships among the Ss and high-level DL concepts?

4. How can we allow digital librarians to easily express those relationships?

5. Which are the fundamental quality properties of a DL? Can we use the formalized DL framework to characterize those properties?

6. Where in the life cycle of digital libraries can key aspects of quality be measured and how?

Outline• Motivation: the problem

– Hypotheses and research questions

• Part 1:Theory– 5S: introduction, formal definitions– The formal ontology

• Part 2: Tools/Applications– Language– Visualization– Generation– Logging

• Part 3: Quality• Conclusions, Future Work

5S and DL formal definitions and compositions (April 2004 TOIS)

5S

structures (d.10)streams (d.9) spaces (d.18) scenarios (d.21) societies (d. 24)

structural metadataspecification(d.25)

descriptive metadataspecification(d.26)

repository(d. 33)

collection (d. 31)

(d.34)indexingservice

structured stream (d.29)

digitalobject (d.30)

metadata catalog (d.32)

browsingservice

(d.37)

searchingservice (d.35)

digital library(minimal) (d. 38)

services (d.22)

sequence (d. 3)

graph (d. 6)function (d. 2)

measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces

event (d.10)state (d. 18)

hypertext(d.36)

sequence (d. 3)

transmission(d.23)

relation (d. 1) language (d.5)

grammar (d. 7)

tuple (d. 4)*

Streams

text

audio

image

video do mss

R

C DMc

describes

stores

is_version_of/ cites/links_to

Ic

Se

Sc

e

extendsreuses

SM

Ac

opexecutes

participates_in

recipient

runs

Scenarios

Societies

inherits_from/includes

association

uses

Top

Pr Metric

Measurable

Measure

describes

employsproduces

employsproduces

employsproduces

Structures

Spaces

Vec

belongs_to

contains

ms

is_ais_a

precedeshappens_before

is_a

redefinesinvokes

contains

contains

Digital Library Formal Ontology

Ic

Acquiring

universalcollection

C

DMCIndexing

DescribingCataloguing

Linking

Hypertext

Submitting

AuthoringDigitizing

doi

mskjp

p

e

e

describes

p

p

p

e

e

p

e

p

Composition of key infrastructure services

Composition of additional services

SearchingBrowsing

queryanchor

Society

actor

C, {doi, i I}

Recommending Filtering Binding Visualizing Expanding query

user model/expr query/category {doj, j J}

{dor, r R} {dof, f F}

biuk

InformationSatisfaction Services

spj query’

fundamental

Rating Training

Infrastructure

Services (Add_Value)

composite

Requesting

handle

p pp

e e e{(doi, acj, rij), i I, j }

p

e

e

p p p p p

e e

classCt

e ee e

e

p

e

Indexing

IC

p

e

transformer

e

Ontology: Taxonomy of Services

BindingBrowsingCustomizingDisseminatingExpanding(query)FilteringRecommendingRequestingSearching

AnnotatingClassifyingClusteringEvaluatingExtractingIndexingLinkingLogging

MeasuringRating

Reviewing (peer)Surveying

Training (classifier)TranslatingVisualizing

ConservingConverting

Copying/ReplicatingTranslating (format)

AcquiringAuthoringCataloging

Crawling (focused)DescribingDigitizingHarvestingSubmitting

PreservationalCreational

AddValue

Repository-Building

Information SatisfactionServices

Infrastructure Services

5SL: a DL Modeling language

• Domain specific languages – Address a particular class of problems by offering

specific abstractions and notations for the domain at hand

– Advantages: domain-specific analysis, program management, visualization, testing, maintenance, modeling, and rapid prototyping.

• XML-based realization of 5S– Interoperability– Use of many standard sub-languages (e.g., MIME

types, XML Schemas, UML notations)

Overview of 5SGraph

Workspace

(instance model)

Structured

toolbox

(metamodel)

5SGen – Version 2: ODL, Services, Scenarios

5SL-SocietiesModel (1)

XPATH/JDOMTransform (2)

XMI:ClassModel (3)

Xmi2Java (4)

JavaClasses

Model (5)

DeterministicFSM (10)

SMC (11)

JavaFinite

State MachineClass

Controller (12)

5SL-ScenarioModel (6)

XPath/JDOMTransform (7)

StateChartModel (8)

Scenario Synthesis (9)

ODLSearch

Java

Wrapping

import

ComponentPool

ODLBrowse

Java

Wrapping

import

.

.

.

JSPUser

InterfaceView (13)

Generated DL Services

DLDesigner

DLDesigner

binds

5SL-SocietiesModel (1)

XPATH/JDOMTransform (2)

XMI:ClassModel (3)

Xmi2Java (4)

JavaClasses

Model (5)

DeterministicFSM (10)

SMC (11)

JavaFinite

State MachineClass

Controller (12)

5SL-ScenarioModel (6)

XPath/JDOMTransform (7)

StateChartModel (8)

Scenario Synthesis (9)

ODLSearch

Java

Wrapping

import

ComponentPool

ODLBrowse

Java

Wrapping

import

.

.

.

ODLSearch

Java

Wrapping

import

ComponentPool

ODLBrowse

Java

Wrapping

import

.

.

.

JSPUser

InterfaceView (13)

Generated DL Services

DLDesigner

DLDesigner

binds

5SGen

The XML Log Format

Log

SessionId MachineInfo StatementTransaction Timestamp

SessionInfo RegisterInfoEvent ErrorInfo

Action

Search Browse StoreSysInfoUpdate

SearchBy QueryString CatalogCollection PresentationInfo

StatusInfo

Timeout

AuthoringModifying

OrganizingIndexing

Storing

Archiving

NetworkingAccessing

Filtering

Creation

DistributionUtilization

Similarity

Pertinence

AccuracyCompletenessConformance

Seeking

SearchingBrowsingRecommending

Relevance

Timeliness

Accessibility

Accessibility

Inactive

Active

Discard

RetentionMining

Semi-Active

Preservability

Timeliness

Preservability

Describing

Similarity

Significance

Quality and the InformationLife Cycle

Rao Shen’s Preliminary Exam:Hypothesis and Research Questions

• The 5S framework provides effective solutions to DL integration.

– Formally define the DL integration problem?– Guide integration of domain focused DLs?

• How to formally model such domain specific DLs?• How to integrate formally defined DL models into a

union DL model?• How to use the union DL model to help design and

implement high quality integrated DLs?

– Assess the integration?

Related Work

DL interoperability approach

Intermediary-based mapping-based

Consists of

mediator wrapper agent

use

two architectures

federation Union Archiving

used in

Consists of

hybrid mapper composite mapper

use

schema mapping

use

SemInt

has an example

LSD

has an example

Interrelated with

DL interoperability approach

Intermediary-based mapping-based

Consists of

mediator wrapper agent

use

two architectures

federation Union Archiving

used in

Consists of

hybrid mapper composite mapper

use

schema mapping

use

Interrelated with

GA

trained by

DL integration formalization

based on

Formal Definition of DL Integration

• DLi=(Ri, DMi, Servi, Soci), 1 i n

– Ri is a network accessible repository

– DMi is a set of metadata catalogs for all collections

– Servi is a set of services

– Soci is a society

• UnionRep• UnionCat• UnionServices• UnionSociety

Repository1

DL1

Repository2

Union Catalog

Union Repository

Catalog1 Catalog2

Searching

Union DL DL2

archaeologists

Society

General Public

Society

ArchaeologistsGeneral Public

Union Society

ServiceBrowsingService

Union Service

Harvesting, Mapping,Searching, Browsing,

Clustering, Visualization

Architecture of a Union DL

Example of Union Service: CitiViz

CitiViz:A Visual User Interface to the

CITIDEL System

ECDL 2004, Bath, England, September 2004

Nithiwat Kampanya, Rao Shen, Seonho Kim, Chris North, and

Edward A. [email protected] http://fox.cs.vt.edu

Digital Object

RepositoryCollection Minimal DL

Metadata Catalog

Descriptive Metadata

Specification

A Minimal DL in the 5S Framework

Structural Metadata

Specification

Streams Structures Spaces Scenarios Societies

indexing

browsing searching

services

hypertext

Structured Stream

Streams Structures Spaces Scenarios Societies

indexing

browsing searching

services

hypertext

Structured Stream

Descriptive Metadata

specification

SpaTemOrg

StraDia

Arch Descriptive Metadata specification

ArchDO

ArchObj

ArchColl

Arch Metadata catalog

ArchDColl ArchDR Minimal ArchDL

A Minimal ArchDL in the 5S Framework

5SGraph5S Archaeology

MetaModelArchDL Expert ArchDL Designer

Structure Sub-model

ETANA-DLUnion Services

Descriptions

HarvestingMapping

SearchingBrowsing

Scenario Sub-model

VN Metadata Format

ETANA-DL Metadata Format

HD Metadata Format

Mapping Tool

Wrapper4VN Wrapper4HD

Inverted Files

Services DB

Index

Index

BrowseService

SearchService

Browse DB

OtherETANA-DL

Services

Web

Interface

XOAI

XOAI

VNCatalog

HDCatalog

UnionCatalog

5SGen

ComponentPool

Browsing…

Computing and Information Technology Interactive Digital Educational Library (CITIDEL)

• Domain: computing / information technology

• Genre: one-stop-shopping for teachers & learners: courseware (CSTC, JERIC), leading DLs (ACM, IEEE-CS, DB&LP, CiteSeer), PlanetMath.org, NCSTRL (technical reports), …

• Submission & Collection: sub/partner collections www.citidel.org

www.CITIDEL.org

• Led by Virginia Tech, with co-PIs:– Fox (director, DL systems)– Lee (history)– Perez (user interface, Spanish support)– Students: Ryan Richardson, Kate McDevitt,

Jon Pryor, Baoping Zhang

• Partners– College of New Jersey (Knox)– Hofstra (Impagliazzo)– Villanova (Cassel)– Penn State (Giles)

Annotations

OAI Data

Harvester

EDUCATORS

ADMINISTRATORS LEARNERS

Multilingual Searching

Revising Annotating Filtering Browsing Administering

Filtering Profiles User Profiles

Union Metadata

OAI Data

Provider

Remote and Peer Digital Libraries (eg. NSDL -CIS)

PORTALS

SERVICES

REPOSITORIES

Digital library architecture for localand interoperable CITIDEL services

CITIDEL Technology Features•Component architecture (Open Digital Library)

•Re-use and compose re-deployable digital library components.

•Built Using Open Standards & Technologies

•OAI: Used to collect DL Resources and DL Interoperability

•XSL and XML: Interface rendering with multi-lingual community based translation of screens and content (Spanish, …)

•Perl: Component Integration

•ESSEX: Search Engine Functionality

•Very fast, utilizing in-memory processing

•Includes snap-shots for persistence

•Multi-scheming (Aaron Krowne, now at Emory U. Library)

•Integrates multiple classifications / views through maps, closure

•Extensions: clustering, visualization, personalization, …

Cluster Search Results from CITIDEL

Cluster NDLTD-Computing

CITIDEL + PIPE• Adds Interaction Personalization to CITIDEL

•Automatically handles multi-modal conversion to Cell phone, PDA, Etc.

•Can be adopted to any digital data set, only requires XML file of content with hierarchy maintained.

Naren Ramakrishnan and Saverio Perugini (U. Dayton)

OCKHAM Library Network (NSDL)

NSDL

OCKHAM

Services

NSDLServices

Teachers LearnersLibrarians

OCKHAMLibrary

Network

LibraryServices

OCKHAM (Ming Luo)

• Simplicity (a la OCCAM’s razor)• Support by Mellon and DLF• Four main ideas:

1. Components2. Lightweight protocols3. Open reference models (e.g., 5S, OAIS)4. Community perspective and involvement

• Funded by NSF in NSDL, with P2P, with Emory, Notre Dame, Oregon State, …

OCKHAM Proposed Services

• Alerting• Browsing• Cataloging• Conversion• OAI – Z39.50• Pathfinding• Registry • (plus others such as from adapted ODL)

A Digital Library Case Study

• Domain: graduate education, research

• Genre:ETDs=electronic theses & dissertations

• Submission: http://etd.vt.edu

• Collection: http://www.theses.org

Project: Networked Digital Library of Theses & Dissertations (NDLTD) http://www.ndltd.org (supported by Ming Luo)

OCLC SRU Interface => Dr. A.K. Tyagi

ETD Union Search Mirror Site in China (CALIS)(http://ndltd.calis.edu.cn – popular site!)

LOCKSS Extensions:Bing Liu, Xiaoyu Zhang, Ji-Sun Kim• Lots of copies keep stuff safe• Stanford (Vicky Reich)• Initial focus on lower levels, journals• Shift to OAI, esp. for ETDs• Collab with Emory (Martin Halbert)

– NDIIP: AmericanSouth, MetaArchive– Help deploy and adapt, apply in other contexts

• Another registry• Set of publisher manifests (information providers)• Set of storage systems (archival storage)

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

open digital library

OA OA

OA

OA

OA

OA

OA

OA

OA

PMH

PMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

Hussein Suleman(Capetown, S. Africa)

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

ETD-1

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

ETD-2

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

ETD-3

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

ETD-4

ETD DL for the Networked Digital Library of Theses and Dissertations

(www.ndltd.org)

Search

Filter

Filter

Union

Recent

Browse

PMH

PMH

PMH

ODLRecent

ODLBrowse

ODLUnion

ODLUnion

ODLSearch

ODLUnionPMH

PMH

US

ER

INT

ER

FA

CE

Students and researchers ETD collections

Example Open Digital Library

Open Digital Library Deployments

• NDLTD (www.ndltd.org)• Computer Science Teaching Center

(www.cstc.org)• Computing and Information Technology

Interactive Digital Educational Library (www.citidel.org)

• Open Archives Distributed (NSF, DFG) – enhancements to PhysNet

• OCKHAM• Open to others through DL-in-a-box

Interest-based User Grouping Model

for Collaborative Filtering in Digital Libraries

7th ICADL 2004

Shanghai, P.R. China

Dec. 15, 2004

Edward A. Fox, Seonho KimVirginia Tech, Blacksburg, VA 24061 USA

Some Other Students/Projects

• Wensi Xi: Matrices, reinforcement, clusters (Microsoft)• Paul Mather: mod/sim of large DLs on clusters;

characterization: uses, files (NASA)• Ming Luo: personalization aided by demographics• Ryan Richarson: CLIR with concept maps• Xiaoyan Yu: Stepping Stones and Pathways (NSF,

Fernando Das Neves completed & returned to Argentina)• Baoping Zhang: Physics and classification (NSF, DFG)• Several: TREC with GP• New projects:

– Superimposed information w. PSU (NSF NSDL)– Quality and metasearch and structure w. Emory (IMLS)

• …

Conclusion

• Many DL/IR: areas, projects, students

• Theory

• Architecture

• Modeling and simulation

• Systems development and testing to: validate above, demonstrate innovations

• Users, interfaces, visualization, usability