William H. Mischo University of Illinois at Urbana--Champaign

27
The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo University of Illinois at Urbana-- Champaign Grainger Engineering Library Information Center

description

The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University. William H. Mischo University of Illinois at Urbana--Champaign Grainger Engineering Library Information Center. Overview. - PowerPoint PPT Presentation

Transcript of William H. Mischo University of Illinois at Urbana--Champaign

Page 1: William H. Mischo     University of Illinois at Urbana--Champaign

The Illinois Digital Library Initiative:

Processing and Access Issues for Full-Text Journals

May 27, 1998Pennsylvania State University

William H. Mischo University of Illinois at Urbana--Champaign

Grainger Engineering Library Information Center

Page 2: William H. Mischo     University of Illinois at Urbana--Champaign

Overview• Testbed Goals & Mission.• Testbed Issues.• Testbed Technologies.• SGML Processing Methodology.• Accomplishments.• Transaction Log Analysis• Federation Tests & Distributed Repository Model.• Future Foci. • What We Have Learned.• Questions

Page 3: William H. Mischo     University of Illinois at Urbana--Champaign

“The Business of a University is Information…The Production and

Dissemination of Information is the Work of the University.”

• Tom Everhart, President, California Institute of Technology

Page 4: William H. Mischo     University of Illinois at Urbana--Champaign

Digital Library Initiative Program

• Funded by National Science Foundation (NSF), DARPA, and NASA.

• Awarded grants to 6 universities (and partners), September 1994--August 1998.

• The 6: Illinois, Michigan, Stanford, Berkeley, Carnegie Mellon, Santa Barbara.

• Each project: $4 million over 4 year project.

• Illinois: Testbed, Research, Evaluation, Web Software.

Page 5: William H. Mischo     University of Illinois at Urbana--Champaign

Scholarship, Publishing, Libraries• Changing Paradigm: Authors, Publishers,

Libraries, A & I Services.

• Scholarly Publishing Issues (We Pay Twice).

• Publisher Costs (85% for First Copy).

• Idea of Universities as Publishers.

• Users’ Information Seeking Behavior (personal collection, colleagues, e-mail, Web, Library).

• Archiving Issues (Depository idea GB, Canada)

• Role of the Library (Function as well as Place).

Page 6: William H. Mischo     University of Illinois at Urbana--Champaign

Scholarship• “The normal mode of scientific growth is

exponential…(we are) entering a period of crisis marked by rapidly increasing concern over problems of manpower, literature, and expenditure that demand solution by reorganization.”– Derek de Solla Price, 1986.

• Year and Number of Journals:– 1665 1– 1932 6,000– 1981 96,000– 1996 165,000

• Avg. Price of U.S. Periodical rose 155%, 1986-96.

Page 7: William H. Mischo     University of Illinois at Urbana--Champaign

Testbed Goals & Objectives• Construct Large-Scale, Multipublisher,

SGML-Based Full-Text Testbed.

• Investigate Processing, Indexing, Normalization, Retrieval and Rendering.

• Study End-User Searching Behavior and Needs.

• Look at One-Stop-Shopping Retrieval Models (Integration of Services).

• Identify Models for Effective Retrieval in Electronic Full-Text Publishing Environment.

Page 8: William H. Mischo     University of Illinois at Urbana--Champaign

Testbed: 54 Journals, 39K ArticlesAll items in SGML & 2/3 in PDF

• American Institute of Physics--APL, JAP, RSI– 12,000 articles, 1995--, weekly updates.

• American Physical Society--PRL– 8,800 articles, 1995--, weekly updates.

• ASCE Journals (25 titles)– 5,000 articles, 1995--.

• IEE Proceedings and Electronics Letters– 7,400 articles, 1993--.

• IEEE Computer Society (14 titles): 5,000 articles, 1996--.

Page 9: William H. Mischo     University of Illinois at Urbana--Champaign

Issues• Toward the Holy Grail of Smart Document.

• Top Menu Integration and Cross-Resource Links.

• Searching over Full-Text of Journals vs. Abstract & Index Service Database.

• Full-Text Display (Mathematics Rendering: SGML, HTML, PDF, XML, Math ML, TeX.).

• Web-Based Problems & Connectivity.

• Breadth and Depth of Collections.

• User Response.

Page 10: William H. Mischo     University of Illinois at Urbana--Champaign

Testbed Technologies• Open Text (HPUX) Search Engine / LiveLink Web.

• Item Metadata for Normalization and Short-Entry Display.

• TCP/IP and HTTP for Full-Text, DCOM DLLs for A&I Links, Java Applets (Wordwheels).

• SGML rendering via Panorama.

• Custom Processing Programs on NT and Unix Platforms (Visual Basic, C++, Perl).

• Microsoft IIS (Web Retrieval, ASP for Links and Top Menu, Authentication w/ Bluestem).

Page 11: William H. Mischo     University of Illinois at Urbana--Champaign
Page 12: William H. Mischo     University of Illinois at Urbana--Champaign

Accomplishments (Overview)• Distributed Repository Model (within

Testbed & with AIP).

• Process & Retrieve from Multiple Publishers & Heterogeneous DTDs.

• Use of Aliasing (Normalization) for Cross-Repository Access from Single Client Search Argument.

• Item Metadata Definition.

• Dynamic Linking of Resources and Proxy A&I Service Access from / to Testbed.

• Focused User Studies.

Page 13: William H. Mischo     University of Illinois at Urbana--Champaign

UIUC DLI Testbed Architectures Under Investigation

Repositories(SGML, PDF)

MetadataIndexes

Gateways

Clients

Testbed Links to:A & I Services,Other Full TextIEE

IEEE CS

APS

ASCE

AIP

Urbana

New York

HTTPJAVAASP

LiveLink

AuthenticationAuthorization

Page 14: William H. Mischo     University of Illinois at Urbana--Champaign

DeLIver Features• Retrieval over Subset of Repositories.• Forward (Citation) & Backward

(Bibliography) Links to Testbed.• Links to INSPEC, Compendex, Current

Contents from Items & Bibliography.• Ovid INSPEC/Compendex Proxy.• Integration with Other Library Resources• Web-Kerberos Based Authentication.• Capability of Digital Signing.• User Transaction Logs.

Page 15: William H. Mischo     University of Illinois at Urbana--Champaign

Toplevel Menu Transactions(Total 19738)

Compendex 2927 Online Catalog 6552 Wilson 496

Curr Serials 1656 Call Nos 298 New Books 324

Grainger Hm Pg 745 Faculty Interest 200 Comments 49

Ref Coll 1677 First Search 698 Reserves 380

DeLIver 519 ASTI 685 Sci Citation 498

CCON 446 PsychLit 92 INSPEC 793

Help Starting 186 FAQ 90 News 54

Page 16: William H. Mischo     University of Illinois at Urbana--Champaign

Transaction Logs (1)4035 total end-user sessions (September through May).

3023 end-user sessions where searches were performed

Top Bar # Sessions Total #

About DeLIver 427 536

Browse (all) 1585 2277

Browse Only 1012

Help 175 190

Quicktips 189 245

Download Software 1001 1086

Other Resources 230 289

Page 17: William H. Mischo     University of Illinois at Urbana--Champaign

Transaction Logs (2)4035 total end-user sessions (September through May).

3023 end-user sessions where searches were performed

Search Fields # Sessions Total #

Keyword 2083 6090Abstract 194 747Article Title 368 976 Article Author 377 926All Author 185 468Citations 39 74Body of Article 76 336Figure Caption 26 60Table Caption 9 12Journal Title 218 530Title, Headings, Caption 118 358

Page 18: William H. Mischo     University of Illinois at Urbana--Champaign

Transaction Logs (3)4035 total end-user sessions (September thru May).

3023 end-user sessions where searches were performed.

Searching Characteristics # Sessions Total #Average Length of Search 727 seconds

Display Full-Text 2079 4267

PDFs 842 10104

SGMs 1516 4660

Extended Citation 578 2212

Boolean Operators 856 5773

ANDS 682

Ors 204 668

NOTs 30 79

KWIC Display 389 780

Links to Inspec/Compendex 261 404

Multiword Search Arguments 1848 6134

Page 19: William H. Mischo     University of Illinois at Urbana--Champaign

Transaction Logs (4)4055 end-user sessions (September thru May)

3023 end-user sessions where searches were performed

Publisher Choices # Sessions Total #

All Publishers 2535 9185

AIP 65 238

APS 33 84

ASCE 96 247

IEE 38 98

Page 20: William H. Mischo     University of Illinois at Urbana--Champaign

Transaction Logs (5)4055 end-user sessions (September thru May)

3023 end-user sessions where searches were performed

Points:Not much use of Help or Quicktips;

a lot of Browsing but < 50% of search sessions;

Not jumping to A&I Services from DeLIver;

mostly Keyword Searching, also fair amount of Author, Article Title, Journal Title;

much more Display Full-Text than Extended Citation (why?);

25% of sessions use Boolean operators;

Multiword Search Arguments (complex terms, not single words) being entered;

Linking to INSPEC/Compendex in 20% of sessions;

predominantly All Publishers being searched.

Page 21: William H. Mischo     University of Illinois at Urbana--Champaign

Testbed User Authentication• Approach:

– Authenticate Once per Session / Authorize per Use

• Current Mechanism:– On 1st Request, User Referred to Bluestem Script– Upon Bluestem Authentication:

• Authorization Record Written to SQL Database• Cookie Set Which Points to that Record

• Need to Fix Redirection Problem with MS IE• Need to Extend Outside Cookie-Setting Domain

Page 22: William H. Mischo     University of Illinois at Urbana--Champaign

Future Work• Implementation of Distributed Repository

Model.• Expand Breadth of Testbed (Loading Locally

and Linking to other Repositories).• Use of Digital Object Identifiers and other

Standards.• Rendering via HTML 4.0 & CSS, XML & XSL.• Adding Dynamic retrieval Mechanisms

(Wordwheels, Co-Occurrence Matrices). • Expand Simultaneous Search Mechanisms.• Expanded User Studies.

Page 23: William H. Mischo     University of Illinois at Urbana--Champaign

SGML vs. HTML vs. XML

• SGML:– Supports Powerful Indexing, Search & Retrieval– But Client, Delivery, & Rendering Issues Remain

• HTML:– Ubiquitous; Rendering Has Become More Robust– But Remains Presentation Oriented, Less Semantic

• XML:– Subset Retains SGML Features of Primary Interest– But XML Is New, Untested, Under-Supported

Page 24: William H. Mischo     University of Illinois at Urbana--Champaign

Converting DLI Testbed to XML• XML Differences from SGML:

– No SHORTREF (Tag Minimization)– Tags Are Case Sensitive– Restrictions on Entities, Attributes, Link Mechanisms– Empty Tags Handled Differently

• Math ML vs. ISO 12083 Math– Math ML a Major Departure -- Adds Semantics– Focus on Java / ActiveX for Initial Deployment; Long-

Term Success May Hinge on XSL / DSSSL• ‘Content-Markup’ requires XSL, Dynamic HTML functionality

Page 25: William H. Mischo     University of Illinois at Urbana--Champaign

CSS, XSL, DSSSL• CCS1 & CCS2 Have Added:

– Overlapping Glyphs, Absolute & Relative Positioning– Downloadable Fonts (Platform, Browser Variable)– Styling by Attributes, 2 Levels of Hierarchy

• XSL, DSSSL, DSSSL-O:– XSL Uses XML Notation, Is Extensible (ECMAScript)– Allows More Extensive Manipulation In Formatting

• Supports Re-arrangement, Navigator Frames, etc.

– Not Yet Implemented in Production Browsers

Page 26: William H. Mischo     University of Illinois at Urbana--Champaign

What We Have Learned (1)• Power of SGML for Indexing & Retrieval.

• Problems with rendering mathematics--SGML, TeX, HTML, XML, Math ML.

• Depth and breadth of collection (TULIP/ Red Sage Syndrome; note use of Ovid client).

• Local Processing Implications

• Metadata needs and robustness of Distributed Model.

Page 27: William H. Mischo     University of Illinois at Urbana--Champaign

What We Have Learned (2)• Efficacy of Full-Text (stand-alone, integrated

with A & I, part of TOC Service).

• The Idea of a Digital Library in the Digital Chaos--the role of the Gateway and Linking of Resources.

• Changing roles of Authors, Publishers, A & I Services, Libraries.

• These Technologies Will Transfer to the Web (CSS I & II, HTML 4.0, Dynamic HTML, XML).