The Illinois Digital Library Initiative:
Processing and Access Issues for Full-Text Journals
May 27, 1998Pennsylvania State University
William H. Mischo University of Illinois at Urbana--Champaign
Grainger Engineering Library Information Center
Overview• Testbed Goals & Mission.• Testbed Issues.• Testbed Technologies.• SGML Processing Methodology.• Accomplishments.• Transaction Log Analysis• Federation Tests & Distributed Repository Model.• Future Foci. • What We Have Learned.• Questions
“The Business of a University is Information…The Production and
Dissemination of Information is the Work of the University.”
• Tom Everhart, President, California Institute of Technology
Digital Library Initiative Program
• Funded by National Science Foundation (NSF), DARPA, and NASA.
• Awarded grants to 6 universities (and partners), September 1994--August 1998.
• The 6: Illinois, Michigan, Stanford, Berkeley, Carnegie Mellon, Santa Barbara.
• Each project: $4 million over 4 year project.
• Illinois: Testbed, Research, Evaluation, Web Software.
Scholarship, Publishing, Libraries• Changing Paradigm: Authors, Publishers,
Libraries, A & I Services.
• Scholarly Publishing Issues (We Pay Twice).
• Publisher Costs (85% for First Copy).
• Idea of Universities as Publishers.
• Users’ Information Seeking Behavior (personal collection, colleagues, e-mail, Web, Library).
• Archiving Issues (Depository idea GB, Canada)
• Role of the Library (Function as well as Place).
Scholarship• “The normal mode of scientific growth is
exponential…(we are) entering a period of crisis marked by rapidly increasing concern over problems of manpower, literature, and expenditure that demand solution by reorganization.”– Derek de Solla Price, 1986.
• Year and Number of Journals:– 1665 1– 1932 6,000– 1981 96,000– 1996 165,000
• Avg. Price of U.S. Periodical rose 155%, 1986-96.
Testbed Goals & Objectives• Construct Large-Scale, Multipublisher,
SGML-Based Full-Text Testbed.
• Investigate Processing, Indexing, Normalization, Retrieval and Rendering.
• Study End-User Searching Behavior and Needs.
• Look at One-Stop-Shopping Retrieval Models (Integration of Services).
• Identify Models for Effective Retrieval in Electronic Full-Text Publishing Environment.
Testbed: 54 Journals, 39K ArticlesAll items in SGML & 2/3 in PDF
• American Institute of Physics--APL, JAP, RSI– 12,000 articles, 1995--, weekly updates.
• American Physical Society--PRL– 8,800 articles, 1995--, weekly updates.
• ASCE Journals (25 titles)– 5,000 articles, 1995--.
• IEE Proceedings and Electronics Letters– 7,400 articles, 1993--.
• IEEE Computer Society (14 titles): 5,000 articles, 1996--.
Issues• Toward the Holy Grail of Smart Document.
• Top Menu Integration and Cross-Resource Links.
• Searching over Full-Text of Journals vs. Abstract & Index Service Database.
• Full-Text Display (Mathematics Rendering: SGML, HTML, PDF, XML, Math ML, TeX.).
• Web-Based Problems & Connectivity.
• Breadth and Depth of Collections.
• User Response.
Testbed Technologies• Open Text (HPUX) Search Engine / LiveLink Web.
• Item Metadata for Normalization and Short-Entry Display.
• TCP/IP and HTTP for Full-Text, DCOM DLLs for A&I Links, Java Applets (Wordwheels).
• SGML rendering via Panorama.
• Custom Processing Programs on NT and Unix Platforms (Visual Basic, C++, Perl).
• Microsoft IIS (Web Retrieval, ASP for Links and Top Menu, Authentication w/ Bluestem).
Accomplishments (Overview)• Distributed Repository Model (within
Testbed & with AIP).
• Process & Retrieve from Multiple Publishers & Heterogeneous DTDs.
• Use of Aliasing (Normalization) for Cross-Repository Access from Single Client Search Argument.
• Item Metadata Definition.
• Dynamic Linking of Resources and Proxy A&I Service Access from / to Testbed.
• Focused User Studies.
UIUC DLI Testbed Architectures Under Investigation
Repositories(SGML, PDF)
MetadataIndexes
Gateways
Clients
Testbed Links to:A & I Services,Other Full TextIEE
IEEE CS
APS
ASCE
AIP
Urbana
New York
HTTPJAVAASP
LiveLink
AuthenticationAuthorization
DeLIver Features• Retrieval over Subset of Repositories.• Forward (Citation) & Backward
(Bibliography) Links to Testbed.• Links to INSPEC, Compendex, Current
Contents from Items & Bibliography.• Ovid INSPEC/Compendex Proxy.• Integration with Other Library Resources• Web-Kerberos Based Authentication.• Capability of Digital Signing.• User Transaction Logs.
Toplevel Menu Transactions(Total 19738)
Compendex 2927 Online Catalog 6552 Wilson 496
Curr Serials 1656 Call Nos 298 New Books 324
Grainger Hm Pg 745 Faculty Interest 200 Comments 49
Ref Coll 1677 First Search 698 Reserves 380
DeLIver 519 ASTI 685 Sci Citation 498
CCON 446 PsychLit 92 INSPEC 793
Help Starting 186 FAQ 90 News 54
Transaction Logs (1)4035 total end-user sessions (September through May).
3023 end-user sessions where searches were performed
Top Bar # Sessions Total #
About DeLIver 427 536
Browse (all) 1585 2277
Browse Only 1012
Help 175 190
Quicktips 189 245
Download Software 1001 1086
Other Resources 230 289
Transaction Logs (2)4035 total end-user sessions (September through May).
3023 end-user sessions where searches were performed
Search Fields # Sessions Total #
Keyword 2083 6090Abstract 194 747Article Title 368 976 Article Author 377 926All Author 185 468Citations 39 74Body of Article 76 336Figure Caption 26 60Table Caption 9 12Journal Title 218 530Title, Headings, Caption 118 358
Transaction Logs (3)4035 total end-user sessions (September thru May).
3023 end-user sessions where searches were performed.
Searching Characteristics # Sessions Total #Average Length of Search 727 seconds
Display Full-Text 2079 4267
PDFs 842 10104
SGMs 1516 4660
Extended Citation 578 2212
Boolean Operators 856 5773
ANDS 682
Ors 204 668
NOTs 30 79
KWIC Display 389 780
Links to Inspec/Compendex 261 404
Multiword Search Arguments 1848 6134
Transaction Logs (4)4055 end-user sessions (September thru May)
3023 end-user sessions where searches were performed
Publisher Choices # Sessions Total #
All Publishers 2535 9185
AIP 65 238
APS 33 84
ASCE 96 247
IEE 38 98
Transaction Logs (5)4055 end-user sessions (September thru May)
3023 end-user sessions where searches were performed
Points:Not much use of Help or Quicktips;
a lot of Browsing but < 50% of search sessions;
Not jumping to A&I Services from DeLIver;
mostly Keyword Searching, also fair amount of Author, Article Title, Journal Title;
much more Display Full-Text than Extended Citation (why?);
25% of sessions use Boolean operators;
Multiword Search Arguments (complex terms, not single words) being entered;
Linking to INSPEC/Compendex in 20% of sessions;
predominantly All Publishers being searched.
Testbed User Authentication• Approach:
– Authenticate Once per Session / Authorize per Use
• Current Mechanism:– On 1st Request, User Referred to Bluestem Script– Upon Bluestem Authentication:
• Authorization Record Written to SQL Database• Cookie Set Which Points to that Record
• Need to Fix Redirection Problem with MS IE• Need to Extend Outside Cookie-Setting Domain
Future Work• Implementation of Distributed Repository
Model.• Expand Breadth of Testbed (Loading Locally
and Linking to other Repositories).• Use of Digital Object Identifiers and other
Standards.• Rendering via HTML 4.0 & CSS, XML & XSL.• Adding Dynamic retrieval Mechanisms
(Wordwheels, Co-Occurrence Matrices). • Expand Simultaneous Search Mechanisms.• Expanded User Studies.
SGML vs. HTML vs. XML
• SGML:– Supports Powerful Indexing, Search & Retrieval– But Client, Delivery, & Rendering Issues Remain
• HTML:– Ubiquitous; Rendering Has Become More Robust– But Remains Presentation Oriented, Less Semantic
• XML:– Subset Retains SGML Features of Primary Interest– But XML Is New, Untested, Under-Supported
Converting DLI Testbed to XML• XML Differences from SGML:
– No SHORTREF (Tag Minimization)– Tags Are Case Sensitive– Restrictions on Entities, Attributes, Link Mechanisms– Empty Tags Handled Differently
• Math ML vs. ISO 12083 Math– Math ML a Major Departure -- Adds Semantics– Focus on Java / ActiveX for Initial Deployment; Long-
Term Success May Hinge on XSL / DSSSL• ‘Content-Markup’ requires XSL, Dynamic HTML functionality
CSS, XSL, DSSSL• CCS1 & CCS2 Have Added:
– Overlapping Glyphs, Absolute & Relative Positioning– Downloadable Fonts (Platform, Browser Variable)– Styling by Attributes, 2 Levels of Hierarchy
• XSL, DSSSL, DSSSL-O:– XSL Uses XML Notation, Is Extensible (ECMAScript)– Allows More Extensive Manipulation In Formatting
• Supports Re-arrangement, Navigator Frames, etc.
– Not Yet Implemented in Production Browsers
What We Have Learned (1)• Power of SGML for Indexing & Retrieval.
• Problems with rendering mathematics--SGML, TeX, HTML, XML, Math ML.
• Depth and breadth of collection (TULIP/ Red Sage Syndrome; note use of Ovid client).
• Local Processing Implications
• Metadata needs and robustness of Distributed Model.
What We Have Learned (2)• Efficacy of Full-Text (stand-alone, integrated
with A & I, part of TOC Service).
• The Idea of a Digital Library in the Digital Chaos--the role of the Gateway and Linking of Resources.
• Changing roles of Authors, Publishers, A & I Services, Libraries.
• These Technologies Will Transfer to the Web (CSS I & II, HTML 4.0, Dynamic HTML, XML).
Top Related