A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock [email protected].

15
A/WWW Enterprises Introduction to CNIDR’s Isearch Archie Warnock [email protected]

Transcript of A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock [email protected].

Page 1: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 1

Introduction to CNIDR’s Isearch

Archie [email protected]

Page 2: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 2

Who is MCNC/CNIDR?

MCNC = Microelectronics Consortium of North Carolina

CNIDR = Clearinghouse for Networked Information Discovery and Retrieval Originally funded by NSF to coordinate

and produce network information tools Now developing public domain and

commercial search/retrieval tools

Page 3: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 3

What is Isearch?

Isearch is the successor to freeWAIS Isearch is a sophisticated full-text

search and retrieval system Isearch is a component of Isite, an

implementation of the NISO standard protocol Z39.50 for information search and retrieval

ftp://ftp.cnidr.org/pub/NIDR.tools/Isearch http://vinca.cnidr.org/software/Isearch/Isearch.html

Page 4: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 4

Terminology - I

Client/server - an architecture to allow communications between programs, possibly on different computers

Protocol - the communication “language” used by client and server programs

http - the protocol used by WWW clients and servers

CGI - mechanism to process WWW forms

Page 5: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 5

Terminology - II

Query - user-supplied search criteria Full-text search - word-based search of

all the text in a document Fielded search - word-based search of

text within only certain fields in a document

Z39.50 - a standard protocol for network-based document search and retrieval

Page 6: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 6

System Components - I

Iindex, the Text Indexer - builds searchable version of the document collection Implements fast word-based searching Document parser - recognize start/end

of individual documents Field parser - recognize start/end of

fields within individual documents

Page 7: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 7

System Components - II

Isearch, the Search engine - searches a document collection based on user-supplied query Command line search

Primarily used for testing WWW gateway (using CGI)

End-user interface using forms Z39.50 gateway

Page 8: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 8

Isearch Capabilities

Fast full-text search US AIDS Patent Collection - can search

~250,000 patents in < 1 second Fielded search

Can restrict searches to title, author, abstract, other fields

Relevance ranking Search “hits” are assigned scores &

sorted

Page 9: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 9

Isearch Capabilities

Word truncation search for “matri*” matches “matrix”

and “matrices” Boolean functions

AND, OR and ANDNOT combinations of different fields

Customized presentation of results Phrase searching (coming soon)

Page 10: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 10

Isearch Customization

What’s needed to customize Isearch? Isearch is written in C++ Documents are C++ objects - data &

procedures Already have SGML & HTML, among others

Object technology allows code reusability, customizing only where differences from existing objects occur

Page 11: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 11

Isearch Customization

What’s needed to make arbitrary documents searchable? Code to parse documents Code to parse fields Code to build brief and full result

records Yes, it requires programming But, many of these are derived from

existing procedures

Page 12: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 12

Customization Example - Linear Algebra

Inputs SGML-tagged bibliographic records TEX preprints

Requirements Field searching on title, author,

abstract Full-text search of preprints WWW-based interface

Page 13: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 13

Customization Example - Linear Algebra

End products HTML-tagged “brief records” - title,

author and links to full bibliographic records and preprints

HTML formatted bibliographic records for display in WWW browser

Preprints for display or retrieval to local storage

Page 14: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 14

Customization Example - Linear Algebra

Sample Bibliographic Record<BB><AID>####</AID><VOL>##</VOL><ISS>##</ISS><ATL>Title text</ATL><AUG> <AU>Author Name</AU> </AUG><ABS>Abstract text</ABS><PPX>Preprint.filename</PPX><PGR>###-###</PGR></BB>

Page 15: A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock warnock@clark.net.

A/WWW Enterprises 15

Customization Example - Linear Algebra

Isearch Modifications ~1 week coding and testing, mostly

in developing presentation customizations

Additional work to develop ingest and on-the-fly formatting scripts, code deployment at ESI

Now have basic code to handle SGML documents using Elsevier DTD