XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library...

73
XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney

Transcript of XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library...

Page 1: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

XTF in Depth

Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library

January 2009 presentationat University of Sydney

Page 2: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

XTF in Depth Part 1:

What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1 Design and data flow Adapting Lucene and Saxon Planned improvements

Part 2: Interactive demos

Page 3: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

XTF in 5 minutes eXtensible Text Framework Search and display technology from CDL Open-source Java framework Powerful and highly configurable All about rapid prototyping, fast deployment,

and incremental improvement XML + Full text search Also indexes PDF, HTML, Word

Excel and Powerpoint coming soon

Page 4: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

XTF in 5 minutes Search: Query power/speed of Lucene, plus:

search results shown in context keyword search, facets, spelling, lots more

View: Processing power of Saxon, plus: large file optimizations, hit markup

Configure and customize exclusively in XSLT Flexible, overlapping collections Mature, tightly integrated, well documented In use at CDL and many other places

Page 5: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

What XTF is not It is not a content management system

Creation (conversion, scanning, manual) Ingest / administration Editing Preservation

Not built for remote administration Not a true XML database

but close Not Google

Google: one interface to vast grab-bag of data XTF: crafted interfaces to high-quality data sets

Page 6: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

How does XTF compare?

Tur

n-ke

y /

easy

----

----

----

--->

Customizable / Powerful ---------------------------------------->

Green-stone

XTF 2.0

XTF 2.1

Solr

* caveat: based on my limited experience with Greenstone and Solr

**

Page 7: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Online Archive of California

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 8: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

eScholarship Editions

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 9: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

calisphere

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 10: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Mark Twain Project Online

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 11: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

UC Berkeley

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 12: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

University of Sydney

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 13: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Encyclopedia of Chicago

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 14: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Indiana University: Newton

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 15: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Indiana University: Swinburne

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 16: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Sweden

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 17: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Brazil

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Italy

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 19: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Needs

Let’s look at four needs that XTF was created to address: Diverse data Open software Rapid deployment Community involvement

Page 20: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Needs: 1. Diverse data Our collections: many and diverse

eScholarship (TEI, PDF)• UC Press monographs (a text may be > 10 megs)• 25,000 scholarly articles in PDF

Mark Twain• Hand-crafted critical edition (TEI + MODS)

OAC: finding aids, images, books, manuscripts• Japanese American Relocation Digital Archives• TEI, EAD, MODS

Book scanning projects (Google, Internet Archive)• Thousands of scanned books (PDF + DC)• Millions of Melvyl catalog records (MARC)

Page 21: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Needs: 2. Open software

Digital Publishing Products “Black box” (no control over fixes & features) Often not standards-based Tech companies have short lifespans Support often spotty Data can be held hostage, or even lost $$$$$

Page 22: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Needs:3. Rapid deployment

New collections arriving Users don't want to wait a year for access Many “what if” and “wouldn't it be cool”

requests from our staff Java programmers are expensive Look & feel goes stale quickly Barrage of feature requests

Page 23: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Needs:4. Community involvement

We want to share the load For XTF 2.1, we asked the XTF

community to vote for features they wanted

At CDL we try to align our development to needs of the community

Result: Everybody benefits

Page 24: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

New and improved in 2.1

Faceted browse Search flexibility Bookbag Spelling correction Similar items OAI-PMH

Page 25: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Faceted browse

Previously implementing faceted browse required lots of XSLT programming.

Hierarchical facets: even harder Required us to deeply refactor the

stylesheets, but now it’s simple to add new facets.

Page 26: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Faceted browse

Page 27: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Faceted browse

Page 28: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Hierarchical facets

Page 29: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Hierarchical facets

Page 30: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Search flexibility

Keyword search: single box (now default). Internally, searches multiple fields.

Advanced search: explicitly fill in constraints for various fields

Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.

Page 31: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Keyword search

Page 32: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Advanced search

Page 33: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Freeform search

Page 34: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

OAI-PMH

This fit nicely into XTF’s architecture Simple but conforming implementation

Page 35: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Bookbag

Refactored the AJAX to use YUI (Yahoo User Interface widgets)

Still session based Now supports emailing the bookbag

Page 36: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Bookbag

Page 37: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Bookbag

Page 38: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Bookbag

Page 39: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Spelling correction

Unicode bug fixes On by default and fully integrated

Page 40: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Spelling correction

Page 41: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Spelling correction

Page 42: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Similar items

Allows user to see “more like this” Improved AJAX integration On by default - no configuration needed

Page 43: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Similar items

Page 44: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Similar items

Page 45: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Other changes in XTF 2.1 Built-in NLM “Blue”, TEI P5, MS Word support

(still support TEI P4, EAD, PDF, HTML, text) Valid XHTML output RawQuery servlet to provide a query back-end

to a (e.g. Ruby) front-end or mash-up. Bug fixes and minor changes (many

reported/requested by users)

Page 46: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Wiki documentation

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 47: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Wiki documentation

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 48: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Design philosophy Adaptation through programming XTF is still about building what you want using a set of

powerful tools

But now: Stylesheets are more modular Build interfaces faster using honed widgets Prettier UI to start with

Page 49: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

XTF is open, standards based Based on free, open-source tools:

Java SDK 1.5+ Lucene 2.1 full-text search toolkit Saxon 8.9 XSLT processor

UNICODE support throughout XTF itself is open-source (BSD license) No native code – pure Java and XSLT 2.0 Runs on Windows, Solaris, Linux, MacOS Drops right in to Tomcat or Resin Lots of user-fixable documentation

Page 50: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Modular Use crossQuery servlet to search, dynaXML

to display and navigate. Deploy one or both. Stylesheets govern flow of data – no Java

programming required Easy to add features incrementally 100% configurable “look and feel” Skin & slice: one system can have several

interfaces and multiple “brands” Collection subsetting driven by meta-data

Page 51: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Why XSLT? XSLT is a natural fit for XML

Powerful, dynamic language Incredibly high-quality, free processor (Saxon)

Why not Java/Struts? Poor for rapid prototyping, steep learning curve

Why not Ruby? Not necessarily a good match for XML data Can be too clever by half But a smart mash-up might be cool...

Page 52: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Indexing Process

Page 53: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Indexing

Input filters adapt to many doc types Any XML doc type PDF, MS Word, plain text, untidy HTML

XTF is agnostic regarding: Document identifiers Filesystem organization

• Uses document selector stylesheet to identify and classify documents in filesystem

Meta-data storage Incremental indexing

Simply update filesystem then run indexer.

Page 54: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

crossQuery servlet

Page 55: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Flexible Search/Display

One query, many collections XTF enables “Virtual collections”

Output filters for various result views e.g. simple vs. advanced search form, results in

brief vs. long format, etc. Query parsers for different search interfaces

Interface to other query protocols SRU and OAI-PMH already implemented Should be easy to adapt other queries:

• Very extensive set of query operators• Flexible query composition

Faceted browse

Page 56: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Query Power

Many operators AND, OR, NEAR, NOT, phrase, range, wildcard Or-Near, multi-field AND, “more like this”

Arbitrarily complex queries Combine full-text search with meta-data Unusual queries like:"dynamic duo" near "red phone"

Structure-aware searching e.g. search only headings, or only bibliographies But must pre-define which structures to search

Page 57: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

More Power

Fixed-length snippets Highlight the hit and just the hit

Sort by relevance, or any meta-data fields Spelling correction No penalty for huge documents

XTF “lazily” pulls in only those parts used by a particular request (e.g. show just Chapter 1)

Scalable Proven with 10 million records / 14 gigs data but beyond that, Solr looks better

Authentication: IP lists, LDAP, or external

Page 58: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

dynaXML servlet

Page 59: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene and Saxon

Adapting Lucene Chunking, flattening, hit marking, stop-words,

setting limits, insensitivity, special queries, faceted browsing, spelling correction

Adapting Saxon Lazy trees, misc. extensions

Page 60: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Chunking Why

Lucene's proximity searches perform best on small documents

Small chunks enable efficient generation of 80-character “snippet” surrounding each hit

How XTF breaks text blocks into 200-word chunks Chunks overlap to detect a hit starting in one and

ending in the next. Each chunk carries structural info, plus pointer to

location in XML doc. Only first chunk carries meta-data for doc

Page 61: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Flattening XML

XSLT prefilter flattens XML structure Series of text blocks Block tagged with structural info for search Prefilter can boost or suppress sections Fine control over proximity matching

Prefilter gathers/marks meta-data Can come from within the document, from an XML

doc in filesystem, or fetched from a URL. Synthesize meta-data (e.g. sort fields, facets)

Page 62: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Hit Marking

Marking search hits in context Lucene doesn't pinpoint location of hits, only gives a

score per-document Custom enhancements to Lucene's “span” logic

score and locate each hit. dynaXML dynamically adds ranked hits to original

XML doc, then sends to XSLT formatter. crossQuery forms a snippet around and highlights

each hit.

Page 63: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Stop-words

Robust, efficient stop-word handling “the, a, an, it, on...” People do use them, and expect corresponding

results. Lucene normally ignores stop-words, for speed. XTF quietly joins stop-words to adjacent words,

forming “n-grams” Example: “man on the moon” ->

man-on on-the the-moon Queries are internally rewritten to search for n-grams

automatically.

Page 64: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Setting Limits

Limits on aberrant queries Adjustable limits on number of terms matched by

range or wildcard queries N-grams naturally make most queries efficient Configurable limits on amount of “work” performed by

a single query. Numeric range query

Avoids term expansion Efficiently filters very granular data, e.g. timestamps: 2006-11-14:12:46:03.77

Page 65: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Insensitivity

Accent/diacritic marks Many users can't or don't know how to type them XTF indexer uses configurable map to remove

accents crossQuery maps query terms

Plural Convenient for “cat” to match “cats” also Configurable map of plural to singular used at index

and query time

Page 66: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Special Queries

OR-NEAR Standard OR query doesn't use proximity OR-NEAR: if words nearby, score is boosted

Multi-field AND All terms must be present, in any field. Essential for certain keyword searches: against all enemies clarke(matches against title and author)

More like this Auto-calculates “interesting” terms in meta-data Creates OR-NEAR query to find similar docs

Page 67: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Faceted Browsing

Draws facet term list from Lucene index Each facet cached in-memory Counts per group created dynamically Special mini-language to sort/select (esp.

useful for hierarchical facets)

Page 68: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Lucene:Spelling Correction Any standard dictionary won't match place and

proper names Idea: use the index as source of suggestions XTF searches words within edit distance 2 Candidates ranked by weighted score:

Edit distance (transpositions discounted) Frequency of use in the index Double-metaphone match

Multi-word correction uses pair frequencies On test data, 80% right suggestion

Page 69: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Saxon:Lazy Trees

The need: display small parts of large (> 10MB) XML documents

Solution: create a binary, random-access version of each document

XSL keys calc'd once and stored Only elements accessed by a given request are

loaded from disk Care must be taken in stylesheets Profile mode is useful for optimization

Page 70: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Adapting Saxon:Extensions More complete SQL database connection Ability to call external tools

Automatic XML conversion in/out Timeout enforcement

File utilities Check file existence Get file length and timestamp

Session data Key/value pairs Value can be XML or plain string

Page 71: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

The future XTF 2.2:

Better out-of-box for large EADs Fixes for incremental indexing; other bug fixes Specify any number of sub-dirs to index Possible TEI P5 refactoring Background auto-warming of new index Support for indexing Powerpoint and Excel files

Further out: A page-turner for scanned texts and converted PDFs Pop-up image/PDF page snippets And of course, features suggested by users

Page 72: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Demos

I’ll demonstrate the features we talked about on several different XTF sites “out in the wild.”

Page 73: XTF in Depth Powerful Search and Display for Electronic Text Martin Haye California Digital Library January 2009 presentation at University of Sydney.

Fin Project: xtf.sourceforge.net

Docs: xtf.wiki.sourceforge.net

Discuss: groups.google.com/group/xtf-user

This talk: xtf.sourceforge.net/talks/2009-01-23.ppt

Me: [email protected]