Post on 20-Dec-2015
Alexandria Digital Earth ProtoType
The ADEPTDigital Library Architecture
JAMES FREW & GREG JANÉE
Alexandria Digital Library ProjectUniversity of California, Santa Barbara
Alexandria Digital Earth ProtoType
2Frew • Cornell DL seminar • 2002-04-23
ADEPT Introduction
Alexandria Digital Earth ProtoType (ADEPT) is: Distributed digital library for geo-referenced information Services supporting DL federation and interoperation Large geospatial collections
Goal: an Internet “library’” layer Organization Persistence Accessibility Scalability
– Lots of collections– Big collections small collections– Heterogeneous contents
Alexandria Digital Earth ProtoType
3Frew • Cornell DL seminar • 2002-04-23
Outline
Core library architecture Metadata interoperability Other features
Query translation Collection discovery
Alexandria Digital Earth ProtoType
5Frew • Cornell DL seminar • 2002-04-23
Architectural Elements
Item structured descriptions (“reports”) contents (optional)
Library set of collections client (public) services
Collection set of items library (internal) services
= a distributed catalog system
Alexandria Digital Earth ProtoType
6Frew • Cornell DL seminar • 2002-04-23
The Big Picture
library(middleware server)
client
item item item item
library(middleware server)
proxy collection
collection
collection
ADEPT
Alexandria Digital Earth ProtoType
7Frew • Cornell DL seminar • 2002-04-23
Role of the Middleware
collection collection
logical view
item item item
collection discoveryservice
client
functional view
• local access point• standard services• access control• thin client support• distributed search• brokering of queries & results• proxying of collections & items• creation & organiza- tion of “local” collections
middlewaremiddleware
RDBMS
Z39.50
“local”
thesaurus/vocabulary
Alexandria Digital Earth ProtoType
8Frew • Cornell DL seminar • 2002-04-23
Middleware Interfaces
Clients
Configuration {collection-id} Collection(collection-id) report Query(query) query-id Results(query-id) {holding-id} Metadata(collection-id, holding-id, view) report
Libraries
Collection report Query(query, accumulator) query-thread Metadata(holding-id, view) report
Collections
Alexandria Digital Earth ProtoType
9Frew • Cornell DL seminar • 2002-04-23
Metadata Reports
Collection Metadata that applies to entire collection
Bucket Item’s bucket metadata
Scan Brief (“1-line”) subset of bucket report
Full All the item’s metadata, in whatever format
Browse URL(s) reduced-resolution graphics
Access URL(s) content (if available)
Alexandria Digital Earth ProtoType
10Frew • Cornell DL seminar • 2002-04-23
A Complete System: The Boxology
HTTP transport
SDLIP proxy, other clientsweb browser
generic DB driver query translator
MIDDLEWARE
CLIENT
SERVER
web intermediary/XMLHTML converter
configurationfile
core functionalityaccess control (service- and collection-level)
query fan-out & results mergingquery result rankingresult set caching
access controlmechanisms
rankingmethods
client-side services (Java classes)
server-side interface (Java interfaces)
RDBMS
JDBC
configuration files,Python scripts
RMI transport
proxy driver
HTTP
HTTP
XML
XML
group driver
thesauri
OR
paradigmlibrary
Z39.50 driver
Alexandria Digital Earth ProtoType
12Frew • Cornell DL seminar • 2002-04-23
ADEPT’s Interoperability Problem
Distributed, heterogeneous collections locally, autonomously created and managed
Minimize impact on collection providers allow use of native metadata
Uniform client services common high-level interface across collections discover and exploit collection-specific interfaces
Assumptions items have metadata items have sufficient, “good” metadata i.e., this is a metadata interoperability problem
Alexandria Digital Earth ProtoType
13Frew • Cornell DL seminar • 2002-04-23
Bucket Concept
Abstract metadata category Strongly typed Well-defined search semantics
– query terms– query operators
Explicitly mapped from source metadata– (FGDC, 1.3, “Time period of content”, “2001-09-08”)
Bucket-level search uniform across all collections
– e.g.: search all collections for items whose Originator bucket contains the phrase “geological survey”
Alexandria Digital Earth ProtoType
14Frew • Cornell DL seminar • 2002-04-23
Bucket Properties
name Coverage date
semantic definition The time period to which the item is relevant.
data type (strictly observed) calendar date or range of calendar dates
syntactic representation (strictly observed) ISO 8601
Alexandria Digital Earth ProtoType
15Frew • Cornell DL seminar • 2002-04-23
What is a bucket? (2/2)
Source metadata is mapped to buckets buckets hold not just simple values
– “2001-09-08” but rather, explicit representations of mappings
– (FGDC, 1.3, “Time period of content”, “2001-09-08”) may have multiple values per bucket
Bucket definition includes search semantics query terms
– ISO 8601 date range query operators
– contains, overlaps, is-contained-in Some semantics are fuzzy to accommodate multiple
implementations
Alexandria Digital Earth ProtoType
16Frew • Cornell DL seminar • 2002-04-23
Collection-level aggregation
Collection-level metadata describes buckets supported by the collection item-level metadata mappings statistical overviews
– item counts– spatiotemporal coverage histograms
Example (de-XML-ized) in collection foo, the Originator bucket is supported and the
following item fields are mapped to it:– (FGDC, 1.1/8.1, “Citation/Originator”) [973 items]– (USGS DOQ, PRODUCER, “Producer”) [973 items]– (DC, Creator, “Creator”) [1249 items]– unknown [6 items]
Alexandria Digital Earth ProtoType
17Frew • Cornell DL seminar • 2002-04-23
Searching collections
Bucket-level uniform across all collections example
– search all collections for items whose Originator bucket contains the phrase “geological survey”
Field-level collection-specific but discovery and invocation mechanisms are uniform functionally equivalent to searching the entire bucket plus
additional constraint example
– search collection foo for items whose FGDC 1.1/8.1 field within the Originator bucket contains the phrase…
Alexandria Digital Earth ProtoType
18Frew • Cornell DL seminar • 2002-04-23
Bucket types (1/7)
6 bucket types: spatial, temporal, hierarchical, textual, qualified textual, numeric
Type captures the portion of the bucket definition that has functional implications data type & syntactic representation query terms query operators
Complete bucket definition name semantic definition bucket type
Alexandria Digital Earth ProtoType
19Frew • Cornell DL seminar • 2002-04-23
Bucket types (2/7)
Spatial data type: any of several types of geometric regions defined
in WGS84 latitude/longitude coordinates syntax: defined by ADEPT query terms: WGS84 box or polygon operators: contains, overlaps, is-contained-in example query:
– <spatial-constraint bucket=“geographic-location” operator=“overlaps”> <box north=“37.5” south=“30.0” east=“-110” west=“-140”/></spatial-constraint>
Alexandria Digital Earth ProtoType
20Frew • Cornell DL seminar • 2002-04-23
Bucket types (3/7)
Temporal data type: calendar date or range of calendar dates syntax: ISO 8601 query term: range of calendar dates operators: contains, overlaps, is-contained-in example query:
– <temporal-constraint bucket=“coverage-date” operator=“contains” from=“1970-01-01” to=“1979-12-31”/>
Alexandria Digital Earth ProtoType
21Frew • Cornell DL seminar • 2002-04-23
Bucket types (4/7)
Hierarchical data type: term drawn from a controlled vocabulary
(thesaurus, etc.) one-to-one relationship between hierarchical buckets and
vocabularies query term: vocabulary term operator: is-a example query:
– <hierarchical-constraint bucket=“feature-type” operator=“is-a” vocabulary=“ADL Feature Type Thesaurus” term=“populated place”/>
Alexandria Digital Earth ProtoType
22Frew • Cornell DL seminar • 2002-04-23
Bucket types (5/7)
Textual data type: text query term: text operators: contains-all-words, contains-any-words, contains-
phrase example query:
– <textual-constraint bucket=“subject-related-text” operator=“contains-all-words” text=“orthophotograph”/>
Alexandria Digital Earth ProtoType
23Frew • Cornell DL seminar • 2002-04-23
Bucket types (6/7)
Qualified textual data type: text with optional associated namespace query term: same query operator: matches example query:
– <qualified-textual-constraint bucket=“identifier” operator=“matches” text=“90-70002-34-5” namespace=“ISBN”/>
Alexandria Digital Earth ProtoType
24Frew • Cornell DL seminar • 2002-04-23
Bucket types (7/7)
Numeric data type: real number query term: real number query operators: standard relational operators example query:
– <numeric-constraint bucket=“minimum-feature-size” operator=“less-than” value=“1.0” unit=“meters”/>
Alexandria Digital Earth ProtoType
25Frew • Cornell DL seminar • 2002-04-23
Bucket types vs. buckets
Bucket types are defined architecturally Buckets in use are defined by collections and items
need standard buckets, defined conventionally, to support cross-collection uniformity
ADL core buckets simple; universal; easily & broadly populated; useful
Bucket descriptions in the following slides: bucket type semantic definition effective treatment of multiple values in searching comparison to Dublin Core
Alexandria Digital Earth ProtoType
26Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (1/6)
Subject-related text Title Assigned term
Originator Geographic location Coverage date Object type Feature type Format Identifier
Alexandria Digital Earth ProtoType
27Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (2/6)
Subject-related text type: textual description: text indicative of item’s subject
– not necessarily from controlled vocabularies superset of Title and Assigned term multiple values: concatenated compare: DC.Subject
Title type: textual description: item’s title subset of Subject-related text multiple values: concatenated compare: DC.Title
Alexandria Digital Earth ProtoType
28Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (3/6)
Assigned term type: textual description: subject-related terms from controlled
vocabularies subset of Subject-related text multiple values: concatenated compare: qualified DC.Subject
Originator type: textual description: names of entities related to item’s origin multiple values: concatenated compare: DC.Creator + DC.Publisher
Alexandria Digital Earth ProtoType
29Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (4/6)
Geographic location type: spatial description: subset of Earth’s surface related to item multiple values: union compare: DC.Coverage.Spatial
Coverage date type: temporal description: calendar dates related to item multiple values: union compare: DC.Coverage.Temporal
Alexandria Digital Earth ProtoType
30Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (5/6)
Object type type: hierarchical vocabulary: ADL Object Type Thesaurus
– (image, map, thesis, sound recording, etc.) multiple values: unioned compare: DC.Type
Feature type type: hierarchical vocabulary: ADL Feature Type Thesaurus
– (river, mountain, park, city, etc.) multiple values: unioned compare: none
Alexandria Digital Earth ProtoType
31Frew • Cornell DL seminar • 2002-04-23
ADL core buckets (6/6)
Format type: hierarchical vocabulary: ADL Object Format Thesaurus
– (loosely based on MIME) multiple values: union compare: DC.Format
Identifier type: qualified textual description: unique identifiers for item multiple values: treated separately compare: DC.Identifier
Alexandria Digital Earth ProtoType
32Frew • Cornell DL seminar • 2002-04-23
Bucket Summary
Strongly typed, abstract metadata category, with defined search semantics, to which source metadata is mapped.
Supports discovery/search across distributed, heterogeneous collections that use metadata structures of their choosing.
Supports “drill-down” searching of item-level metadata elements.
Alexandria Digital Earth ProtoType
33Frew • Cornell DL seminar • 2002-04-23
Challenges
Metadata is like life: refuses to follow the rules unknown semantics inconsistent typing/syntax unknown or unidentifiable sources poor/inconsistent quality proliferation of overlapping vocabularies ...
Reality check: Dublin Core won adapt buckets to qualified Dublin Core incorporate fallback mechanism or polymorphism
– e.g, treat fields as thesauri/controlled vocabularies or as text
Alexandria Digital Earth ProtoType
35Frew • Cornell DL seminar • 2002-04-23
ADEPT query language
Domain collection of items item = (unique ID, field, …) field = (name, value) bucket = (name, union or concatenation of fields)
Queries atomic constraint: (attribute name, operator, target)
– return items that have at least 1 value for the attribute,for which at least one value matches the target
arbitrary Boolean combinations– AND, OR, AND NOT
Alexandria Digital Earth ProtoType
36Frew • Cornell DL seminar • 2002-04-23
The problem
Algorithmically translate ADEPT queries to SQL accommodate all possible SQL implementations configurable by mere mortals generate “reasonable” SQL make up for DB deficiencies
– stupid things like order of tables & conditions matter– incorporate optimizer hints and directives
Alexandria Digital Earth ProtoType
37Frew • Cornell DL seminar • 2002-04-23
Approach
Python-based translator ~1500 lines
Extensible “paradigms” describe atomic translation techniques 15 paradigms Each paradigm ~100 lines (50 Python code, 20 assertions,
30 documentation) Intrinsic & explicit combination rules
unify; then JOIN; then self-JOIN; etc. Configuration file describes:
buckets, fields, paradigms, paradigm configuration Boolean override rules misc: external identifier table, optimizer clauses
Alexandria Digital Earth ProtoType
39Frew • Cornell DL seminar • 2002-04-23
The problem
Distributed queries: necessary evil necessary to achieve scalability
– performance– autonomy
introduce scalability, performance, and reliability problems
Amelioration strategies increase server performance/reliability
– replication, DIENST connectivity regions turn into offline problem
– Web search engines, OAI harvesting model identify relevant collections to query (ADEPT)
– analogous to Web search engine
Alexandria Digital Earth ProtoType
40Frew • Cornell DL seminar • 2002-04-23
Approach
Build on collection-level metadata spatial & temporal density histograms item counts per collection categorization schemes
Upload periodically to central server Use Euler histograms to support range queries
Alexandria Digital Earth ProtoType
41Frew • Cornell DL seminar • 2002-04-23
Challenges
Relevance not necessarily Boolean worldwide, petabyte, 1cm resolution database
vs.world map drawn on napkin
weight by resolution or minimum feature size– but sometimes you want the napkin
The JOIN problem statistics computed independently
Text overviews STARTS?