New Generation Database Systems: XML Databases

53
IS 257 – Fall 2006 2006.11.28- SLIDE 1 New Generation Database Systems: XML Databases University of California, Berkeley School of Information IS 257: Database Management

description

New Generation Database Systems: XML Databases. University of California, Berkeley School of Information IS 257: Database Management. Lecture Outline. XML and RDBMS Native XML Databases. Lecture Outline. XML and DBMS Native XML Databases. Standards: XML/SQL. - PowerPoint PPT Presentation

Transcript of New Generation Database Systems: XML Databases

Page 1: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 1

New Generation Database Systems: XML Databases

University of California, BerkeleySchool of Information

IS 257: Database Management

Page 2: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 2

Lecture Outline• XML and RDBMS• Native XML Databases

Page 3: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 3

Lecture Outline• XML and DBMS• Native XML Databases

Page 4: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 4

Standards: XML/SQL• As part of SQL3 an extension providing a

mapping from XML to DBMS is being created called XML/SQL

• The (draft) standard is very complex, but the ideas are actually pretty simple

• Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY

Page 5: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 5

Standards: XML/SQL• That table can be mapped to:

<EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row>

<row> … etc. …

Page 6: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 6

Standards: XML/SQL• In addition the standard says that

XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML.

• Variants of this are incorporated into the latest versions of ORACLE

• But what if you want to deal with more complex XML schemas (beyond “flat” structures)?

Page 7: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 7

XML to Relational Database Mapping

Bhavin Kansara

The following slides are adapted from:

Slide from Bhavin Kansara

Page 8: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 8

Introduction• XML/relational mapping means data

transformation between XML and relational data models

• XML documents can be transformed to relational data models or vice versa.

• Mapping method is the way the mapping is done

Slide from Bhavin Kansara

Page 9: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 9

XML• XML: Extensible Markup Language• Documents have tags giving extra information

about sections of the document– E.g. <title> XML </title> – <slide> Introduction </slide>

• XML has emerged as the standard for representing and exchanging data on the World Wide Web.

• The increasing amount of XML documents requires the need to store and query XML documents efficiently.

Slide from Bhavin Kansara

Page 10: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 10

XML vs. HTML

• HTML tags describe how to render things on the screen, while XML tags describe what thing are.

• HTML tags are designed for the interaction between humans and computers, while XML tags are designed for the interactions between two computers.

• Unlike HTML, XML tags tell you what the data means, rather than how to display it

<name><first> abc </first><middle> xyz </middle><last> def </last>

</name>

<html><head><title>Title of page</title></head><body>abc <br>xyz <br>def <br></body></html>

Slide from Bhavin Kansara

Page 11: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 11

XML Technologies

• Schema LanguagesDTDsXML Schemas

• Query LanguagesXPathXQueryXSLT

• Programming APIsDOMSAX

<bib> { for $b in doc("http://bstore1.example.com/bib.xml")/bib/book where $b/publisher = "Addison-Wesley" and $b/@year > 1991 return <book year="{ $b/@year }"> { $b/title } </book> }</bib>

<?xml version="1.0" encoding="ISO-8859-1"?><?xml-stylesheet type="text/xsl" href="simple.xsl"?><breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description> two of our famous Belgian Waffles </description> <calories>650</calories> </food></breakfast_menu>

Slide from Bhavin Kansara

Page 12: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 12

DTD ( Document Type Definition )• DTD stands for Document Type Definition• The purpose of a Document Type

Definition is to define the legal building blocks of an XML document.

• It formally defines relationship between the various elements that form the documents.

• DTD allows computers to check that each component of document occurs in a valid place within the document.

Slide from Bhavin Kansara

Page 13: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 13

DTD ( Document Type Definition )

Slide from Bhavin Kansara

Page 14: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 14

XML vs. Relational DatabaseCUSTOMERName AgeABC 30

XYZ 40

<customers> <custRec>

<Name type=“String”>ABC</custName> <Age type=“Integer”>30</custAge>

</custRec> <custRec>

<Name type=“String”>XYZ</custName> <Age type=“Integer”>40</custAge>

</custRec> </customers>

Slide from Bhavin Kansara

Page 15: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 15

XML vs. Relational Database

Slide from Bhavin Kansara

Page 16: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 16

XML vs. Relational Database

<!ELEMENT note (to+, from, header, message*, #PCDATA)>Slide from Bhavin Kansara

Page 17: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 17

XML vs. Relational Database

Slide from Bhavin Kansara

Page 18: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 18

When XML representation is not beneficial• When downstream processing of the data

is relational • When the highest possible performance is

required• When any normalized data components

have value outside the XML representation or the data need not be retained in XML form to have value

• When the data is naturally tabular

Slide from Bhavin Kansara

Page 19: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 19

When XML representation is beneficial• When schema is volatile • When data is inherently hierarchical in

nature • When data represents business objects in

which the component parts do not make sense when removed from the context of that business object

• When applications have sparse attributes • When low-volume data is highly structured

Slide from Bhavin Kansara

Page 20: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 20

XML-to-Relational mapping• Schema mapping

Database schema is generated from an XML schema or DTD for the storage of XML documents.

• Data mappingShreds an input XML document into relational tuples and inserts them into the relational database whose schema is generated in the schema mapping phase

Slide from Bhavin Kansara

Page 21: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 21

Schema Mapping

Slide from Bhavin Kansara

Page 22: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 22

Simplifying DTD

Slide from Bhavin Kansara

Page 23: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 23

DTD graph

Slide from Bhavin Kansara

Page 24: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 24

Inlined DTD graph• Given a DTD graph, a node is inlinable if and only if it

has exactly one incoming edge and that edge is a normal edge.

Slide from Bhavin Kansara

Page 25: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 25

Inlined DTD graph

Slide from Bhavin Kansara

Page 26: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 26

Generated Database Schema

Slide from Bhavin Kansara

Page 27: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 27

Data Mapping• XML file is used to insert

data into generated database schema

• Parser is used to fetch data from XML file.

Slide from Bhavin Kansara

Page 28: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 28

Summary• Simplify DTD• Create DTD graph from simplified DTD• Create inlined DTD graph from DTD graph• Use inlined DTD graph to generate

database schema• Insert values from XML file into generated

tables

Slide from Bhavin Kansara

Page 29: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 29

Issues• So, we can convert the XML to a relational

database, but can we then export as an XML document?– This is equally challenging

• But MOSTLY involves just re-joining the tables• How do you store and put back the wrapping tags

for sets of subelements?• Since the decomposition of the DTD was

approximate, the output MAY not be identical to the input

Page 30: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 30

Lecture Outline• XML and RDBMS• Native XML Databases

Page 31: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 31

Native XML Database (NXD) • Native XML databases have an XML-based

internal model– That is, their fundamental unit of storage is XML

• However, different native XML databases differ in What they consider the fundamental unit of storage– Document vs element or segment

• And how that information or its subelements are accessed, indexed and queried– E.g., SQL vs. Xquery or a special query language

Page 32: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 32

Database Systems supporting XQuery• The following database systems offer XQuery

support: – Native XML Databases:

• Berkeley DB XML• eXist• MarkLogic• Software AG Tamino• Raining Data TigerLogic• Documentum xDb (X-Hive/DB)

– Relational Databases (also support SQL): • IBM DB2• Microsoft SQL Server• Oracle

Page 33: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 33

Anatomy of a Native XML database• The next set of slides (available on the

class web site) come from George Feinberg of SleepyCat Software– SleepyCat is now part of Oracle

Page 34: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 34

Further comments on NXD• Native XML databases are most often

used for storing “document-centric” XML document– I.e. the unit of retrieval would typically be the

entire document and not a particular node or subelement

• This supports query languages like Xquery– Able to ask for “all documents where the third

chapter contains a page that has boldfaced word”

– Very difficult to do that kind of query in SQL

Page 35: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 35

XML-Based IR - Cheshire II• I thought I would take a little time to talk about

how the Cheshire system (that I have been working for nearly 20 years) uses XML, since it has some similarities (and many differences) to XML database systems

• Cheshire II (and Cheshire 3) are document-centric and involve parsing the XML for the purposes of indexing (and sometimes for retrieval of partial documents)

Page 36: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 36

Cheshire II SGML/XML Support• Underlying native format for all data is SGML or

XML• The DTD defines the file format for each file• Full SGML/XML parsing• SGML/XML Format Configuration Files define

the database• USMARC DTD and MARC to SGML conversion

(and back again)• Access to full-text via special SGML/XML tags

Page 37: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 37

SGML/XML Support• Example XML record for a DL document

<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>

Page 38: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 38

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

SGML Support

• Example SGML/MARC Record

Page 39: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 39

SGML Support• Mini-TREC document…

<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …

Page 40: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 40

…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …

Page 41: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 41

<TP>CMMT Comment &amp; Analysis.

GOVT Legal issues.

</TP>

<PUB>The Financial Times

</PUB>

<PAGE>

London Page 4

</PAGE>

</DOC>

Page 42: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 42

SGML/XML Support

• Configuration files for the Server are also SGML/XML:– They include tags describing all of the data

files and indexes for the database.– They also include instructions on how data is

to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

Page 43: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 43

Cheshire Configuration Files<!-- ******************************************************************* --><!-- ************************* TREC INTERACTIVE TEST DB **************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>

<!-- --><!-- TREC TEST DATABASE FILEDEF --><!-- -->

<!-- The Interactive TREC Financial Times datafile --><FILEDEF TYPE=SGML>

<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>

<!-- filetag is the "shorthand" name of the file --><FILETAG> trec </FILETAG>

<!-- filename is the full path name of the main data directory --><FILENAME> /projects/is240/ft </FILENAME>

<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>

<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> ft.assoc </ASSOCFIL>

<!-- history is the full path name of the file's history file --><HISTORY> cheshire_index/TESTDATA.history </HISTORY>…

Page 44: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 44

<!-- The following are the index definitions for the file --><INDEXES>

<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>

<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>

<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>

<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…

Page 45: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 45

<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><!-- It includes headlines, datelines, bylines, and full text --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>

<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>

Page 46: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 46

Cluster Definitions<!-- ************************* CLUSTER ********************************* --><!-- *********************** DEFINITIONS ******************************* -->

<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>

<tagspec><FTAG>FLD950 </FTAG> <s> ^a </s>

</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>

<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>^a</s><ftag>FLD490</ftag><s>^a</s><ftag>FLD830</ftag><s>^a</s><ftag>FLD740</ftag><s>^a</s>

</tagspec></from><to> <tagspec>

<ftag>titles</ftag> </tagspec></to><from> <tagspec>

<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>

<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>

<tagspec> <ftag>subjsum</ftag></tagspec></summarize>

</clusmap></CLUSTER>

Page 47: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 47

Component Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>

Page 48: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 48

Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>

<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>

Page 49: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 49

Indexing• Any SGML/XML tagged field or attribute can be

indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a

specific index– Underlying postings information includes term

frequency for probabilistic searching.– SGML may include address of full-text for indexing

• New indexes can be easily added, or old ones deleted

Page 50: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 50

Database Storage • All data stored as SGML/XML flat text files

plus optional linked full-text files • File format is defined though SGML/XML

DTD (also flat text file)• “Associator” files provide indexed direct

access to each record in SGML/XML files.– Contain offset and record length for each

“record”– Associators can be built to index any

conformant document in a directory sub-tree

Page 51: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 51

Database Storage

AssociatorFile

Page DataFile

SGML/XMLFile

HistoryFile

DTDFileCluster

File

PostingsFile

IndexFile

IndexFile

RemoteRDBMS

ConfigFile

IndexFile

AssociatorFileProx

data File

Page 52: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 52

Client/Server Architecture• Server Supports:

– Database storage– Indexing – Z39.50 access to local data– Boolean and Probabilistic Searching– Relevance Feedback– External SQL database support

• Client Supports:– Programmable (Tcl/Tk – Python soon) Graphical User

Interface– Z39.50 access to remote servers– SGML & MARC formatting

• Combined Client/Server CGI scripting via WebCheshire

Page 53: New Generation Database Systems: XML Databases

IS 257 – Fall 2006 2006.11.28- SLIDE 53

Z39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine