Post on 19-Mar-2016
description
IS 257 – Fall 2006 2006.11.28- SLIDE 1
New Generation Database Systems: XML Databases
University of California, BerkeleySchool of Information
IS 257: Database Management
IS 257 – Fall 2006 2006.11.28- SLIDE 2
Lecture Outline• XML and RDBMS• Native XML Databases
IS 257 – Fall 2006 2006.11.28- SLIDE 3
Lecture Outline• XML and DBMS• Native XML Databases
IS 257 – Fall 2006 2006.11.28- SLIDE 4
Standards: XML/SQL• As part of SQL3 an extension providing a
mapping from XML to DBMS is being created called XML/SQL
• The (draft) standard is very complex, but the ideas are actually pretty simple
• Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY
IS 257 – Fall 2006 2006.11.28- SLIDE 5
Standards: XML/SQL• That table can be mapped to:
<EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row>
<row> … etc. …
IS 257 – Fall 2006 2006.11.28- SLIDE 6
Standards: XML/SQL• In addition the standard says that
XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML.
• Variants of this are incorporated into the latest versions of ORACLE
• But what if you want to deal with more complex XML schemas (beyond “flat” structures)?
IS 257 – Fall 2006 2006.11.28- SLIDE 7
XML to Relational Database Mapping
Bhavin Kansara
The following slides are adapted from:
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 8
Introduction• XML/relational mapping means data
transformation between XML and relational data models
• XML documents can be transformed to relational data models or vice versa.
• Mapping method is the way the mapping is done
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 9
XML• XML: Extensible Markup Language• Documents have tags giving extra information
about sections of the document– E.g. <title> XML </title> – <slide> Introduction </slide>
• XML has emerged as the standard for representing and exchanging data on the World Wide Web.
• The increasing amount of XML documents requires the need to store and query XML documents efficiently.
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 10
XML vs. HTML
• HTML tags describe how to render things on the screen, while XML tags describe what thing are.
• HTML tags are designed for the interaction between humans and computers, while XML tags are designed for the interactions between two computers.
• Unlike HTML, XML tags tell you what the data means, rather than how to display it
<name><first> abc </first><middle> xyz </middle><last> def </last>
</name>
<html><head><title>Title of page</title></head><body>abc <br>xyz <br>def <br></body></html>
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 11
XML Technologies
• Schema LanguagesDTDsXML Schemas
• Query LanguagesXPathXQueryXSLT
• Programming APIsDOMSAX
<bib> { for $b in doc("http://bstore1.example.com/bib.xml")/bib/book where $b/publisher = "Addison-Wesley" and $b/@year > 1991 return <book year="{ $b/@year }"> { $b/title } </book> }</bib>
<?xml version="1.0" encoding="ISO-8859-1"?><?xml-stylesheet type="text/xsl" href="simple.xsl"?><breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description> two of our famous Belgian Waffles </description> <calories>650</calories> </food></breakfast_menu>
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 12
DTD ( Document Type Definition )• DTD stands for Document Type Definition• The purpose of a Document Type
Definition is to define the legal building blocks of an XML document.
• It formally defines relationship between the various elements that form the documents.
• DTD allows computers to check that each component of document occurs in a valid place within the document.
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 13
DTD ( Document Type Definition )
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 14
XML vs. Relational DatabaseCUSTOMERName AgeABC 30
XYZ 40
<customers> <custRec>
<Name type=“String”>ABC</custName> <Age type=“Integer”>30</custAge>
</custRec> <custRec>
<Name type=“String”>XYZ</custName> <Age type=“Integer”>40</custAge>
</custRec> </customers>
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 15
XML vs. Relational Database
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 16
XML vs. Relational Database
<!ELEMENT note (to+, from, header, message*, #PCDATA)>Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 17
XML vs. Relational Database
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 18
When XML representation is not beneficial• When downstream processing of the data
is relational • When the highest possible performance is
required• When any normalized data components
have value outside the XML representation or the data need not be retained in XML form to have value
• When the data is naturally tabular
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 19
When XML representation is beneficial• When schema is volatile • When data is inherently hierarchical in
nature • When data represents business objects in
which the component parts do not make sense when removed from the context of that business object
• When applications have sparse attributes • When low-volume data is highly structured
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 20
XML-to-Relational mapping• Schema mapping
Database schema is generated from an XML schema or DTD for the storage of XML documents.
• Data mappingShreds an input XML document into relational tuples and inserts them into the relational database whose schema is generated in the schema mapping phase
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 21
Schema Mapping
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 22
Simplifying DTD
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 23
DTD graph
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 24
Inlined DTD graph• Given a DTD graph, a node is inlinable if and only if it
has exactly one incoming edge and that edge is a normal edge.
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 25
Inlined DTD graph
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 26
Generated Database Schema
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 27
Data Mapping• XML file is used to insert
data into generated database schema
• Parser is used to fetch data from XML file.
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 28
Summary• Simplify DTD• Create DTD graph from simplified DTD• Create inlined DTD graph from DTD graph• Use inlined DTD graph to generate
database schema• Insert values from XML file into generated
tables
Slide from Bhavin Kansara
IS 257 – Fall 2006 2006.11.28- SLIDE 29
Issues• So, we can convert the XML to a relational
database, but can we then export as an XML document?– This is equally challenging
• But MOSTLY involves just re-joining the tables• How do you store and put back the wrapping tags
for sets of subelements?• Since the decomposition of the DTD was
approximate, the output MAY not be identical to the input
IS 257 – Fall 2006 2006.11.28- SLIDE 30
Lecture Outline• XML and RDBMS• Native XML Databases
IS 257 – Fall 2006 2006.11.28- SLIDE 31
Native XML Database (NXD) • Native XML databases have an XML-based
internal model– That is, their fundamental unit of storage is XML
• However, different native XML databases differ in What they consider the fundamental unit of storage– Document vs element or segment
• And how that information or its subelements are accessed, indexed and queried– E.g., SQL vs. Xquery or a special query language
IS 257 – Fall 2006 2006.11.28- SLIDE 32
Database Systems supporting XQuery• The following database systems offer XQuery
support: – Native XML Databases:
• Berkeley DB XML• eXist• MarkLogic• Software AG Tamino• Raining Data TigerLogic• Documentum xDb (X-Hive/DB)
– Relational Databases (also support SQL): • IBM DB2• Microsoft SQL Server• Oracle
IS 257 – Fall 2006 2006.11.28- SLIDE 33
Anatomy of a Native XML database• The next set of slides (available on the
class web site) come from George Feinberg of SleepyCat Software– SleepyCat is now part of Oracle
IS 257 – Fall 2006 2006.11.28- SLIDE 34
Further comments on NXD• Native XML databases are most often
used for storing “document-centric” XML document– I.e. the unit of retrieval would typically be the
entire document and not a particular node or subelement
• This supports query languages like Xquery– Able to ask for “all documents where the third
chapter contains a page that has boldfaced word”
– Very difficult to do that kind of query in SQL
IS 257 – Fall 2006 2006.11.28- SLIDE 35
XML-Based IR - Cheshire II• I thought I would take a little time to talk about
how the Cheshire system (that I have been working for nearly 20 years) uses XML, since it has some similarities (and many differences) to XML database systems
• Cheshire II (and Cheshire 3) are document-centric and involve parsing the XML for the purposes of indexing (and sometimes for retrieval of partial documents)
IS 257 – Fall 2006 2006.11.28- SLIDE 36
Cheshire II SGML/XML Support• Underlying native format for all data is SGML or
XML• The DTD defines the file format for each file• Full SGML/XML parsing• SGML/XML Format Configuration Files define
the database• USMARC DTD and MARC to SGML conversion
(and back again)• Access to full-text via special SGML/XML tags
IS 257 – Fall 2006 2006.11.28- SLIDE 37
SGML/XML Support• Example XML record for a DL document
<ELIB-BIB><BIB-VERSION>ELIB-v1.0</BIB-VERSION><ID>756</ID><ENTRY>June 12, 1996</ENTRY><DATE>June 1996</DATE><TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE><ORGANIZATION>University of California</ORGANIZATION><TYPE>report</TYPE><AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL><AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL><AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL><PROJECT>SNEP</PROJECT><SERIES>Vol 3</SERIES><PAGES>40</PAGES><TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF><PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF></ELIB-BIB>
IS 257 – Fall 2006 2006.11.28- SLIDE 38
<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...
SGML Support
• Example SGML/MARC Record
IS 257 – Fall 2006 2006.11.28- SLIDE 39
SGML Support• Mini-TREC document…
<DOC><DOCNO>FT931-3566</DOCNO><PROFILE>_AN-DCPCCAA3FT</PROFILE><DATE>930316</DATE><HEADLINE>FT 16 MAR 93 / Italy's Corruption Scandal: Magistrates hold key tounlocking Tangentopoli - They will set the investigation agenda</HEADLINE><BYLINE> By ROBERT GRAHAM</BYLINE><TEXT>OVER the weekend the Italian media felt obliged to comment on a non-event.No new arrests had taken place in any of the country's ever more numerouscorruption scandals which centre on the illicit funding of political parties...</TEXT><XX> …
IS 257 – Fall 2006 2006.11.28- SLIDE 40
…Companies:-</XX><CO>Ente Nazionale Idrocarburi. Ente Nazionale per L'Energia Electtrica. Ente Partecipazioni E Finanziamento Industria Manifatturiera. IRI Istituto per La Ricostruzione Industriale.</CO><XX>Countries:-</XX><CN>ITZ Italy, EC.</CN><XX>Industries:-</XX><IN>P9222 Legal Counsel and Prosecution. P91 Executive, Legislative and General Government. P13 Oil and Gas Extraction. P9631 Regulation, Administration of Utilities. P6719 Holding Companies, NEC.</IN><XX>Types:-</XX> …
IS 257 – Fall 2006 2006.11.28- SLIDE 41
…
<TP>CMMT Comment & Analysis.
GOVT Legal issues.
</TP>
<PUB>The Financial Times
</PUB>
<PAGE>
London Page 4
</PAGE>
</DOC>
IS 257 – Fall 2006 2006.11.28- SLIDE 42
SGML/XML Support
• Configuration files for the Server are also SGML/XML:– They include tags describing all of the data
files and indexes for the database.– They also include instructions on how data is
to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.
IS 257 – Fall 2006 2006.11.28- SLIDE 43
Cheshire Configuration Files<!-- ******************************************************************* --><!-- ************************* TREC INTERACTIVE TEST DB **************** --><!-- ******************************************************************* --><!-- This is the config file for the Cheshire II TREC interactive Database --><DBCONFIG><DBENV>/projects/is240/GroupX/indexes </DBENV>
<!-- --><!-- TREC TEST DATABASE FILEDEF --><!-- -->
<!-- The Interactive TREC Financial Times datafile --><FILEDEF TYPE=SGML>
<DEFAULTPATH>/projects/is240/GroupX </DEFAULTPATH>
<!-- filetag is the "shorthand" name of the file --><FILETAG> trec </FILETAG>
<!-- filename is the full path name of the main data directory --><FILENAME> /projects/is240/ft </FILENAME>
<CONTINCLUDE> /projects/is240/ft.CONT </CONTINCLUDE>
<!-- fileDTD is the full path name of the file's DTD --><FILEDTD> /projects/is240/TREC.FT.DTD </FILEDTD><!-- assocfil is the full path name of the file's Associator --><ASSOCFIL> ft.assoc </ASSOCFIL>
<!-- history is the full path name of the file's history file --><HISTORY> cheshire_index/TESTDATA.history </HISTORY>…
IS 257 – Fall 2006 2006.11.28- SLIDE 44
<!-- The following are the index definitions for the file --><INDEXES>
<!-- ******************************************************************* --><!-- ************************* DOC NO. ********************************* --><!-- ******************************************************************* --><!-- The following provides document number access. --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE PRIMARYKEY=IGNORE><INDXNAME> cheshire_index/trec.docno.index </INDXNAME><INDXTAG> docno </INDXTAG>
<INDXMAP><USE> 12 </USE><struct> 1 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 2 </struct> </INDXMAP>
<INDXMAP><USE> 12 </USE><struct> 6 </struct> </INDXMAP>
<INDXKEY><TAGSPEC><FTAG>DOCNO </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>…
IS 257 – Fall 2006 2006.11.28- SLIDE 45
<!-- ******************************************************************* --><!-- ************************* TOPIC *********************************** --><!-- ******************************************************************* --><!-- The following is the primary index for probabilistic searches --><!-- It includes headlines, datelines, bylines, and full text --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD_PROXIMITY NORMAL=STEM><INDXNAME> cheshire_index/trec.topic.index </INDXNAME><INDXTAG> topic </INDXTAG>
<INDXMAP><USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP><INDXMAP><USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>…<STOPLIST> cheshire_index/topicstoplist </STOPLIST><INDXKEY><TAGSPEC><FTAG>HEADLINE </FTAG><FTAG>DATELINE </FTAG><FTAG>BYLINE </FTAG><FTAG>TEXT </FTAG></TAGSPEC> </INDXKEY> </INDEXDEF>
IS 257 – Fall 2006 2006.11.28- SLIDE 46
Cluster Definitions<!-- ************************* CLUSTER ********************************* --><!-- *********************** DEFINITIONS ******************************* -->
<CLUSTER><clusname> classcluster </clusname><cluskey normal=CLASSCLUS>
<tagspec><FTAG>FLD950 </FTAG> <s> ^a </s>
</tagspec></cluskey><stoplist> /usr3/cheshire2/data2/clasclusstoplist </stoplist><clusmap>
<from> <tagspec><ftag>FLD245</ftag><s>^[ab]</s><ftag>FLD440</ftag><s>^a</s><ftag>FLD490</ftag><s>^a</s><ftag>FLD830</ftag><s>^a</s><ftag>FLD740</ftag><s>^a</s>
</tagspec></from><to> <tagspec>
<ftag>titles</ftag> </tagspec></to><from> <tagspec>
<ftag>FLD6..</ftag><s>^[abcdxyz]</s> </tagspec></from><to> <tagspec>
<ftag>subjects</ftag> </tagspec></to><summarize> <maxnum> 5 </maxnum>
<tagspec> <ftag>subjsum</ftag></tagspec></summarize>
</clusmap></CLUSTER>
IS 257 – Fall 2006 2006.11.28- SLIDE 47
Component Definitions<COMPONENTS><COMPONENTDEF><COMPONENTNAME> TESTDATA/COMPONENT_DB1 </COMPONENTNAME><COMPONENTNORM>NONE</COMPONENTNORM><COMPSTARTTAG> <TAGSPEC> <FTAG>mainenty </FTAG> <FTAG>titles </FTAG> </TAGSPEC></COMPSTARTTAG><COMPENDTAG> <TAGSPEC><FTAG>Fld300 </FTAG></TAGSPEC></COMPENDTAG><COMPONENTINDEXES><!-- First index def --><INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=NONE><INDXNAME> TESTDATA/comp1index1.author…</INDEXDEF></COMPONENTDEF></COMPONENTS>
IS 257 – Fall 2006 2006.11.28- SLIDE 48
Result Formatting (Display)<DISPOPTIONS>KEEP_ENTITIES</DISPOPTIONS>
<DISPLAY> <FORMAT NAME="B" OID="1.2.840.10003.5.105" DEFAULT> <convert function="TAGSET-G"> <clusmap> <from> <tagspec> <ftag>DOCNO</ftag> </tagspec></from> <to> <tagspec> <ftag>28</ftag> </tagspec></to> <from> <tagspec> <ftag>#DOCID#</ftag> </tagspec></from> <to> <tagspec> <ftag>5</ftag> </tagspec></to> </clusmap> </convert></FORMAT></DISPLAY>
IS 257 – Fall 2006 2006.11.28- SLIDE 49
Indexing• Any SGML/XML tagged field or attribute can be
indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a
specific index– Underlying postings information includes term
frequency for probabilistic searching.– SGML may include address of full-text for indexing
• New indexes can be easily added, or old ones deleted
IS 257 – Fall 2006 2006.11.28- SLIDE 50
Database Storage • All data stored as SGML/XML flat text files
plus optional linked full-text files • File format is defined though SGML/XML
DTD (also flat text file)• “Associator” files provide indexed direct
access to each record in SGML/XML files.– Contain offset and record length for each
“record”– Associators can be built to index any
conformant document in a directory sub-tree
IS 257 – Fall 2006 2006.11.28- SLIDE 51
Database Storage
AssociatorFile
Page DataFile
SGML/XMLFile
HistoryFile
DTDFileCluster
File
PostingsFile
IndexFile
IndexFile
RemoteRDBMS
ConfigFile
IndexFile
AssociatorFileProx
data File
IS 257 – Fall 2006 2006.11.28- SLIDE 52
Client/Server Architecture• Server Supports:
– Database storage– Indexing – Z39.50 access to local data– Boolean and Probabilistic Searching– Relevance Feedback– External SQL database support
• Client Supports:– Programmable (Tcl/Tk – Python soon) Graphical User
Interface– Z39.50 access to remote servers– SGML & MARC formatting
• Combined Client/Server CGI scripting via WebCheshire
IS 257 – Fall 2006 2006.11.28- SLIDE 53
Z39.50 Overview
UI
UI
MapQuery
Internet
MapResults
MapQuery
MapResults
MapQuery
MapResults
SearchEngine