Native XML Databases for Information Systems Chris Wallace XQuery workshop April 2006.

26
Native XML Databases for Information Systems Chris Wallace XQuery workshop April 2006
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of Native XML Databases for Information Systems Chris Wallace XQuery workshop April 2006.

Native XML Databasesfor Information Systems

Chris WallaceXQuery workshop

April 2006

Chris Wallace, UWE, Bristol 2

Exploring the design space

• Native XML database (NXD)– Storing, querying and updating XML documents without

mapping into relations– Schema-free– Trees are to NXD what tables are to RDBMS– Tables are trees

• Information Systems– Focus on semi-structured data (mixture of simple data

items, text and complex nested structures)– Searching, derived data, visualisation– Process support– Large problem space variously supported by

spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data

• “design as a conversation with the materials in the situation” (Schon)

Chris Wallace, UWE, Bristol 3

Solution:eXist Native XML Database

• eXist– Open source Java – European team of developers led by Wolfgang Meier– Under development for several years, mature except for documentation

• Supports– XQuery– XUpdate– XSLT– Free-text searching– XQuery Extensions to allow complete applications to be developed

• Documents (files) are organised in collections (folders) in a file store– XML Documents stored in an efficient, B+ tree structure with indexes– Non-XML resources (XQuery, CSS, JPEG ..), etc can be stored as binary

• Deployable in different ways– Embedded in a Java application– Part of a Cocoon pipeline– As web application in Apache/Tomcat– With embedded Jetty HTTPServer

• Multiple Interfaces– REST – to Java servlet – SOAP– XML-RPC

Chris Wallace, UWE, Bristol 4

Sample Implementations

• Family photos and history– Integration of meta-data on family

photos with family history (births, deaths and marriages) and Google Earth

• FOLD– modules, programmes, scheme

operations, staff, organisational structures, events

• Other demos on the eXist demo site

Chris Wallace, UWE, Bristol 5

FOLD – Faculty OnLine Data• Operations at student level (2000 in CEMS) supported by central

systems (student records, finance)• FOLD Scope – teaching and assessment management and

organisational knowledge– Modules [450] and their specification– Programmes (Courses) [100] and their structures– Operations – Runs, Coursework, exams– Staff (300+)– Organisational structure (100)– Events

• Information currently distributed over word documents, spreadsheets, access databases, SQL database, flat text files, LDAP

• Aims– To support distributed data ownership– To provide a web of data within and between systems– To support organisational processes– To improve data veracity

Chris Wallace, UWE, Bristol 6

FOLD Entity TypesEntity Type Identifier

No of instances Document Map

No of documents / year

Document Type

Module Specification ModuleCode/Version 450 one each 450 (40)complex structure

Module Run ModuleCode/Year/Runno 460 one per field/year 6 table

Module assessments

ModuleCode/Year/Runno/ElementNo 800 one per field/year 6 Table

Examination ModuleCode/Year/Exam 420 one per year 1

Student numbers Date/ModuleCode 450 * 4 one per date 5 table

Award types PrimaryAward 8 one only 1 simple structure

Programmes ProgrammeCode/Year 100 one per year 1 table

Programme Structure

ProgrammeCode/Pathway/Version 110 one each 110 (20)

complex structure

Organisational structure GroupName 100

several per major group 60 simple structure

Events EventGroup/EventID 300 all events in a group 50 simple structure

Staff Name 400 per responsibility 5 table with reps

Training Name/Course 200 one only 1 table

Training Courses Course 40 one only 1 table

ucasKey words UCASCode/Keyword 4000 one only 1 table

UWE calendar Date 365 one only 1 table

SuggestedHours Level 5 one only 1 simple structure

Entity Type metadata DatasetName 20 one only 1 table

System Configuration Faculty 1 one only 1 table

Chris Wallace, UWE, Bristol 7

FOLD current stats• Code

– XQuery -3000– XSLT -3000– XSD - 300 (one schema)– CSS - 200– PHP - 10 ( vcal)

• Pages– about 25 user– Only 1 admin as yet

• Information System development – CW (4 months)– Placement Student (8 months)– Phase allocation:

• Project (20%)• Code (20%)• Data – gathering, conversion, cleaning (60%)

Chris Wallace, UWE, Bristol 8

FOLD - Modules and Programmes

+ Module

- moduleCode : String

+ Module Specification

- version : Year

- faculty : Faculty

- field : Field

- title : String

- credits : CreditsType

- level : LevelType

- syllabus : RestrictedHTML

- readingStrategy : RestrictedHTML

+ 1..1+ 1..*

+ definition

+ ProgrammeStructure

- version : Year

+ Programme

- programmeCode : String

- ucasCode : String [0..1]

+ 1..1

+ 1..*+ s tructure

+ Stage

+ 1..1

+ 1..* {ordered}

+ OptionGroup

- id : String

- comment : String [0..1]

- minCredits : int

- maxCredits : int

+ 1..1

+ 1..* {ordered}

+ Core

+ 1..1

+ 1..* {ordered}

+ 1..*

+ 1..*

+ core

+ Option

+ 1..1

+ 1..* {ordered}

+ 1..*

+ 1..*

+ optional

+ Module Combination

- comment : String

+ 1..1

+ 0..1+ pre-requis ite

+ 1..1

+ 0..1

+ co-requisite

+ 1..*

+ 1..*

+ e

xpre

ssio

n

This is a boolean expression such as ( m1 and m2 and (m4 or (m5 and m6))

+ Learning Outcome

- assessed in Comp A : boolean

- assessed in Comp B : boolean

- specification : RestrictedHTML

- outcomeType : Learning Outcome

+ 1..1

+ 1..* {ordered}

+ Reading item

+ Book

- authors : String

- title : String

- year : String

- source : String

+ WebSite

- url : URL

- text : String

+ 1..1

+ 1..1

+ 1..1

+ 1..*+ Excluded

The FOLD

Chris Wallace, UWE, Bristol 9

Areas for attention• Conceptual Modelling

– Identifiers– Relationships and links– Versioning

• Logical Modelling (in XML)– Element/attribute– Views– Validation

• Physical layer (in NXD) – Structuring documents and collections– Mapping to editors– Responsibilities

• Programming– Functional allocation between tiers– Views and constructed elements– Integrity– XQuery programming

• User interface– Editing– Long transactions

• Development Process– Case Tool requirements

• Scope of application of NXD

Chris Wallace, UWE, Bristol 10

Conceptual Modelling• Conventional normalised data model

– EAR ++ • Entity (not XML entities like &)• Attribute (multi-valued)• Relationships

– Association– Composition

– Object Orientation? • methods are mainly getters (of derived values)• Inheritance only useful in the schema domain• Instance inheritance more useful in IS

– Expressivity Problems• Identifiers• Order of parts• Verbosity

• ? Conceptual Scope – Edit trails, versioning, activity tracking

• Generality problem – Roles as Attributes

• <ModuleLeader>Stewart Green</ModuleLeader>

– Roles as Entities• <role><title>Module Leader</title><person>Stewart Green</person></role>

Chris Wallace, UWE, Bristol 11

Identifiers

• Principle adopted – use naturally occurring identifiers wherever possible– Persons : “Chris Wallace”– Rooms : “3P14”

• Yes– Reduces gap between Real World domain and system– Names in minutes of meetings, on spreadsheets are readable

• No– Duplicates

• Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty

• Mergers generate duplicates– Aliases– Not all entities have unique domain identifiers

• Gives rise to confusion in the problem domain and should be resolved there• Po

– All names need namespace – “Chris Wallace” at CEMS at UWE– Need to replace multiple naming conventions with a single naming

scheme (e.g. initials)– URN’s and semantic web

Chris Wallace, UWE, Bristol 12

Conceptual to Logical

• Attributes v elements • Relationships• Integrity• Views

Chris Wallace, UWE, Bristol 13

Attributes v elements

• E.g. – <Module code=“UFIEKG-20-3”

level=“3”>…– <Module><ModuleCode>UFIEKG-

20-3</ModuleCode>• What criteria to use?

– Attributes as ‘meta’ is vague– FOLD uses only elements

Chris Wallace, UWE, Bristol 14

Relationships

• Implementing Relationships– One – Many

• RDBMS – primary key on the One side becomes foreign key on the Many side

• NXD – choose which side on the basis of complexity and responsibility

– Sequence (modules in a stage)– Complex (pre-requisite expression)

– Many-Many• RDBMS – intersection table • NXD– as for one-many • or either side as appropriate – e.g. Groups and

subgroups

Chris Wallace, UWE, Bristol 15

Integrity• Structural integrity

– Schema validation too weak and too restructive– NXD stores any well-formed XML

• Referential Integrity– RDBMS – ‘eager’

• data not allowed in unless valid, updates maintain integrity• integrity failures transient, repair outside database

– NXD – ‘lazy’• store the data and provide on-demand or on-trigger validation• Integrity failures can be persisted (XLinkit) and repair is inside

database• Identifier Uniqueness

– XML ids only checked within a document– NXD stores all XML nodes with internal identifiers

• For Information Systems, veracity of the model is what’s important

Chris Wallace, UWE, Bristol 16

Logical to Physical layers• What criteria to use in allocation of logical units to the

physical layer:– Documents – a physical aggregation of entity instances– Collections – a physical aggregation of documents

• Examples– Module Specification [moduleCode]

• Module Spec is an Entity• Each Module Spec is a Document

– Module Run [moduleCode/year/runNo]• Module Run is an Entity• Set of Module Runs for a Field is a Document

• Issues– Schemas needed per entity, not per document– Principle: No concepts modelled in the physical layer– Use Physical layer for responsibility, access rights ?

Chris Wallace, UWE, Bristol 17

Programming issues

• Tier design• Views and constructed elements• XQuery programming

Chris Wallace, UWE, Bristol 18

Tier design

• Allocation of functionality to tiers– Initially nearly all XQuery generating

HTML– As work matured, code moved into

function libraries and XSLT– XQuery for request input, sessions,

selection of nodes, computation of views for

– XSLT to generate interface for– CSS to style

Chris Wallace, UWE, Bristol 19

Views• Views arise from the need for de-normalisation for

presentation– Coursework Element

• As a simple element– Key : moduleCode/Year/runNo/elementNo– Data: due date

• As an extended de-normalised element– SuggestedHours (computed from Hours table)– Late date (computed from UWE calendar)– Weighings (extracted from relevant specification)– Module Leader (extracted from Module Run)

• Views as intermediate structures – From low level functions– For output to XSL– Constructed elements in XQuery use copy (losing reference so

cant update through a constructed element)• View caching for efficiency

– Triggers can invoke cache renewal

Chris Wallace, UWE, Bristol 20

declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo) , $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return<CourseworkElement> <ModuleCode>{$moduleCode}</ModuleCode> {$mod/Title} <RunNo>{$runNo}</RunNo> {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} <Component>CW</Component> <ElementNo>{$elementNo}</ElementNo> {$elementSpec/Description} <SuggestedHours>{$hrs}</SuggestedHours> <WeightInComponent>{$weightInComponent}</WeightInComponent> <WeightInModule>{$weightInModule}</WeightInModule> <DueDate>{data($dueDate)}</DueDate> <ReturnDate>{data($returnDate)}</ReturnDate></CourseworkElement>

};

Chris Wallace, UWE, Bristol 21

Integrity• Unlike RDBMS, integrity checks not inherent in Database

– Structural ( schema validation)– Referential integrity– Business rules

• Policies– Restrictive - allow in only data which has satisfied integrity

constraints• Unitary view of data – model must be consistent at all times

– Permissive – allow in un-validated data with on-demand validation reconciliation

• Pluralist view – model will probably never be consistent but have to work with this

• On-demand validation– Structure via eXist validation – Referential (via explicit coding)– Extensive Business rules

Chris Wallace, UWE, Bristol 22

XQuery programming

• Functional style yields good clean code• But its not OO!• Need to rethink some algorithms • Strict data typing needs explicit

conversion• Schema not missed• XPath 2.0 in XQuery, Xpath 1.0 in XSLT

(xalan) causes confusion• Fast and responsive

Chris Wallace, UWE, Bristol 23

User Interface• Table structured Document editing

– Allows maintenance using familiar Spreadsheet tools (Excel 2003 + Add-in)– Schema is induced by Excel– Accommodations

• Multi-valued fields as concatenated values– XPath Join and tokenise functions– Embedded separator problem (a name with ‘,’ as a legitimate character)– Defeats conventional indexing but eXist supports full text indexing

• Optional elements increase table width• Formatting choices not maintained (e.g. column widths, freeze-window location)

– WebDav to provide Web Folder access (still not functioning)• Structured Document editing

– Allows maintenance with Word without a schema• With difficulty –not schema awareness

– Use InfoPath to create desktop form based on schema• Need to redo if schema changes

– Document editors (Arbotext, XMetal..) - expensive• In-situ updates

– With Xquery-generated forms and update– With XForms using Orbeon (open-source XForms server)

Chris Wallace, UWE, Bristol 24

Development Tools

• eXist Java Client provides basic tools– Syntax-aware editor– Query execution– User and database management

• XML spy • Any text editor• Model-driven development

– Conceptual Model -> logical Model -> physical Model

– Rose, QSEE ?

Chris Wallace, UWE, Bristol 25

Development Process• Co-development of Information system structure

(code and schemas) and content (documents)• Support schema migration and refactoring (using

XQuery/XSLT)• Slide from prototype to production• Pluses and Minuses of user enthusiasm• Go for ‘low-hanging fruit’• Pay attention to the learning process

– XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP

• Project management via steering group, discussion boards but needs forceful lead developer

• Reflection forced by presentations and workshops• Is Agile IS development different to Agile Software

development?

Chris Wallace, UWE, Bristol 26

Characteristics of good fit ?

• FOLD– Low update rate / medium access rate– High document complexity– Document-centric ownership– Navigational interface– Integration with central systems – (via

XML interfaces?)