Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML...

37
Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the past decade. The traditional model provide constraints mechanisms and features to control quality defects but unfortunately these methods are not foolproof. This report reviews work on data quality in both database and management research areas. The review includes (i) the exploration into the notion of data quality, its definitions, metrics, control and improvement in data and information sets and (ii) investigation of the techniques which used in traditional databases like relational and object databases where most focus and resource has been directed. In spite of the wide adoption of XML data since its inception, the exploration does not only show a huge gap between research works of data quality in relational databases and XML databases but also show how very little support database systems provide in giving a measure of the quality of the data they hold. This inducts the need to formularize mechanisms and techniques for embedding data quality control and metrics into XML data sets. It also presents the viability of a process based approach to data quality measurement with suitable techniques, applicable in a dynamic decision environments with multidimensional data and heterogeneous sources. This will involve modelling the interdependencies and categories of the attributes of data quality generally referred to as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logic and any other appropriate approaches. The attempt is contextualised using the healthcare domain as it bears all the required characteristics. Categories and Subject Descriptors: H.2.0 [Database Management]: Security, integrity and protection; K.6.4 [System Management]: Quality assurance; J.3 [LIFE AND MEDICAL SCIENCES]: Medical information systems General Terms: Healthcare, HL7, NXD Additional Key Words and Phrases: data quality, health care, native XML databases 1. INTRODUCTION This research is motivated by three key issues or trends. First, the adoption of XML (eXensible Markup Language) as a data representation, presentation and ex- change format by domains with semistructured data like the health care domain. Second, the transition from paper based health records to centralised electronic health records to improve the quality and availablility of patient health data na- tionally. And lastly the measurement, control and improvement of data and infor- mation quality in informations sets. These issues which are introduced below briefly, present the need for an integrated formal approach between healthcare datasets, information models like HL7, openEHR etc, data and information quality metrics. There is high interest and excitment in industries over XML. This is shown by the number of emerging tools and related products and its incoporation into very important standards in domain like Health care. Why are traditional databases and models are still heavily used in the healthcare domain inspite of (i) the fact that healthcare data bears most of the characteristics of semistructed data as identified End of Year Review, October 2006.

Transcript of Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML...

Page 1: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in theHealthcare Domain

Henry Addico

As XML data is being widely adopted as a data and object exchange format for both structuredand semi structured data, the need for quality control and measurement is only to be expected.This can be attributed to the increase in the need for data quality metrics in traditional databasesover the past decade. The traditional model provide constraints mechanisms and features tocontrol quality defects but unfortunately these methods are not foolproof.

This report reviews work on data quality in both database and management research areas.The review includes (i) the exploration into the notion of data quality, its definitions, metrics,control and improvement in data and information sets and (ii) investigation of the techniqueswhich used in traditional databases like relational and object databases where most focus andresource has been directed.

In spite of the wide adoption of XML data since its inception, the exploration does not only showa huge gap between research works of data quality in relational databases and XML databases butalso show how very little support database systems provide in giving a measure of the quality of thedata they hold. This inducts the need to formularize mechanisms and techniques for embeddingdata quality control and metrics into XML data sets. It also presents the viability of a processbased approach to data quality measurement with suitable techniques, applicable in a dynamicdecision environments with multidimensional data and heterogeneous sources. This will involvemodelling the interdependencies and categories of the attributes of data quality generally referredto as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logicand any other appropriate approaches. The attempt is contextualised using the healthcare domainas it bears all the required characteristics.

Categories and Subject Descriptors: H.2.0 [Database Management]: Security, integrity andprotection; K.6.4 [System Management]: Quality assurance; J.3 [LIFE AND MEDICALSCIENCES]: Medical information systems

General Terms: Healthcare, HL7, NXD

Additional Key Words and Phrases: data quality, health care, native XML databases

1. INTRODUCTION

This research is motivated by three key issues or trends. First, the adoption ofXML (eXensible Markup Language) as a data representation, presentation and ex-change format by domains with semistructured data like the health care domain.Second, the transition from paper based health records to centralised electronichealth records to improve the quality and availablility of patient health data na-tionally. And lastly the measurement, control and improvement of data and infor-mation quality in informations sets. These issues which are introduced below briefly,present the need for an integrated formal approach between healthcare datasets,information models like HL7, openEHR etc, data and information quality metrics.

There is high interest and excitment in industries over XML. This is shown bythe number of emerging tools and related products and its incoporation into veryimportant standards in domain like Health care. Why are traditional databases andmodels are still heavily used in the healthcare domain inspite of (i) the fact thathealthcare data bears most of the characteristics of semistructed data as identified

End of Year Review, October 2006.

Page 2: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

2 · Henry Addico

in [Abiteboul 1997]? (ii) the issue of semistructuredness as a result of the integra-tion process of healthcare data especially when the disparate datasources say GPsurgueries do not use a generic or standard schema for data storage Could it bethe scepticism towards the use of XML databases as most initial? implementationslacked the essential DBMS features needed by the industrial community? There hasbeen immense research and commercial activity in XML databases since [Widom1999] highlighted possible directions of XML database research. Prospectively, do-mains that require database management systems (DBMS) that supports flexiblestructure and high dimensional data, the healthcare domain for example shouldbe adopting XML databases as they are more appropriate than other traditionalor legacy DBMS. This research began by performing a genral review of XML andXML databases and this is presented in section 2.

Information quality (IQ) or data quality (DQ) is another issue which has be-come a critical concern of organisations and an active research area in ManagementInformation Systems(MIS) and Database Warehousing as the availability of infor-mation alone is no longer a strategic advantage [Lee et al. 2002; Oliveira et al. 2005;Motro and Rakov 1997]. Whilst the database community use the term DQ and ismore concerned about using formal approaches which involve models, languagesand algorithms developed in their field, the management community use IQ (infor-mation is transformed data) and is interested in the abstract demographics, processinfluences, data cycle (from collection to actual use) of data quality. The manage-ment community approach is useful when trying to understand the DQ conceptwhilst the database approach is towards automation of their results with the aim ofeliminating data quality problems from their datasets. However the managementcommunity go beyond the intrinsic quality attributes and are generally interestedin complete frame works generally referred to as Total Data Quality Management(TDQM) to improve data quality across domains. Despite significant increase inIQ research to meet the needs of organisation, only a few ad hoc techniques areavailable for measuring, analyzing and improving data quality [Lee et al. 2002].Ensuring that this data in dynamic decisions environment is fit for its purpose istherefore still a difficult one [Shankaranarayan et al. 2003]. Nevertheless the deci-sion maker or user of these databases requires the data to measure up in order toconfidently make quality decisions.

The recent attempts to build a centralized electronic health record (EHR) isin-line with the above two issues as it requires the use of heterogeneous databasesand widespread adaptation of enabling technologies that facilitate data sharing andaccess from multiple data sources. Aside the introduction of EHR, data quality inhealth care has been of concern over the past decade and is still a key area withinthe healthcare domain with considerable attention as it is crucial to effective healthcare [Leonidas Orfanidis 2004].

This work begins with a review of work on XML and XML databases in section2 as XDBMS unfurl features suitable for managing data and information qualityespecially in the healthcare domain. The review continues with data and infor-mation quality in database and management research areas in section 3. Section4 explores the context for the research by looking at recent trends in health caretowards the development centralized EHR particularly in the united kingdom, spar-ingly making reference to other national implementations. This review on XML andEnd of Year Review, October 2006.

Page 3: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 3

XML databases, data and information quality and EHR in the healthcare domain,provide the grounding to conclude by presenting possible research directions on aprocess based approach to incorporating data quality into XML databases in thesection 5.

2. XML AND XML DATABASES

XML (eXtensible Mark-up Language), a subset of Standard Generalized MarkupLanguage (SGML) with an extensible vocabulary; is now a widely used data andobject exchange format for structured and semi structured data as well as the rep-resentation of tree structured data [Jiang et al. 2002; Rys et al. 2005; Shui et al.2005; Widom 1999; MacKenzie 1998]. It has been fast emerging as the dominantstandard for representing data in the World Wide Web since the inception of itsspecification in 1996 by the World Wide Web Consortium(W3C) [Shanmugasun-daram et al. 1999; MacKenzie 1998]. Unlike Html another subset of SGML whichserve the task of describing how to display a data item, XML describes the dataitself [Shanmugasundaram et al. 1999; Widom 1999; MacKenzie 1998]. Howevertrivial one might think of this property; its usefulness cannot be underestimated.It enables applications to interpret data in multiple ways, filter the document basedon its content or even restructure data on the fly. Furthermore it provides a naturalway to separate information content from presentation allowing multiple views viathe application of the eXtensible Stylesheet Language (XSL) specifications [Widom1999].

XML is generally presented in documents consisting of ordered or unordered el-ements that are textually structured by tags with optional attributes (key valuepairs). Elements can be semi structured by allowing unstructured free text be-tween the start and end tags as in figure 1. The elements can be nested withinother elements but the document must have a particular element that embodies allthe other elements called the root element of the document. A Document Type def-inition (DTD) describes the structure for the elements, their associated attributesand constraints [Achard et al. 2001]. The DTD can be expressed more elaboratelyas an XML document called XML schema. A well formed (correct in syntax) XMLdocument is said to be valid with respect to a DTD or XML schema if it conformsto the logical structures defined in the DTD or XML schema.

The following features of XML do not only account for its fast adoption andrapid development but also influence techniques adopted for efficient storage andquerying [Kim et al. 2002; Kader 2003]:

—Document-Centric—Large amount of mixed unstructured and structured data which is useful for

semi-structured or flexible data applications [figure 1].—Supports element and content order.—Readily Human readable—100% round tripping is essential (see 2.2) [Feinberg 2004]. For example in the

health care domain reconstruction of a patient EHR is important—Data-Centric: an alternative to the above document-centric feature

—Structured or very little mixed content of unstructured data.[figure 2].—Element and content order is insignificant

End of Year Review, October 2006.

Page 4: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

4 · Henry Addico

Fig. 1. a Document-Centric XML element

Fig. 2. a Data-Centric XML example

—Best consumed by machine—100% round tripping is not essential (see 2.2). For example one might only

need the cost of a flight—Provision of a family of Technologies like XREF linking, XPointer, XPath, XQuery,

XSL, AXML, SGML, SOAP, RDF etc.—self-documenting format that describes structure and field names as well as spe-

cific values—verbose meta data: each value text has a syntactic or semantic meta data.—yields higher storage costs—Best consumed by machine and promotes interoperability between systems—numerous schema languages with varying features for value, structural and

reference constraints e.g. XML Schema, RELAX NG, Schematron etc.—platform-independent

The document centric and data centric features are treated as generally treatedas alternatives. The interests of this research lie in local or distributed XML doc-ument stores with traditional database features like efficient storage, querying ofEnd of Year Review, October 2006.

Page 5: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 5

data, transaction, indexing etc to be precise an XML document management sys-tem(XDBMS). The next section defines XML databases and explores its subcom-ponents in its subsections.

2.1 Anatomy of an XML Database

An XML database is defined informally as lists of XML documents matching theirgiven Document Type Definition (DTD) or Schema by [Cohen et al. 1999]. Thisis comparable to the relational model which contains a list of tuples conformingto relation schemes [Cohen et al. 1999]. However as an XML document can bewithout a schema it is more appropriate to be defined ordinarily as a list of XMLdocuments [Moro et al. 2005].

Systems for managing XML data are either specialized systems designed exclu-sively for XML [Rys et al. 2005] or extended (”universal”) databases, designedto managed XML data amongst others. The first is referred to as Native XMLdatabase (NXD) and the latter XML enabled databases (XED). While NXD is notentirely new from ground up as a lot of the techniques are adaptations and adop-tions from semi structured databases, XED was a relentless effort and warrantedneed to tap into the 36 years work of investment [Shanmugasundaram et al. 1999]in relational databases, object databases and other traditional databases. The keydifference between XED and NXD is how they employ the traditional databaseconcepts. The NXD will have a (logical) model for its fundamental unit (XMLdocument), stores and retrieves documents according to this model. It can how-ever employ underlying physical storage model based on a relational, hierarchical,or object-oriented database, or use a proprietary storage format such as indexed,compressed files. The XED on the other hand use the traditional databases in itsentirety and only provide XML outputs via transformation of the results. Most im-plementations of XED and NXD provide more than just efficient storage but alsothe essential features of a database management system (DBMS) like appropriatequerying interfaces, security, multiple user access, transactional support etc. Inthis paper will refer to XED and NXD of this calibre and hence will use XEDBMSand NXDBMS respectively. There are quite a number of XEDBMS like BerkeleyDB XML [Burd and Staken 2005], DB2 9 Express-C No Charge PureXML Hy-brid Data Server [IBM 2006] etc. and numerous implementations of NXDBMSlike eXist [Pehcevski et al. 2005], 4suite [Olson 2000], Sedna [Aznauryan et al.2006], Xindice [Gabillon 2004; Sattler et al. 2005],TIMBER [Jagadish et al. 2002],Natix [Fiebig et al. 2002] etc.

An attempt to include quality metrics in an NXD requires a thorough under-standing of its architecture and interfaces. A general exploration of the essentialcomponents of multi tier databases architecturally structured into levels with theaim of providing physical and logical data independence could lead to fruitful pos-sibilities. This exploration begins with storage and data model in sections 2.2 and2.3. Next in section 2.4 the exploration continues with indexing including indexingsupport for annotation and meta data. Querying interfaces and transaction followin section 2.5 and section 2.6. Distributed XML database concepts and how theyare currently achieved concludes the anatomy exploration in section 2.7 as it is abasic requirement in the healthcare domain.

End of Year Review, October 2006.

Page 6: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

6 · Henry Addico

2.2 Storage models

The approach to storing XML documents could be either intact as whole docu-ments (document-centric) or following a shredding scheme (data-centric) [Feinberg2004]. This could be storage in off-the shelf traditional database management sys-tems [Jiang et al. 2002; Fiebig et al. 2002] or proprietary storage format. Theintact approach involves mapping the document into large objects field or files aswhole documents with an overflow mechanisms for very large documents whilstin the shredded case contents of the documents are mapped unto an elaborateschema of a traditional database taking advantage of the presence of a DTD orXML schema [Jiang et al. 2002].

From a document centric perspective the storage of XML documents as a sin-gular data item is ideal in handling whole document queries but manipulation offragments will require parsing of the document each time. However from a datacentric (non-intact model) view where documents are broken down into smallerchunks, the case of handling parts of the document is more efficient but negativelyimpacts the extraction of whole XML documents.

The document-centric and data-centric approaches can be compared from a per-spective of their granularity. The Intact storage approach has a granularity of ad-dressability [Feinberg 2004] of a whole XML document except for the special caseof immutable documents where interior addresses or offsets can be used. Howeverthere is a problem of how to address target documents. Two primary mechanismsare, by some unique name or generated document id or by query (XPath or Xquery).Querying over large collection of intact databases will require indexing which can bedone when parsing documents. The intact approach is useful in applications wherecollections comprise of relatively small documents which tend to be processed as awhole and 100% round tripping (the decomposition of the documents into smallerchunks and the re-composition when evaluating whole document queries is referredto as round tripping) is required. Non-intact on the other hand has addressabilityand accessibility to be subdocument level typically element or node level [Feinberg2004]. However concurrency granularity may, or may not be finer than the doc-ument level. It allows the ability to reference, modify and retrieve an element orpartial document or other objects within a document. It also provides more efficientquerying, without whole document parsing. There are challenges like the degree ofround trip as decomposition storage results in loss or change of information, such asreordering of attributes, change in XML declarations. This is due to the fact thatthere is not a 1:1 mapping from XML info-set to bytes in a document [Feinberg2004].

Model-mapping, a typical implementation of non-intact approach stores the logi-cal form of the XML document instead of the actual XML documents. This logicalmodel generally referred to as the data model is reviewed in next section. The map-ping is either edge oriented or node-oriented with a fixed schema is used for all XMLdocuments with out the assistance of DTD [Jiang et al. 2002]. An example, XPar-ent, is characterized by a four table schema. It maintains both label paths(sequenceof element tags) and data-paths(sequence of elements) in two separate but relatedtables.

There is still a dilemma or conflict when mixing the concepts of a single documentEnd of Year Review, October 2006.

Page 7: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 7

being a database and a set of XML documents being a database. This is due tothe document and data centric features of XML. This needs to be resolved as itmight be useful to use both views simultaneously [Widom 1999]. Nevertheless mostof the storage and data models seem to adopt the data-centric path with only fewconsidering the document-centric feature. This is not favourable to this researchas the EHR is an aggregational information model so that the document-centricapproach will be more appropriate. In traditional databases the data storage isabstracted away from the data model providing what is referred to as physical dataindependence. Most implementations (especially the intact approaches) of NXDhowever lack this feature and directly store the data model to disk.

2.3 XML data models

Above the storage model there is a need to have is a logical one referred to as thedata model [Feinberg 2004; Widom 1999]. This model can be sometimes closelytied to the querying language. A typical and popular example is DOM which istied to XQuery. DOM is object oriented, document-centric and forms the basisof most data models.It simply maps the document and its contents to basic andcomplex data structures [Goldman et al. 1999]. Each data item is modelled as anode, some nodes have names, values, and may have siblings or children forming atree structure. Generally the models take the form of a forest of unranked, node-labelled trees, one tree per document [Moro et al. 2005; Kacholia et al. 2005]. Anode corresponds to an element, attribute or a value and the edges represent ele-ment (parent)-sub element (child) or element-value relationships [Moro et al. 2005].XPath another popular example, models XML documents as an ordered tree using7 types of nodes (root, element, text, attribute, namespace, processing instructionsand comment) [Jiang et al. 2002]. TaDOM a third example is an extension of DOMby [Haustein et al. 2005]. It separates attribute nodes into attribute roots for eachelement and the attribute values are become text nodes. Lore project’s schema-less, self-describing semi structured data model Object Exchange Mode (OEM),was ported to an elegant XML data model where every data item is modelled as anelement pair of unique element identifiers(eid) and values which can be atomic textor complex values. The complex value bearing the element tag name, an orderedlist of attribute name or atomic value pairs, crosslink sub elements and normal subelement types.

The following are some of the classical features and examples of query languagesthat implement them if any:

—data typing support: Lore data model, XQuery—element and attributes ordering: Lore,—inclusion of white space between elements, comments, processing instructions e.g.

XPath—semantic view representing IDREF as cross link edges and foreign key relation-

ships: Lore—temporal support—concurrency support: taDOM—metadata and annotations management.

End of Year Review, October 2006.

Page 8: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

8 · Henry Addico

There is not as yet a data model providing all of these features with some of thesefeatures yet to be included in XML database models.

2.4 Indexing

In order to exploit the content and structure of XML documents several num-bering schemes (index structure mechanisms) are normally embedded in the treerepresentation model [Moro et al. 2005]. Absolute address index approaches likeposition-based indexing and path based indexing by [Sacks-Davis et al. 1997] arenot very efficient as updates cause an expensive re-computation which is non desir-able in applications with high insert and update frequency [Kha et al. 2001]. [Khaet al. 2001] present an indexing scheme which extends Absolute Region Coordinate(ARC) used to locate the content of structured documents called Relative Regioncoordinate which reduces the number of computation when a value of a leaf nodechanges. As this scheme has a special property of only affecting the nodes in itsregion and on the right hand (a storage scheme called block sub tree) it reduces theinput-output(I/O) necessary to perform updates. There are also indexing schemesfor the structure of the document, data centric in nature as opposed to the latterwhich is document centric. Examples are the interval based schemes [Li and Moon2001] where label are based on transversal order of the document and its size andprefix labelling scheme where nodes inherit their parent label as a prefix. The aboveindexing schemes also suffer an undesirable computation and modification to otherrecords when multiple elements referred to as segments are inserted or updated,[Catania et al. 2005] presented a lazy approach to XML update where no othermodifications is made to existing records. The writer also presented data struc-tures and structural join algorithm based on segments. DeweyID indexing schemeby [Haustein et al. 2005] is another which identifies tree nodes, avoids re-labellingeven under bulk insertions and deletions and allows the derivation of all ancestornode identities(IDs) without accessing the actual data.

As most work on quality measurement assumes the need to have meta data thereis also a need to provide meta data indexing facilities. In XML databases the metadata can be applied directly to elements or stored as views to the XML data. Thework of [Cho et al. 2006] which investigates meta-data indexing along each locationstep in XPath queries. The writer derive full meta data index which maintains ameta data level for all elements and inheritance meta data index which maintainsmeta data for only elements which is the meta data is specifically defined. Indexingis such an important aspect as it does not only affect the storage and query efficiencybut also holds the key for effective use of meta data for data quality measurement.

2.5 Querying XML Databases

The usefulness and potential of a database is highly dependent on providing op-timal query language and interfaces to store and retrieve data element. QueryingXML data has quite complex requirements most of which are still unknown and willbe until significant number of data-intensive applications are built [Widom 1999].Most text searches performed on XML related data seem to be keyword based. Anidentified problem in this scenario is efficiency of extracting this information [Ka-cholia et al. 2005]. An approach is Backward Expanding search which unfortunatelyperforms poorly due to the common occurrence of a keyword or if some nodes haveEnd of Year Review, October 2006.

Page 9: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 9

a high degree [Kacholia et al. 2005]. A Bidirectional Search algorithm which im-proves backward expanding by allowing forward search from potential roots towardsleaves [Kacholia et al. 2005].

The query language needs to support structural expressiveness, rich qualifiersand recursiveness with optimal query evaluations providing numerous interfacesvia the family of XML technologies. It must also exploit the presence of DTD orXML schema, document-centric or data-centric features of XML, meta-data andallow fuzzy restriction and query ranking. A Language like XPath superseded byXQuery, employ ”tree pattern” queries to select nodes based on their structuralcharacteristics [Moro et al. 2005]. For example the query:

//article/[/author[@last=”DeWitt”]]//proceedings[@conf=”VLDB”]requests for all article with author last name DeWitt and has been in the VLDB

conference [Moro et al. 2005]. The writers [Moro et al. 2005; Amer-Yahia et al.2005] explains that the query to consist of two parts.

—@last=”DeWitt”, @conf=”VLDB” are value based as they select elements ac-cording to their values i.e. content based.

—//article[/author]//proceedings is structurally related as it imposes structuralrestrictions (e.g. a proceedings element must exist under the article with at leastone child author).

The value based conditions can be efficiently evaluated with traditional indexingschemes whilst the structural conditions are quite a challenge. Set based algo-rithms, holistic processing techniques which have been developed in recent researchwork to out perform the conventional methods [Moro et al. 2005]. The conven-tional method mostly involved query decomposition using binary operators, thensome optimization is applied to produce an efficient query plan [Moro et al. 2005].The holistic approach however requires some pre-processing on the data or boththe data and the query but with incremental global matching ability [Moro et al.2005]. Most query languages express order sensitive queries which is particularlyimportant in document centric views through the ability to address parts of a doc-ument in a navigational fashion, based on the sequence of nodes in the documenttree [Vagena et al. 2004]. View mechanisms, dynamic structural summaries may beuseful for query processing and rewriting, keyword and proximity search [Widom1999]. [Amer-Yahia et al. 2005] propose a scoring and ranking method based onboth structure and content instead of the term frequency (tf), inverse documentfrequency (idf), techniques from Information retrieval which are approximations forrelevance ranking. Languages that support these features are declarative in nature.W3QL is a database-inspired query language designed for the web around 1995.It is a declarative query language just like Structured Query Language (SQL) andapplies database techniques to search the hypertext and semi structured data em-ploying existing indexes as access paths [Konopnicki and Shmueli 2005]. Lorel isanother declarative language designed initially for semi structured data [Abiteboulet al. 1997] and then migrated to XML. It seems to be the most flexible with alot of the features highlighted above. Other languages like XML-QL, a value basedlanguage with structural recursion, using patterns and allowing not only node se-lection but transformations via the notion of variable binding [Deutsch et al. 1998].

End of Year Review, October 2006.

Page 10: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

10 · Henry Addico

OQL and Quilt follow a functional paradigm integrating features from the otherdeclarative languages [Fan and Simeon 2003; Chamberlin et al. 2001].

There is also the need for query languages or approaches which hides awaythe complexity without compromising the complex grammar features by provid-ing graphical, form based amongst others for naive users. Equix is a typical formbased query language which automatically generates results documents and theirassociated DTD without a user intervention [Cohen et al. 1999]. This was an at-tempt to provide a graphical interface better than the graphical interfaced querylanguage XMAS which facilitates user-friendly query formulation and very usefulfor optimization purposes [Baru et al. 1999]. XML-GL is another of such graphicaloriented languages [Ceri et al. 1999].

As with extensions to SQL like work of [Parssian and Yeoh 2006] in relationaldatabases, to cater for data quality queries there might be a need to extend one ofthe numerous XML query languages. This forms a very small proportion of workon querying with XPath and XQuery being the most popular as this area is mostactively being researched and interests in a detailed exploration is out of scope theaims of this review.

2.6 Transaction Models in XML Databases

Research in XML database field has moved focused from a single user to fully fledgemulti user, multi access databases with collaboration on XML documents via Con-current Version System (CVS) [Dekeyser et al. 2004] being a thing of the past.As this has been studied extensively in the traditional database systems context,native XML databases must measure up. Processing interactions with XML docu-ments in multi-user environments requires guarded transactional context assuringthe well-known ACID properties [Haustein and Harder 2003]. [Feinberg 2004] workon transaction management is a typical example bearing these properties.

The structural properties of XML documents present some more challenges withlocking schemes propagation with respect to authorisation [Finance et al. 2005].[Dekeyser et al. 2004] argue about the inadequacy of table locking, predicate lock-ing and hierarchical locking schemes (adaptations of traditional database lockingschemes) mainly used in the shredded approach to XML data storage. They thenpropose two new locking schemes with two scheduling algorithm based on pathlocks which are tightly coupled to the document instance.

This area seems to have the least research focus. It is however critically importantwhen it comes to integration and heterogeneous database sets. Most of the keyfeatures of transactions depend on heavily on the extents of integration betweenthe query and security models.

2.7 Distributed XML databases

Distributed data management is an important research area for the database com-munity. Most prominent is the data warehouses in traditional databases. A datawarehouse is generally understood as an integrated and time-varying often histor-ical collections of data from multiple, heterogeneous, autonomous and distributedinformation sources. It is primarily used in strategic decision making by means ofonline analytical processing (OLAP) techniques [Lechtenborger and Vossen 2003].A data warehouse (DW) is different from the conventional DBMS in that is operatesEnd of Year Review, October 2006.

Page 11: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 11

on aggregated data from disparate sources for analysis purposes or answer complexad hoc queries which the disparate data sources can not answer in isolation. DWuses materialised views to provide faster and more efficient answers to real worldqueries.

The introduction of standards like WSDL, UDDI, SOAP using http and XMLespecially AXML has driven a shift from client-server architectures to servicedbased approaches [Bilykh et al. 2003]. [Abiteboul et al. 2006] conceptualises anXML DW by utilising work from ActiveXML (AXML), a language which is basedon the concept of embedding service calls inside XML documents. So these servicecalls could be calls to other databases or even tasks or process agents in a localisedor distributed environment. It is also quite different from application based peerto peer approaches as in [Boniface and Wilken 2005] using mediators to ensureinteroperability as this is solved by an information model like HL7 CDA for example.This is discussed into more depth in section 4.3.3. This DW is of particular interestas it is a research goal to incorporate quality metric in such a distributed virtualenvironment.

3. DATA AND INFORMATION QUALITY

The determination of quality of resources, products, systems, processes etc. hasalways been essential to mankind. Being currently in an information age wheredata is a precious resource, its quality measurement is needed as corporations, gov-ernmental organizations, educational institutions, health institutions and researchcommunities maintain and use data intensively. The generation of knowledge tosupport critical decisions based on these collections of data bring up the long-standing question of ”is this data of high quality?”. Several attempts has beenmade to answer this question in two main research communities; database man-agement and the information management. Whilst the database community usethe term DQ and is more concerned about using formal approaches which involvemodels, languages and algorithms developed in their field, the management com-munity use IQ as their interests lie in the abstract demographics, process influencesand the impact of data cycle (from collection to actual use) on quality. The nextsection defines DQ and IQ and investigates possible interrelations.

3.1 Defining data quality and Information quality

.Data quality as defined by in the American National Dictionary for information

Systems concerns correctness, timeliness, accuracy and completeness that makedata appropriate for use. A more elaborate definition from the ISO 8402 (QualityManagement and Quality Assurance Vocabulary) reflects data quality as:

The totality of characteristics of an entity that bears on its ability tosatisfy stated and implied needs [Abate et al. 1998].

Meaning that a measurable level of quality is reached if a specification (confor-mity to set of rules or facts) which reflects its intended use (utility) is satisfied. Thewriter [Abate et al. 1998] argues that conformity and utility are sufficient qualityindicators and ties up with other traditional definitions like ”fitness for use”, ”fit-ness for purpose”, ”customer satisfaction”, or ”conformance to requirements”.The

End of Year Review, October 2006.

Page 12: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

12 · Henry Addico

above definitions are simple extensions of the definition of Quality and informationcan be substituted for an appropriate definition of IQ. Many academics use IQ andDQ interchangeably [Bovee et al. 2003]. IQ is however contextual as Information iscontextualised data [Stvilia et al. ]. On the other hand others believe informationquality is an issue after data items have been assembled and interpreted as infor-mation. This stems from another key difference between data and information;the presence or absence of relationships between terms (numbers and words) [Pohl2000]. So that as data is transformed into information, data quality and its at-tributes (dimensions) are transformed to information quality [Ehikioya 1999]. Thenotion of the product-based perspective called data quality and the service-basedperspective called information quality as presented in [Price and Shanks 2004] ismuch preferred. The product-based perspective focuses on design and internal viewof Information System and the service-based focuses on the information consumerresponse to their task based interactions with the information system (IS). Thisimplies that the processes involved in producing and providing the service is asimportant as the data itself when determining information quality and motivatesthe development of a processes dependent formal model to measure informationquality. Even though Information and data quality can be used interchangeablythe above perspectives and definitions show their differences. The following sectionlook at the attributes of quality called data quality dimensions.

3.2 Dimensions of IQ

Generally assessment of data quality is based on attributes and characteristics thatcapture specific facets of quality [Scannapieco et al. 2005]. These facets generallyreferred to as dimensions of quality differ with the quality definition (intrinsicallyor extrinsically defined) or the quality model used in the assessment (theoretical,system-based, product-based, or empirical). The next section looks into these var-ious assessments. There have been several attempts [Wand and Wang 1996; Motroand Rakov 1997; Gertz and Schmitt 1998; Pipino et al. 2002; Ballou et al. 1998] todefine and generate lists of these dimensions as the classical different approachesseem to outline new and synonomical variations of existing dimensions. Instead of along list of these dimensions and their definition which will be far from exhaustive,it seems more appropraite to present a smaller set of these dimension identified by[Leonidas Orfanidis 2004; Gendron 2000] in the health care domain.

—Accessibility . EHR data should be available to authorized users including pa-tients with special needs, care providers, mobile users, emergency services, andmembers of integrated care teams. Access should be easy to use, be fast for bothcare professionals and patients. Data should be accessible from wherever theyare needed in appropriate forms and amount . Privacy from unauthorized usersshould be strictly maintained, and overriding of authorization constraints shouldbe recorded with documented reasons .

—Usability . EHRs should be accessible in different data formats and from differentkinds of hardware and networks, to ensure interoperability between different sys-tems. Data held within an EHR should be organized (including chronologically)and presented for ease of retrieval.

End of Year Review, October 2006.

Page 13: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 13

—Security and confidentiality. EHRs must be secure and confidential. Patientsshould be allowed to check who has access to their data and in what circum-stances.

—Provenance . EHRs should show the source and the context of data, linked tometadata about provenance of data. This is really to ensure believability

—Data validation . The status of the EHR data should be described by metadata,for example, to indicate if data are pending, and times of entry and retention.Patients should be allowed to check the validity of their EHR data.

—Integrity . Data accreditation standards should be established for new data, andinconsistency and duplication should be removed.

—Accuracy and timeliness . The content of an EHR should be as near real-time aspossible. Thus, data should be timely, in that it relates to the present. Tempo-rality is also essential as the EHR need to cover the lifetime of patients

—Completeness . The existence of further data should also be indicated, possiblywith links to other data.

—Consistency . There should exist consistency between items of multiple data frommultiple sources. EHRs should comply with the existing relevant standards, suchas security, data protection, and communication standards (HL7).

The above itemisation of a few of the data quality needs in the health care from[Leonidas Orfanidis 2004] used data quality dimensions defined in other literature.These are represented in italics. Reviews like [Naumann and Rolker 2000] of em-pirical, theoretical experiments which generate these dimensions have shown thislimitation. These limitations are not surprising as [Abate et al. 1998] has identi-fied the following problems of quality assessment using these dimensions. First ofall, generating an exhaustive lists of attributes may be difficult and unverifiable.Secondly, interdependencies of attributes make it difficult to define a minimal or-thogonal set of these attributes. Last but not the least, this results in the isolation ofquality attributes hindering the identification of systematic data quality problems.The grouping of attributes to identify specific quality problems is a more compre-hensive solution which will ease the identification of systematic problems. Mostcategorisation attempts [Naumann and Rolker 2000; Strong et al. 1997; Price andShanks 2004; Hinrichs 2000] to find a more comprehensive solution has not yieldedyet a desirable solution. One of these attempts which has gained some attentionis a categorization from [Richard et al. 1994]. The writers grouped dimensions ofthe quality problems that precipitate in other dimensions together. They are asfollows:

—Intrinsic (Accuracy, Objectivity, Believability and Reputation):- A lack of processor weakness in the current process for creating data values that corresponds tothe actual or true values. This implies that the information has quality in itsown right.

—Contextual (Value-Added, Relevancy, Timeliness, Completeness and AppropriateAmount of Data):- A lack of process or weakness in the current process forproducing data pertinent to the tasks of the user. It highlights the requirementthat the IQ be considered within the context of the task at hand.

End of Year Review, October 2006.

Page 14: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

14 · Henry Addico

—Representational (Interpretability, Ease of Understanding, Representational Con-sistency and Concise Representation):- A lack of process or weakness in the cur-rent process for supplying data that is intelligible and clear. This emphasizesthe need for the storage systems to provide access to this information in aninterpretable way, easy to understand, manipulate with concise and consistentrepresentation.

—Accessible (Accessibility and Access Security):- A lack of process or weakness inthe current process for providing readily available and obtainable data. The datamust be accessible but secure.

There are no evidence of the benefits of categories with doubts about their usefulnesswhen improving data quality of data sets [Lee et al. 2002]. Dimensional analysisis still under active research and there is as yet no agreed consensus. A resolutionof their interdependencies and synonymical variations is most needed as they playa very important part in data quality assessment processes discussed in the nextsection.

3.3 Data Quality assessments

The simplest approach to data quality assessment as identified by [Mandl andPorter 1999] involves the quantification of quality indicators (for example com-pleteness and good communication). These indicators are actually the quality di-mensions described above which the research community has been defining overthe past decade. Several other attempts however seem to follow [Arts et al. 2001]model with the following seven steps:

—describe the objectives of the entry—determine data items that need to be checked—define data quality aspects—select the methods for quality measurement—determine criteria—perform quality measurement—quantify measured data quality

Attempts following the above model have resulted in a paradigm of total qualitymanagement frameworks for data quality. These frameworks have been empirical,theoretical, system-oriented, and ontological or domain oriented. [Bovee et al. 2003]developed a conceptual framework with all the essential dimensions or attributesfor accessing IQ.

The AIMQ project [Lee et al. 2002] developed an overall model with an accompa-nying assessment instruments for measuring IQ and comparing these measurementsto bench marks and across stake holders. Firstly their model consists of a 2 x 2model framework defining what IQ means to consumers and managers. The framework is made of four quadrants dependent on whether information is considered aproduct or a service with either a formal or customer driven assessment. Secondlya questionnaire for measuring IQ along the relevant dimensions to consumers andmanagers. The third component consists of analysis techniques for interpreting theassessments against benchmarks of different stakeholders.End of Year Review, October 2006.

Page 15: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 15

Formal methods for measuring data quality in the relational world have beenconsidered. Generally focusing on completeness [Davidson et al. 2004; Scannapiecoand Batini 2004; Parssian et al. 2004] A closest formal approach was the attempt by[Ehikioya 1999] using fuzzy logic. The writer stated O=FT (D) where O is output(information) FT is the transformational unit and D is the data to be processed.This implies that the quality of the information out depends on the transformationfunction and the quality of the data unit. The writer then adopts a quality mea-suring spectrum showing the level of satisfaction of a dimension as they affect theoverall quality directly. A confidence measure of 0 means low and 1 meaning high.As there will be infinite points on the information spectrum fuzzy sets and logictheory will be most suitable. Even though this is similar to assigning a probabilityranging from 0 to 1 the writer believes fuzziness is more flexible.

According to [Wand and Wang 1996] to design information systems that de-liver high-quality data, the notion of data quality must be well understood. Anontologically based approach to defining data may be the ticket for success in real-world systems. More recently [Milano et al. 2005] has been looking at identifyingand possibly correcting data quality problems using an ontological approach calledOntology-based XML Cleaning(OXC). It formalises the problem of defining dataquality metrics based on an ontological model as DTD and XML Schema are notexpressive enough to capture additional knowledge and constraints. Furthermorethey lack formal semantics in order to allow automated reasoning. Using the DTDand Schema as basis, an ontology capturing domain knowledge is designed by adomain expect including any additional knowledge that have been left out of theconceptual design due to limits of the schema language or bad design. A mappingbetween the ontology and schema is also defined. Together they form a referenceworld against which data quality dimensions can be defined. Data quality improve-ment can also be applied. The result of applying OXC methodology makes an XMLdocument not only schema valid but also ontology-valid. The paper managed todefine data quality completeness in terms of value completeness, leaf completenessand parent child completeness.

[Shankaranarayan et al. 2003] discuss a modelling scheme for managing dataquality via data quality dimensions using Information Product map (IPMAP).IPMAP fills the void and extends Information Management System (IMS) which isused in computing the quality of the final product but does not quite measure upto total quality management [Shankaranarayan et al. 2003]. They compare man-ufacturing IP to manufacturing of a physical product(PP). Raw material, storage,assembly, processing, rework and packing. Both IP and PP may be outsourcedto an external agency or organisation which uses different standards and comput-ing resources. IP with similar properties and data inputs can be manufactured onthe same or subset of the production processes. Quality at source and continuousimprovement have been successfully applied in managing data quality [Shankara-narayan et al. 2003]. There is a need to trace a quality problem in an IP tomanufacturing stages that may have caused it and predict the processes impact ondata quality. The paper introduces an IP framework which it describes as set ofmodelling constructs to systematically represent the manufacture of an IP calledthe IPMAP. It then employs a meta data approach which includes data quality di-mensions. The IPMAP helps in visualising the distribution of the data sources and

End of Year Review, October 2006.

Page 16: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

16 · Henry Addico

also the flow of the data elements and sequences by which data elements are pro-cessed to create an IP. It also enhances the understanding of the processes and thebusiness units in the overall production. However it is does not provide functionalcomputability of the data quality during the process.

3.4 Data quality metrics

Statisticians were the first to investigate data quality problems, considering dataduplicates in the 60s. This was followed by data quality control analogous toquality control and improvement of physical products in the 80s. It is only atthe beginning of the 90s that measuring quality in datasets and databases andwarehouses has been considered [Scannapieco et al. 2005]. In order to understanddata quality fully the research community has identified a number of characteristicsthat capture specific facets of quality. This is referred to as data quality dimensions.These dimensions can be objective or subjective in nature and their measurementcan vary on granularity of the dataset semantically and syntactically. The few nextsections look at the dimensions which have received most attention.

3.4.1 Accuracy. There are quite a few definitions of accuracy: (i) Inaccuracyimplies that Information System (IS) represents a Real World (RW) state differentfrom the one that should have been represented. (ii) Whether the data availableare the true values (correctness, precision accuracy or validity). (iii) The degree ofcorrectness and precision with which real world data of interest to an applicationdomain are represented in an information system.

A value based accuracy of data is the distance between a value v and a valuev’ which is considered correct. Value based accuracy can be syntactic and seman-tic. The syntactic accuracy is measured by comparison functions that evaluatethe distance between v and v’. Semantic accuracy however goes beyond syntacticcorrectness, considering correctness in terms of fact and truth. Semantic accuracynormally requires object identification. Ideally value accuracy in XML data setswill be similar. Literature considers coarser computing accuracy metrics like col-umn accuracy relation or a whole database accuracy. There is a further notion ofaccuracy, duplication when an object is stored more than once in a data source.Primary key and integrity constraints are normally useful in this instance unlessnon natural keys are employed. Measuring accuracy coarser than values typicallyuse a ratio between accurate value and total number of values [Scannapieco et al.2005].

3.4.2 Completeness. Completeness is defined as the extent to which data is ofsufficient breadth, depth and scope for task at hand [Wang and Madnick 1989].[Pipino et al. 2002] identifies schema completeness as the degree to which entitiesand attributes are not missing from the schema (database granularity level), col-umn completeness as a function of missing values in a column of a table (column orattribute level granularity) and population completeness as the amount of missingvalues with respect to a reference population (value granularity: value, tuple, at-tribute, relational completeness). The above completeness fits the relational modelwhere null values are present but for the XML data model where nulls are non exis-tent the concept of reference relation is introduced. Given a relation r the referencerelation ref(r) is the relation containing all tuples that satisfies the relational schemaEnd of Year Review, October 2006.

Page 17: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 17

of r. However the reference relations are not always available and the cardinality,which is time dependent, must be used. Completeness is however expressed as

cardinalityofr/cardinalityofref(r). (1)

It must be noted that in a model with null values, generally the presence of nullvalues indicates missing value. However in order to characterize completeness thereis a need to understand why a value is missing. A value can be unknown, nonexistentor have an unknown existence.

3.4.3 Time-related: Currency, timeliness, and volatility. Data values are eitherstable or time variable. Even in instances where data values are stable their time ofcollection and transformation are relevant and hence the need for time-related di-mensions. The time dependent dimension is rather interdependent not only on eachother but dimensions like completeness and accuracy. The measurement of thesedimensions is dependent on the availability of time related meta data. Currencyis a measure of how promptly data are updated. This is measured with respect tothe last updated meta data. Timeliness is currency of data relative to a specifictask or usage or in time (meets a deadline). So there is the concept of the databeing current but late. Volatility is a measure of the frequency to which data varyin time. [Ballou and Pazer 2003] defines currency as

Currency = Age + (DeliveryT ime − InputT ime) (2)

where age measures how old the data unit is when received. Delivery time is whenthe information is delivered and input time when the data was obtained. If volatilityis defined as the length of time data remains valid Timeliness can be defined as

Timeliness = max(0, 1 − currency/volatiltiy) (3)

3.4.4 Consistency. This dimension captures the violation of semantic rules de-fined over a set of data units [Scannapieco et al. 2005]. In relational theory integrityconstraints are instantiations of such semantic rules. Consistency is problematicmainly in environments where semantics can not be fully expressed in the schemadue to restriction of the schema language or the diversity and share amount ofthese constraints known and unknown. Relational theory presents inter-relationaland intra-relational constraints (multiple attribute or domain constraints). To ex-press semantics in non relational environments theoretical editing model can beemployed.

3.5 Data quality control in Databases

The terms information and data quality have been used to describe mismatchesbetween the views of the world provided by an Information and Database systemsand the true state of the world [Parssian et al. 2004]. In relational model informa-tion products can be viewed as a sequence of relational algebra operations. Thequality (accuracy and completeness) of the derived data are functions of the qual-ity attributes of the underplaying data from the tables and the operations. Thisinspired [Parssian et al. 2004] to develope output quality metrics for the relationaloperators i.e. selection, projection and join.

[Mazeika and Bohlen 2006] uses a string proximity graph which captures theproperties of proper nouns databases with misspellings. This is typically useful in

End of Year Review, October 2006.

Page 18: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

18 · Henry Addico

proper noun databases like names and addresses databases where dictionary modeltechniques are not adequate. However a combination of those techniques which giveefficient approximation of selectivity for a given string and edit distance statisticallyand the more precise computation of the centre and border of hyper spherical clus-ters of misspellings. QSQL is an extension of SQL to allow the inclusion of qualitymetrics in user queries to multiple sources in order to select the data source thatmeets the quality requirements of the user. It however assumes the pre-assessmentof the quality dimensions using sampling techniques and results stored in meta ta-bles. The examples presented did not consider the granularity of the assessments asthe meta data where all table oriented. The work of [Ballou et al. 2006] also esti-mates the quality of IPs as the output produced by some combination of relationalalgebraic operations applied to data from base tables. [Ballou et al. 2006] assumesthe quality of the base tables are not know during design time and cannot be knownwith precision due to the dynamic nature of real world databases. The measure ofthe quality of an IP is the number of acceptable data units found divided by thetotal number of data units. The unacceptable data units are evaluated from defectsin sampled data from the base tables with the data units derived by the level ofgranularity sufficient to produce other data units via algebraic operations.

Data quality measurement still remains an open problem. There seems to be alot of work based on a few dimensions in an isolated context. A true measure willhave to be incorporate all dimensions the most relevant to the domain in context.

4. DATA IN THE HEALTHCARE DOMAIN

Health and Clinical professionals meet the needs of patients by drawing on theknowledge accumulated by medicine over 5000 years. Clinical practise is a knowl-edge based business, requiring clinicians to use vast amount of information to makecritical decisions during patients care. About a third of doctors’ time is spentrecording and synthesising information from personal and professional communica-tion. Yet most of the information doctors used during clinical practise is locked upin paper based records and the knowledge reproduced by rote [Smith 1996]. Duringpersonal and professional communication with patients and other clinicians severalquestions and information is shared. Managing this information amongst otherproblems which are listed below has been the focus of most information medicalresearch.

—Primarily there is lack of explicit structure. Many of these documents are stilltextual narrations and irrelevant search queries can be reduced by applying somemeaningful structure to these clinical documents [Schweiger et al. 2002]: dataquality requirements like usability, consistency etc. .

—There is also the problem of flexibility of a documents structure. Clinical datarequires flexibility in terms of free-textual descriptions, different structural lev-els and even individual structures so that the document must not restrict con-tent [Schweiger et al. 2002]: requirement of an appropriate storage system .

—The emergence of new complex data types and often their impact on the per-ception of the old types in particular their inter relationships [MacKenzie 1998]:requirement of an appropriate storage system .

End of Year Review, October 2006.

Page 19: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 19

—Huge amount of data where typical instance of data growth is exponential [Leonidas Or-fanidis 2004; MacKenzie 1998]. Managing this knowledge with high rate of changecritical to good clinical practice [Smith 1996]: a combination of data quality re-quirements like appropriate amount of data and an appropriate storage system.

—Diverse and high growth knowledge of information set: requirement of an appro-priate storage system and information model .

—Data storage is normally distributed and data duplication is a norm [MacKenzie1998]. Health care processes involve caregivers in highly mobile functions. Carehappens at GP, secondary care hospital, in surgery, on the patients bed andhome, in ambulances etc. [Bilykh et al. 2003]

—Storage format are as heterogeneous as the actual data [MacKenzie 1998]: a com-bination of data quality requirements like completeness of data and an appropriatestorage system.

—Indexing facilities provided are inadequate [MacKenzie 1998]: requirement of anappropriate storage system.

—Non-integrated database across hospitals [Lederman 2005]: requirement of anappropriate distributed storage system.

—Widespread use of paper records [Lederman 2005].—Poor data security [Lederman 2005; Bilykh et al. 2003]. A growing scope of

people requiring access to clinical information. For example clinical librarians inthe information process of Evidenced based medicine [Jerome et al. 2001]: dataquality requirements like accessibility, availability, security, consent, reuse etc.

—Poor process management of consenting to disclosure of patient data [Lederman2005]: Data quality requirements like confidentiality.

—Scale and integration of heterogeneous data like Clinical images(X-Rays, MRI,PET scans) and speech as to mention a few: requirement an appropriate storagesystem.

—Failure to understand consumers needs [Riain and Helfert 2005]. data qualityrequirements like integrity of patients.

—Poorly defined information production process coupled with unidentified productlife cycle [Riain and Helfert 2005]: requirement of an appropriate informationmodel.

—Lack of an information product manager (IPM) [Riain and Helfert 2005]: iscrucial if analysis corresponding to quality treatment is the prime query prob-lem [Pedersen and Jensen 1998; Leonidas Orfanidis 2004]: requirement of anappropriate storage system and information model with the integration of qualitycontrol.

The development of the electronic records and the use of the right storage mediumlike XML databases will either alleviate or reduce the extent most of the aforementioned problems apart from the data quality related ones. It will also helpbuild research databases to enhance patient care delivery. Many national healthcaresectors are reforming existing polices, technology and frameworks with a long termgoal of keeping a complete record about a patient care. This complete record willbe available to all concerned and is described to be complete as it will capture all

End of Year Review, October 2006.

Page 20: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

20 · Henry Addico

data generated from the process of care spanning a patient’s cradle to grave. Thispresents a fundamental change to how health professionals manage patient caredata. The next few sections are summaries of its implementation in the UK.

4.1 Introduction: Current trends of the Electronic record for Patient care

The idea of computerizing the patient care record has been around since the early1960s when hospitals first started using computers [Grimson et al. 2000]. Its use wasinitially focused on financial process and therefore only kept basic data about thepatient [Canning 2004]. As hospitals and laboratories became more computerized,test results were in computerized format and in consequence their integration withthe basic demographic data [Grimson et al. 2000].

The electronic record over the past years has had variety of names ComputerizedMedical Record (CMR), Computerized Patient Record (CPR) and Electronic Med-ical Record (EMR) as to mention a few. However more recently Electronic PatientRecord (EPR) and Electronic Health Record (EHR) has been globally used anddefined. The EPR describes a record of periodic care mainly provided by an insti-tution whilst the EHR is a longitudinal record of a patient’s health and healthcarefrom cradle to grave [Department of Health 1998]. EHR has been embraced andits been actively used in most national healthcare domain [Department of Health1998; Sherman 2001; Raoul et al. 2005; Riain and Helfert 2005].

The computerisation of the NHS and its patients record started under the Na-tional Programme for IT (NPFIT) now the new Department of Health agency NHSConnecting for Health (CfH). The National Programme for IT plans to connect over30,000 GPs in England to almost 300 hospitals and give patients access to theirpersonal health and care information. An estimate of about 8 billion transactionsbeing handled each year by 2005 has been made according to [Canning 2004]. Thiswill be a total transformation the way the NHS works by enabling service-wide in-formation sharing of electronic patient records [Sugden et al. 2006]. The NHS CfHwhich was launched April 2005 is responsible for delivery the national IT programalong with business critical systems. The programme started with the implementa-tion of the N3 national networking service followed by the NHS care record servicefunctionality. This was in two parts; firstly the national procurement which in-cluded the Spine, Electronic booking and the electronic transfer of prescriptionsand secondly the ”Cluster” level procurement which included patient administra-tion system (PAS), electronic ordering and browsing of tests, picture archiving andcommunication systems as well as clinical decision support systems.

The rest of these sections investigate the current state of the above systems andservices in order to gain an overall understanding and developments made by theprogramme towards the creation of the national patient care record by discussingsome of the above elements in the following order [CfH 2005; Canning 2004; Sugdenet al. 2006].

—NHS Care record service—Choose and Book—Electronic Prescription service (EPT)—N3 national network for the NHS—NHSmail national email and directory serviceEnd of Year Review, October 2006.

Page 21: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 21

—Picture Archiving and Communications Systems (PACS)

This will help in discussing the challenges affecting the programmes successful de-velopment particularly, as the implementation of such large scale heath service ITprojects in a few nationals’ health sector has proved difficult [Hendy et al. 2005].

4.2 The Programme

4.2.1 NHS Care record service. The NHS Care record service (NCRS) is centralto the strategy towards the creation of the unified electronic record [Moogan 2006].The electronic record which will be the output of this NHS care record service willbe the basis to offer the implementation of other above mentioned services andsystems. It will enable the booking of appointments based on the patients choice(the Choose and Book service), automatic transfer of complete records betweenGeneral Practitioners when the patient changes addresses, provide instant patientmedical records when needed for emergency care, improve health by giving peopleaccess to this information, as well as improving care through better safety andoutcomes through ensuring that information and knowledge is available when itneeded. It will replace the old Patient Administration System(PAS) [Hendy et al.2005].

The NCRS is composed of the following two elements with a third patient ac-cess feature [CfH 2006]. Firstly the detailed care record, which comprises of datagenerated during episodes of care in a particular institution combined with sev-eral others from other organisations providing care to the same patient. Theseorganisations include NHS Acute Trust, Mental health trust, general practices andtheir wider primary care team (pharmacy, community nursing team, dental surgery,opticians, social care team and so on). Secondly the Summary Care record withless complexity containing information to facilitate cross organisation patient care.This needs to be carefully controlled so that it serves its purpose as the detailedrecord summary. It should contain aspects like major diagnosis, current and regularprescriptions, allergies, adverse reaction etc. which are deemed significant whilststill keeping its complexity to the minimum. The summary record will have com-ponents maintained or obtained from General Practices, maintained summaries ofthe patient record, discharge summaries from Hospitals, mental health trusts etc.Lastly the access feature, HealthSpace, which does not only ensures the availabilityof the summaries of the electronic record to the patient via the secure HealthSpacewebsite but also patients can add comments, add treatment preference notes, im-portant facts like religion, a record of self medications, update details about theirweight, blood pressure and so on. HealthSpace will eventually be replacing theNHS Direct website [Cross 2006a]. A typical example from the central Hampshireelectronic health record pilot project shows the typical elements of the electronicrecord as follows.

—General practice systems—General practice clinical record(coded items)—General practice prescription record

—Hospital Systems—inpatient record(hospital episodes statistics extract)—discharge letter

End of Year Review, October 2006.

Page 22: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

22 · Henry Addico

—Pathology requests—Radiology requests—inpatient drug prescriptions and administrations—out patient attendances—attendances accident and emergency department and discharge letter—Maternity discharge letter—Maternity discharge letter—waiting lists details

—NHS direct—Call Summary—Advice Given

—Ambulance—Patients details—Observation details—Intervention details

—social services—Client details—Residential care record—Non-residential care record

The following illustration of a Patient record taken from [Moogan 2006] shouldhelp give a better picture: Mrs Ross’s general practitioner records all consultationswith Mrs Ross, as does the practice nurse. Together these form part of the generalpractice component of Mrs Ross’s Detailed Care Record. So when she sees a referreddiabetologist after being diagnosed to have diabetes mellitus, a junior doctor andseparately a nurse in her local hospital’s can see her detailed care record contributedby the hospital and portions of the general practitioners contribution after Mrs Ross’consent with supporting knowledge base that offer decision support and guidancewhilst a record of the access trail is kept progressively. Mrs Ross then sees herpharmacist to get her medication, her optician to have a check for diabetic eyedisease and her podiatrist for foot care. Each will be able to see their own recordsand if she consents, important entries in other parts of her detailed record will helpdeliver best care. A few weeks later Mrs Ross visits her hospital diabetic clinic.They record, among other things, her new blood test, change her medication, andnote that she has developed the first signs of a diabetic eye disease. These are allenough to be in her summary record. She will have access to part of this informationthrough the HealthSpace and even record things like her blood pressure, her weightchanges etc.

A summary of the patient EHR will be held in a national database, known asthe ’SPINE’ ensuring that particular vital information is always accessible. Morein-depth record will be kept locally at the patient main GP or hospital.

Issues of data confidentiality are very important and a lot of measures havebeen taken. This is discussed further in the next section as its more appropriate.The NCRS will start in the middle of the 2006 and will iteratively develop inimprovement cycles until 2010 [Moogan 2006; CfH 2006]. This summer(2006) everyhousehold in England should receive a leaflet explaining the NHS plan to make theirhealthcare accessible electronically [Cross 2006a].End of Year Review, October 2006.

Page 23: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 23

4.2.2 Choose and Book. The Choose and book system allows a General Prac-titioner to refer a patient for elective treatment at another centre via a computersystem. This enables the General Practitioner to perform this booking direct fromthe surgery [Cross 2006b; Canning 2004; CfH 2006]. So that in the above illus-tration Mrs Ross will have a choice of which diabetologist to see as well as thetime to book the initial appointment. The patient will have a up to a choice offour hospitals or clinics [CfH 2006; Coombes 2006]. The information detailing theservices which local centres offer are compiled by the Department of Health(DOH)and will include performance statistics like waiting times, rates of methicillin resis-tant infection and cancelled operations. However the commissioning is left to localprimary care trusts placing no restriction on whether the commissions go to NHShospitals or independent treatment centres [Coombes 2006]. The Choose and Booksystem will be effectively changing the access rights to the patient EHR so that aclinician from the elective treatment centre will have authorisation to all or sectionsof the record. Questions like ”whether this will involve transfer of the sections orall of the record between the two treatment centres?”, ”how secure will the transferbetween two centres?”and ”will the patient confidentiality be maintained?”. Thiswill largely depend on the structure of the data warehouse.

4.2.3 N3 national network for the NHS. The N3 the new network replacing theNHSNET (a private NHS network) will run nationally. It is a world class networkingservice vital to the creation, delivery and support of the nation IT programme. Ithas sufficient, secure connectivity and broadband capacity to meet current andfuture networking needs of the NHS [?]. It will serve as both broad bands fornetworking the spine as well as a telephony network. It will offer rapid transmissionof x ray films and other images [CfH 2006]. There will be private virtual networksto allow transmission between GPs and trusts [Hendy et al. 2005]. NHS Connectingfor Health delegates the responsibility for integrating and managing the service toan appointed N3 Service Provider (N3SP) with British Telecom being the mainService Provider.

Implementation of N3 began in April 2004. The planned timeline was to haveconnected at least 6,000 sites in the first year which is by 31 March 2005, and then6,000 in each following year for two years. Estimating to complete all 18,000 sites inthe NHS to be connected by 31 March 2007. By August 2005 over 10,000 sites hadreceived their N3 connections, including more than 75 per cent of GP practices [?].A total sites in England connected to N3 by the 17th February was 13,559 and inScotland 1148 [N3 2006]. Commendable progress has been made as 98% of generalpractice along with essential components of the records service having been in placesince end of February 2006 [Cross 2006a]. A critical look of technological needs likewireless systems, voice enabled devices, handwriting recognition devices etc andsupport is however out of scope of this report. Nevertheless the implications andeffects on the implementation of EHR are well understood and a list of these ispresented in section 4.3.1

4.2.4 Electronic Prescription service. The Electronic Prescription Service(EPS)will allow the transmission of a patients prescription from the General Practitionerto a Pharmacy of the patient choice [CfH 2006]. The Electronic Prescription Ser-vice will streamline the current time-consuming process for dealing with repeat

End of Year Review, October 2006.

Page 24: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

24 · Henry Addico

prescriptions by alleviating the need for patients to visit their GP just to collecta repeat prescription. It will provide a more efficient way of dealing with the 1.3million prescription currently being issued. It will be implemented in two phases.The first stage which allows existing prescription forms to be printed with a uniquebarcode has been rolled out at the time of writing it report [NCFH 2006b]. Thepatient can then present this form in a pharmacy which retrieves the record usingthe barcode. The second stage will however totally replace the paper based pre-scription with an electronic one [NCFH 2006a]. The electronic prescriptions willinclude an electronic signature of the prescribing health professional and access tothe prescription will be controlled by smartcards. In the longer term, the EPS willbe integrated with the NHS care record system [NCFH 2006b]. This service statedin 2005 and it is expected to roll out by the end of 2007 [CfH 2006; NCFH 2006b].

4.2.5 Picture Archiving and Communications Systems (PACS). PACS will en-able the storage and transmission of X-ray and scan data in electronic formatlinked with the patient detailed record [CfH 2006]. PACS technology allows fora near film-less process, with all the flexibility of digital systems. This eliminatesthe costs associated with hard film and releases valuable space currently used forstorage. PACS will deal with a wide range of specialties, including radiotherapy,CT, MRI, nuclear medicine, angiography, cardiology, fluoroscopy, ultrasound, den-tal and symptomatic mammography. In due course NHS PACS will be tightlyintegrated with the NHS Care Record Service (CRS) described above, removingthe traditional barrier between images and other patient records and providing aunified source for clinical electronic record of a patient.

PACS will be delivered at NHS locations including strategic health authorities,acute trusts and any location where pictures of a medical nature required for thepurposes of NHS diagnosis or treatment such as military hospitals, the homesof NHS radiologists and specialists and new diagnostic treatment centres [NCFH2005]. It will be available nationally during 2007 [CfH 2006].

4.3 Challenges of the Electronic Health Record and its quality

4.3.1 Technological Issues. Most interaction between clinicians comprises nar-rative (free text). Narrative contains more information than isolated or codedwords. Most electronic records, however, rely on structured data entry when fun-damentally the health care data is not structured but rather semi-structured. Thesemi structured property of healthcare data stipulates the consideration of XMLdatabases which is now a widely used data and object exchange format for bothstructured and semi structured data [Shanmugasundaram et al. 1999; Widom 1999;MacKenzie 1998]. However database container solutions which support the requiredinteroperability are still under intense research.

Making sure that the data is in electronic format requires heavy investment intowireless, voice enabled, hand writing recognition, touch screens software and hard-ware to mention a few. Handwriting, for example is automatic. But for most peopleentering data into a computer via typing is not. Handwriting also potentially allowsfor more thought for focusing on the diagnosis and the management of a patientsillness. However tools that will stimulate cognitive reasoning such as differential di-agnosis, prompting, reminders, mnemonics, algorithms, references, risk calculators,End of Year Review, October 2006.

Page 25: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 25

decision trees, and best evidence resources are difficult to develop [Walsh 2004].Speech is also another easier way to data entry [Zue 1999]

Speech is natural, we know how to speak before knowing how to readand write. Speech is also efficient-most people can speak about five timesfaster than they type and probably ten times faster than they can write.And speech is flexible-we do not have to touch or see anything to carryon a conversation.

The overhead in terms of investment is huge in both time and capital. The success ofelectronic health care records is so dependent on such technological advancements tothe extent that any solution without these technologies will not match expectations.The short comings have an adverse effect on controlling quality defects introducedin the data quality while clinicians deliver care to patients.

4.3.2 Confidentiality and Patient Acceptance. Access to the electronic recordwill only be available to trained NHS professionals who will require a smart cardwith a security chip together with a Personal Identification number(PIN) [CfH2006]. Access to the electronic record will be role based. Each user will have a roleand will also belong to group which associates with specific privileges and rightsto specific sections of the electronic record. An access trail is recorded and thisis monitored by a privacy officer [CfH 2006]. These measures are restrictive butunfortunately not foolproof as card and identity fraud still prevails over similarsystems for personal banking. This raises an alarm which patients cannot ignoreand therefore are not generally convinced about the safety of their records. Theavailability of the information over the internet is another issue of concern whichaffect the implementation with regards to the gaining the patients trust. The useof virtual private networks and encryption of the data especially with an algorithmwhich involves the patient NHS unique number is not adequate. However patientsmight believe that their data confidentiality and security is by far better than whenthe data was keep in pieces and incomplete with very little control to which NHSpersonnel had access to them as the security is by far better than previous whichwas used to control both paper based record and the partial electronic record.There is evidently better quality of security, audits and fraud detection [CfH 2006].Generally patients are often upset to discover the sharing of their records and theinclusion of the social care record decreases the public confidence as only 23%of people would be willing for their NHS records to be shared with social carestaff [Cross 2006a]. Measures like informing patients of their right to opt out andproviding a sealed envelope service to control parts of specific sections should help.Patients can limit their participations allowing access for emergency use only, apartial access to the summary of the records and no summary care record. Patientscan also limit NHS professionals who have access to their record based on groupedroles or individual roles [Moogan 2006]. This challenge is of most interest as it asymptom of the data quality dimensional category of accessibility discussed earlier.Any attempt to deal with this category will require a process model integratedwith a suitable security model for the health domain. It will have to provide bothassessment and control as it is one issue for the model to provide outstanding qualityin terms of accessibility and another to provide a measure of its level.

End of Year Review, October 2006.

Page 26: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

26 · Henry Addico

4.3.3 Standardization. The lack of standardization in keeping of the electronicrecords both within and between local primary care trusts seems to hamper theprogress of the programme. A key requirement of making an interoperable elec-tronic record is to preserving its meaning, protecting the confidentiality and sen-sitivity across systems. This is unachievable without appropriate defined stan-dards [Cross 2006a; Kalra and Ingram 2006; Blobel 2006]. As [Dolin 1997] argues

Data can be nested to varying degrees (e.g. a data table storing labora-tory results must accommodate urine cultures growing one or more thanone organism, each with its own set of antibiotic sensitivities). Data canbe highly interrelated (e.g. a provider may wish to specify that a patientsrenal insufficiency is due both to diabetes mellitus and to hypertension,and is also related to the patients polyuria and malaise). Data can beheterogeneous (e.g. test results can be strictly numeric, alpha-numeric,or composed of digital images and signals) ... a computerized healthrecord must be able to accommodate unforeseen data.

The development of a generic cross-institutional architecture record and infrastruc-ture started with the Synapses and SynEx research project, an attempt to developa generic approach in applying internet technologies for viewing and sharing health-care records integrated with existing health care computing environments [Grimsonet al. 2001; Joachim Bergmann et al. 2006]. Synapses and SynEx built upon theconsiderable research efforts by the database community into federated (collectionof autonomous, heterogeneous databases to which integrated access is required)or interoperable database systems. Synapses aimed to equip clients with the abil-ity to request complete or incomplete EHCRs from connected information systemsreferred to as feeder systems. The European Union’s Telematics Framework Pro-grammes have supported research over the past decade to create standards likeGood European Health Record(GEHR) from standards like CEN standards andENV 13606 [Kalra and Ingram 2006].

The European Pre-standard ENV 13606:2000(Electronic Healthcare record com-munication) which was a result of the an earlier standard ENV 12265 is a messagebased standard for the exchange of EHRs [Christophilopoulos 2005; Grimson et al.2001]. Its revision in 2001 took into consideration the adaptation of techniquesfrom other standards like OpenEHR.

OpenEHR was initiated under an EU research program (Good European HealthRecord, continued as Good Electronic Health Record). Unlike the ENV 13606:2000it followed a multi level methodology as opposed to the single [Christophilopoulos2005]. The first level is a generic reference model for healthcare containing fewclasses (e.g. role, act, entity, participation etc.) ensuring stability over time whilstthe other level considers healthcare application concepts modelled ”archetypes”(thekey concept of OpenEHR). ”Archetypes” are reusable elements which facilitateinteroperability and re-use and which has the capability of evolving with medicalknowledge and practice [Christophilopoulos 2005; Rector et al. 2003].

Medical Markup Language (MML) is another set of standards developed in1995 to allow the exchange of medical data between different medical informa-tion providers by ”Electronic Health Record Research Group”, a special interestgroup of the Japan Association for Medical Informatics. It was based on StandardEnd of Year Review, October 2006.

Page 27: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 27

Generalized Markup Language (SGML) at its inception and was later ported toXML (extensible mark-up language) [Guo et al. 2004]. However the recent version3.0 is based on the HL7 Clinical Document Architecture (CDA).

The Patient Record Architecture (PRA) was describedd by [Boyer and Alschuler2000]:

The semantics of PRA is drawn from HL7 RIM [Boyer and Alschuler2000]. The HL7 Patient Record Architecture (PRA) is a document rep-resentation standard designed to support the delivery and documenta-tion of patient care. A PRA is a defined and persistent object. It is amultimedia object which can include text, images and sounds.

One key feature of HL7 is that it allows refinement of models by specificationrestrictions and its vocabulary is replaceable [Glover 2005]. The HL7 standard ver-sion three (HL7 v3) which includes Reference information Model(RIM) and Datatype specification (both ANSI standards) spans all healthcare domains. Unlikethe previous versions which provided Syntactic interoperability (exchange of infor-mation between systems without the guarantee of meaning consistency across thesystems) HL7 V3 aims at semantic interoperability (which ensures unambiguousmeaning across systems) through its support for data types. This version doesnot quite measure up as stakeholders expect computable semantic interoperability(CSI). This is however a limitation as XML cannot support CSI [Jones and Mead2005]. HL7 Version 3 includes a formal methodology for binding the common struc-tures to domain-specific concept codes. This enables separation of common struc-tures from domain-specific terminologies, such as vocabularies used in SystematizedNomenclature of Medicine (SNOMED), Digital Imaging and Communications inMedicine (DICOM), Medical Dictionary for Regulatory Activities (MeDRA), Min-imum Information About a Micro array Experiment (MIAME)/Micro Array andGene Expression (MAGE) and Logical Observation Identifiers, Names and Codes(LOINC) [Jones and Mead 2005].

Problems highlighted above stem from data quality a problem which has gainedconsiderable attention from the research community. The healthcare and mostother domains reliant on data have considered the effect of poor data quality. Itsmeasurement and control remain difficult and an open problem. The best stan-dards for EHR will have to incorporate data quality control if the problems areto be nipped in the bud. Any developed system which does not incorporate theaforementioned standards will only result in a less future proof solution.

4.3.4 Scale and expectation of patients to rise. The National Health Service inEngland alone handles 1 million admissions and 37 million outpatients attendancesper annum requiring high quality and efficient communications between 2,500 hos-pitals and 10,000 general practices [Moogan 2006]; this scale puts huge amount ofstress and pressure on the implementation of the programme considering the risksof migration and integration with the legacy system gradually developed over time.Expectations of the benefits of the electronic records are so high they may becomean obstacle to accelerating its development especially with frequent implementationfailures and the fact that health data growth is exponential [MacKenzie 1998].

Health care in any national setting is also quite a complex enterprise [Goldschmidt2005]. It associates with such immense boundless knowledge which is overwhelming

End of Year Review, October 2006.

Page 28: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

28 · Henry Addico

as its integration is such a critical need [Kalra and Ingram 2006]. The fact thata centralized approach is being adopted is not future proof considering the sheerscale. A distributed approach like [Abiteboul et al. 2004] will be far better as thedata will be kept at the location where is more frequently use so that GP will holddata of their patients and can securely retrieve other aspects of the record fromsecondary care if required. All that required is appraise references in the record toits other segments.

4.3.5 Other Factors. [Midgley 2005] believes that the effect of handling healthcare Information technology as a compulsory element in political processes is not apositive one. New campaigns has been launched to persuade clinicians to supportthe multimillion pound computerisation programme of the health care in England.This was a result of the underperformed launch of the elements of the programmein particular the Choose and Book System [Cross 2006b]. Even though the surveyconducted in January by Medix show scepticism on the side of NHS professionals(68% regarded the performance of the programme to be poor), 59% of GPs and66% of other doctors still believing electronic records will improve patient careshould drive the programme to succeed [Cross 2006b]. The requirement of limitingthe health costs and to maximize resource utilisation cannot be ignored especiallyconsidering the extent of the expenditure of the NHS. The effect of Governmentpolicies controlling and maintaining evidenced-based and quality assured care can-not be ignored either as their integration in the process is paramount [Kalra andIngram 2006].

Communication between the NHS CfH and the NHS care centers is poor with lackof clarity about future developments [Hendy et al. 2005]. Financial circumstanceof some trusts is also being reported to slow the transition to electronic records.Major problems during delivery of releases have failed as legacy IT systems whichmost trust has procured over a long period of time with long term contracts is notcompatible. Trusts might need new contracts to have these systems replaced. Thetiming of changing this system is of concern as it will have a negative impact onthe development of the electronic record. International research has highlightedclinical, ethical and technical requirements needed to effect the transition to elec-tronic records [Kalra and Ingram 2006]. It is too early to predict whether the NHSprogramme for the electronic records will succeed or fail [Cross 2006a]. However asthe world look at EHR with emerging and fusing of standards like HL7, openEHRetc. there is a glimpse of success ahead. According to [Goldschmidt 2005] by 2020,based on the present projections, approximately 50% of health care practitionerswill be using some form of a functional EHR. However it needs vast investment andmassive expenditure must be made in the short-term while most benefits can onlybe realized in the long-term. Despite long-standing claims, and data from recentstudies, there is still relatively little real-world evidence that widespread adoptionof electronic record will save money overall.

The introduction of the EHR will eliminate most of the data problems in thehealthcare domain as identified in section 4 leaving the data quality related ones.However as EHR model are aggregational in nature they aggravate the quality prob-lems. Apart from carrying these problems and difficulties along it also compoundtheir severity. Nevertheless the expectations of quality from users of this new in-End of Year Review, October 2006.

Page 29: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 29

formation model will only rise. This is why an attempt to address these qualityproblems is paramount is this new setting.

5. RESEARCH DIRECTIONS

The relational modelling and data management has gained most ground. Itinvolves the breaking of complex units into typed atomic units. Relations are thencreated between these units to mimic the initial conceptual complex units. This isnormally not enough to support most online decisions and requires the generationof views for different purposes which readily answer the questions of interest (datawarehouse with materialised views). Data warehouses serve these purposes as theyare designed to efficiently store data and answer online queries. This is typical ofthe healthcare domain and the proposed solution for the centralised EHR in theUK is a distributed data warehouse solution.

Data warehouse management processes employ means to manage both data in-tegration and its quality. This however is inadequate when determining a measurefor service based IQ as it considers tasks and transformation performed mainly bysystem users like the programmer, designer etc. Hence the results do not reflectperspectively the right level or measure of data quality to other stakeholders. Theprocess of data collection, organisation to storage, reorganisation, processing, rein-terpretation, summarization for presentation and for other data application formthe process of manufacture of information and users play roles of:

—Data producers; people or groups who generate data during the data productionprocess

—Data custodians; people who provide and manage computing resources for storingand processing data and carry responsibility for the security of the data.

—Data consumers; people or groups who use the data, the people that utilize,aggregate and integrate the data.

—Data managers; responsible for managing data quality

The quality of data is affected by this data flow within a domain and a truemeasure will need to consider the influence of the above processes and the effectof these roles on data quality systematically. Apart from [Hinrichs 2000] and theIP approach [Davidson et al. 2003] (cross organisation and departments oriented)there is still no well established process model for managing data quality. Mostassessments, including the ones which has been applied later in a domain like healthcare, focus on cross domain data quality concepts and reuse despite of the fact thatdata quality is domain dependent [Gendron et al. 2004]. There has been no attemptas yet restricting the solution to the healthcare domain first and then tackling theabstraction later. This on the other hand might lead to fruitful possibilities.

The assessment and measurement of healthcare data quality needs to go beyondthe quantification of several of the quality dimensions of relevance to the domain.It requires an integrated performance measurement approach for service qualityand clinical effectiveness. This can only be achieved with appropriate informationsystem architectures which are process oriented providing meta-information aboutthe processes and the quality effects via appropriate indicators which will provide a

End of Year Review, October 2006.

Page 30: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

30 · Henry Addico

better basis for automating most aspects of process of quality management includingmonitoring and improving healthcare processes. It will also enhance the tracking,management and audit relationships within the healthcare network as more detailmeta data can be captured from participants and service groups [Helfert et al. 2005].

A process based approach will require a tight integration of workflow and theinformation model in a particular domain. The emergence of new informationmodels for complex multidimensional data like healthcare data coupled with theadoption of XML as a data exchange format and data storage presents new op-portunities to transform the management of such data localised or distributed.Such models include the clinical document architecture (CDA) of HL7 version 3,OpenEHR archtypes, XML DTD for Health care by ASTM E31.25( sub committeeof the health care informatics-ASTM) amongst others. These attempts to providestandard and complete models aggregating the disparate health information aboutpatients is a transition. A transition comparable to a time after Codd’s [Codd 1970]introduction of relational algebra: the period of building the essential models whichare the back bone of todays legacy traditional DBMS, a time where integrity con-straints was thought to be enough to control the quality of input data into databasesand a time where meta data and data quality issues were ignored [Stephens 2003].However the concept of data quality and the need to incorporate quality metricsin databases is now well understood [Motro and Rakov 1997]. The depth of incor-poration of these metrics into the above information models is unfortunately quitepoor and needs rigorous consideration.

Instead of following the data decompositioning approach of breaking down thecomplex units into atomic types, an attempt to manage the above informationmodels as a whole complex data type is much preferred. This will however requirea standard means to transform these information models using some sort of objectmodelling construct into a data model. This should be cross domain with an appro-priate language. As this is a direct transformation, this should result in a mappingof information quality expectations from the model into data quality policy con-straints. This will require the adoption of a metadata data model to manage thedata quality control.

It follows from this review that the information production process is a com-bination of data manipulation queries involving selections, inserts and deletions.Assuming that an aggregation of such operations is informally referred to as a pro-cess. Tasks performed in a domain will be based on a process that may or maynot be allowed to alter data. Typically processes need to be identifiable to be ableto allow only a capable user to run or to be used via means of some access controlpolicy. They should particularly, be able to define the impact they have on dataquality dimension or their categorisation. This aggregation model like this infor-mation models mentioned in the last paragraph have a lower granularity. Howeverthis level of granularity is more appropriate when automatic generation of semanticmeta data is needed as there is the possibility to consider the tasks involve in theworkflows or practices of that domain with all its associated data.

The nested nature of information model nature form a tree structure similarto an XML document data. XML databases offer the most features required tomanage these information model and the process approach. Its use will enhancethe exploitation of structural properties of an EHR, implementation of flexibleEnd of Year Review, October 2006.

Page 31: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 31

policy files, making reference to other policy files and also tie processes to sub treesor particular access path.

An attempt to implement a relational database which manages data, accuracyand lineage (i.e. data and some of its quality attributes) by [Widom 2005] is akey motivation to this research. However this work will differ as it will not involveto creation of a new XML database system from scratch but rather derive ways toincorporate data quality measurement in a distributed XML database environmentas described in section 2.7. This work will also go beyond accuracy and lineageconsidering as many healthcare related dimensions in particular one of the non-objective quality attributes like accessibility.

This research aims to

—Define the notion of processes formally for data quality measurement. This willinclude the exploration of process finiteness per domain, processes behaviourand effect on data quality. Even though the generation of an exhaustive processbasis for a particular domain will be difficult it is critical as the existence ofunknown processes affect the overall quality and influence the ability to tracequality defects. This will require a formal construct like process calculus and theapplication of fuzzy logic instead of the simple ratio and probability theory.

—Define operators for the creation of new processes, division of processes intosmaller units and the combination of processes with their data quality effects incontext.

—Formalise a data quality policy scheme by which each process will need to operate.This will provide means of specifying constraints which are implementable in adistributed environment. This will cater for all relevant dimensions and willfollow a tier approach structured around the categories of dimensions.

—Incorporate accessibility in particular as it has had very little focus in researchdespite its essentially in the EHR setting. An adaptation of the security policymodel [Anderson 1996] which is comparable to the Bell-LaPadula (a militarysecurity policy) and Clark-Wilson (banking security policy) with an appropriateprocess a based path indexing is needed.

—Refine the policy model resolving redundant issues surrounding processes withequal, child or ancestry access paths. This will also investigate the administrationof data quality policy files in a distributed environment especially one followingthe peer to peer architecture mentioned in section 2.7.

—Test the model using dataset from a hospital about obesity of patients.—Derive a domain independent solution which will be more appropriate and reusable.

REFERENCES

Abate, M. L., Diegert, K. V., and Allen, H. W. 1998. A Hierarchical Approach to ImprovingData Quality. Data Quality Journal 4, 1 (september), 365–9.

Abiteboul, S. 1997. Querying semi-structured data. In ICDT. 1–18.

Abiteboul, S., Alexe, B., Benjelloun, O., Cautis, B., Fundulaki, I., Milo, T., andSahuguet, A. 2004. An Electronic Patient Record ”on Steroids”: Distributed, Peer-to-Peer,Secure and Privacy-conscious. In VLDB. 1273–1276.

Abiteboul, S., Manolescu, I., and Taropa, E. 2006. A Framework for Distributed XML DataManagement. In EDBT, Y. E. Ioannidis, M. H. Scholl, J. W. Schmidt, F. Matthes, M. Hat-

End of Year Review, October 2006.

Page 32: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

32 · Henry Addico

zopoulos, K. Bohm, A. Kemper, T. Grust, and C. Bohm, Eds. Lecture Notes in ComputerScience, vol. 3896. Springer, 1049–1058.

Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. L. 1997. The Lorel QueryLanguage for Semistructured Data. International Journal on Digital Libraries 1, 1, 68–88.

Achard, F., Vaysseix, G., and Barillot, E. 2001. XML Bioinformatics And Data IntegrationBioinformatics. Bioinformatics Review 17, 2, 115–125.

Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., and Toman, D. 2005. Structure AndContent Scoring For XML. In VLDB ’05: Proceedings Of The 31st International ConferenceOn Very Large Data Bases. VLDB Endowment, Secaucus, NJ, USA, 361–372.

Anderson, R. 1996. A Security Policy Model for Clinical Information Systems. BMA ReportISBN 0-7279-1048-5, British Medical Association.

Arts, D., de Keizer, N., and de Jonge, E. 2001. Data Quality Measurement and Assurance inMedical Registries. In MEDINFO 2001: Proceedings of the 10th World Congress on MedicalInformatics. IOS Press, IMIA, 404. London:.

Aznauryan, N. A., Kuznetsov, S. D., Novak, L. G., and Grinev, M. N. 2006. SLS: A numberingscheme for large XML documents. Programing Computer Software 32, 1, 8–18.

Ballou, D., Wang, R., Pazer, H., and Tayi, G. K. 1998. Modeling Information ManufacturingSystems to Determine Information Product Quality. Management Science 44, 4, 462–484.

Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based Quality Estima-tion of Query Results in Relational Database Environments. IEEE Transactions on Knowledgeand Data Engineering 18, 5, 639–650.

Ballou, D. P. and Pazer, H. L. 2003. Modeling Completeness Versus Consistency Tradeoffs inInformation Decision Contexts. IEEE Transactions on Knowledge and Data Engineering 15, 1,240–243.

Baru, C., Chu, V., Gupta, A., Ludascher, B., Marciano, R., Papakonstantinou, Y., andVelikhov, P. 1999. XML-based information mediation for digital libraries. In DL ’99: Pro-ceedings of the fourth ACM conference on Digital libraries. ACM Press, New York, NY, USA,214–215.

Bilykh, I., Bychkov, Y., Dahlem, D., Jahnke, J. H., McCallum, G., Obry, C., Onabajo, A.,and Kuziemsky, C. 2003. Can GRID Services Provide Answers to the Challenges of NationalHealth Information Sharing? In CASCON ’03: Proceedings of the 2003 conference of the Centrefor Advanced Studies on Collaborative research. IBM Press, 39–53.

Blobel, B. G. 2006. Advanced EHR Architectures–Promises or Reality. Methods Inf Med. 45, 1,95–101.

Boniface, M. and Wilken, P. 2005. ARTEMIS: Towards a Secure Interoperability Infrastructurefor Healthcare Information Systems. In HEALTHGRID.

Bovee, M., Srivastava, R. P., and Mak, B. 2003. A Conceptual Framework and Belief-functionApproach to Assessing Overall Information Quality. International Journal of Intelligent Sys-tems 18, 1 (January), 51–74.

Boyer, S. and Alschuler, S. 2000. HL7 Patient Record Architecture Update. XMLEu-rope2000 http://www.gca.org/papers/xmleurope2000/papers, accessed 05-2005, 5.

Burd, G. and Staken, K. 2005. Use a Native XML Database for Your XML Data. XMLJournal May edition, http://xml.sys–con.com/read/90126.htm.

Canning, C. SPRING 2004. The Relevance of the National Programme for Information Technol-ogy to Ophthalmology. FOCUS 29, 2.

Catania, B., Ooi, B. C., Wang, W., and Wang, X. 2005. Lazy XML Updates: Laziness as aVirtue, of Update and Structural Join Efficiency. In SIGMOD ’05: Proceedings of the 2005ACM SIGMOD international conference on Management of data. ACM Press, New York, NY,USA, 515–526.

Ceri, S., Comai, S., Damiani, E., Fraternali, P., Paraboschi, S., and Tanca, L. 1999. XML-GL: A Graphical Language for Querying and Restructuring XML Documents. In SistemiEvoluti per Basi di Dati. 151–165.

CfH. 2005. Delivering IT for a modern, efficient NHS. Tech. rep., Connecting for Health. May.

End of Year Review, October 2006.

Page 33: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 33

CfH. accessed 2006. Health Care Records Whenver and Whereever You Need Them. Tech. rep.,NHS Connecting For Health. accessed May.

Chamberlin, D., Robie, J., and Florescu, D. 2001. Quilt: An XML Query Language forHeterogeneous Data Sources. Lecture Notes in Computer Science 1997, 1.

Cho, S., Koudas, N., and Srivastava, D. 2006. Meta-data Indexing for XPath Location Steps.In SIGMOD Conference. 455–466.

Christophilopoulos, E. 2005. ARTEMIS (IST-1-002103-STP): A Semantic Web Service-basedP2P Infrastructure for the Interoperability of Medical Information Systems. InnoFire MedicalCooperation Network Newsletter .

Codd, E. F. 1970. A Relational Model of Data for Large Shared Data Banks. Communicationsof ACM 13, 6, 377–387.

Cohen, S., Kanza, Y., Kogan, Y. A., Nutt, W., Sagiv, Y., and Serebrenik, A. 1999. EquiXEasy Querying in XML Databases. In ACM International Workshop on the Web and Databases(WebDB’99). 43–48.

Coombes, R. 2006. Patients Get Four Choices for NHS Treatments. BMJ 332, 7532, 8.

Cross, M. 2006a. Keeping the NHS Electronic Spine on Track. BMJ 332, 7542, 656–658.

Cross, M. 2006b. New Campaign to Encourage use of National IT Programme Begins.BMJ 332, 7534, 139–a–.

Davidson, B., Lee, Y. W., and Wang, R. Y. 2003. ”Developing Data Production Maps: MeetingPatient Discharge Submission Requirements. International Journal of Healthcare Technologyand Management, 6, .2, 87–103.

Davidson, I., Grover, A., Satyanarayana, A., and Tayi, G. K. 2004. A General Approach toIncorporate Data Quality Matrices into Data Mining Algorithms. In KDD ’04: Proceedings ofthe tenth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM Press, New York, NY, USA, 794–798.

Dekeyser, S., Hidders, J., and Paredaens, J. 2004. A Transaction Model for XML Databases.World Wide Web 7, 1, 29–57.

Department of Health, N. E. 1998. Information for Health: An In-formation Strategy for the Modern NHS 1998-2005, series A1103,.http://www.dh.gov.uk/PublicationsAndStatistics/Publications,Crown copyright Publica-tionsPolicyAndGuidance, 123 p.

Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and Suciu, D. 1998. XMLQL: A QueryLanguage for XML. In WWW The Query Language Workshop (QL). Cambridge, MA.

Dolin, R. Jan 1997-Feb 1997. Outcome Analysis: Considerations for an Electronic Health Record.MD Computing. 14, 1, 50–6.

Ehikioya, S. A. 1999. A characterization of information quality using fuzzy logic. In Fuzzy In-formation Processing Society NAFIPS. 18th International Conference of the North American.635–639.

Fan, W. and Simeon, J. 2003. Integrity Constraints for XML. J. Comput. Syst. Sci. 66, 1,254–291.

Feinberg, G. 2004. Anatomy of a Native XML Database. In XML 2004 Conference And Exibition.SchemaSoft.

Fiebig, T., Helmer, S., Kanne, C.-C., Moerkotte, G., Neumann, J., Schiele, R., and West-mann, T. 2002. Anatomy of a Native XML Base Management System. The VLDB Jour-nal 11, 4, 292–314.

Finance, B., Medjdoub, S., and Pucheral, P. 2005. The Case For Access Control on XML Rela-tionships. In CIKM ’05: Proceedings of the 14th ACM international conference on Informationand knowledge management. ACM Press, New York, NY, USA, 107–114.

Gabillon, A. 2004. An Authorization Model for XML Databases. In SWS ’04: Proceedings ofthe 2004 workshop on Secure web service. ACM Press, New York, NY, USA, 16–28.

Gendron, M., Shanks, G., and Alampi, J. 2004. Next Steps in Understanding InformationQuality and Its Effect on Decision Making and Organizational Effectiveness. In 2004 IFIP In-ternational Conference on Decision Support Systems, G. Widmeyer, Ed. PRATO, TUSCANY.

End of Year Review, October 2006.

Page 34: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

34 · Henry Addico

Gendron, M. S. 2000. Data quality in the Healthcare Industry. Ph.D. thesis, State University ofNew York at Albany.

Gertz, M. and Schmitt, I. 1998. Data Integration Techniques based on Data Quality Aspects. In3rd National Workshop on Federated Databases. Magdeburg, Germany Shaker Verlag. ISBN:3-8265-4522-2.

Glover, H. 2005. An Introduction to HL7 Version 3 and the NPfIT Message ImplementationManual. In HL7 UK Conference: HL7 and its key role in NPfIT and Existing Systems Inte-gration, H. U. Conference, Ed.

Goldman, R., McHugh, J., and Widom, J. 1999. From Semistructured Data to XML: Migratingthe Lore Data Model and Query Language. In Proceedings of the 2nd International Workshopon the Web and Databases (WebDB ’99). Philadelphia, Pennsylvania.

Goldschmidt, P. G. 2005. HIT and MIS: Implications of Health Information Technology andMedical Information Systems. Communications of the ACM 48, 10 (October), 69–74.

Grimson, J., Grimson, W., and Hasselbring, W. 2000. The SI challenge in Health Care.Communications of the ACM 43, 6, 48–55.

Grimson, J., Stephens, G., Jung, B., Grimson, W., Berry, D., and Pardon, S. 2001. SharingHealth-Care Records over the Internet. IEEE Internet Computing 5, 3, 49–58.

Guo, J., Takada, A., Tanaka, K., Sato, J., Suzuki, M., Suzuki, T., Nakashima, Y., Araki, K.,and Yoshihara2, H. 2004. The development of MML (Medical Markup Language) version 3.0as a medical document exchange format for HL7 messages. Journal of Medical Systems 28, 6(December).

Haustein, M. P. and Harder, T. 2003. Advances in Databases and Information Systems. 978-3-540-20047-5, vol. 2798/2003. Springer Berlin / Heidelberg, Chapter taDOM: A TailoredSynchronization Concept with Tunable Lock Granularity for the DOM API, 88–102.

Haustein, M. P., Harder, T., Mathis, C., and Wagner, M. 2005. Deweyids - The Key toFine-Grained Management of XML Documents. In 20th Brasilian Symposium On Databases.85–99.

Helfert, M., Henry, P., Leist, S., and Zellner, G. 2005. Healthcare performance indicators:Preview of frameworks and an approach for healthcare process-development. In Khalid S.Soliman, K.S. (ed): Information Management in Modern Enterprise: Issues & Solutions -Proceedings of The 2005 International Business Information Management Conference. ISBN:0-9753393-3-8. Lisbon, Portugal, 371–378.

Hendy, J., Reeves, B. C., Fulop, N., Hutchings, A., and Masseria, C. 2005. Challenges toImplementing the National Programme for Information Technology (NPfIT): A QualitativeStudy. BMJ 331, 7512, 331–336.

Hinrichs, H. 2000. CLIQ - Intelligent Data Quality Management. In Fourth International BalticWorkshop on Databases and Information Systems: Doctoral Consortium. Vilnius, Lithuania.

IBM. accessed july 2006. DB2 9 for Linux UNIX and Windows pureXML and storage compression.IBM Software http://www-306.ibm.com/software/data/db2/9/, website.

Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., Pa-parizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002.TIMBER: A Native XML database. The VLDB Journal 11, 4, 274–291.

Jerome, R. N., Giuse, N. B., Gish, K. W., Sathe, N. A., and Dietrich, M. S. 2001. Informa-tion needs of clinical teams: analysis of questions received by the Clinical Informatics ConsultService. Bulletin Medical Library Association 89, 2 (April), 177–185.

Jiang, H., Lu, H., Wang, W., and Yu, J. X. 2002. Path Materialization Revisited: An EfficientStorage Model for XML Data. In CRPITS ’02: Proceedings Of The Thirteenth AustralasianConference On Database Technologies. Australian Computer Society, Inc., Darlinghurst, Aus-tralia, Australia, 85–94.

Joachim Bergmann, Oliver J. Bott, D. P. P., Hau, R., Joachim Bergmann, Oliver J. Bott,D. P. P., and Haux, R. 2006. An e-consent-based shared EHR system architecture for inte-grated healthcare networks. International Journal of Medical Informatics 2305, 7.

Jones, T. M. and Mead, C. N. 2005. The Architecture of Sharing. Healthcare Informatics online,35–40.

End of Year Review, October 2006.

Page 35: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 35

Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H.2005. Bidirectional Expansion For Keyword Search On Graph Databases. In VLDB ’05: Pro-ceedings Of The 31st International Conference On Very Large Data Bases. VLDB Endowment,Secaucus, NJ, USA, 505–516.

Kader, Y. A. 2003. An Enhanced Data Model and Query Algebra for Partially Structured XMLDatabase. Tech. Rep. CS-03-08, Department of Computer Science, University of Sheffield.

Kalra, D. and Ingram, D. 2006. Information Technology Solutions for Healthcare. Springer-Verlag London Ltd, Chapter Electronic health records, 5–102.

Kha, D. D., Yoshikawa, M., and Uemura, S. 2001. An XML Indexing Structure with RelativeRegion Coordinate. In ICDE ’01: Proceedings of the 17th International Conference on DataEngineering. IEEE Computer Society, Washington, DC, USA, 313.

Kim, S. W., Shin, P. S., Kim, Y. H., Lee, J., and Lim, H. C. 2002. A Data Model and Algebrafor Document-Centric XML Document. In ICOIN ’02: Revised Papers from the InternationalConference on Information Networking, Wireless Communications Technologies and NetworkApplications-Part II. Springer-Verlag, London, UK, 714–723.

Konopnicki, D. and Shmueli, O. 2005. Database-inspired Search. In VLDB ’05: Proceedings OfThe 31st International Conference On Very Large Data Bases. VLDB Endowment, Secaucus,NJ,USA, 2–12.

Lechtenborger, J. and Vossen, G. 2003. Multidimensional Normal Forms for Data WarehouseDesign. Inf. Syst. 28, 5, 415–434.

Lederman, R. 2005. Managing Hospital Databases: Can Large Hospitals Really Protect PatientData. Health Infromatics Journal 13, 3, 201–210.

Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A Methodology forInformation Quality Assessment. Inf. Manage. 40, 2, 133–146.

Leonidas Orfanidis, Panagiotis D. Bamidis, B. E. 2004. Data Quality Issues In ElectronicHealth Records: An Adaptation Framework For The Greek Health System. Health InformaticsJournal 10, 1, 23–36.

Li, Q. and Moon, B. 2001. Indexing and Querying XML Data for Regular Path Expressions.In VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 361–370.

MacKenzie, D. 1998. New Language Could Meld the Web Into a Seamless Database. Sci-ence 280, 5371, 1840–1841.

Mandl, K. and Porter, S. 1999. Data Quality and the Electronic Medical Record: A Rolefor Direct Parental Data Entry. In American Medical Informatics Association Three YearCumulative Symposium Proceedings.

Mazeika, A. and Bohlen, M. H. 2006. Cleansing Databases of Misspelled Proper Nouns. InCleanDB.

Midgley, A. K. 2005. ”Choose and book” Does not Solve any Problems. BMJ 331, 7511, 294–b–.

Milano, D., Scannapieco, M., and Catarci, T. 2005. Using Ontologies for XML Data Cleaning.In OTM Workshop on Inter-organizational Systems and Interoperability of Enterprise Softwareand Applications (MIOS+INTEROP).

Moogan, P. 2006. The Clinical development of the NHS care record service.http://www.connectingforhealth.nhs.uk/crbd/docs/, Connecting for health. accessed on05/2006.

Moro, M. M., Vagena, Z., and Tsotras, V. J. 2005. Tree-pattern Queries on a LightweightXML Processor. In VLDB ’05: Proceedings of the 31st international conference on Very largedata bases. VLDB Endowment, Secaucus, NJ, USA, 205–216.

Motro, A. and Rakov, I. 1997. Not All Answers Are Equally Good: Estimating the Quality ofDatabase Answers. 1–21.

N3. 2006. DeliveryUpdate. N3 bulletin News 17, 1.

Naumann, F. and Rolker, C. 2000. Assessment Methods for Information Quality Criteria. InIQ. 148–162.

NCFH. 2005. Communications Toolkit 4 -PACS. Tech. Rep. 3234, NHS connecting for health,http://www.connectingforhealth.nhs.uk/publications/toolkitaugust05/. August.

End of Year Review, October 2006.

Page 36: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

36 · Henry Addico

NCFH. 2006a. Electronic Prescription Service passes one million mark. Tech. rep., NHS Connect-ing for Health. May. accessed May 2006.

NCFH. 2006b. Strategic Health Authority Communications Toolkit - ElectronicPrescription Service (EPS). Tech. Rep. 2091, NHS connecting for health,http://www.connectingforhealth.nhs.uk/publications/toolkitaugust05/. January.

Oliveira, P., Rodrigues, F., and Henriques, P. 2005. A Formal Definition of Data QualityProblems. In IQ. MIT.

Olson, M. 2000. 4Suite an open-source platform for XML and RDF processing. 4suite.org accessedjuly 2006, http://4suite.org/index.xhtml.

Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing Data Quality for InformationProducts: Impact of Selection, Projection, and Cartesian Product. Management Science 50, 7,967–982.

Parssian, A. and Yeoh, W. 2006. QSQL: An Extension to SQL for Queries on InformationQuality. In 1st Australasian Workshop on Information Quality (AusIQ 2006). University ofSouth Australia, Adelaide, Australia.

Pedersen, T. B. and Jensen, C. S. 1998. Research Issues in Clinical Data Warehousing. InIn Proceedings of the Tenth International Conference on Statistical and Scientific DatabaseManagement. IEEE Computer Society, 43–52.

Pehcevski, J., Thom, J. A., and Vercoustre, A.-M. 2005. Hybrid XML Retrieval: CombiningInformation Retrieval and a Native XML Database. Inf. Retr. 8, 4, 571–600.

Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Commun.ACM 45, 4, 211–218.

Pohl, J. 2000. Transition from Data to Information. Tech. rep., Collaborative Agent DesignResearch Center. November.

Price, R. J. and Shanks, G. 2004. A Semiotic Information Quality Framework. In IFIP WG8.3International Conference on Decision Support Systems (DSS2004). IFIP WG8.3 InternationalConference on Decision Support Systems (DSS2004), 658–672.

Raoul, K., Euloge, T., and Roland, M. 2005. Designing and Implementing an Electronic HealthRecord System in Primary Care Practice in Sub-Saharan Africa: A Case Study from Cameroon.Informatics in Primary Care, Volume 13, Number 3, November 2005, pp. 179-186(8) 13, 3(November), 179–186(8).

Rector, A., Rogers, J., Taweel, A., Ingram, D., Kalra, D., Milan, J., Singleton, P.,Gaizauskas, R., Hepple, M., Scott, D., and Power, R. 2003. CLEF Joining up Health-care with Clinical and Post-Genomic Research. Clef industrial forum, CLEF, Sheffield.

Riain, C. O. and Helfert, M. 2005. An Evaluation of Data Quality Related Problem Patternsin Healthcare Information Systems. In IADIS Virtual Multi Conference on Computer Scienceand Information Systems. Vol. single. 189,193.

Richard, W., Strong, D., and Guarascio, L. 1994. An Empirical Investigation of Data QualityDimensions: A Data Consumer’s Perspective. Tech. rep., MIT TDQM Research Program, 50Memorial Drive, Cambridge, Ma. 02139.

Rys, M., Chamberlin, D., and Florescu, D. 2005. XML and Relational Database Manage-ment Systems: The Inside Story. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMODInternational Conference on Management of Data. ACM Press, New York, NY, USA, 945–947.

Sacks-Davis, R., Dao, T., Thom, J. A., and Zobel, J. 1997. Indexing Documents for Querieson Structure, Content and Attributes. In Proceedings of International Symposium on DigitalMedia Information Base (DMIB). 236–245.

Sattler, K.-U., Geist, I., and Schallehn, E. 2005. Concept-based querying in mediator systems.The VLDB Journal 14, 1, 97–111.

Scannapieco, M. and Batini, C. 2004. Completeness in the Relational Model. A ComprehensiveFramework. In 9th International Conference on Information Quality.

Scannapieco, M., Missier, P., and Batini, C. 2005. Data Quality at a Glance. Datenbank-Spektrum 14, 1–23.

Schweiger, R., Hoelzer, S., Altmann, U., Rieger, J., and Dudeck, J. 2002. Plug-and-PlayXML:. Journal of the American Medical Informatics Association 9, 1, 37–48.

End of Year Review, October 2006.

Page 37: Data Quality of Native XML Databases in the Healthcare Domain · Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as

Data Quality of Native XML Databases in the Healthcare Domain · 37

Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing Data Quality in DynamicDecision Environments: An Information Product Approach. Journal of Database Manage-ment 14, 4, 14–32.

Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., Dewitt, D. J., and Naughton, J. F.1999. Relational Databases For Querying XML Documents: Limitations And Opportunities.In VLDB. Morgan Kaufmann, 302–314.

Sherman, G. 2001. Toward electronic health records. Office of Health and Information HighwayHealth Canada. http://www.hc-sc.gc.ca,2005 .

Shui, W., Lam, F., Fisher, D. K., and Wong, R. K. 2005. Querying and Maintaining Or-dered XML Data using Relational Databases. In Sixteenth Australasian Database Conference(ADC2005), H. E. Williams and G. Dobbie, Eds. CRPIT, vol. 39. ACS, Newcastle, Australia,85–94.

Smith, R. 1996. What Clinical Information do Doctors Need? British Medical Journal 313, 7064(October), 1062–8.

Stephens, R. T. 2003. Metadata and XML: Will History Repeat Itself? Columns7553, RTodd.com, http://www.dmreview.com/article sub.cfm?articleId=7553. October. vis-ited:20/08/2006.

Strong, D. M., Lee, Y. W., and Wang, R. Y. 1997. Data Quality In Context. Commun.ACM 40, 5, 103–110.

Stvilia, B., Gasser, L., Twidale, M. B., and Smith, L. C. A Framework for InformationQuality Assessment. http://www.isrl.uiuc.edu/ ∼gasser/papers/stvilia IQFramework.pdf.

Sugden, B., Wilson, R., and Cornford, J. 2006. Re-configuring the health supplier market:Changing relationships in the primary care supplier market in England. Technical Report SeriesCS-TR-951, Newcastle upon Tyne: University of Newcastle upon Tyne:Computing Science.

Vagena, Z., Moro, M. M., and Tsotras, V. J. 2004. Efficient Processing of XML ContainmentQueries Using Partition-Based Schemes. In IDEAS. 161–170.

Walsh, S. H. 2004. The Clinician’s Perspective on Electronic Health Records and How They CanAffect Patient Care. BMJ 328, 7449, 1184–1187.

Wand, Y. and Wang, R. Y. 1996. Anchoring Data Quality Dimensions in Ontological Founda-tions. Communications ACM 39, 11, 86–95.

Wang, Y. R. and Madnick, S. E. 1989. The Inter-Database Instance Identification Problemin Integrating Autonomous Systems. In Proceedings of the Fifth International Conferenceon Data Engineering, February 6-10, 1989, Los Angeles, California, USA. IEEE ComputerSociety, 46–55.

Widom, J. 1999. Data Management For XML Research Directions. IEEE Data EngineeringBulletin, Special Issue On XML 22, 3 (September), 44–52.

Widom, J. 2005. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. InConference on Innovative Data Systems Research. 262–276.

Zue, V. 1999. Talking with your computer. Scientific American 40.

End of Year Review, October 2006.