Relational Database Design and Implementation for...

PhyloInformatics 7: 1-66 - 2005

Relational Database Design andImplementation for BiodiversityInformatics

Paul J. Morris

The Academy of Natural Sciences

1900 Ben Franklin Parkway, Philadelphia, PA 19103 USA

Received: 28 October 2004 - Accepted: 12 January 2005

AbstractThe complexity of natural history collection information and similar information within the scopeof biodiversity informatics poses significant challenges for effective long term stewardship of thatinformation in electronic form. This paper discusses the principles of good relational databasedesign, how to apply those principles in the practical implementation of databases, andexamines how good database design is essential for long term stewardship of biodiversityinformation. Good design and implementation principles are illustrated with examples from therealm of biodiversity information, including an examination of the costs and benefits of differentways of storing hierarchical information in relational databases. This paper also discussestypical problems present in legacy data, how they are characteristic of efforts to handle complexinformation in simple databases, and methods for handling those data during data migration.

Introduction

The data associated with natural historycollection materials are inherently complex.Management of these data in paper formhas produced a variety of documents suchas catalogs, specimen labels, accessionbooks, stations books, map files, field notefiles, and card indices. The simpleappearance of the data found in any one ofthese documents (such as the columns foridentification, collection locality, datecollected, and donor in a handwrittencatalog ledger book) mask the inherentcomplexity of the information. Theappearance of simplicity overlying highlycomplex information provides significantchallenges for the management of naturalhistory collection information (and othersystematic and biodiversity information) inelectronic form. These challenges includemanagement of legacy data producedduring the history of capture of natural

history collection information into databasemanagement systems of increasingsophistication and complexity.

In this document, I discuss some of theissues involved in handling complexbiodiversity information, approaches to thestewardship of such information in electronicform, and some of the tradeoffs betweendifferent approaches. I focus on the verywell understood concepts of relationaldatabase design and implementation.Relational1 databases have a strong(mathematical) theoretical foundation

1 Object theory offers the possibility of handling muchof the complexity of biodiversity information in objectoriented databases in a much more effective mannerthan in relational databases, but object oriented andobject-relational database software is much lessmature and much less standard than relationaldatabase software. Data stored in a relational DBMSare currently much less likely to become trapped in adead end with no possibility of support than data in anobject oriented DBMS.

1


(Codd, 1970; Chen, 1976), and a widerange of database software productsavailable for implementing relationaldatabases.

Figure 1. Typical paths followed by biodiversityinformation. The cylinder represents storage ofinformation in electronic form in a database.

The effective management of biodiversityinformation involves many competingpriorities (Figure 1). The most importantpriorities include long term datastewardship, efficient data capture (e.g.Beccaloni et al., 2003), creating high qualityinformation, and effective use of limitedresources. Biodiversity information storagesystems are usually created and maintainedin a setting of limited resources. The mostappropriate design for a database to supportlong term stewardship of biodiversityinformation may not be a complex highlynormalized database well fitted to thecomplexity of the information, but rathermay be a simpler design that focuses on themost important information. This is not tosay that database design is not important.Good database design is vitally importantfor stewardship of biodiversity information.In the context of limited resources, gooddesign includes a careful focus on whatinformation is most important, allowingprogramming and database administrationto best support that information.

Database Life Cycle

As natural history collections data havebeen captured from paper sources (such ascentury old handwritten ledgers) and haveaccumulated in electronic databases, thenatural history museum community hasobserved that electronic data need muchmore upkeep than paper records (e.g.National Research Council, 2002 p.62-63).Every few years we find that we need tomove our electronic data to some newdatabase system. These migrations are

usually driven by changes imposed upon usby the rapidly changing landscape ofoperating systems and software.Maintaining a long obsolete computerrunning a long unsupported operatingsystem as the only means we have to workwith data that reside in a long unsupporteddatabase program with a custom front endwritten in a language that nobody writescode for anymore is not a desirablesituation. Rewriting an entire collectionsdatabase system from scratch every fewyears is also not a desirable situation. Thecomputer science folks who think aboutdatabases have developed a conceptualapproach to avoiding getting stuck in suchunpleasant situations – the database lifecycle (Elmasri and Navathe, 1994). Thedatabase life cycle recognizes that databasemanagement systems change over time andthat accumulated data and user interfacesfor accessing those data need to bemigrated into new systems over time.Inherent in the database life cycle is theinsight that steps taken in the process ofdeveloping a database substantially impactthe ease of future migrations.

A textbook list (e.g. Connoly et al., 1996) ofstages in the database life cycle runssomething like this: Plan, design,implement, load legacy data, test,operational maintenance, repeat. In slightlymore detail, these steps are:

1. Plan (planning, analysis, requirementscollection).

2. Design (Conceptual database design,leading to information model, physicaldatabase design [including systemarchitecture], user interface design).

3. Implement (Database implementation,user interface implementation).

4. Load legacy data (Clean legacy data,transform legacy data, load legacydata).

5. Test (test implementation).6. Put the database into production use

and perform operational maintenance. 7. Repeat this cycle (probably every ten

years or so).

Being a visual animal, I have drawn adiagram to represent the database life cycle(Figure 2). Our expectation of databasesshould not be that we capture a largequantity of data and are done, but ratherthat we will need to cycle those data

2


through the stages of the database life cyclemany times.

In this paper, I will focus on a few parts ofthe database life cycle: the conceptual andlogical design of a database, physicaldesign, implementation of the databasedesign, implementation of the user interfacefor the database, and some issues for themigration of data from an existing legacydatabase to a new design. I will provideexamples from the context of natural historycollections information. Plan ahead. Gooddesign involves not just solving the task athand, but planning for long termstewardship of your data.

Levels and architecture

A requirements analysis for a databasesystem often considers the networkarchitecture of the system. The differencebetween software that runs on a singleworkstation and software that runs on aserver and is accessed by clients across anetwork is a familiar concept to most users

of collections information. In some cases, adatabase for a collection running on a singleworkstation accessed by a single userprovides a perfectly adequate solution forthe needs of a collection, provided that theworkstation is treated as a server with anuninterruptible power supply, backupdevices and other means to maintain theintegrity of the database. Any computerrunning a database should be treated as aserver, with all the supporting infrastructurenot needed for the average workstation. Inother cases, multiple users are capturingand retrieving data at once (either locally orglobally), and a database system capable ofrunning on a server and being accessed bymultiple clients over a network is necessaryto support the needs of a collection orproject.

It is, however, more helpful for anunderstanding of database design to thinkabout the software architecture. That is, tothink of the functional layers involved in adatabase system. At the bottom level is theDBMS (database management system [see

3

Figure 2. The Database Life Cycle


glossary, p.64]), the software that runs thedatabase and stores the data (layeredbelow this is the operating system and itsfilesystem, but we can ignore these fornow). Layered above the DBMS is youractual database table or schema layer.Above this may be various code andnetwork transport layers, and finally, at thetop, the user interface through which peopleenter and retrieve data (Figure 29). Somedatabase software packages allow easyseparation of these layers, others aremonolithic, containing database, code, andfront end into a single file. A databasesystem that can be separated into layerscan have advantages, such as multiple userinterfaces in multiple languages over asingle data source. Even for monolithicdatabase systems, however, it is helpful tothink conceptually of the table structuresyou will use to store the data, code that youwill use to help maintain the integrity of thedata (or to enforce business rules), and theuser interface as distinct components,distinct components that have their ownplaces in the design and implementationphases of the database life cycle.

Relational Database Design

Why spend time on design? The answer issimple:

Poor Design + Time =Garbage

As more and more data are entered into apoorly designed database over time, and asexisting data are edited, more and moreerrors and inconsistencies will accumulatein the database. This may result in bothentirely false and misleading dataaccumulating in the database, or it mayresult in the accumulation of vast numbersof inconsistencies that will need to becleaned up before the data can be usefullymigrated into another database or linked toother datasets. A single extremely carefuluser working with a dataset for just a fewyears may be capable of maintaining cleandata, but as soon as multiple users or morethan a couple of years are involved, errorsand inconsistencies will begin to creep intoa poorly designed database.

Thinking about database design is useful for

both building better database systems andfor understanding some of the problems thatexist in legacy data, especially thoseentered into older database systems.Museum databases that begandevelopment in the 1970s and early 1980sprior to the proliferation of effective softwarefor building relational databases were oftenwritten with single table (flat file) designs.These legacy databases retain artifacts ofseveral characteristic field structures thatwere the result of careful design efforts toboth reduce the storage space needed bythe database and to handle one to manyrelationships between collection objects andconcepts such as identifications.

Information modeling

The heart of conceptual database design isinformation modeling. Information modelinghas its basis in set algebra, and can beapproached in an extremely complex andmathematical fashion. Underlying thiscomplexity, however, are two core concepts:atomization and reduction of redundantinformation. Atomization means placingonly one instance of a single concept in asingle field in the database. Reduction ofredundant information means organizing adatabase so that a single text stringrepresenting a single piece of information(such as the place name DemocraticRepublic of the Congo) occurs in only asingle row of the database. This one row isthen related to other information (such aslocalities within the DRC) rather than eachrow containing a redundant copy of thecountry name.

As information modeling has a firm basis inset theory and a rich technical literature, it isusually introduced using technical terms.This technical vocabulary include terms thatdescribe how well a database designapplies the core concepts of atomizationand reduction of redundant information (firstnormal form, second normal form, thirdnormal form, etc.) I agree with Hernandez(2003) that this vocabulary does not makethe best introduction to informationmodeling2 and, for the beginner, masks theimportant underlying concepts. I will thus

2 I do, however, disagree with Hernandez'entirely free form approach to databasedesign.

4


describe some of this vocabulary only afterexamining the underlying principles.

Atomization

1) Place only one concept in eachfield.

Legacy data often contain a single field fortaxon name, sometimes with the author andyear also included in this field. Considerthe taxon name Palaeozygopleurahamiltoniae (HALL, 1868). If this name isplaced as a string in a single field“Palaeozygopleura hamiltoniae (Hall,1868)”, it becomes extremely difficult to pullthe components of the name apart to, say,display the species name in italics and theauthor in small caps in an html document:<em>Palaeozygopleura hamiltoniae</em>(H<font size=-2>ALL</font>, 1868), or toassociate them with the appropriate tags inan XML document. It likewise is muchharder to match the search criteriaGenus=Loxonema and Trivial=hamiltoniaeto this string than if the components of thename are separated into different fields. Ataxon name table containing fields forGeneric name, Subgeneric name, TrivialEpithet, Authorship, Publication year, andParentheses is capable of handling mostidentifications better than a single text field.However, there are lots more complexities –subspecies, varieties, forms, cf., near,questionable generic placements,questionable identifications, hybrids, and soforth, each of which may need its own fieldto effectively handle the wide range ofdifferent variations of taxon names that canbe used as identifications of collectionobjects. If a primary purpose of the data setis nomenclatural, then substantial thoughtneeds to be placed into this complexity. Ifthe primary purpose of the data set is torecord information associated with collectionobjects, then recording the name used andindicators of uncertainty of identification arethe most important concepts.

2) Avoid lists of items in a field.

Legacy data often contain lists of items in asingle field. For example, a remarks fieldmay contain multiple remarks made atdifferent times by different people, or ageographic distribution field may contain alist of geographic place names. For

example, a geographic distribution fieldmight contain the list of values “New York;New Jersey; Virginia; North Carolina”. Ifonly one person has maintained the data setfor only a few years, and they have beenvery careful, the delimiter “;” will separate allinstances of geographic regions in eachstring. However, you are quite likely to findthat variant delimiters such as “,” or “ ” or“:” or “'” or “l” have crept into the data.

Lists of data in a single field are a commonlegacy solution to the basic informationmodeling concept that one instance of onesort of data (say a species name) can berelated to many other instances of anothersort of data. A species can be distributed inmany geographic regions, or a collectionobject can have many identifications, or alocality can have many collections madefrom it. If the system you have for storingdata is restricted to a single table (as inmany early database systems used in theNatural History Museum community), thenyou have two options for capturing suchinformation. You can repeat fields in thetable (a field for current identification andanother field for previous identification), oryou can list repeated values in a single field(hopefully separated by a consistentdelimiter).

Reducing RedundantInformation

The most serious enemy of clean data inlong -lived database systems is redundantcopies of information. Consider a localitytable containing fields for country, primarydivision (province/state), secondary division(county/parish), and named place(municipality/city). The table will containmultiple rows with the same value for eachof these fields, since multiple localities canoccur in the vicinity of one named place.The problem is that multiple different textstrings represent the same concept anddifferent strings may be entered in differentrows to record the same information. Forexample, Philadelphia, Phil., City ofPhiladelphia, Philladelphia, and Philly are allvariations on the name of a particularnamed place. Each makes sense whenwritten on a specimen label in the context ofother information (such as country andstate), as when viewed as a single locality

5


record. However, finding all the specimensthat come from this place in a database thatcontains all of these variations is not aneasy task. The Academy ichthyologycollection uses a legacy Muse databasewith this structure (a single table for localityinformation), and it contains some 16different forms of “Philadelphia, PA, USA”stored in atomized named place, state, andcountry fields. It is not a trivial task tosearch this database on locality informationand be sure you have located all relevantrecords. Likewise, migration of these datainto a more normal database requiresextensive cleanup of the data and is notsimply a matter of moving the data into newtables and fields.

The core problem is that simple flat tablescan easily have more than one rowcontaining the same value. The goal ofnormalization is to design tables that enableusers to link to an existing row rather than toenter a new row containing a duplicate ofinformation already in the database.

Figure 3. Design of a flat locality table (top) withfields for country and primary division comparedwith a pair of related tables that are able to linkmultiple states to one country without creatingredundant entries for the name of that country.The notation and concepts involved in theseEntity-Relationship diagrams are explained below.

Contemplate two designs (Figure 3) forholding a country and a primary division (astate, province, or other immediatesubdivision of a country): one holdingcountry and primary division fields (with

redundant information in a single localitytable), the other normalizing them intocountry and primary division tables andcreating a relationship between countriesand states.

Rows in the single flat table, given time, willaccumulate discrepancies between thename of a country used in one row and adifferent text string used to represent thesame country in other rows. The problemarises from the redundant entry of theCountry name when users are unaware ofexisting values when they enter data andare freely able to enter any text string in therelevant field. Data in a flat file locality tablemight look something like those in Table 1:

Table 1. A flat locality table.

Locality id Country Primary Division300 USA Montana301 USA Pennsylvania302 USA New York303 United

StatesMassachusetts

Examination of the values in individual rows,such as, “USA, Montana”, or “United States,Massachusetts” makes sense and is easilyintelligible. Trying to ask questions of thistable, however, is a problem. How manystates are there in the “USA”? The tablecan't provide a correct answer to thisquestion unless we know that “USA” and“United States” both occur in the table andthat they both mean the same thing.

The same information stored cleanly in tworelated tables might look something likethose in Table 2:

Here there is a table for countries that holdsone row for USA, together with a numericCountry_id, which is a behind the scenesdatabase way for us to find the row in thetable containing “USA' (a surrogate numeric

6

Table 2. Separating Table 1 into two relatedtables, one for country, the other for primarydivision (state/province/etc.).

Country id Name300 USA301 Uganda

PrimaryDivision

id

fk_c_country_id Primary Division

300 300 Montana301 300 Pennsylvania302 300 New York303 300 Massachusetts


primary key, of which I will say more later).The database can follow the country_id fieldover to a primary division table, where it isrecorded in the fk_c_country_id field (aforeign key, of which I will also say morelater). To find the primary divisions withinUSA, the database can look at theCountry_id for USA (300), and then find allthe rows in the primary division table thathave a fk_c_country_id of 300. Likewise,the database can follow these keys in theopposite direction, and find the country forMassachusetts by looking up itsfk_c_country_id in the country_id field in thecountry table.

Moving country out to a separate table alsoallows storage of a just one copy of otherpieces of information associated with acountry (its northernmost and southernmostbounds or its start and end dates, forexample). Countries have attributes(names, dates, geographic areas, etc) thatshouldn't need to be repeated each time acountry is mentioned. This is a central ideain relational database design – avoidrepeating the same information in more thanone row of a table.

It is possible to code a variety of userinterfaces over either of these designs,including, for example, one with a picklist forcountry and a text box for state (as in Figure4). Over either design it is possible toenforce, in the user interface, a rule thatdata entry personnel may only pick anexisting country from the list. It is possibleto use code in the user interface to enforcea rule that prevents users from enteringPennsylvania as a state in the USA andthen separately entering Pennsylvania as astate in the United States. Likewise, witheither design it is possible to code a userinterface to enforce other rules such asconstraining primary divisions to thoseknown to be subdivisions of the selectedcountry (so that Pennsylvania is not

recorded as a subdivision of Albania).

By designing the database with two relatedtables, it is possible to enforce these rules atthe database level. Normal data entrypersonnel may be granted (at the databaselevel) rights to select information from thecountry table, but not to change it. Higherlevel curatorial personnel may be grantedrights to alter the list of countries in thecountry table. By separating out the countryinto a separate table and restricting accessrights to that table in the database, thestructure of the database can be used toturn the country table into an authority fileand enforce a controlled vocabulary forentry of country names. Regardless of theuser interface, normal data entry personnelmay only link Pennsylvania as a state inUSA. Note that there is nothing inherent inthe normalized country/primary divisiontables themselves that prevents users whoare able to edit the controlled vocabulary inthe Country Table from entering redundantrows such as those below in Table 3.Fundamentally, the users of a database areresponsible for the quality of the data in thatdatabase. Good design can only assistthem in maintaining data quality. Gooddesign alone cannot ensure data quality.

It is possible to enforce the rules above atthe user interface level in a flat file. Thisenforcement could use existing values in thecountry field to populate a pick list of countrynames from which the normal data entryuser may only select a value and may notenter new values. Since this rule is onlyenforced by the programing in the userinterface it could be circumvented by users.More importantly, such a business ruleembedded in the user interface alone caneasily be forgotten and omitted when dataare migrated from one database system toanother.

7

Figure 4. Example data entry form using a picklistcontrol together with a text entry control.

Table 3. Country and primary division tablesshowing a pair of redundant Country values.

Country id Name500 USA501 United States

PrimaryDivision id

fk_c_country_id Primary Division

300 500 Montana301 500 Pennsylvania302 500 New York303 501 Massachusetts


Normalized tables allow you to more easilyembed rules in the database (such asrestricting access to the country table tohighly competent users with a large stake inthe quality of the data) that make it harderfor users to degrade the quality of the dataover time. While poor design ensures lowquality data, good design alone does notensure high quality data.

Good design thus involves carefulconsideration of conceptual and logicaldesign, physical implementation of thatconceptual design in a database, and gooduser interface design, with all else followingfrom good conceptual design.

Entity-Relationship modeling

Understanding the concepts to be stored inthe database is at the heart of gooddatabase design (Teorey, 1994; Elmasriand Navathe, 1994). The conceptual designphase of the database life cycle shouldproduce a result known as an informationmodel (Bruce, 1992). An information modelconsists of written documentation ofconcepts to be stored in the database, theirrelationships to each other, and a diagramshowing those concepts and theirrelationships (an Entity-Relationship or E-Rdiagram, ). A number of information modelsfor the biodiversity informatics communityexist (e.g. Blum, 1996a; 1996b; Berendsohnet al., 1999; Morris, 2000; Pyle 2004), mostare derived at least in part from theconcepts in ASC model (ASC, 1992).Information models define entities, listattributes for those entities, and relateentities to each other. Entities andattributes can be loosely thought of astables and fields. Figure 5 is a diagram of alocality entity with attributes for a mysteriouslocalityid, and attributes for country andprimary division. As in the example above,this entity can be implemented as a tablewith localityid, country, and primary divisionfields (Table 4).

Entity-relationship diagrams come in avariety of flavors (e.g. Teorey, 1994). TheChen (1976) format for drawing E-Rdiagrams uses little rectangles for entitiesand hangs oval balloons off of them forattributes. This format (as in the distributionregion entity shown on the right in Figure 6below) is very useful for scribbling out draftsof E-R diagrams on paper or blackboard.Most CASE (Computer Aided SoftwareEngineering) tools for working withdatabases, however, use variants of theIDEF1X format, as in the locality entityabove (produced with the open source toolDruid [Carboni et al, 2004]) and thecollection object entity on the left in Figure 6(produced with the proprietary tool xCase[Resolution Ltd., 1998]), or the relationshipdiagram tool in MS Access. Variants of theIDEF1X format (see Bruce, 1992) drawentities as rectangles and list attributes forthe entity within the rectangle.

Not all attributes are created equal. Thediagrams in Figures 5 and 6 list attributesthat have “ID” appended to the end of theirnames (localityid, countryid, collection_objectid, intDistributionRegionID). Theseare primary keys. The form of this notationvaryies from one E-R diagram format toanother, being the letters PK, or anunderline, or bold font for the name of theprimary key attribute. A primary key can bethought of as a field that contains uniquevalues that let you identify a particular row ina table. A country name field could be theprimary key for a country table, or, as in theexamples here, a surrogate numeric fieldcould be used as the primary key.

To give one more example of therelationship between entities as abstractconcepts in an E-R model and tables in adatabase, the tblDistributionRegion entityshown in Chen notation in Figure 6 could beimplemented as a table, as in Table 5, witha field for its primary key attribute,intDistributionRegionID, and a second fieldfor the region name attributevchrRegionName. This example is a portionof the structure of the table that holdsgeographic distribution area names in aBioLink database (additional fields hold therelationship between regions, allowingPennsylvania to be nested as a geographicregion within the United States nested withinNorth America, and so on).

8

Table 4. Example locality data.

Locality id Country Primary Division300 USA Montana301 USA Pennsylvania


Table 5. A portion of a BioLink (CSIRO, 2001)tblDistributionRegion table.

intDistributionRegionID vchrRegionName15 Australia16 Queensland17 Uganda18 Pennsylvania

The key point to think about when designingdatabases is that things in the real worldcan be thought of in general terms asentities with attributes, and that informationabout these concepts can be stored in thetables and fields of a relational database. Ina further step, things in the real world canbe thought of as objects with properties thatcan do things (methods), and theseconcepts can be mapped in an object model(using an object modeling framework suchas UML) that can be implemented with anobject oriented language such as Java. Ifyou are programing an interface to arelational database in an object orientedlanguage, you will need to think about howthe concepts stored in your database relateto the objects manipulated in your code.Entity-Relationship modeling produces thecritical documentation needed tounderstand the concepts that a particularrelational database was designed to store.

Primary key

Primary keys are the means by which welocate a single row in a table. The value fora primary key must be unique to each row.The primary key in one row must have adifferent value from the primary key of everyother row in the table. This property ofuniqueness is best enforced by the

database applying a unique index to theprimary key.

A primary key need not be a single attribute.A primary key can be a single attributecontaining real data (generic name), a groupof several attributes (generic name, trivialepithet, authorship), or a single attributecontaining a surrogate key (name_id). Ingeneral, I recommend the use of surrogatenumeric primary keys for biodiversityinformatics information, because we are tooseldom able to be certain that otherpotential primary keys (candidate keys) willactually have unique values in real data.

A surrogate numeric primary key is anattribute that takes as values numbers thathave no meaning outside the database.Each row contains a unique number thatlets us identify that particular row. A table ofspecies names could have generic epithetand trivial epithet fields that together make aprimary key, or a single species_id fieldcould be used as the key to the table witheach row having a different arbitrary numberstored in the species_id field. The valuesfor species_id have no meaning outside thedatabase, and indeed should be hiddenfrom the users of the database by the userinterface. A typical way of implementing asurrogate key is as a field containing anautomatically incrementing integer thattakes only unique values, doesn't take nullvalues, and doesn't take blank values. It isalso possible to use a character fieldcontaining a globally unique identifier or acryptographic hash that has a highprobability of being globally unique as asurrogate key, potentially increasing the

9

Figure 6. Comparison between entity and attributes as depicted in a typical CASE tool E-R diagram in avariant of the IDEF1X format (left) and in the Chen format (right, which is more useful for pencil and papermodeling). The E-R diagrams found in this paper have variously been drawn with the CASE tools xCaseand Druid or the diagram editor DiA.

Figure 5. Part of a flat locality entity. Animplementation with example data is shown belowin Table 4.


ease with which different data sets can becombined.

The purpose of a surrogate key is to providea unique identifier for a row in a table, aunique identifier that has meaning onlyinternally within the database. Exposing asurrogate key to the users of the databasemay result in their mistakenly assigning ameaning to that key outside of the database.The ANSP malacology and invertebratepaleontology collections were for a whileprinting a primary key of their mastercollection object table (a field called serialnumber) on specimen labels along with thecatalog number of the specimen, and someof these serial numbers have been copiedby scientists using the collection and haveeven made it into print under the rational butmistaken belief that they were catalognumbers. For example, Petuch (1989,p.94) cites the number ANSP 1133 for theparatype of Malea springi, which actuallyhas the catalog number ANSP 54004, buthas both this catalog number and the serialnumber 00001133 printed on a computergenerated label. Another place wheresurrogate numeric keys are easily exposedto users and have the potential of taking ona broader meaning is in Internet databases.An Internet request for a record in adatabase is quite likely to request thatrecord through its primary key. An URL witha http get request that contains the value fora surrogate key directly exposes thesurrogate key to the world . For example,the URL: http://erato.acnatsci.org/wasp/search.php?species=12563 uses the valueof a surrogate key in a manner that userscan copy from their web browsers and emailto each other, or that can be crawled andstored by search engines, broadening itsscope far beyond simply being an arbitraryrow identifier within the database.

Surrogate keys come with risks, mostnotably that, without other rules beingenforced, they will allow duplicate rows,identical in all attributes except thesurrogate primary key, to enter the table(country 284, USA; country 526, USA). Areal attribute used as a primary key willforce all rows in the table to contain uniquevalues (USA). Consider catalog numbers.If a table contains information aboutcollection objects within one catalog numberseries, catalog number would seem a logical

choice for a primary key. A single catalognumber series should, in theory, containonly one catalog number per collectionobject. Real collections data, however, donot usually conform to theory. It is notunusual to find that 1% or more of thecatalog numbers in an older catalog seriesare duplicates. That is, real duplicates,where the same catalog number wasassigned to two or more different collectionobjects, not simply transcription errors indata capture. Before the catalog numbercan be used as the primary key for a table,or a unique index can be applied to acatalog number field, duplicate values needto be identified and resolved. Resolvingduplicate catalog numbers is a non-trivialtask that involves locating and handling thespecimens involved. It is even possible fora collection to contain real immutableduplicate catalog numbers if the samecatalog number was assigned to twodifferent type specimens and theseduplicate numbers have been published.Real collections data, having accumulatedover the last couple hundred years, oftencontain these sorts of unexpectedinconsistencies. It is these sorts ofproblematic data and the limits on ourresources to fully clean data to fit theoreticalexpectations that make me recommend theuse of surrogate keys as primary keys inmost tables in collections databases.

Taxon names are another case where asurrogate key is important. At first glance, atable holding species names could use thegeneric name, trivial epithet, and authorshipfields as a primary key. The problem is,there are homonyms and other suchhistorical oddities to be found in lists oftaxon names. Indeed, as Gary Rosenberghas been saying for some years, you needto know the original genus, species epithet,subspecies epithet, varietal epithet (or trivialepithet and rank of creation), authorship,year of publication, page, plate and figure touniquely distinguish names of Mollusks(there being homonyms described by thesame author in the same publication indifferent figures).

Normalize appropriately for yourproblem and resources

When building an information model, it isvery easy to get carried away and expand

10


the model to cover in great elaboration eachtiny facet of every piece of information thatmight be related to the concept at hand. Insome situations (e.g. the POSC model orthe ABCD schema) where the goal is toelaborate all of the details of a complex setof concepts, this is very appropriate.However, when the goal is to produce afunctional database constructed by a singleindividual or a small programming team, themodel can easily become so elaborate as tohinder the production of the softwareneeded to reach the desired goal. This isthe real art of database design (and objectmodeling); knowing when to stop.Normalization is very important, but youmust remember that the ultimate goal is ausable system for the storage and retrievalof information.

In the database design process, theinformation model is a tool to help thedesign and programming team understandthe nature of the information to be stored inthe database, not an end in itself.Information models assist in communicationbetween the people who are specifying whatthe database needs to do (people who talkin the language of systematics andcollections management) and theprogrammers and database developers whoare building the database (and who speakwholly different languages). Informationmodels are also vital documentation when itcomes time to migrate the data and userinterface years later in the life cycle of thedatabase.

Example: Identifications ofCollection Objects

Consider the issue of handlingidentifications that have been applied tocollection objects. The simplest way ofhandling this information is to place a singleidentification field (or set of atomizedgenus_&_higher, species, authorship, year,and parentheses fields) into a collectionobject table. This approach can handle onlya single identification per collection object,unless each collection object is allowedmore than one entry in the collection objecttable (producing duplicate catalog numbersin the table for each collection object withmore than one identification). In manysorts of collections, a collection object tends

to accumulate many identifications overtime. A structure capable of holding onlyone identification per collection object posesa problem.

A standard early approach to the problem ofmore than one identification to a singlecollection object was a single table withcurrent and previous identification fields.The collection objects table shown in Figure7 is a fragment of a typical legacy non-normal table containing one field for currentidentification and one for previousidentification. This example also includes asurrogate numeric key and fields to hold oneidentifier and one date identified.

One table with fields for current andprevious identification allows rules thatrestrict each collection object to one recordin the collection object table (such as aunique index on catalog number), but onlyallows for two identifications per collectionobject. In some collections this is not ahuge problem, whereas in others thisstructure would force a significantinformation loss3. A tray of fossils or aherbarium sheet may each contain a longhistory of annotations and changes inidentification produced by different people atdifferent times. The table with one set offields for current identification, another forprevious identification and one field each foridentifier and date identified suffers anotherproblem – there is no necessary linkbetween the identifications, the identifier,

3 I chose such a flat structure, with 6 fields forcurrent identification and 6 fields for originalidentification for a database for data capture onthe entomology collections at ANSP. Itallowed construction of a more efficient dataentry interface than a better normalizedstructure. Insect type specimens seem to veryseldom have the complex identificationhistories typical of other sorts of collections.

11

Figure 7. A non-normal collection object entity.


and the date identified. The database isagnostic as to whether the identifier was theperson who made the current identification,the previous identification, or some otheridentification. It is also agnostic as towhether the date identified is connected tothe identifier. Without carefully enforcedrules in the user interface, the dateidentified could reflect the date of somerandom previous identification, the identifiercould be the person who made the currentidentification, and the previous identificationcould be the oldest identification of thecollection object, or these fields could holdsome other arbitrary combination ofinformation, with no way for the user to tell.We clearly need a better structure.

Figure 8. Moving identifications to a related entity.

We can allow multiple identifications foreach collection object by adding a secondtable to hold identifications and linking thattable to the collection object table (Figure 8).These two tables for collection object andidentification can hold multipleidentifications for each collection object if weinclude a field in the identification table thatcontains values from the primary key of thecollection object table. This foreign key isused to link collection object records withidentification records (shown by the “Crow'sFoot” symbol in the figure). One namingconvention for foreign keys uses the nameof the primary key that is being referenced(collection_object_id) and prefixes it with c_(for copy, thus c_collection_object_id for theforeign key). If, as in Figure 8, the

identification table holds a foreign keypointing to collection objects, and a set offields to hold a taxon name, then eachcollection object can have manyidentifications.

This pair of tables (Collection objects andIdentifications, Figure 8) still has lots ofproblems. We don't have any way ofknowing which identification is the mostrecent one. In addition, the taxon namefields will contain multiple duplicate values,so, for example, correcting a misspelling ina taxon name will require updating everyrow in the identification table holding thattaxon name. Conceptually, each collectionobject can have multiple identifications, buteach taxon name used in an identificationcan be applied to many collection objects.What we really want is a many to manyrelationship between taxon names andcollection objects (Figure 9). Relationaldatabases can not handle many to manyrelationships directly, but they can byinterpolating a table into the middle of therelationship – an associative entity. Theconcepts collection object – identification –taxon name are good example of anassociative entity (identification) breaking upa many to many relationship (betweencollection objects and taxon names). Eachcollection object can have many taxonnames applied to it, each taxon name canbe applied to many collection objects, andthese applications of taxon names tocollection objects occur through anidentification.

In Figure 9, the identification entity is anassociative entity that breaks up the manyto many relationship between speciesnames and collection objects. Theidentification entity contains foreign keyspointing to both the collection object andspecies name entities. Each collectionobject can have many identifications, eachidentification involves one and only onespecies name. Each species name can beused in many identifications, and eachidentification applies to one and only onecollection object.

This set of entities (taxon name,identification [the associative entity], andcollection object) also allows us to easilytrack the most recent identification byadding a date identified field to the

12


identification table. In many cases withlegacy data, it may not be possible todetermine the date on which anidentification was made, so adding a field toflag the current identification out of a set ofidentifications for a specimen may benecessary as well. Note that adding a flagto track the current identification requiresbusiness rules that will need to beimplemented in the code associated withthe database. These business rules mayspecify that only one identification for asingle collection object is allowed to be thecurrent identification, and that theidentification flagged as the currentidentification must have either no date ormust have the most recent date for anyidentification of that collection object. Analternative, suggested by an anonymousreviewer, is to include a link to the sole

current identification in the collection objecttable. (That is, to include a foreign keyfk_current_identification_id incollection_objects, which is thus able to linka collection object to one and only onecurrent identification. This is a veryappropriate structure, and lets businessrules focus on making sure that this currentidentification is indeed the currentidentification).

This identification associative entity sittingbetween taxon names and collection objectscontains an attribute to hold the name of theperson who made the identification. Thisfield will contain many duplicate values assome people make many identificationswithin a collection. The proper way to bringthis concept to third normal form is to moveidentifiers off to a generalized person table,and to make the identification entity aternary associative entity linking speciesnames, collection objects, and identifiers(Figure 10). People may play multiple rolesin the database (and may be a subtype of ageneralized agent entity), so a conventionfor indicating the role of the person in theidentification is to add the role name to theend of the foreign key. Thus, the foreignkey linking people to identifications could becalled c_person_id_identifier. In anotherentity, say handling the concept ofpreparations, a foreign key linking to thepeople entity might be calledc_person_id_preparator.

The set of concepts Taxon Name,identification (as three way associativeentity), identifier, and collection objectdescribes a way of handing theidentifications of collection objects in thirdnormal form. Person names, collectionobjects, and taxon names are all capable ofbeing stored without redundant repetition ofinformation. Placing identifiers in aseparate People entity, however, requiresfurther thought in the context of naturalhistory collections. Legacy data will containmultiple similar entries (G. Rosenberg;Rosenberg, G.; G Rosenberg; Rosenberg;G.D. Rosenberg), all of which may or maynot refer to the same person. Combining allof these legacy entries into a normalizedperson table risks introducing errors ofinterpretation into the data. In addition,adding a generic people table and linking itto identifiers adds additional complexity and

13

Figure 9. Using an associative entity(identifications) to link taxon names to collectionobjects, splitting the many to many relationshipbetween collection objects and identifications.


coding overhead to the database. People isone area of the database where you need tothink very carefully about the costs andbenefits of a highly normalized designFigure 11. Cleaning legacy data, theadditional interface complexity, and theadditional code required to implement ageneric person as an identifier, along withthe risk of propagation of incorrectinferences, may well outweigh the benefitsof being able to handle identifiers in ageneric people entity. Good, wellnormalized design is critical to be able toproperly handle the existence of multipleidentifications for a collection object, butnormalizing the names of identifiers may lieoutside the scope of the critical coreinformation that a natural history museumhas the resources to properly care for, or bebeyond the scope of the critical informationneeded to complete a grant funded project.Knowing when to stop elaborating theinformation model is an important aspect ofgood database design.

Example extended: questionableidentifications

How does one handle data such as theidentification “Palaeozygopleura hamiltoniae(HALL, 1868) ?” that contains an indicationof uncertainty as to the accuracy of thedetermination? If the question mark isstored as part of the taxon name (either in asingle taxon name string field, or as anatomized field in a taxon name table), thenyou can expect your list of distinct taxonnames to include duplicate entries for“Palaeozygopleura hamiltoniae (HALL,1868)” and for “Palaeozygopleurahamiltoniae (HALL, 1868) ?”. This is clearlyan undesirable duplication of information.

Thinking through the nature of theuncertainty in this case, the uncertainty is anattribute of a particular identification (thisspecimen may be a member of thisspecies), rather than an attribute of a taxonname (though a species name can

14

Figure 10. Normalized handling of identifications and identifiers. Identifications is an associative entityrelating Collection objects, species names and people.

Figure 11. Normalized handling of identifications with denormalized handling of the people who perfommedthe identifications (allowing multiple entries in identification containing the name of a single identifier).


incorporate uncertain generic placement:e.g. Loxonema? hamiltoniae with thisgeneric uncertainty being an attribute of atleast some worker's use of the name). But,since uncertainty in identification is aconcept belonging to an identification, it isbest included as an attribute in anidentification associative entity (Figure 11).

Vocabulary

Information modeling has a widely usedtechnical terminology to describe the extentto which data conform to the mathematicalideals of normalization. One commonlyencountered part of this vocabulary is thephrase “normal form”. The term first normalform means, in essence, that a databasehas only one concept placed in each fieldand no repeating information within one row,that is, no repeating fields and no repeatingvalues in a field. Fields containing the value“1863, 1865, 1885” (repeating values) or thevalue “Paleozygopleura hamiltoniae Hall”(more than one concept), or the fieldsCurrent_identification andPrevious_identification (repeating fields) areexample violations of first normal form. Insecond normal form, primary keys do notcontain redundant information, but otherfields may. That is two different rows of atable may not contain the same values intheir primary key fields in second normalform. For example, a collection object tablecontaining a field for catalog number servingas primary key would not be able to containmore than one row for a single catalognumber for the table to be in second normalform. We do not expect a table ofcollection objects to contain informationabout the same collection object in twodifferent rows. Second normal form isnecessary for rational function of a relationaldatabase. For catalog number to be theprimary key of the collection object table, a

unique index would be required to forceeach row in the table to have a unique valuefor catalog number. In third normal form,there is no redundant information in anyfields except for foreign keys. A third normaltreatment of geographic names wouldproduce one and only one row containingthe value “Philadelphia”, and one and onlyone row containing the value“Pennsylvania”.

To make normal forms a little clearer, let'swork through some examples. Table 6 is afragment of a hypothetical flat file database.Table 6 is not in first normal form. Itcontains three different kinds of problemsthat prevent it from being in first normal form(as well as other problems related to highernormal forms). First, the Catalog_numberand identification fields are not atomic.Each contains more than one concept.Catalog_number contains the acronym of arepository and a catalog number. Theidentification fields both contain a speciesname, rather than separate fields forcomponents of that name (generic name,specific epithet, etc...). Second,identification and previous identification arerepeating fields. Each of these contains thesame concept (an identification). Third,preparations contains a series of repeatingvalues.

So, what transformations of the data do weneed to do to bring Table 6 into first normalform? First, we must atomize, that is, splitup fields until one and only one concept iscontained in each field. In Table 7,Catalog_number has been split intorepository and catalog_no, identificationand previous identification have been splitinto generic name and specific epithet fields.Note that this splitting is easy to do in thedesign phase of a novel database but mayrequire substantial work if existing data

15

Table 6. A table not in first normal form.

Catalog_number Identification Previous identification PreparationsANSP 641455 Lunatia pilla Natica clausa Shell, alcohol ANSP 815325 Velutina nana Velutina velutina Shell

Table 7. Catalog number and identification fields from Table 6 atomized so that each field now containsonly one concept.

Repository Catalog_no Id_genus Id_sp P_id_gen P_id_sp PreparationsANSP 641455 Lunatia pilla Natica clausa Shell, alcoholANSP 815325 Velutina nana Velutina velutina Shell


need to be parsed into new fields.

Table 7 still isn't in in first normal form. Theprevious and current identifications are heldin repeating fields. To bring the table to firstnormal form we need to remove theserepeating fields to a separate table. To linka row in our table out to rows that weremove to another table we need to identifythe primary key for our table. In this case,Repository and Catalog_no together formthe primary key. That is, we need to knowboth Repository and Catalog number inorder to find a particular row. We can nowbuild an identification table containing genusand trivial name fields, a field to identify if anidentification is previous or current, and therepository and catalog_no as foreign keys topoint back to our original table. We could,as an alternative, add a surrogate numericprimary key to our original table and carrythis field as a foreign key to ouridentifications table. With an identificationtable, we can normalize the repeatingidentification fields from our original table asshown in Table 8. Our data still aren't infirst normal form as the preparations fieldcontaining a list (repeating information) ofpreparation types.

Table 8. Current and previous identification fieldsfrom Tables 6 and 7 split out into a separate table.This pair of tables allows any number of previousidentifications for a particular collections object.Note that Repository and Catalog_no togetherform the primary key of the first table (they couldbe replaced by a single surrogate numeric key).

Repository (PK) Catalog_no (PK) PreparationsANSP 641455 Shell, alcoholANSP 815325 Shell

Repository Catalog_no Id_genus Id_sp ID_orderANSP 641455 Lunatia pilla CurrentANSP 641455 Natica clausa PreviousANSP 815325 Velutina nana CurrentANSP 815325 Velutina velutinaPrevious

Much as we did with the repeatingidentification fields, we can split therepeating information in the preparationsfield out into a separate table, bringing withit the key fields from our original table.Splitting data out of a repeating field intoanother table is more complicated thansplitting out a pair of repeating fields if youare working with legacy data (rather thanthinking about a design from scratch). Tosplit out data from a field that hold repeating

values you will need to identify the delimiterused to split values in the repeating field (acomma in this example), write a parser towalk through each row in the table, split thevalues found in the repeating field on theirdelimiters, and then write these values intothe new table. Repeating values that havebeen entered by hand are seldom clean.Different delimiters may be used in differentrows (comma or semicolon), delimiters maybe missing (shell alcohol), spacing arounddelimiters may vary (shell,alcohol, frozen),the delimiter might be a data value in somerows(alcohol, formalin fixed; frozen,unfixed), and so on. Parsing a fieldcontaining repeating values therefore can'tbe done blindly. You will need to assess theresults and fix exceptions (probably byhand). Once this parsing is complete,Table 9, we have a set of three tables(collection object, identification, preparation)in first normal form.

Table 9. Information in Table 6 brought into firstnormal form by splitting it into three tables.

Repository Catalog_noANSP 641455ANSP 815325

Repository Catalog_no

Id_genus Id_sp ID_order

ANSP 641455 Lunatia pilla CurrentANSP 641455 Natica clausa PreviousANSP 815325 Velutina nana CurrentANSP 815325 Velutina velutina Previous

Repository Catalog_no PreparationsANSP 641455 ShellANSP 641455 Alcohol

Non-atomic data and problems with firstnormal form are relatively common in legacybiodiversity and collections data (handling ofthese issues is discussed in the datamigration section below). Problems withsecond normal form are not particularlycommon in legacy data, probably becauseunique key values are necessary for arelational database to function. Secondnormal form can be a significant issue whendesigning a database from scratch and inflat file databases, especially thosedeveloped from spreadsheets. In secondnormal form, each row in a table holds aunique value for the primary key of thattable. A collection object table that is not insecond normal form can hold more than one

16


row for a single collection object. Inconsidering second normal form, we need tostart thinking about keys. In the databasedesign process we may consider candidatekeys – fields that could potentially serve askeys to uniquely identify rows in a table. Ina collections object table, what informationdo we need to know to find the row thatcontains information about a particularcollection object? Consider Table 10.Table 10 is not in second normal form. Itcontains 4 rows with information about aparticular collections object. A reasonablecandidate for the primary key in acollections object table is the combination ofRepository and Catalog number. In Table10 these fields do not contain uniquevalues. To uniquely identify a row in Table10 we probably need to include all the fieldsin the table into a key.

Table 10. A collections object table with repeatingrows for the candidate key Repository +Catalog_no.

Repository

Catalog_no

Id_genus

Id_sp ID_order Preparation

ANSP641455 Lunatia pilla Current ShellANSP641455 Lunatia pilla Current alcoholANSP641455 Natica clausaPrevious ShellANSP641455 Natica clausaPrevious alcohol

If we examine Table 10 more carefully wecan see that it contains two independentpieces of information about a collectionsobject. The information about thepreparation is independent of theinformation about identifications. In formalterms, one key should determine all theother fields in a table. In Table 10,repository + catalog number + preparationare independent of repository + catalognumber + id_genus + id species + id order.This independence gives us a hint on howto bring Table 10 into second normal form.We need to split the independent repeatinginformation out into additional tables so thatthe multiple preparations per collectionobject and the multiple identifications percollection object are handled asrelationships out to other tables rather thanas repeating rows in the collections objecttable (Table 11).

Table 11. Bringing Table 10 into second normalform by splitting the repeating rows of preparationand identification out to separate tables.

Repository Catalog_noANSP 641455

Repository Catalog_no PreparationANSP 641455 ShellANSP 641455 Alcohol

Repository Catalog_no

Id_genus

Id_sp ID_order

ANSP 641455 Lunatia pilla CurrentANSP 641455 Natica clausa Previous

By splitting the information associated withpreparations out of the collection objecttable into a preparation table andinformation about identifications out to anidentifications table (Table 11) we can bringthe information in Table 10 into secondnormal form. Repository and Catalognumber now uniquely determine a row in thecollections object table (which in our limitedexample here now contains no otherinformation.) Carrying the key fields(repository + catalog_no) as foreign keysout to the preparation and identificationtables allows us to link the information aboutpreparations and identifications back to thecollections object. Table 11 is thus nowholding the information from Table 10 insecond normal form. Instead of usingrepository + catalog_no as the primary keyto the collections object table, we could usea surrogate numeric primary key(coll_obj_ID in Table 12), and carry thissurrogate key as a foreign key into therelated tables.

Table 11 has still not brought theinformation into third normal form. Theidentification table will contain repeatingvalues for id_genus and id_species – aparticular taxon name can be applied inmore than one identification. This is astraightforward matter of pulling taxonnames out to a separate table to allow amany to many relationship betweencollections objects and taxon namesthrough an identification associative entity(Table 12). Note that both Repository andPreparations could also be brought out toseparate tables to remove redundant non-key entries. In this case, this is probablybest accomplished by using the text value ofRepository (and of Preparations) as the key,

17


and letting a repository table act to controlthe allowed values for repository that can beentered into the collections object tables(rather than using a surrogate numeric keyand having to follow that out to therepository table any time you wanted toknow the repository of a collections object).Herein lies much of the art of informationmodeling – knowing when to stop.

Table 12. Bringing Table 11 into third normal formby splitting the repeating values of taxon names inidentifications out into a separate table.

Repository Catalog_no Coll_obj_IDANSP 641455 100

Coll_obj_ID Preparations100 Shell100 Alcohol

coll_obj_ID C_taxon_ID ID_order100 1 Current100 2 Previous

Taxon_ID Id_genus Id_sp1 Lunatia pilla2 Natica clausa

Producing an information model.

An information model is a detaileddescription of the concepts to be stored in adatabase (see, for example, Bruce, 1992).An information model should be sufficientlydetailed for a programmer to use it toconstruct the back end data storagestructures of the database and the code tosupport the business rules used to maintainthe quality of the data. A formal informationmodel should consist of at least threecomponents: an Entity-Relationshipdiagram, a description of relationshipcardinalities, and a detailed description ofeach entity and each attribute. The lattershould include a description of the scopeand nature of the data to be held in eachattribute.

Relationship cardinalities are textdescriptions of the relationships betweenentities. They consist of a list of sentences,one sentence for each of the two directionsin which a relationship can be read. Forexample, the relationship between speciesnames and identifications in the E-Rdiagram in could be documented asfollows:

Each species name is used in zero or moreidentifications.

Each identification uses one and only onespecies name.

The text documentation for each entity andattribute explains to a programmer thescope of the entity and its attributes. Thedocumentation should include particularattention to limits on valid content for theattributes and business rules that govern theallowed content of attributes, especiallyrules that govern related content spreadamong several attributes. For example, thedocumentation of the date attribute of thespecies names entity in Figure 11 abovemight define it as being a variable lengthcharacter string of up to 5 charactersholding a four digit year greater than 1757and less than or equal to the current year.Another rule might say that if the authorshipstring for a newly entered record alreadyexists in the database and the date isoutside the range of the earliest or latestyear present for that authorship string, thenthe data entry system should raise awarning message. Another rule mightprohibit the use of a species name in anidentification if the date on a species nameis more recent than the year of a dateidentified. This is a rule that could beenforced either in the user interface or in abefore insert trigger in the database.

Properly populated with descriptions ofentities and attributes, many CASE tools arecapable of generating text and diagrams todocument a database as well as SQL(Structured Query Language) code togenerate the table structures for thedatabase with very little additional effortbeyond that needed to design the database.

Example: PH core tables

As an example of an information model, Iwill describe the core components of theAcademy's botanical collection, PH(Philadelphia Herbarium) type specimendatabase. This database was specificallydesigned for capturing data off of herbariumsheets of type specimens. The databaseitself is in MS Access and is much morecomplex than these core tables suggest. Inparticular, the database includes tables forhandling geographic information in a morenormalized form than is shown here.

18


The summary E-R diagram of core entitiesfor the PH type database is shown in Figure12. The core entity of the model is theHerbarium sheet, a row in the Herbariumsheet table represents a single herbariumsheet with one or more plant specimensattached to it. Herbarium sheets are beingdigitally imaged, and the database includesmetadata about those images. Herbariumsheets have various sorts of annotationsattached and written on them concerningthe specimens attached to the sheets.Annotations can include original label data,subsequent identifications, and variouscomments by workers who have examinedthe sheet. Annotations can include taxonnames, including discussion of the typestatus of a specimen. Figure 12 shows theentities (and key fields) used to representthis core information about a herbariumsheet.

Figure 12. Core tables in the PH type database.

We can describe each of the relationshipsbetween the entities in the E-R diagram inwith a pair of sentences describing therelationship cardinalities. These sentencescarry the same information as the crows-footnotations on the E-R diagram, but in a morereadily intelligible form. To borrowlanguage from the object orientedprograming world, they state how manyinstances of an entity may be related to howmany instances of another entity, that is,how many rows in one table may be relatedto rows of another table by matching rowscontaining the same values for primary key

(in one table) and foreign key (in the othertable). The text description of relationshipcardinalities can also carry additionalinformation that a particular case tool maynot include in its notation, such as a limit ofan instance of one entity being related toone to three instances of another entity.

Relationship cardinalities:

Each Herbarium Sheet contains zero to manySpecimens.

Each Specimen is on one and only oneHerbarium sheet.

Each Specimen has zero to many Annotations.Each Annotation applies to one and only one

Specimen.

Each Herbarium sheet has zero to manyImages.

Each Image is of one and only one herbariumsheet.

Each Annotation uses one and only one TaxonName.

Each Taxon Name is used in zero to manyAnnotations.

Each Annotation remarks on zero to one TypeStatus.

Each Type status is found in one and only oneAnnotation.

Each Type Status applies to one and only oneTaxon Name.

Each Taxon Name has zero to many TypeStatus.

Each Taxon Name is the child of one and onlyone Higher Taxon.

Each Higher Taxon contains zero to manyTaxon Names.

Each Higher Taxon is the child of zero or oneHigher Taxon.

Each Higher Taxon is the parent of zero to manyHigher Taxa.

The E-R diagram in describes only the coreentities of the model in the briefest terms.Each entity needs to be fleshed out with atext description, attributes, and descriptionsof those attributes. Figure 13 is a fragmentof a larger E-R diagram with more detailedentity information for the Herbarium sheetentity. Figure 13 includes the name anddata type of each attribute in the Herbariumsheet entity. The herbarium sheet entityitself contains very little information. All ofthe biologically interesting information abouta Herbarium sheet (identifications,provenance, etc) is stored out in relatedtables.

19


Figure 13. Fragment of PH core tables E-Rdiagram showing Herbarium sheet entity with allattributes listed.

Entity-relationship diagrams are still only bigpicture summaries of the data. The bulk ofan information model lies in the entitydocumentation. Examine Figure 13.Herbarium sheet has an attribute calledName, and another called Date. From theE-R diagram itself, we don't know enoughabout what sort of information these fieldsmight hold. As the Date field has a datatype of timestamp, we could guess that itrepresents a timestamp generated when arow is entered into the herbarium sheetentity, but without further documentation, wecan't know whether this is correct or not.The names of the attributes Name and Dateare legacies of an earlier phase in thedesign of this database, better names forthese attributes would be “Created by” and“Date created”. Entity documentation isneeded to explain what these attributes are,what sort of information they should hold,and what business rules should be appliedto maintain the integrity and validity of thatinformation. Entity documentation for oneentity in this model, the Herbarium sheet,follows (in Appendix A) as an example of asuitable level of detail for entitydocumentation. A definition, the domain ofvalid values, business rules, and examplevalues all help describe the nature of theinformation intended to go into a table thatimplements this entity and can assist inphysical design of the database, design ofthe user interface, and in future migrationsof the data (Figure 1).

Physical design

An information model is a conceptual designfor a database. It describes the concepts tobe stored in the database. Implementationof a database from an information model

involves converting that conceptual designinto a physical design, into a plan foractually implementing the database in code.Large portions of the information modeltranslate very easily into instructions forbuilding tables. Other portions of aninformation model require more thought, forexample, should a particular business rulebe implemented as a trigger, as a storedprocedure, or as code in the user interface.

The vast majority of relational databasesoftware developed since the mid 1990suses some variant of the language SQL asthe primary means for manipulating thedatabase and the information stored withinthe database (the clearest introduction Ihave encountered to SQL is Celko, 1995b).Database server software packages (e.g.MS SQLServer, PostgreSQL, MySQL) allowdirect entry of SQL statements through acommand line client. However, mostdatabase software also provides for someform of graphical front end that can hide theSQL from the user (such as MS Access overthe MS Jet engine or PGAccess overPostgreSQL, or OpenOffice.org, Rekall,Gnome-db, or Knoda over PostgreSQL orMySQL). Other database software, notablyFilemaker, does not natively use SQL (thisis no longer true in Filemaker7, which has ascript step for running SQL queries).Likewise, CASE tools allow users to design,implement, modify, and reverse engineerdatabases through a graphical userinterface, without the need to write SQLcode. While SQL is the language ofrelational databases, it is quite possible todesign, implement, and use relationaldatabases without writing SQL code byhand.

Even if you aren't going to write SQLyourself to manipulating data, it is veryhelpful to think in terms of SQL. When youwant to ask a question of your data,consider what query would you write toanswer that question, then think about howto implement that query in your databasesoftware. This should help lead you to thedesired result set. Note that phrase: resultset. Set is an important word. SQL is a setbased language. Tables with their rows andcolumns may look like a spreadsheet. SQL,however, operates not on individual rowsbut on sets. Set thinking is the key toworking with relational databases.

20


Basic SQL syntax

SQL queries serve two distinctly differentpurposes. Data definition queries allow youto create structures for holding your data.Data definition queries define tables, fields,indices, stored procedures, and triggers.On the other hand, data manipulationqueries allow you to add, edit, and viewdata. In particular, SELECT queries retrievedata from the database.

Data definition queries can be used tocreate new tables and alter existing tables.A CREATE TABLE statement simplyprovides the information needed to create atable, such as a table name, a list of fieldnames, types for each field, constraints toapply to each field, and fields to index.Queries to create a very simple collectionobject table and to add an index to itscatalog number field are shown below (inMySQL syntax, see DuBois, 2003; DuBoiset al, 2004). Here I have followed a goodform for readability, placing SQL commandsin upper case, user supplied names fordatabase elements in lowercase, spacingthe statements out over several lines, andindenting lines to improve clarity.

CREATE TABLE collection_object ( collection_object_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, acronym CHAR(4) NOT NULL DEFAULT “ANSP”, catalog_number CHAR(10) NOT NULL);

CREATE INDEX catalog_number ON collection_object(catalog_number);

The create table query above will create atable for the collection object entity shown inFigure 14 and the create index query thatfollows it will index the catalog number field.SQL has a very English-like syntax. SQL

uses a small set of commands such asCreate, Select, Update, and Delete. Thesecommands have a simple, easilyunderstood syntax yet can be extremelyflexible, powerful, and complex.

Data placed in a table based on the entity inFigure 14 might look like those in Table 13:

Table 13. Rows in a collection object table

collection_object_id acronym catalog_number300 ANSP 34000301 ANSP 34001302 ANSP 28342303 ANSP 100382

SQL comes in a series of subtly differentdialects. There are standards for SQL[ANSI X3.135-1986, was the first, mostvendors support some subset of SQL-92 orSQL-99, while SQL:2003 is the lateststandard (ISO/IEC, 2003; Eisenberg et al,2003)], and most implementations are quitesimilar. However, each DBMS implementsa subtly different set of features and theirown extensions of the standard. A SQLstatement in the PostgreSQL dialect tocreate a table based on the collection objectentity in Figure 14 is similar, but not quiteidentical to the SQL in the MySQL dialectabove:

CREATE TABLE collection_object ( collection_object_id SERIAL NOT NULL UNIQUE PRIMARY KEY, acronym VARCHAR(4) NOT NULL DEFAULT 'ANSP', catalog_number VARCHAR(10) NOT NULL,);CREATE INDEX catalog_number ON collection_object(catalog_number);

Most of the time, you will not actually writedata definition queries. In DBMS systemslike MS Access and Filemaker there arehandy graphical tools for creating andediting table structures. SQL serverdatabases such as MySQL, Postgresql,and MS SQLServer have command lineinterfaces that let you issue data definitionqueries, but they also have graphical toolsthat allow creation and editing of tablestructures without worrying about datadefinition query syntax. For complexdatabases, it is best to create and maintainthe database design in a separate CASEtool (such as xCase, or Druid, both used toproduce E-R diagrams shown herein, or any

21

Figure 14. A collection object entity with a fewattributes.


of a wide range of other commercial andopen source CASE tools). Database CASEtools typically have a graphical userinterface for design, tools for checking theintegrity of the design, and the ability toconvert the design to a set of data definitionqueries. Using a CASE tool, one designsthe database, then connects to a datasource, and then has the CASE tool issuethe data definition queries to build thedatabase. Documentation of the databasedesign can be printed from the CASE tool.Subsequent changes to the databasedesign can be made in the CASE tool andthen applied to the database itself.

The workhorse for most databaseapplications is data retrieval. In SQL this isaccomplished using the SELECT statement.Select statements can specify the desiredfields and the criteria to limit the resultsreturned by a query. MS Access has a veryuseful graphical query designer. Thefamiliar queries you build with this designerby dragging fields from tables onto thequery design and then adding criteria to limitthe result sets are just SELECT queries(indeed it is possible to change the querydesigner over to SQL view and see the sqlstatement you have built with the designer).For those from the Filemaker world,SELECT queries are like designing a layoutwith the desired fields on it, then changingover to find view, adding criteria to limit thefind, and then running the find to show yourresult set. Here is a simple select statementto list the species in the genus Chicoreuspresent in a taxonomic dictionary file:

SELECT generic_epithet, trivial_epithet FROM taxon_name WHERE generic_epithet = “Chicoreus”;

This SQL query will return a result set ofinformation – all of the generic and trivialnames present in the taxon_name tablewhere the generic name is Chicoreus.Remember that the important word here is“set” (Figure 15). SQL is a set basedlanguage. You should think of this queryreturning a single set of information ratherthan an iterated list of rows from the sourcetable. Set based thinking is quite differentfrom the iterative thinking common to mostprograming languages . Behind the scenes,the DBMS may be walking through rows inthe table, looking up values in indexes, and

all sorts of interesting creative programmingfeatures that are generally of no concern tothe user of the database. SQL provides astandard interface on top of the details ofexactly how the DBMS is extracting datathat allows you to easily think about sets ofinformation, rather than worrying about howto get that information out of its storagestructures.

SELECT queries can ask sophisticatedquestions about aggregates of data. Thesimplest form of these is a query thatreturns all the distinct values in a field. Thissort of query is extremely useful forexamining messy legacy data.

The query below will return a list of theunique values for country andprimary_division (state/province) from alocality table, sorted in alphabetic order.

SELECT DISTINCT country, primary_divisionFROM locality_table;ORDER BY country, primary_division;

In legacy data, a query like this will usuallyreturn an interesting list of variations on thespelling and abbreviation of both countrynames and states. In the MS Access querydesigner, a property of the query will let youconvert a SELECT query into a SELECTDISTINCT query, or you can switch thequery designer to SQL view and add theword DISTINCT to the sql statement.Filemaker allows you to limit options in apicklist to distinct values from a field, butdoesn't (as of version 6.1) have a facility forselecting and displaying distinct values in afield other than in a picklist.

22

Figure 15. Selecting a set.


Working through an example:Extracting identifications.

SELECT queries are not limited to a singletable. You can ask questions of data acrossmultiple tables at once. The usual way ofdoing this is to follow a relationship joiningone table to another. Thus, in ourinformation model for an identification thathas a table for taxon names, another forcollection objects, and an associative entityto relate the two in identifications (Figure11), we can create a query that starts in thecollection object table and joins theidentification table to it by following theprimary key to foreign key basedrelationship. The query then follows anotherrelationship out to the taxon name table.This join from collections objects toidentifications to taxon names provides a listof the identifications for each collectionobject. Given a catalog number, we canobtain a list of related identifications.

SELECT generic_higher, trivial, author, year, parentheses, questionable, identifier, date_identified,catalog_numberFROM collections_object LEFT JOIN identification ON collection_object_id = c_collection_object_id LEFT JOIN taxon_name ON c_taxon_id = taxon_id WHERE catalog_number = “34000”;

Because SQL is a set based language, ifthere is one collection object with thecatalog number 34000 (Table 14) which hasthree identifications (Table 15,Table 16),this query will return a result set with threerows(Table 17):

Table 14. A collection_object table.

collection_object_id catalog_number55253325 34000

Table 15. An identification table.

c_collection_object_id

c_taxonid date_identified

55253325 23131 1902/--/--55253325 13144 1986/--/--55253325 43441 1998/05/--

Table 16. A taxon_name table

taxon_id Generic_higher trivial23131 Murex sp.13144 Murex ramosus43441 Murex bicornis

Table 17. Selected result set of joined rows fromcollection_object, identification, and taxon_name.

Generic_higher

trivial .... date_identified catalog_number

Murex sp. 1902/--/-- 34000Murex ramosus 1986/--/-- 34000Murex bicornis 1998/05/-- 34000

The collection object table contains only onerow with a catalog number of 34000, but theset produced by joining identifications tocollection objects contains three rows withthe catalog number 34000. SQL is returningsets of information, not rows from tables inthe database.

We could order this result set by the datethat the collection object was identified, orby a current identification flag, or both(assuming the format of the date_identifiedfield allows for easy sorting in chronologicalorder):

SELECT generic_higher, trivial, author, year, parentheses, questionable, identifier, date_identified, catalog_numberFROM collections_object LEFT JOIN identification ON collection_object_id = c_collection_object_id LEFT JOIN taxon_name ON c_taxon_id = taxon_id WHERE catalog_number = “34000”ORDER BY current_identification, date_identified;

Entity-Relationship diagrams showrelationships connecting entities. Theserelationships are implemented in a databaseas joins between tables. Joins can be muchmore fluid than implied by an E-R diagram.

SELECT DISTINCT collections_object.catalog_number FROM taxon LEFT JOIN identification ON taxonid = c_taxon id

LEFT JOIN collection object ON c_collections_objectid =

collections_objectid WHERE taxon.taxon_name = “Chicoreus ramosus”;

The query above is straightforward, itreturns one row for each catalog numberwhere the object has an identification ofChicoreus ramosus. We can also write a

23


query to follow the same join in the oppositedirection. Starting with the criterion set onthe taxon table, the query below follows thejoins back to the collections_object table tosee a selected set of catalog numbers.

SELECT collections_object.catalog_number, taxon.taxon_name FROM collections_object LEFT JOIN identification ON collections_objectid =

c_collections_objectid LEFT JOIN taxon ON c_taxonid = taxon id;

Following a relationship like this from themany side to the one side takes a little morethinking about. The query above will returna result set with one row for each taxonname that is used in an identification, and, ifa collection object has more than oneidentification, its catalog number will appearin more than one row. This is the normalbehavior of a query across a join thatrepresents a many to one relationship. Theresult set will be inflated to include one rowfor each selected row on the many side ofthe relationship, with duplicate values for theselected columns on the other side of therelationship. This also is why the previousquery was a Select Distinct query. If it hadsimply been a select query and there werespecimens with more than one identificationof “Chicoreus ramosus”, the catalognumbers for those specimens would beduplicated in the result set. Think ofqueries as returning result sets rather thanrows from database tables.

Thinking in sets rather than rows is evidentwhen you perform update queries to alterdata already in the database. In aprogramming language, you would think ofiterating through each row in a table,checking to see if that row matched thecriteria for an update and then applying anupdate to that row if it did. You can think ofan SQL update query as simply selectingthe set of records that match your criteriaand applying the update to that set as awhole (Figure 16, top).

UPDATE species_dictionarySET genus = “Chicoreus”WHERE genus = “Chicoresu”;

Nulls and tri-valued logic

Boolean logic with its operations on true andfalse is at least vaguely familiar to most ofus. SQL throws in an added twist. It usestri-valued logic. SQL expressions may betrue, false, or null. A field may contain a nullvalue. A null is different from an emptystring or a zero. A character field intendedto hold generic names could potentiallycontain “Silurus”, or “Chicoreus”, or“Palaeozygopleura”, or “” (an empty string),or NULL as valid values. An integer fieldcould hold 1, or 5, or 1024, or -1, or 0, orNULL. Nulls make the most sense in thecontext of numeric fields or date fields.Suppose you want to use an real numberfield to hold a measurement of a specimen,say maximum shell height in a gastropod.Storing the number in a real number fieldwill make it easy for you to calculate sums,means, and perform other mathematicaloperations on this field. You are left with aproblem, however, when you don't knowwhat value to put in that field. Suppose thespecimen in front of you is a slug (with noshell to measure). What value do you place

24

Figure 16. An SQL update statement should bethought of as acting on an entire result set at once(top), rather than walking through each row in thetable, as might be implemented in an iterativeprograming language (bottom).


in the shell height field? Zero might makesense, but won't produce sensible results forsome sorts of calculations. A negativenumber, or more broadly a number outsidethe range of expected valid values (such as99 for year in a two digit date field in adatabase designed in the 1960s) that youcould use to exclude out of range valuesbefore performing your calculation? Yourperception of the scope of valid valuesmight not match that of users of the system(as when the 1960s data survived to 1999).In our example of values for shell height, ifsomeone decides that hyperstrophicgastropods should have negative values ofshell height as they coil up the axis ofcoiling instead of down it like normalorthostrophic gastropods the values -1 and0 would no longer fall outside the scope ofvalid shell heights. Null is the SQL solutionto this problem. Nulls don't behave asnumbers. Nulls allow you to flag records forwhich there is no sensible in range value toplace in a field. Nulls make slightly lesssense in character fields where you canallow explicit values such as “NotApplicable”, “Unknown”, or “Not examined”that let you explicitly record the reason thata value was not entered in the field. Thedifficulty in this case is in maintaining thesame value for the same concept over time,preventing “Not Applicable” from beingentered by some users and “N/A” by othersand “n/a” and “” by others. Code to helpusers consistently enter “Not Applicable”, or“Unknown” can be embedded in the userinterface, but fundamentally, ensuringconsistent data entry in this form is a matterof careful user training, quality controlprocedures, and detailed documentation.

Nulls make for interesting complicationswhen it comes time to query the database.We normally think of expressions inprograms as following some set of rules toevaluate as either true or false. Mostprograming languages have some constructthat lets us take an action if some conditionis met; IF some expression is trueTHEN do something. The expression(left(genus,4) <> “Silu”) wouldsensibly seem to evaluate to true for allcases where the first four characters of thegenus field are not “Silu”. Not so in an SQLdatabase. Nulls propagate. If anexpression contains a null, the null will

propagate to make result of the wholeexpression null. If the value of genus insome row is null, the expression left(NULL,4) <> “Silu” will evaluate to null, notto true or false. Thus the statement selectgeneric, trivial from taxon_name where(left(generic,4) <> “silu”) will not returnthe expected result set (it will not includerows where generic=NULL. Nulls arehandled with a function, such as IsNull(),which can take a null and return a true orfalse result. Our query needs to add aterm: select generic, trivial fromtaxon_name where (left((generic,4) <>

“silu”) or IsNull(generic)).

Maintaining integrity

In a spreadsheet or a flat file database,deleting a record is a simple matter ofremoving a single row. In a relationaldatabase, removing records and changingthe links between records in related tablesbecomes much more complex. A relationaldatabase needs to maintain databaseintegrity. An important part of maintainingintegrity is knowing what do you do withrelated records when you delete a record onone side of a join. Consider a scenario:You are cataloging a collection object andyou enter data about it into a database(identification, locality, catalog number, kindof object, etc...). You then realize that youentered the data for this object yesterday,and you are creating a duplicate record thatyou want to delete. How far does the deletego? You no doubt want to get rid of theduplicate record in the collection object tableand the identifications attached to thisrecord, but you don't want to keep followingthe links out to the authority file for taxonnames and delete the names of any taxaused in identifications. If you delete acollections object you do not want to leaveorphan identifications floating around in thedatabase unlinked to any collections object.These identifications (carrying a foreign keyfor a collections object that doesn't exist)can show up in subsequent queries andhave the potential to become linked to newcollections objects (silently adding incorrectidentifications to them as they are created).Such orphan records, which retain links tono longer existent records in other tables,violate the relational integrity of thedatabase.

25


When you delete a record, you may or maynot want to follow joins (relationships) out torelated tables to delete related records inthose tables. Descriptions of relationshipsthemselves do not provide clear guidanceon how far deletions should propagatethrough the database and how they shouldbe handled to maintain relational integrity. Ifa collection object is deleted, it makessense to delete the identifications attachedto that object, but not the taxon names usedin those identifications as they are probablyused to identify other collection objects. If,in the other direction, a taxon name isdeleted the existence of any identificationsthat use that taxon name almost certainlymean that the delete should fail and thename should be retained. An operationsuch as merging a record containing acorrectly spelled taxon name with a recordcontaining an incorrectly spelled copy of thesame name should correct any links to theincorrect spelling prior to deleting thatrecord.

Relational integrity is best enforced withconstraints, triggers, and code enforcingrules at the database level (supported byerror handling in the user interface). Somedatabase languages support foreign keyconstraints. It is possible to join two tablesby including a column in one table thatcontains values that match the values in theprimary key of another table. It is alsopossible to explicitly enforce foreign keyconstraints on this column. Including aforeign key constraint in a table definitionwill require that values entered in the foreignkey column match values found in therelated primary key. Foreign key constraintscan also include cascading deletes.Deleting a row in one table can cascade outto related tables with foreign key constraintslinked to the first table. A foreign keyconstraint on the c_collections_object_idfield of an identification table could causedeletes from the related collections objecttable to cascade out and remove relatedrows from the identification table. Supportfor such deletion of related rows variesbetween database systems.

Triggers are blocks of code in a databasethat are executed when particular actionsare performed. An on delete trigger is ablock of code tied to a table in a databasethat can fire when a record is deleted from

that table. An on delete trigger for acollections object could, like a foreign keyconstraint, delete related records in anidentification table. Triggers, unlikeconstraints, can contain complex logic andcan do more than simply affect related rows.An on delete trigger for a taxon name tablecould check for related records in anidentification table and cause the deleteoperation to fail if any related records exist.An on insert or on update trigger can includecomplex format checking and business rulechecking code, and we will see later,triggers can be very helpful in maintainingthe integrity of hierarchical information(trees) stored in a database.

Triggers, foreign keys, and other operationsexecuted on the database server do have adownside: they involve the processing ofcode, and thus reduce the speed ofdatabase operations. In many cases (whereyou are concerned about the integrity of thedata), you will want to support theseoperations somewhere – either in userinterface code, in a middle layer of businesslogic code, or as code embedded in thedatabase. Embedding rules to support theintegrity of the data in the database (throughtriggers and constraints) can be an effectiveway of ensuring that all users and clientsthat attach to the database have to followthe same business rules. Triggers can alsosimplify client development by reducing thenumber of operations the client mustperform to maintain integrity of the data.

User rights & Security

Another important element tomaintaining data quality is control over whohas access to a database. Limits on who isable to add data and who is able to alterdata are essential. Unrestricted databaseaccess to all and sundry is an invitation tounusable data. At a minimum, guestsshould have select only access to publicparts of the database, data entry personnelshould have limited select and update (andperhaps delete) rights to parts of thedatabase, a limited set of skilled users maybe granted update access to tables housingcontrolled vocabularies, and only systemadministrators should have rights to addusers or alter user privileges. Particularbusiness functions (such as collection

26


managers filling loans, curators approvingloans, or a registrar approving accessions)may also require restrictions to limit theseoperations on the database to only thecorrect users. User rights are bestimplemented at the database level.Database systems include native methodsfor restricting user rights. You shouldimplement rights at this level, rather thantrying to implement a separate privilegesystem in the user interface. You willprobably want to mirror the databaseprivileges in the front end (for example,hiding administrative menu options fromlower level users), but you should not relyon code in the front end of a database torestrict the ability of users to carry outparticular operations. If a database frontend can connect a user to a databasebackend with a high privilege level, thepotential exists for users to skip the frontend and connect directly to the databasewith a high privilege level (see Morris 2001for an example of a server wide security riskintroduced by a design that implementeduser access control in the client).

Implementing as joins &Implementing as views

In many database systems, a set of joinscan be stored as a view of a database. Aview can be treated much like a table.Users can query a view and get a result setback. Views can have access rights grantedby the access privilege system. Someviews will accept update queries and alterthe data in the tables that lie behind them.Views are particularly valuable as a tool forrestricting a class of users to a subset ofpossible actions on a subset of thedatabase and enforcing these restrictions atthe database level. A user can be grantedno rights at all to a set of tables, but givenselect access to a view that shows a subsetof information from those tables. Anaccount that updates a web database byquerying a master database might begranted select only access to a view thatlimits it to just the information needed toupdate the web dataset (such as a flat viewof Darwin Core [Schwartz, 2003; Blum andWieczorek, 2004] information). Given thecomplex joins and very complex structure ofbiodiversity information, views are probablynot practical ways to restrict data entry

privileges for most biodiversity databases.Views may, however, be an appropriatemeans of limiting guest access to a readonly view of the data.

Interface design

Simultaneously with the conceptual andphysical design of the back end of adatabase, you should be creating a designfor the user interface to access the data.Existing user interface screens for a legacydatabase, paper and pencil designs of newscreens, and mockups in database systemswith easy form design tools such asFilemaker and MS Access are of use ininterface design. I feel that the mostimportant aspect of interface design fordatabases is to fit the interface to theworkflow, abstracting the user interfaceaway from the underlying complex datastructures and fitting it to the tasks thatusers perform with the data. A typical userinterface problem is to place the userinterface too close to the data by creatingone data entry screen for each table in thedatabase. In anything other than a verysimple database, having the interface tooclose to the data ends up in a bewilderingprofusion of pop up windows that leaveusers entirely confused about where theyare in data entry and how the current openwindow relates to the task at hand.

Figure 17. A picklist control for entering taxonnames.

Consider the control in Figure 17. It allowsa user to select a taxon name (say toprovide an identification of a collectionobject) off of a picklist. This control wouldprobably allow the user to start typing thetaxon name in the control to jump to therelevant part of a very long picklist. Apicklist like this is a very seductive formelement in many situations. It can allow adata entry person to make fewer keystrokesand mouse gestures to enter a particularitem of information than by filling in a set offields. It can mask substantial complexity inthe underlying database (the taxon namemight be built from 12 fields or so and thecontrol might be bound to a field holding asurrogate numeric key representing aparticular combination). By having users

27


pick values off of a list you can enforce acontrolled vocabulary and can avoid theentry of misspelled taxon names and othercomplex vocabulary. Picklists, howeverhave a serious danger. If a data entryperson selects the wrong taxon name whenentering an identification from the picklistabove there is no way for anyone to find thata mistake has been made without havingsomeone return to the original source for theinformation and compare the record againstthat source (Figure 18). In contrast, amisspelled taxon name is usually easy tolocate (by comparison with a controlled listof taxon names). If data is entered as text,simple misspellings can be found, identified,and fixed. Avoid picklists as sole sources ofinformation.

Figure 18. A picklist being used as the solesource of locality information.

One option to avoid the risk of unfindableerrors is to entirely avoid the use of picklistsin data entry. Simply exchanging picklistsfor text entry controls on forms will result inthe loss of the advantages of picklistcontrols; more rapid data entry and, moreimportantly, a controlled vocabulary. It ispossible to maintain authority control anduse text controls by writing code behind atext entry control that will enforce acontrolled vocabulary by querying anauthority file using the information entered inthe text control and throwing an error (andpresenting the user with an error message)if no match is found in the controlledvocabulary in the authority file. Thisalternative can work well for single wordentries such as generic names, where it isfaster to simply type a name than it is toopen a long picklist, move to the correct

location on the list, and select a value.Replacing a picklist with a controlled textbox, however, is not a good choice forcomplex formated information such aslocality descriptions.

Another option to avoid the risk ofunfindable errors is to couple a picklist witha text control (Figure 19). A collectingevent could be linked to a locality through apicklist of localities, coupled with aredundant text field to enter a named place.The data entry person needs to make morethan one mistake to create an unfindableerror. To make an unfindable error, thedata entry person needs to select the wrongvalue from the picklist, enter the wrongvalue in the text box, and have theseincorrect text box value match the incorrectchoice from the picklist (an error that is stillquite conceivable, for example if the dataentry person looks at the label for one

specimen when they are typing ininformation about another specimen). Thetext box can hold a terminal piece ofinformation that can be correlated with theinformation in the picklist, or a redundantpiece of information that must match a valueon the pick list. A picklist of species namesand a text box for the trivial epithet allow anerror to be raised whenever the trivial

28

Figure 19. A picklist and a text box used incombination to capture and check localityinformation. Step 1, the user selects a localityfrom the picklist. Step 2, the database looks uphigher level geographic information. Step 3, theuser enters the place name associated with thelocality. Step 4, the database checks that thenamed place entered by the user is the correctnamed place for the locality they selected off thepicklist.


epithet in the text box does not match thespecies name selected on the picklist. Notethat the value in the text box need not bestored as a field in the database if thequality control rules embedded in thedatabase require it to match the picklist.Alternately the values can be stored andused to flag records for later review in thequality control process.

Design your forms to function without theneed for lifting hands off the keyboard. Dataentry should not require the user to touchthe mouse. Moving to the next control,pressing a button, moving to the nextrecord, opening a picklist, and duplicatinginformation from the previous record, are alloperations that can be done from thekeyboard. Human interface design is adiscipline in its own right, and I won't saymore about it here.

Practical Implementation

Be Pragmatic

Most natural history collections operate inan environment of highly limited resources.Properly planning, designing, andimplementing a database system followingall of the details of some of the informationmodels that have been produced for thecommunity (e.g. Morris 2000) is a taskbeyond the resources of most collections. Areasonable estimate for a 50 to 100 tabledatabase system includes about 500-1000stored procedures, more than 100,000 linesof user interface code, one year of design,two or more years of programming, adevelopment team including a databaseprogrammer, database administrator, userinterface programmer, user interfacedesigner, quality control specialist, and atechnical writer, all running to some$1,000,000 in costs. Clearly natural historycollections that are developing their owndatabase systems (rather than usingexternal vendors or adopting communitybased tools such as BioLink [CSIRO, 2001]or Specify) must make compromises.These compromises should involveselecting the most important elements oftheir collections information, spending themost design, data cleanup, and programingeffort on those pieces of information, andthen omitting the least critical information or

storing it in less than third normal form datastructures.

A possible candidate for storage in less thanideal form is the generalized Agent conceptthat can hold persons and institutions thatcan be linked to publications as authors,linked to collection objects as preparators,collectors, identifiers, and annotators, andlinked to transactions as donors, recipients,packers, authorizers, shippers, and so forth.For example, given the focus on collectionobjects, using Agents as authors ofpublications (through an authorship listassociative entity) may introduce substantialcomplications in user interface design, codeto maintain data integrity, and the handlingof existing legacy data that produce costsfar in excess of the utility gained from properthird normal form handling of the concept ofauthors. Conversely, a database systemdesigned specifically to handle bibliographicinformation requires very clean handling ofthe concept of Authors in order to be able toproduce bibliographic citations in multipledifferent formats (at a minimum, the authorlast name and initials need to be atomizedin an author table and they need to berelated to publications through anauthorship list associative entity).Abandoning third normal form (or higher) inparts of the database is not a bad thing fornatural history collections, so long as thedecisions to use lower normal forms areclearly thought out and restricted to the leastimportant parts of the data.

I chose the example of Agents as a possibletarget for reduced database complexitydeliberately. Some institutions and userswill immediately object that a generalizedAgent related to transactions and collectionobjects is of critical importance to their data.Perfect. This is precisely the approach I amadvocating. Identify the most importantparts of your data, and put your time, effort,design, programing, and data manipulationinto making sure that your database systemis capable of cleanly handling those mostcritical areas. Identify the concepts that arenot of critical importance and minimize thedesign complexity you allow them tointroduce into your database (recognizingthat problems will accumulate in the qualityof these data). In a setting of limitedresources, we are seldom in a situationwhere we can build systems to store all of

29


the highly complex information associatedwith collections in optimum form. This factdoes not, however, excuse us fromidentifying the most important informationand applying the best solutions we can tothe stewardship of that information.

Approaches to management ofdate information

Dates in collection data are generallyproblematic as they are often known only toa level of precision less than a single day.Dates may be known to the day, or in somecases to the time of day, but often they areknown only to the month, or to the year, orto the decade. In some cases, dates areknown to be prior to a particular date (e.g.the date donated may be known but thedate collected may not other than that it issometime prior to the date donated). Inother cases dates are known to be within arange (e.g. between 1932-June-12 and1932-July-154); in yet others they are knownto be within a set of ranges (e.g. collected inthe summers of 1852 to 1855). Designingdatabase structures to be able to effectivelystore, retrieve, sort, and validate this rangeof possible forms for dates is not simple(Table 18).

Using a single field with a native date datatype to hold collections date information isgenerally a poor idea as date data typesrequire each date to be specified to theprecision of one day (or finer). Simplystoring dates as arbitrary text strings isflexible enough to encompass the widevariety of formats that may be encountered,but storing dates as arbitrary strings doesnot ensure that the values added are validdates, in a consistent format, are sortable,or even searchable.

Storage of dates effectively requires theimplementation of an indeterminate orarbitrary precision date range data typesupported by code. An arbitrary precision

4 There is an international standard date andtime format, ISO 8601, which specifiesstandard numeric representations for dates,date ranges, repeating intervals and durations.ISO 8601 dates include notations like 19 for anindeterminate date within a century, 1925-03for a month, 1860-11-5 for a day, and 1932-06-12/1932-07-15 for a range of dates.

date data type can be implemented mostsimply by using a text field and enforcing aformat on the data allowed into that field (bybinding a picture statement or formatexpression to the control used for data entryinto that field or to the validation rules for thefield). A format like “9999-Aaa-99 TO 9999-Aaa-99” can force data to be entered in afixed standard order and form. Similarformat checks can be imposed with regularexpressions. Regular expressions are anextremely powerful tool for recognizingpatterns found in an expanding number oflanguages (perl, PHP, and MySQL allinclude support for regular expressions). Aregular expression for the date format abovelooks like this: /^[0-9]{4}-[A-Z]{1}[a-z]{2}-[0-9]{2}( TO [0-9]{4}-[A-Z]{1}[a-z]{2}-[0-9]{2})+$/. A regularexpression for an ISO date looks like this: /^[0-9]{2,4}(-[0-9]{2}(-[0-9]{2})+)+(-[0-9]{4}(-[0-9]{2}(-[0-9]{2})+)+)+$/. Note that simple patterns still do nottest to see if the dates entered are valid.

Another date storage possibility is to use aset of fields to hold start year, end year,start month, end month, start day, and endday. A set of such numeric fields can besorted and searched more easily than a textdate range field but needs careful planningof what values are placed in the day fieldsfor dates for which only the month is knownand other handling of indeterminateprecision.

From a purely theoretical standpoint, usinga pair of native date data type fields to holdstart day and end day is the best way tohold indeterminate date and date rangeinformation (as 1860 translates to the range1860-01-01 to 1860-12-31). Native datedata types have native recognition of validand invalid values, sorting functions, andsearch functions. Implementing dates witha native date data type avoids the need towrite code to support validation, sorting, andother things that are needed for dates towork. Practical implementation of datesusing a native date data type, however,would not work well as just a pair of datefields exposed for text entry on the userinterface. Rather than simply typing “1860”the data entry person would need to stop,think, type 1860-01-01, move to the enddate field, then hopefully remember the lastday of the year correctly and enter it.

30


Efficient and accurate data entry wouldrequire a user interface with code capable ofaccepting “1860” as a valid entry and storingit as an appropriate date range. Printing isalso an issue – the output we would expecton a label would be “1860” not “1860-01-01to 1860-12-31” for cases where the datewas only known to a year, with a range onlyprinting when the range was the originallyknown value. An option to handle thisproblem is to use a pair of date fields forsearching and a text field for verbatim dataand printing, though this solution introducesredundancy and possible accumulation oferrors in the data.

Multiple start and end points (such assummers of several years) are probably rareenough values to hold in a separate textdate qualifier field. A free text date qualifierfield, separate from a means of storing adate range, as a means for handling theseexceptional records would preserve thedata, but introduces a reduction of precisionin searches (as effective searches couldonly operate on the range end points, notthe free text qualifier). Properly handingevents that occur within multiple dateranges requires a separate entity to holddate information. The added code andinterface complexity needed to support suchan entity is probably an unnecessary burdento add for most collections data.

Handling hierarchicalinformation

Hierarchies are pervasive in biodiversity

informatics. Taxon names, geopoliticalentities, chronostratigraphic units andcollection objects are all inherentlyhierarchical concepts – they can all berepresented as trees of information. Thetaxonomic hierarchy is very familiar (e.g. aFamily contains Subfamilies which containGenera). Collection objects are intuitivelyhierarchical in some disciplines. Forexample, in invertebrate paleontology a bulksample may be cataloged and later split,with part of the split being sorted into lots.Individual specimens may be split fromthese lots. These specimens may becomposed of parts (paired valves ofbivalves), which can have thin sectionsmade from them. In addition, derivedobjects (such as casts) can be made fromspecimens or their parts. All of thesedifferent collection objects may be assignedtheir own catalog numbers but are stillrelated by a series of lot splits andpreparation steps to the original bulksample. A bird skin is not so obviouslyhierarchical, but skin, skeleton, and frozentissue samples from the same bird may bestored separately in a collection.

Some types of database systems are betterat handling hierarchical information thanothers. Relational databases do not haveeasy and intuitive ways of storinghierarchies in a form that is both easy toaccess and maintains the integrity of thedata. Older hierarchical database systemswere designed around business hierarchiesand natively understood hierarchies. Objectoriented databases can apply some of the

31

Table 18. Comparison of some ways to store date information

Fields data type IssuesSingle date field date Native sorting, searching, and validation. Unable to store date

ranges, will introduce false precision into data.Single date field character Can sort on start date, can handle single dates and date ranges

easily. Needs minimum of pattern or format applied to entrydata, requires code to test date validity.

Start date and enddate fields

two character fields,6 character fields, or6 integer fields.

Able to handle date ranges and arbitrary precision dates.Straightforward to search and sort. Requires some code forvalidation.

Start date and enddate fields

two date fields Native sorting and validation. Straightforward to search. Able tohandle date ranges and arbitrary precision. Requirescarefully designed user interface with supporting code forefficient data entry.

Date tablecontaining datefield and start/endfield.

date field, numericfields, character field,or character fields.

Handles single dates, multiple non-continuous dates, and dateranges. Needs complex user interface and supporting code.


basic properties of extensible objects toeasily handle hierarchical information.Object extensions are available for somerelational database management systemsand can be used to store hierarchicalinformation more readily than in relationalsystems with only a standard set of SQLdata types.

There are several different ways to storehierarchical information in a relationaldatabase. None of these are ideal for allsituations. I will discuss the costs andbenefits of three different structures forholding hierarchical information in arelational database: flattening a tree into adenormalized table, edge representation oftrees, and a tree visitation algorithm.

Denormalized table

A typical legacy structure for the storage ofhigher taxonomy is a single table containingseparate fields for Genus, Family, Orderand other higher ranks (Table 19). (Notethat order is a reserved word in SQL [as inthe ORDER BY clause of a SELECTstatement] and most SQL dialects will notallow reserved words to be used as field ortable names. We thus need to use somevariant such as T_Order as the actual fieldname). Placing a hierarchy in a single flattable suffers from all the risks associatedwith non normal structures (Table 20).

Placing higher taxa in a flat table like Table19 does allow for very easy extraction of thehigher classification of a taxon in a singlequery as in the following examples. A flatfile is often the best structure to use for aread only copy of a database used to powera searchable website. Asking for the familyto which a particular genus belongs is verysimple:

SELECT family FROM higher_taxonomy FROM higher_taxonomy WHERE genus = “Chicoreus”;

Likewise, asking for the higher classificationof a particular species is verystraightforward:

SELECT class, t_order, family FROM higher_taxonomy LEFT JOIN taxon_name ON higher_taxonomy.genus = taxon_name.genus WHERE taxon_name_id = 352;

Edge Representation

Heirarchical information is typicallydescribed in an information model using anentity with a one to many link to itself(Figure 20). For example, a taxon entitywith a relationship where a taxon can be thechild of zero or one higher taxon and aparent taxon can have zero to many childtaxa. (Parent and child are used here in thesense of computer science description oftrees, where a parent node in the tree canbe linked to several child nodes, rather thanin any genealogical or evolutionary sense).

Taxonomic hierarchies are nested sets andcan readily be stored in tree data structures.Thinking of the classification of animals as atree, Kingdom Animalia is the root node ofthe tree. The immediate child nodes underthe root might be the thirty some phyla of

32

Table 19. Higher taxa in a denormalized flat file table

Class T_Order Family Sub Family GenusGastropoda Caenogastropoda Muricidae Muricinae MurexGastropoda Caenogastropoda Muricidae Muricinae ChicoreusGastropoda Caenogastropoda Muricidae Muricinae Hexaplex

Table 20. Examples of problems with a hierarchy placed in a single flat file.

Class T_Order Family Sub Family GenusGastropoda Caenogastropoda Muricidae Muricinae MurexGastropod Caenogastropoda Muricidae Muricinae ChicoreusGastropoda Neogastropoda Muricinae Muricidae Hexaplex

Figure 20. A taxon entity with a join to itself.Each taxon has zero or one higher taxon. Eachtaxon has zero to one lower taxa.


animals, each with their own subphylum,superclass, or class children. Animaliacould thus be the parent node of the phylumMollusca. Following the branches of thetree down to lower taxa, the terminal nodesor leaves of the tree would be species andsubspecies. Taxonomic hierarchies readilytranslate into trees and trees are veryfamiliar data structures in computer science.

Storage of a higher classification as a tree istypically implemented using a tablestructure that holds an edge representationof the tree hierarchy. In an edgerepresentation, a row in the table has a fieldfor the current node in the tree and anotherfield that contains a pointer to the currentnode's parent. The simplest case is a tablewith two fields, say a higher taxon tablecontaining a field for taxon name andanother field for parent taxon name.

CREATE TABLE higher_taxon ( taxon_name char(40) not nullprimary_key, parent_taxon char(40) not null);

Table 21. An edge representation of a tree

Taxon_name (PK) Higher_taxonGastropoda [root]Caenogastropoda GastropodaMuricidae CaenogastropodaChicoreus MuricidaeMurex Muricidae

In this implementation (Table 21) you canfollow the parent – child links recursively tofind the higher classification for a genus, orto find the genera placed within an order.However, this implementation requiresrecursive queries to move beyond theimmediate parent or child of a node. Givena genus, you can easily find its immediateparent and its immediate parent's parent.

SELECT t1.taxon_name, t2.taxon_name, t2.higher_taxon FROM higher_taxon as t1 LEFT JOIN higher_taxon as t2 ON t1.higher_taxon = t2.taxon_name WHERE t1.taxon_name = “Chicoreus”;

The query above will return a result set“Chicoreus”, ”Muricidae”,”Caenogastropoda” from the data in Table21. Unfortunately, unless the tree isconstrained to always have the a fixednumber of ranks between every genus and

the root of the tree (and the entry point for aquery is always a generic name), it is notpossible to set up a single query that willalways return the higher classification for agiven generic name. The only way toeffectively query this table is with programcode that recursively issues a series of sqlstatements to walk up (or down) the tree byfollowing the higher_taxon to taxon_namelinks, that is, by recursively traversing theedge representation of the tree. Such codecould be either implemented as a storedprocedure in the database or higher upwithin the user interface.

By using the taxon_name as the primarykey, we impose a constraint that will helpmaintain the integrity of the tree, that is,each taxon name is allowed to occur at onlyone place in the tree. We can't place thegenus Chicoreus into the family Muricidaeand also place it in the family Turridae.Forcing each taxon name to be a uniqueentry prevents the placement ofanastomoses in the tree. More than justthis constraint is needed, however, tomaintain a clean representation of ataxonomic hierarchy in this edgerepresentation. It is possible to storeinfinite loops by linking the higher_taxon ofone row to a taxon name that links back toit. For example (Table 22), if the genusMurex is in the Muricidae, and the highertaxon for Muricidae is set to Murex, aninfinite loop is established where Murexbecomes a higher taxon of itself, and Murexis not linked to the root of the tree.

Table 22. An error in the hierarchy.

Taxon_name (PK) Higher_taxonGastropoda [root]Caenogastropoda GastropodaMuricidae MurexChicoreus MuricidaeMurex Muricidae

Maintenance of a table containing an edgerepresentation of a large tree is difficult. It iseasy for program freezing errors such as theinfinite loop in Table 22 to be inadvertentlycreated unless there is code testing eachchange to the table for validity.

The simple taxon_name, higher_taxonstructure has another problem: How do youprint the family to which a specimen belongson its label? A solution is to add a rankcolumn to the table (Table 23).

33


Selecting the family for a genus thenbecomes a case of recursively following thetaxon_name – higher_taxon links back to ataxon name that has a rank of Family. Thepseudocode below (Code Example 1)illustrates (in an overly simplistic way) howranks could work by embedding the logic ina code layer sitting above the database. Itis also possible to implement rank handlingat the database level in a database systemthat supports stored procedures.

Table 23. Edge representation with ranks.

Taxon_name (PK) Higher_taxon RankGastropoda [root] Class Caenogastropoda Gastropoda OrderMuricidae Caenogastropoda FamilyChicoreus Muricidae GenusMurex Muricidae Genus

An edge representation of a tree can bestored as in the examples above using ataxon name as a primary key. It is alsopossible to use a numeric surrogate keyand to store the recursive foreign key for theparent as a number (e.g.

c_taxon_id_parent). If you are updating ahierarchical taxonomic dictionary through auser interface with a code layer between theuser and the table structures, using thesurrogate key values in the recursive foreignkey to identify the parent taxon of thecurrent taxon is probably a good choice. Ifthe hierarchical taxonomic dictionary isbeing maintained directly by someone whois editing the file directly, then using thetaxon name as the foreign key value isprobably a better choice. It is much easierfor a knowledgeable person to check a list ofnames for consistency than a list of numericlinks. In either case, a table containing anedge representation of a tree will requiresupporting code, either in the form of aconsistency check that can be run afterdirect editing of the table or in the form of auser interface that allows only legal changesto the table. Without supporting code, thedatabase itself is not able to ensure that allrows are placed somewhere in the tree(rather than being unlinked nodes ormembers of infinite loops), and that the tree

34

Code Example 1.

// Pseudocode to illustrate repeated queries on edge // representation table to walk up tree from current // node to a higher taxon node at the rank of family.// Note: handles some but not all error conditions.$root_marker = “[Root]”; // value of higher taxon of root node$max_traverse = 1000; // larger than any leaf to root path$rank = “”; // holds rank of parent nodes $targetName = $initialTarget; // start with some taxon$counter = 0; // counter to trap for infinite loopswhile ($rank <> “Family”) // desired result and ($counter < $max_traverse) // errors: infinite loop/no parent and ($targetname <> $root_marker) // error: reached root{ $counter++; // increment counter $sql = “SELECT taxon_name, rank, higher_taxon FROM higher_taxon WHERE t1.taxon_name = '$targetName'”; $results = getresults($connection,$sql); if (numberofrows($results)==0) { // error condition: no parent found for current node $counter = $max_traverse; // exploit infinite loop trap } else { // a parent exists for the current node $row = getrow($results); $currentname = getcolumn($row, ”taxon_name”); $rank = getcolumn($row,”rank”); $targetName = getcolumn($row”higher_taxon”); } // endif} // wend


is a consistent branching tree withoutanastomoses. If a field for rank is included,code can also check for rank placementerrors such as the inclusion of a class withina genus.

If you wish to know the family placement fora species, code in this form will be able tolook it up but will require multiple queries todo so. If you wish to look up the entirehigher classification for a taxon, code in thisform will require as many queries to find theclassification as there are steps in thatclassification from the current taxon to theroot of the tree.

To solve the unknown number of multiplequeries problem in an edge representationof a tree it is possible to store the path fromthe current node to the root of the treeredundantly in a parentage field (Table 24).Adding this redundancy requires supportingcode, otherwise the data are certain tobecome inconsistent. A typical example ofa redundant parentage string can be foundin BioLink's taxonomic hierarchy(CSIRO2001). BioLink stores core taxon nameinformation in a table called tblBiota. Thistable has a surrogate primary keyintBiotaID, a field to hold the full taxon namevchrFullName, a recursive foreign key fieldcontaining a pointer to another row intblBiota intParentID, and a field thatcontains a backslash delimited list of all theintParentID values from the current node upto the root node for the tree vchrParentage.BioLink also contains other fields and setsof related tables to store additionalinformation about the current taxon name.

Storing the path from the current node to theroot of the tree (top of the hierarchy) in aparentage field allows straightforwardselection of the taxonomic hierarchy of anyparticular taxon (Code Example 2, below).One query will extract the taxon name andits parentage string, the parentage stringcan then be split on its delimiter, and this list

of primary key values can be assembled ina query (e.g. where intBiotaID = 1 orintBiotaID = 2 or ... order by vchrParentage)that will return the higher classification of thetaxon. Note that if the field used to storethe parentage is defined as a string field(with the ability to add an index and rapidlysort values by parentage), it will have alength limit in most database systems(usually around 255 characters) and insome systems may only sort on the first fiftyor so characters in the string. Thus somedatabase systems will impose a limit on howdeep a tree (how many nodes in a path toroot) can be stored in a rapidly sortableparentage string.

Tree Visitation

A different method for storing hierarchicalstructures in a database is the tree visitationalgorithm (Celko, 1995a). This methodworks by traversing the tree and storing thestep at which each node is entered in onefield and in another field, the step at whicheach node is last exited. A table that holdsa tree structure includes two fields (t_leftand t_right [note that “left” and “right” areusually reserved words in SQL and can't beused as field names in many databasesystems]). To store a tree using these twofields, set a counter to zero, then traversethe tree, starting from the root. Each timeyou enter a node, increment the counter byone and store the counter value in the t_leftfield. Each time you leave a node for thelast time, increment the counter by one andstore the counter value in the t_right field.You traverse the tree in inorder5, visiting anode, then the first of its children, then thefirst of its children, passing down until youreach a leaf node, then go back to the leaf'sparent and visit its next child.

Table 25 holds the classification for 6 taxa.

5 As opposed to a preorder or postorder treetraversal.

35

Table 24. Storing the path from the current node to the root of the tree redundantly in a parentage field,example shows a subset of fields from BioLink's tblBiota (CSIRO, 2001).

intBiotaID (PK) vchrFullName intParentID vchrParentage1 Gastropoda /12 Caenogastropoda 1 /1/23 Muricidae 2 /1/2/34 Chicoreus 3 /1/2/3/45 Murex 3 /1/2/3/5


The tree traversal used to set the left andright values is shown below in Figure 21.The value of the counter on entry into anode is shown in yellow; this is the valuestored in the left field for that node. Thevalue of the counter on the last exit from anode is shown in red, this is the valuestored in the right field for that node.

We start at the root of the tree, Gastropoda.The counter is incremented to one, and theleft value for Gastropoda is set to 1.Gastropoda has one child,Caenogastropoda, we increment thecounter, enter that node and set its leftvalue to two. We proceed down the tree inthis manner until we reach a leaf, Murex.On entering Murex we increment thecounter, set Murex's left to 5. Since Murexhas no children, we leave it, incrementingthe counter and setting its right value to 6.We move back to the parent node forMurex, Muricidae. Muricidae has a child wehaven't visited yet, so we move down into it,

36

Code Example 2.

// Pseudocode to illustrate query on edge representation table containing parentage// string. Uses field names found in BioLink table tblBiota. // Note: does not handle any error conditions.// Note: does not include code to place higher taxa in correct order.$rank = “”; $targetName = $initialTarget; // Name of the taxon who's parentage you want to find$sql = “SELECT vchrParentage FROM tblBiota WHERE vchrFullName = '$targetName'”; // get parentage string$results = getresults($connection,$sql); // run query on database connection$row = getrow($results); // get the result set from the query$parentage = getcolumn($row,”rank”); // get the parentage string// the parentage string is a list of ids separated by backslashes@array_parentage = split($parentage,”\”); // split the parentage string into ids$sql = “SELECT vchrFullName FROM tblBiota WHERE “;$first=TRUE;for ($x=1;$x<rowsin(@array_parentage);$x++) { // for each id in parentage, build query get the name of the taxon if ($first==FALSE) { $sql = $sql + “ and “; } // Note: enclosing integer values in quotes in sql queries is usually not // required, but is a good programing practice to help prevent sql injection // attacks $sql = $sql + “ intBiotaID = '“ + @array_parentage[$x] + “'”; $first=FALSE;}$results = getresults($connection,$sql); // now run assembled query to get namesfor ($x=1;$x<rowsin($results);$x++) { $row=getrows(results) @array_taxa[$x]=getcolumn($row,”vchrFullName”);}

Table 25. A tree stored in a table using a left righttree visitation algorithm.

TaxonID (PK) Taxon Name Left Right1 Gastropoda 1 122 Caenogastropoda 2 113 Muricidae 3 104 Muricinae 4 95 Murex 5 66 Chicoreus 7 8

Figure 21. Representing the classification ofMurex and Chicoreus as a tree and using a treevisitation algorithm to map the nodes in the tree.Values placed in the left field in yellow, valuesplaced in the right field in red. Data as in Table 25.


Chicoreus. Entering Chicoreus weincrement the counter and set Chicoreus'left to 7. Since Chicoreus has no morechildren, we leave it, incrementing itscounter and setting its right value to 8. Westep back to Chicoreus' parent, Muricnae.Since Muricinae has no more children, weleave it for the last time, incrementing thecounter and storing its value to Muricinae'sright. We keep walking back up the tree,finally getting back to the root after havingvisited each node in the tree.

In comparison to an edge representation,the tree visitation algorithm has some veryuseful properties for selecting data from thedatabase. The query Select taxon_namewhere t_left = 1 will return the root of thetree. The query Select t_right/2 where t_left= 1 will return the number of nodes in thetree. Selecting the higher classification for aparticular node is also fast and easy:

SELECT taxon_name FROM treetableWHERE t_left < 7 and t_right > 8ORDER by t_left;

The query above will return the higherclassification for Chicoreus in the exampleabove. We don't need to know the numberof steps to the root, make recursive queries,store redundant information or have to worryabout any of the other problematic featuresof denormalized or edge visitationimplementations of hierarchies. Selectingall of the children of a particular node, or allof the leaf nodes below a particular nodeare also very easy queries.

SELECT taxon_name FROM treetableWHERE t_left > 3 and t_right < 10 and t_right – t_left = 1ORDER by t_left;

As grand as the tree visitation algorithm isfor extracting data from a database, it hasits own problems. Except for a very smalltree, it will be extremely hard to create andmaintain the left/right values by hand. Youwill probably need code to construct theleft/right values by traversing through anedge representation of the tree. You willalso need a user interface supported bycode to insert nodes into the tree, deletenodes, and move nodes. This code baseprobably won't be much more complex thanthat needed for robust support of an edge

representation of a tree, but you will need tohave such code, while you might be able toget away without writing it for an edgerepresentation.

The biggest difficulty with the tree visitationalgorithm is editing the tree. If you wish toadd a single node (say the genus Hexaplexwithin the Muricidae in the example above),you will need to change left and right valuesfor many, most, or even all of the rows in thedatabase. In order to obtain a fast retrievalspeed on queries that include the left andright field in select criteria, you will probablyneed to index those fields. An update tomost rows in the table will essentially meanrebuilding the index for each of thesecolumns, even if you only changed onenode in the tree. In a single user database,this will make edits and inserts on the treevery slow. The situation will be even worsein a multi-user database. Updating manyrows in a table that contains a tree willeffectively lock the entire table from accessby other users while the update isproceeding. If the tree is large andfrequently edited table locks will make a thedatabase effectively unusable.

Unless a table contains very few rows, it willneed to have indexes for columns that areused to find rows in the table. Withoutindexes, retrieval of data can be quite slow.Efficient extraction of information from atable that uses tree visitation will meanadding indexes to the left and right fields.

CREATE UNIQUE INDEX t_left ON treetable(t_left);CREATE UNIQUE INDEX t_right ON treetable(t_right);

When a row is added to the table, thevalues in the left and right fields will now beindexed (with the details of how the index iscreated, stored, and used depending on theDBMS). In general, indexes willsubstantially increase the storagerequirements for a database. Creating anindex will take some amount of time, and anindex will increase the time needed to inserta row into a table (Figure 22). Appropriateindexes dramatically reduce the timerequired for a select query to return a resultset.

37


Indexing

An index allows at DBMS to very rapidlylook up rows from table. An appropriatechoice of fields to index is a very importantaspect of database implementation, andcareful tuning of indexes is an importantdatabase administration task for a largedatabase. Being able to reduce the time fora database operation from 10 seconds to 2seconds can change reduce the personyears of effort in a large data captureoperation and make the difference betweencompletion of a project being feasible orinfeasible. Small improvements inperformance, such as reducing the time fora query to complete from 2 seconds to lessthan one second, will result in a largeincrease in efficiency in large scale(hundreds of thousands of records) datacapture.

The details of how a multi-user DBMSprevents one user from altering the data thatis being edited by another user, and how itprevents one user from seeing inconsistentdata from a partially completed update byanother user, are also an important

performance concern for databaseimplementation and administration.

An update to a single row will not interferewith other queries that are trying to selectdata from other rows if the table uses rowlevel locking. If, however, the table applieslocking at the level of database pages(chunks of data that the database is storingtogether as units behind the scenes), thenan update that affects one row may lock outqueries that are trying to access data fromother rows (Figure 23). Some databasesapply table level locking and will lock out allusers until an update has completed. Insome cases, locks can produce deadlockswhere two users both lock each other out ofrows that they need to access to complete aquery.

Because of these locking issues, a treevisitation algorithm is a poor choice forstoring large trees that are frequently editedin a multi-user database. It is, however, agood choice for some situations, such asstorage of many small trees (usingtree_number, t_left, and t_right fields) asmight be used to track collection objects orfor searchable web databases that are not

38

Figure 22. An index can greatly increase the speed at which data is retrieved from a database, but can slowthe rate of data insertion as both the table and its indexes need to be updated.

Figure 23. Page locks. Many database systems store data internally in pages, and an operation that locksa row in a page may also lock all other queries trying to access any other row on the same page.


edited by their users and are infrequentlyupdated from a master data source. Thetree visitation algorithm is particularly usefulfor searchable web databases, as iteliminates the need for multiple recursivequeries to find data as in edgerepresentations (a query on joined tables inMySQL in the example below will find thechildren of all taxa that start “Mure”).

SELECT a.taxon_name, b.taxon_name, b.rankFROM treetable as a LEFT JOIN treetable as b ON a.taxonid = b.taxonidWHERE b.t_left > a.t_left and b.t_right < a.t_right and b.taxon_name like “Mure%”ORDER by a.taxon_name, t_left;

Because there are serious performance andcode complexity tradeoffs between variousways of implementing the storage ofhierarchical information in a relationaldatabase, choosing an appropriate treerepresentation for the situation at hand isquite important. For some situations, a treevisitation algorithm is an ideal way to handlehierarchical information, whereas in others itis a very poor choice. Since single pointalterations to the tree result in large portionsof the table holding the hierarchy beingchanged, tree visitation is a very poorchoice for holding large taxonomichierarchies that are frequently edited.However, its fast efficient extraction of datamakes it a very good choice for situations,such as web databases, where users areseldom or never altering data, or wheremany small distinct trees are being held inone table.

The storage of trees is a very logical placeto consider supporting data integrity withstored procedures. A table holding an edgerepresentation of a tree is not easy tomaintain by hand. Even skilled users canintroduce records that create infinite loops,and unlinked subtrees (anastomoses,except those produced by misspellings, canbe prevented by enforcing a unique indexon child names). Murex could be linked toMurcinae which could be linked to Muricidaewhich could be linked by mistake back toMurex to create an infinite loop. Murexcould be linked to Muricinae, which might bymistake not be linked to a higher taxoncreating an unlinked subtree. An edge

representation could be supported by an oninsert/update trigger. This trigger couldcheck to see if each newly inserted orupdated parent has a match to an existingchild (select count(*) from taxon wherechild = %newparent) – preventing theinsertion of unlinked subtrees. If an oninsert or on update trigger finds that abusiness rule would be violated by therequested change, it can fail and return anerror message to the calling program. Inmost circumstances, such an error messageshould propagate back through the userinterface to the user in a form that tells themwhat corrective action to take (“You can'tinsert Muricinae:Murex yet, you have to linkMuricinae to the correct higher taxon first”)and in a form that will tell a programmerexactly where the origin of the error was(“snaildb_trigger_error_257”).

An on insert and on update trigger on atable holding an edge representation of atree could also check that the parent of eachnewly inserted or updated row links to rootwithout encountering child – preventinginfinite loops. Checking each inserted orchanged record for a parent that is linked upto the root of the tree would have a highercost, as it would require a recursive seriesof queries to trace an unknown number ofparent-child links back to the root of thetree. Decisions on how to store trees andwhat code to place where to support thestorage of those trees can have a significantimpact on the performance of a database.Slow updates may be tolerable in ataxonomic dictionary that serves as a rarelyedited authority file for a specimen data set,but be intolerable for regular edits to adatabase that is compiling a tree oftaxonomic names and their synonymies.

Triggers are also a very good place tomaintain redundant data that has beencached in the database to reducecalculation or lookup time. On insert and onupdate triggers can, for example, maintainparentage strings in tables that store trees(where they will need to recurse the childrenof a node when that node is linked to a newparent, as this change will need topropagate down through all the parentagestrings of the child nodes).

Hierarchical information is widespread incollection data. Therefore, in the process of

39


beginning the design phase of a databaselife cycle, it is worth considering a widerrange of DBMS options than just relationaldatabases. Some other database systemsare better than relational databases atnative handling of trees and hierarchicalinformation. Object oriented databasesystems should be considered, as they caninclude data types and data structures thatare ideal for the storage and retrieval ofhierarchical information. However, at thepresent time, using an object DBMS ratherthan a standard relational DBMS has costs,particularly poor portability and reducedavailability of support. Relational databaseshave a solid well understood mathematicalfoundation (Codd, 1970), have widespreadindustry support for sql, are not hard tomove data in and out of, and have a widerange of implementations by many vendors,including the open source community.Object databases (and hybrid object supportin relational databases) are still none ofthese (thus this paper deals almost entirelywith relational databases).

Data Stewardship

We often state to groups on tours throughnatural history collections that “ourspecimens are of no value without theirassociated data”. We are well used tothinking of stewardship of our collectionobjects and their associated paper records,but we must also think carefully about thestewardship of electronic data associatedwith those specimens. I will divide datastewardship into two concepts: 1) Datasecurity, that is maintaining the data in anenvironment where it is safe from maliciousdamage and accidental loss Data securityencompasses network security,administration, pre-planning and riskmitigation. 2) Data quality control, that isprocedures to assist the users of thedatabase in maintaining clean andmeaningful data. Data quality control is ofparticular importance during data migration,as we are often performing large scaletransformations of collections records, andthus need to carefully plan to prevent boththe loss of information and the addition ofnew erroneous information. Datastewardship involves the short term datasecurity tasks of system administration andquality control and the long term perspective

of planning for movement of the datathrough multiple database lifecycles.

Data Security

Providing carefully planned multiple layersof security has become an essential part ofmaintaining the integrity of electronic data.Any computer connected to the Internet is atarget for attack, and any computer is atarget for theft. Simply placing a computerbehind a firewall is not, by any means, anadequate defense for it or its data. Securityrequires paranoia, but not just any oldparanoia. Security requires a broad thinkingparanoia. Computer data security, as in thedesign of cryptographic systems, dependson the strength of the weakest link in thesystem. Ferguson and Schneier (2003),discussing cryptographic system design,provide a very good visual analogy. Asecure system is like the stockade walls of afort, but, in the virtual world of computersand data, it is very easy to become focusedon a single post in that stockade, trying tobuild that single post to an infinite heightwhile not bothering to build any of the rest ofthe stockade. An attacker will, of course,just walk around this single tall post.Another good analogy for computer securityis that of defense in depth (e.g. Bechtel,2003). Think of security as involving manylayers, all working together to prevent anysingle point of entry from giving easy accessto all resources in the system.

In the collection management communitythere is a widespread awareness of theimportance of preplanning for disasterresponse. Computer security is littledifferent. Consider a plausible scenario:what do you do if you discover that yourweb database server is also hosting a childpornography ftp server? If you are part of alarge institution, you may have an incidentresponse team you can call on to deal withthe situation. In a small institution, youmight be the most skilled and trained personavailable. Preplanning and establishing anincident response capability (e.g. a teamcomposed of people from around theinstitution with computing and legal skillswho can plan appropriate responses,compile contact lists, assess the security ofthe institution's computing resources, andrespond to computer incidents) is becoming

40


a standard practice in informationtechnology.

One of the routine threats for any computerconnected to the Internet is attack fromautomated software seeking to exploitknown vulnerabilities in the operatingsystem, server services, or applications. Itis therefore important to be aware of newlydiscovered vulnerabilities in the softwarethat you are using in your network and toapply security patches as appropriate (andnot necessarily immediately on release of apatch, Beatte et al., 2002). Vulnerabilitiescan also be exploited by malicious hackerswho are interested in attacking you inparticular, but it is much more likely thatyour computers will be targets simplybecause they are potential resources,without any particular selection of you as atarget. Maintaining defenses against suchattacks is a necessary component ofsystems administration for any computerconnected to the Internet

Threat analysis

Computer and network security covers avast array of complex issues. Just as inother areas of collection management,assessing the risks to biodiversity andcollections information is a necessaryprecursor to effective allocation of resourcesto address those risks. Threat analysis is acomprehensive review and ranking of therisks associated with computerinfrastructure and electronic data. A threatanalysis of natural history collection datahoused within an institution will probablysuggest that the highest risks are as follows.Internal and external security threats existfor biodiversity information and itssupporting infrastructure. The two greatestthreats are probably equipment theft andnon-targeted hacking attacks that seek touse machines as resources. An additionalsevere risk is silent data corruption (ormalicious change) creeping into databasesand not being noticed for years. Risks alsoexist for certain rare species. The release ofcollecting locality information for rare andendangered species may be a threat tothose species in the wild. Public distributionof information about your institution'sholdings of valuable specimens may makeyou a target for theft. In addition, for somecollections, access to some locality

information might be restricted by collectingagreements with landowners and havecontractual limits on its distribution. A threatanalysis should suggest to you whereresources should be focused to maintain thesecurity of biodiversity data.

Michael Wojcik put it nicely in a post toBugtraq. “What you need is a weightedthreat model, so you can address threats inan appropriate order. (The exact metric isdebatable, but it should probably combineattack probability, likely degree of damage,and at a lesser weight the effort ofimplementing defense. And, of course,where it's trivial to protect against a threat,it's worth adding that protection even if thethreat is unlikely.)” (Wojcik, 2004)

Implementing Security

If you go to discuss database security withyour information technology support people,and they tell you not to worry because themachine is behind the firewall, you shouldworry. Modern network security, likenavigation, should never rely on any singlemethod. As the warning on marine chartsgoes “The prudent navigator will not relysolely on any single aid to navigation”, thecore idea in modern computer networksecurity is defense in depth. Security foryour data should include a UPS, regularbackup, offsite backup, testing backupintegrity and the ability to restore frombackup, physical access control, needlimited access to resources on the network,appropriate network topology (includingfirewalls), encryption of sensitive networktraffic (e.g. use ssh rather than telnet),applying current security patches tosoftware, running only necessary serviceson all machines, and having an incidentresponse capability and disaster recoveryplan in place.

Other than thinking of defense in depth, themost important part of network security isimplementing a rational plan given theavailable resources that is based upon athreat analysis. It is effectively impossible tosecure a computer network against attacksfrom a government or a major corporation.All that network security is able to do is toraise the barrier of how difficult it will be topenetrate your network (either electronicallyor physically) and increase the resources

41


that an intruder would need to obtain or alteryour data. Deciding what resources toexpend and where to put effort to secureyour information should be based on ananalysis of the threats to the data and yourcomputer infrastructure.

While electronic penetration risks do existfor collections data (such as extraction ofendangered species locality data by illicitcollectors, or a malicious intruder placing achild pornography ftp site on your webserver), the greatest and most immediaterisks probably relate to physical security andlocal access control. Theft of computersand computer components such as harddrives is a very real risk. The spread ofinhumane “human resources” practices fromthe corporate world creates a risk that thevery people most involved in the integrity ofcollections data may seek to damage thosedata (in practice this is approached muchthe same way as the risks posed bymistakes such as accidentally deleting filesby applying access control, backup, anddata integrity checking schemes). After acareful threat analysis, three priorities willprobably stand out: control on physicalaccess to computers (especially servers),applying access control so that people onlyhave read/write/delete access to theelectronic resources they need, and by awell designed backup scheme (includingbackup verification and testing of datarestoration).

Hardware

Any machine on which a database runsshould be treated as a server. If thatmachine loses power and shuts down, thedatabase may be left in an inconsistentstate and become corrupt. DBMS softwareusually stores some portion of the changesbeing made to a database in memory. If amachine suddenly fails, only portions of theinformation needed to make a consistentchange to the data may have been writtento disk. Power failures can easily causedata corruption. A minimum requirement fora machine hosting a database (but not amachine that is simply connecting to adatabase server as a client) is anuninterruptible power supply connected tothe server with a power monitoring card andwith software that is capable of shuttingdown services (such as the DBMS) in the

event of a prolonged power failure. In abrief power outage, the battery in the UPSprovides power to keep the server running.As the battery starts to run down, it cansignal the server that time is running out,and software on the server can shut downthe applications on the server and shutdown the server itself. Another level ofprotection that may be added to a server isto use a set of hard disks configured as aredundant RAID array (e.g. level 5 RAID).RAID arrays are capable of storing dataredundantly across several hard disks. Inthe event that one hard disk in the arrayfails, copies of the data stored on it are alsoheld on other disks in the array, and, whenthe failed drive is replaced with a new one,the redundant data are simply copied backinto place.

Backup

Any threat analysis of biodiversity data willend up placing backups at or near the top ofthe risk mitigation list. A well plannedbackup scheme is of vital importance for acollection database. A good plan arisesdirectly out of a threat analysis. The rate ofchange of information in a database and theacceptable level of loss of that informationwill strongly influence the backup plan.Each database has different rates ofchange, different backup needs and thusrequires a plan appropriate for its needs.Regardless of how good a backup plan is inplace, no records related to collectionobjects data should be held solely inelectronic form. Paper copies of the data(preferably labels, plus printed catalogpages) are an essential measure to ensurethe long term security of the data. In theevent of a catastrophic loss of electronicdata, paper records provide an insurancethat the information associated withcollection objects will not be lost.

Backups of electronic data should includeregular on-site backups, regular off-sitetransport of backups, making copies ondurable media, and remote off-siteexchanges. The frequency and details ofthe backup scheme will depend on the rateof data entry and the threat analysisassessment of how much data you arewilling to re-enter in the event of failure ofthe database. A large international projectwith a high rate of change might mandate

42


the ability to restore up to the moment offailure. Such a project might use a backupscheme using daily full backups, hourlybackups of incremental changes since thelast full backup, and a transaction log thatallows restoration of data from the lastbackup to the most recent entry in thetransaction log, requiring the use of a SQLserver DBMS capable of this form of robustbackup. (A transaction log records everyupdate insert and delete made to thedatabase and allows recovery from the lastincremental backup to the moment offailure). Such backups might be writtendirectly on to a tape device by the DBMS, orthey might be written to the hard disk ofanother server. This scheme might becoupled with monthly burns of the backupsto cd, or monthly archiving of a digitalbackup tape, and monthly shipment ofcopies of these backups to remote sites. Atthe other end of the spectrum, a rarelychanged collection database might get oneannual burn to cd along with distribution ofbackup copies to an offsite backup storeand perhaps a remote site.

Database backups are different from normalfilesystem backups. In a filesystem backup,a file on disk is simply copied to anotherlocation (such as onto a backup tape). Adatabase that is left open and running,however, needs to have backup copiesmade from within the database softwareitself. If a DBMS has a database openduring a filesystem backup, a server backupprocess that attempts to create backupcopies of the database files by copying themfrom disk to another location will most likelyonly be able to produce a corruptinconsistent copy of the database.Database management software typicallykeeps recent changes to the data inmemory, and does not place the databasefiles on disk into a consistent state until thedatabase is closed and the software isshutdown. In some circumstances, it isappropriate to shutdown the DBMS andmake filesystem backups of the closeddatabase files. Backups of data held on adatabase server that is running all the time,however, need to be planned andimplemented from both within the databasemanagement software itself (e.g. storingbackup copies of the database files to disk)and from whatever process is managing

backups on the server (e.g. copying thosebackup files to tape archives). It is alsoimportant not to blindly trust the backupprocess. Data verification steps should beincluded in the backup scheme to makesure that valid accurate copies of the dataare being backed up (minor errors incopying database files can result in corruptand unusable backup copies).

Offsite storage of backup copies allowsrecovery in the case of local disaster, suchas a fire destroying a server room anddamaging both servers and backup tapes.Remote storage of backup copies (e.g. twomuseums on different continents makingarrangements for annual exchange ofbackup copies of data) could be valuabledocumentation of lost holdings to thescientific community in the event of a largeregional disaster.

In developing and maintaining a backupscheme it is essential to go through theprocess of restoring those backups. Testingand practicing restoration ensures that whenyou do have to restore from backup youknow what to do and know that yourrecovery plan includes all of the piecesneeded to get you back to a functionalrestored database. “Unfortunately serversdo fail, and many an administrator hasexperienced problems during the restoreprocess because he or she had notpracticed restoring databases or did not fullyunderstand the restore process”(Linsenbardt and Stigler, 1999 p.272).

With many database systems both the datain the database and the system or masterdatabase are required to restore adatabase. In MySQL, user access rights arestored in the MySQL database, and afilesystem backup of one database on amysql server will not include user rights.MS SQL Server likewise suggests creating abackup of the system database after eachmajor change to included databases, as wellas making regular backups of eachdatabase. The PostgreSQL documentationsuggests making a filesystem backup ofonly the entire cluster of databases on aserver and not trying to identify the filesneeded to make a complete filesystembackup of a single database. “This will notwork because the information contained inthese files contains only half the truth. The

43


other half is in the commit log filespg_clog/*, which contain the commit statusof all transactions. A table file is only usablewith this information.” (PostgreSQL GlobalDevelopment Group, 2003). Databasebackups can be quite complex, and testingthe recovery process is important to ensurethat all the components needed to restore afunctional database are included in thebackup scheme.

Access Control

Any person with high level access to adatabase can do substantial damage, eitheraccidentally or on purpose. Anyone withhigh level access to a database can dosubstantial damage with a simple commandsuch as DROP DATABASE, or more subtlythrough writing a script that graduallydegrades the data in a database. No purelyelectronic measures can protect data fromthe actions of highly privileged uses.Database users should be granted only theminimum privileges they need to do theirwork. This principle of minimum privilege isa central concept in computer security(Hoglund & McGraw, 2004). Users of abiodiversity database may include guests,data entry personnel, curators,programmers, and system administrators.Guests should be granted only read (select)only access, and that only to portions of thedatabase. Low level data entry personnelneed to be able to enter data, but should beunable to edit controlled vocabularies (suchas lists of valid generic names), andprobably should not be able to create orview transactions involving collectionobjects such as acquisitions and loans.Higher level users may need rights to altercontrolled vocabularies, but only systemadministrators should have the ability togrant access rights or create new users.Database management systems include, tovarying degrees of granularity, the ability togrant users rights to particular operations onparticular objects in a database. Manysupport some form of the SQL commandGRANT rights TO user ON resource. Mostaccess control is best implemented bysimply using the access control measurespresent in the database system, rather thancoding access control as part of thebusiness rules of a user interface to thedatabase. Restriction of access to single

records in the database (row level accesscontrol), however, usually needs to beimplemented in higher layers.

Physical access control is also important. Ifa database server is placed in some readilyaccessible space, a passerby might shut itdown improperly causing databasecorruption, unplug it during a disk writeoperation causing physical failure of a harddisk, or simply steal it. Servers are bestmaintained in spaces with access limited toknowledgeable IT personnel, preferably aserver room with climate control, computersafe fire suppression systems, tightlyrestricted access, and video monitoring.

System Administration

Creating new user accounts, deactivatingold user accounts and other suchmaintenance of user access rights are tasksthat fall to a database administrator. A fewyears ago, such maintenance of rights onthe local machine, managing backups andmanaging server loads were the principletasks of a system administrator. Today,defense of the local network has become aabsolutely essential system administrationtask. A key part of that defense ismaintaining software with current securitypatches. Security vulnerabilities arecontinuously brought to light by thecomputer security research community.Long experience with commercial vendorsunresponsive to anything other than publicdisclosure has led to a widespread adoptionof an ethical standard practice in thesecurity community. An ethical securityresearcher is expected to notify softwarevendors of newly discovered vulnerabilities,then provide a 30 day grace period for thevendor to fix the vulnerability, followed bypublic release of details of the vulnerability.Immediate public release of vulnerabilities isconsidered unethical as it does not providesoftware vendors with any opportunity toprotect people using their software. Waitingan indefinite period of time for a vendor topatch their software is also consideredunethical, as this leaves the entire usercommunity vulnerable without knowing it (itbeing the height of hubris for a securityresearcher to assume that they are the onlyperson capable of finding a particularvulnerability). A 30 day window from vendornotification to public release is thus

44


considered a reasonable ethicalcompromise that best protects the interestsof the software user community. Somefamilies of vulnerabilities (such as C strcpybuffer overflows, and SQL injection) haveproven to be very widespread and relativelyeasy to locate in both open and closedsource software. A general perceptionamong security researchers (e.g. Hoglundand McGraw, 2004 p.9) is that manyexploits are known in closed circles amongmalicious and criminal programmers(leaving all computer systems open toattack by skilled individuals), and that publicdisclosure of vulnerabilities by securityresearchers leads to vendors closingsecurity holes and a general increase inoverall internet and computer security.These standard practices of the securitycommunity mean that it is vitally importantfor you to keep the software on anycomputer connected to the Internet up todate with current patches from the vendor ofeach software package on the system.

Installing a patch, however, may breaksome existing function in your system. Inan ideal setting, patches are first installedand tested on a separate testing server (orlocal network test bed) and then rolled outto production systems. In most limitedresource settings (which also tend to lackthe requirement of continuous availability),patching involves a balance between therisks of leaving your system unpatched forseveral days and the risks of a patch takingdown key components of your system.

Other now essential components of systemadministration include network activitymonitoring and local network design.Network activity monitoring is important forevaluating external threats, identifyingcompromised machines in the internalnetwork, and identifying internal users whoare misusing the network, as well as theclassical role of simply managing normaltraffic flow on the network. Possible networkcomponents (Figure 24) can include aborder router with firewall limiting traffic intoand out of the network, a network addresstranslation router sitting between internalnetwork using private ip address space andthe Internet, firewall software running oneach server limiting accessible ports on thatmachine, and a honeypot connected toborder router.

Honeypots are an interesting technique forsystem activity monitoring. A honeypot is amachine used only as bait for attackers.Any request a honeypot receives isinterpreted as either a scan forvulnerabilities or a direct attack. Suchrequests can be logged and used toevaluate external probes of the network andidentify compromised machines on theinternal network. Requests logged by ahoneypot can also used to update theborder router's rules to exclude any networkaccess from portions of the Internet.Honeypots can also be set up as machinesleft open with known vulnerabilities in orderto study hacker behavior, but such use mayraise ethical and legal issues in ways thatusing a honeypot simply to identify potentialattackers does not. Monitoring systemactivity is, as noted above, a systemadministration task that is an importantcomponent of the defense in depth of anetwork.

Example: SQL Injection Attacks

One particular serious risk for a databasethat is made available for search over theInternet is sql injection attacks (Anley 2002;Smith 2002). If a database is providing aback end for a web search interface, it ispossible for any person on the Internet toinject malicious queries to the databaseusing the web interface. In some cases(such as execution of MS SQLServer storedprocedures that execute shell commands), itis possible for an attacker to not only alterdata in the database, but to also executearbitrary commands on the server with theprivileges of the database software. That is,it may be possible for an attacker to takecomplete control of a server that is providinga database for search on the web.

Most web searchable databases are set upby setting up the database in a sql server(such as MySQL, MS SQLServer, orPostgreSQL), and writing code to producethe html for the web search and resultspages in a scripting language (such as PHP,ASP, or CGI scripts in PERL). A user will fillin a form with search criteria. This form willthen be submitted to a program in thescripting language that will take the criteriaprovided in the form submission, create ansql query out of them, submit that query tothe underlying database, receive a result set

45


back, format that result set as a web page,and return the resulting html document tothe user. If this system has not been set upwith security in mind, a malicious user maybe able to alter data in the database or eventake control of the server on which thedatabase is running. The attack is verysimple. An attacker can fill in a field on aform (such as a genus field) with criteriasuch as '; drop database; which couldbe assembled into a query by the scriptinglanguage as “SELECT genus, trivialFROM taxon WHERE genus like ' ';drop database; ' “, a series of threesql queries that it will then pass on to thedatabase. The database might interpret thefirst command as a valid select statement,return a result set, see the semicolon as aseparator for the next query, execute that bydropping (that is, deleting) the database,

see the second semicolon as anotherseparator, and return an error message forthe third query made up of just a singlequotation mark.

Defense against sql injection attacksinvolves following two basic principles ofnetwork security. First, never trust anyinformation provided to you by users,especially users over the Internet. Second,allow users only the minimum access rightsthey need. All code that runs behind a webinterface should sanitize anything that mightbe provided to it by a user (including hiddenvalues added to a form, as the source htmlof the form is available to the user, and theyare free to alter it to make the submissionsay anything they desire). This sanitizationshould take the form of disallowing allcharacters except those known to be valid(rather than disallowing a set of known

46

Figure 24. A possible network topology with a web accessible database readily accessible to the Internet,and a master database behind more layers of security. The web accessible copy of a database can bedenormalized and heavily indexed relative to the master database.


attacks). The code for the example abovemight include a sanitize routine thatcontains a command that strips everythingexcept the letters A to Z (in upper and lowercase) from whatever the user provided asthe criteria for genus. A regular expressionpattern for everything outside this validrange of characters is: /[^A-Za-z]/, thatis match anything not (^) in the range ([ ])A-Z or a-z. An example (in the webscripting language PHP) of a function thatuses such a regular expression to sanitizethe content of variables holding informationprovided over the web that might containgeneric and trivial names is shown below.

function sanitize() { global $genus,$trivial; $genus ~= preg_replace(“/[̂ A-Za-z] ”,””,$genus); $trivial ~= preg_replace(“/[̂ a-z]/”,””,$trivial);}

This function uses a regular expressionmatch that examines the variable $genusand replaces any characters that are not inthe range A-Z or the range a-z with blankstrings. Thus an attack suppling a value for$genus of “'; drop database;” wouldbe sanitized to the innocuous searchcriterion “dropdatabase”. Note that thesanitize function is explicitly listing allowedvalues and removing everything else, ratherthan just dropping the known dangerousvalues of semicolon and single quote. Anattack may evade “exclude the bad” filtersby encoding attack characters so that theydon't appear bad to the filter, but aredecoded an act somewhere beyond thefilter. A single quote might be url encodedas %27, and would pass unchangedthrough a filter that only excludes matchesto the ; and ' characters. Always limit userinput to known good characters, and bewary when the set of allowed valuesextends beyond [A-Za-z0-9]. Tight controlon user provided values is substantiallymore difficult when unicode (multi-bytecharacters) and characters outside the basicASCII character set are involved.

It is also important to limit user rights to theabsolute minimum necessary. If yourscripting language program has anembedded username and password to allowit to connect to your database and retrieve

information, you should set up a usernameand password explicitly for this task alone,and grant this user only the minimum selectprivileges to the database. The details ofthis will vary substantially from one DBMS toanother. In MySQL, the followingcommands might be appropriate:

GRANT SELECT ON webdb.webtable TO phpuser@localhost IDENTIFIED BY PASSWORD “plaintextpassword”

or to be explicit with the MySQL privilegetables6:

INSERT INTO user (host, user, password, select_priv, insert_priv, update_priv, delete_priv ) VALUES ( “localhost”,”phpuser”, password(“plaintextpassword”), “N”, ”N”, ”N”, “N” );INSERT INTO db (db, user, select_priv, insert_priv, update_priv, delete_priv ) VALUES (“webdb”, “phpuser”, “N”, ”N”, ”N”, “N” ); INSERT INTO tables_priv (host, db, user, table_name, table_priv ) VALUES (“localhost”,“webdb”, “phpuser”, “webtable”, “Select” );

With privileges restricted to select only onthe table you are serving up on the web,even if an attacker obtains the usernameand password that you are using to allowthe scripting language to access thedatabase (or if they are able to apply an sqlinjection attack), they will be limited in thekinds of further attacks they can perform onyour system. If, however, an attacker canobtain a valid database password andusername they may be able to exploit

6 You should empty out the history files forMySQL or your shell if you issue a commandthat embeds a password in it in either aMySQL query or a shell command, as historyfiles are potentially accessible by other users.

47


security flaws in the DBMS to escalate theirprivileges. Hardcoding a username /password combination in a web script, aweb scripting function placed off the webtree, or in a configuration file off the webtree is generally necessary to allow webpages to access a database. Anyhardcoded authentication credentialprovides a possible target for an internal orexternal attacker. Through an error in filenaming or web server configuration, thesource code of a web script placed on thepublicly accessible web tree may becomevisible to outside users, exposing anyauthentication credentials hard coded in thatfile. Users on the local file system may beable to read files placed outside of the webtree or be able to escalate their privileges toread configuration files with embeddedpasswords. It is thus very important to limitto the absolute minimum needed privilegesthe user rights of any usernames that arehardcoded outside of the database. In aDBMS that includes stored procedures, youshould explicitly deny the web user therights to run stored procedures (especially inthe case of MS SQLServer, which comeswith built in stored procedures that canexecute shell commands, meaning that anyuser who is able to execute storedprocedures can access anything on theserver visible to the user account that theSQLServer is running under, which ought tobe an account with privileges restricted tothe minimum needed for the operation of theserver). Alternately, run all web operationsthrough stored procedures, and grant theweb user rights for those stored proceduresand nothing else (tight checking of variablespassed to stored procedures and denial ofrights to do anything other than execute asmall set of stored procedures can assist inpreventing sql injection attacks).

Even if the details of the discussion aboveon sql injection attacks seem obscure, asthey probably will if you haven't worked withweb databases and scripting languages, Ihope the basic principle of defense in depthhas come across. The layers of firewall (torestrict remote access to the server to port80 [http]) , sanitizing all informationsubmitted with a form before doing anythingwith it, limiting the webuser's privileges onthe sql server, and limiting the sql server'sown privileges all work together to reduce

the ability of attackers to penetrate yoursystem.

Computer and network security is aconstantly changing and evolving field. It is,moreover, an important field for all of us whoare running client-server databases, webserving information out of databases, or,indeed, simply using computers connectedto the Internet. Anyone with responsibilityover data stored on a computer connectedto the Internet should be at least working tolearn something about network securitypractices and issues. Monitoring networksecurity lists (such as securityfocus.com'sbugtraq list or US-CERT's technical cybersecurity alerts) provides awareness ofcurrent issues, pointers to papers onsecurity and incident response, andopportunities to learn about how softwarecan be exploited.

Maintaining Data Quality

The quality of your data are important.Incorrect data may become permanentlyassociated with a specimen and might in thefuture be used draw incorrect biologicalconclusions. While good database designand good user interface design assist inmaintaining the quality of your data overtime, they are not enough. The day to dayprocesses of entering and modifying datamust also include quality control proceduresthat include roles for everyone involved withthe database.

Quality Control

Quality control on data entry covers a seriesof questions: Are literal data capturedcorrectly? Are inferences correctly madefrom the literal data? Are data correctlycaptured into the database, with informationbeing entered into the correct fields in thecorrect form? Are links to resources beingmade correctly (e.g. image files ofspecimens)? Are the data and inferencesactually correct? Some of these questionscan be answered by a proofreader, othersby queries run by a database administrator,whereas others require the expertknowledge of a taxonomist. Tools forquality control include both componentsbuilt into the database and procedures forpeople to follow in interacting with thedatabase.

48


At the database level controls can be addedto prevent some kinds of data entry errors.At the most basic level, field types in adatabase can limit the scope of valid entries.Integer fields will throw an error if a dataentry person tries to enter a string. Datefields require a valid date. Most data inbiodiversity databases, however, goes intostring fields that are very looselyconstrained. Sometimes fields holdingstring data can be tightly constrained on thedatabase level, as in fields that are declaredas enumerations (limited in their fielddefinition to a small set of values). In somecases, atomization can help with control. Astring field for specimen count might be splitinto an integer field to hold the count, anenumerated field to hold the kind of objectsbeing counted, and a qualifier field.Constraints in the database can test thecontent of one field in a record againstothers. A constraint could, for example,force users to supply a source of a previousnumber if they provide a previous number.Code in triggers that fire on inserts andupdates can force the data entered in afield to conform to a defined pattern, suchas “####-##-##” for dates in ISO format.Similar constraints on the scope of data thatwill be accepted as valid can be applied atthe user interface level. While theseconstraints will trap some data entry errors(some typographic errors and some dataentry into incorrect fields), the scope of validinput for a database field will always belarger than the scope of correct input.Mistakes will occur on data entry. Incorrectdata will be entered into the databaseregardless of the controls placed on dataentry. To have any assurance that the dataentered into a biodiversity database isaccurate, quality control measures beyondsimply controlling data entry and trustingdata entry personnel are necessary.

The database administrator can conductperiodic review of large numbers of records,or tools can be built into the database toallow privileged users to easily review largeblocks of records. The key to such bulkreview of records is to look for outliers. Aneasy way to find outliers is to sort columnsand look at top and bottom records to findobvious out of range values such datesentered in text fields.

SELECT TOP 10 named_place FROM collecting_event ORDER BY named_place;

SELECT TOP 10 named_place FROM collecting_event ORDER BY named_place DESC;

The idea of a review of the top and bottomof alphabetized lists can be extended to abroader statistical review of field content foridentification and examination of outliers.Correlation of information between fields is arich source of tests to examine for errors indata entry. Author, year combinationswhere the year is far different from otheryears with the same authorship string aregood candidates for review, as are outlyingcollector – collecting event datecombinations. Comparison of a table ofbounding latitudes and longitudes forcountries and primary divisions with thecontent of country, primary division, latitude,and longitude fields can be used either as acontrol on data entry or as a tool forexamination of outliers in subsequentreview.

Ultimately quality control rests on review ofindividual records by knowledgeablepersons. This review is an inherent part ofthe use of biological collections bysystematists who examine specimens,make new identifications, comment onlocality information, and create new labelsor annotations. In the context of biologicalcollections, these annotations are a livingpart of an active working collection. In othercontexts, knowledge of the quality ofrecords is important for assessing thesuitability of the data for some analysis.

The simplest tool for recording the status ofa record is a field holding the status of arecord. A slightly more complex schemeholds the current state of the record, whoplaced the record into that state, and a timestamp for that action. Any simple schemelike this (who last updated, date lastupdated, etc), which records statusinformation in the same table as the data,suffers all the problems of any sort of flat filehandling of relational data. A databaserecord can have many actions taken on it bymany different people. A more generalsolution to activity tracking holds the activityinformation in a separate table, which can

49


hold records of what actions were taken bywhat agents at what times. If a projectneeds to credit the actions taken byparticipants in the project (how manyspecies occurrence records has eachparticipant created, for example), then aseparate activity tracking table is essential.In some parts of some data sets (e.g.identifications in a collection), credit foractions is inherent in the data, in others it isnot. A good general starting place forrecording the status of records is to timestamp everything. Record who entered arecord when, who reviewed it when, whomodified it how when. Knowing who didwhat when can have much more utility thansimply crediting work. In some data entrytasks, an error can easily be repeated overa set of adjacent records entered by onedata entry person, analogous to an incorrectcatalog number being copied to the top ofthe catalog number column on a new pagein a handwritten ledger. If records arecreator and creation time stamped, recordsthat were entered immediately before andafter a suspect record can be easilyexamined for similar or identical errors.Time stamping also allows for readycalculation of data entry and review ratesand presentation of rate and review qualitystatistics to data entry personnel.

Maintenance of controlled vocabulariesshould be tightly regulated. If regular dataentry personnel can edit controlledvocabularies, they will cease to becontrolled. Dictionary information (used togenerate picklists or to check for valid input)can be stored in tables to which data entrypersonnel are granted select only access,while a restricted class of higher level usersare granted update, insert, and deleteaccess. This control can be mirrored in theuser interface, with the higher level usersgiven additional options for dictionaryediting. If dictionary tables are hard tocorrectly maintain (such as tables holdingedge representations of trees throughparent-child links), even advanced andhighly skilled users will make mistakes, sointegrity checks (e.g. triggers that check thatthe parent value of a record correctly links itto the rest of the tree) should be built intothe back end of the database (as skilledusers may well bypass the user interface).

Quality control is a process that involves

multiple people. A well designed data entryinterface cannot ensure the quality of data.Well trained data entry personnel cannotensure the quality of data. Bulk overview ofunusual records by a databaseadministrator cannot ensure the quality ofdata. Random review (or statisticalsampling) of newly created records by thesupervisors of data entry personnel cannotensure data quality. Review of records byspecialists cannot ensure data quality.Only bringing all of these people togetherinto a quality control process provides aneffective means of quality control.

Separation of original data andinferences

Natural history collection databases containinformation about several sorts of things.First, they contain data about specimens:where they were collected, theiridentifications, and who identified them.Second, they contain inferences about thesedata – inferences such as geographiccoordinates added to georeference data thatdid not originally include coordinates. Third,they contain metadata about thesesubsequent inferences, and fourth, theycontain transaction data concerning thesource, ownership, permitting, movement,and disposition of specimens.

Efforts to capture text information aboutcollection objects into databases have ofteninvolved making inferences about theoriginal data and adding information (suchas current country names) to aid in theretrieval of data. Clearly, distinguishing theoriginal data from subsequent inferences isusually straightforward for a specialistexamining a collection object. Herbariumsheets usually contain many generations ofannotations, mammal and bird skins usuallyhave multiple attached tags, molluscan drycollections and fossils usually have multiplesets of labels within a tray. Examining thesepaper records of the annotation history of aspecimen usually makes it clear what dataare original and what are subsequentinferences. In contrast, database records,unless they include carefully plannedmetadata, often present only a single viewof the information associated with aspecimen – not clearly indicating whichportions of that data are original and whichare subsequent inferences.

50


It is important wherever possible to store aninvariant copy of the original dataassociated with a collection object and toinclude metadata fields in a database toindicate the presence and nature ofinferences. For example, a project tocapture verbatim records from a handwrittencatalog can use these data to produce anormalized database of the collection butshould also store a copy of the originalverbatim records in a simple flat table.These verbatim records are thus readilyavailable for examination by someonewondering about an apparent problem in thedata related to a collection object. Likewise,it is important to include some field structureto store the source of coordinates producedin a georeferencing project – and to allow forthe storage and identification of originalcoordinates. Most important, I feel, is thatany inferences that are printed onto paperrecords associated with specimens need tobe marked as inferences. As older paperrecords degrade over time, newer paperrecords have the potential of beingconsidered copies of the data on thoseoriginal records unless subsequentinferences included upon them are clearlymarked as inferences (in the simplest case,by enclosing inferences in square brackets).

Error amplification

Large complex data sets that are used foranalyses by many users, who often lack aclear understanding of data quality, aresusceptible to error amplification. Naturalhistory collection data sets have historicallybeen used by specialists examining smallnumbers of records (and their relatedspecimens). These specialists are usuallyhighly aware of the variability in quality ofspecimen data and routinely supplycorrections to identifications andprovenance data. Consider, in contrast, thepotential for large data sets of collectioninformation to be linked to produce, forexample, range maps for speciesdistributions based on catalog data.Consider such a range map incorporatingerroneous data points that expand theapparent range of a species outside of itstrue range. Now consider a worker usingthese aggregate data as an aid to identifyinga specimen of another species, taken fromoutside the range of the first species, then

mistakenly identifying their specimen as amember of the first species, and thendepositing the specimen in a museum(which catalogs it and adds its incorrect datato the aggregate view of the range of thefirst species). Data sets that incorporateerrors (or inadequate metadata) from whichinferences can be made, and to which newdata can be added based on thoseinferences are subject to error amplification.Error amplification is a known issue inmolecular sequence databases such asgenbank (Pennisi, 1999; Jeong and Chen,2001) where incorrect annotation of onesequence can lead to propagation of theerror as new sequences are interpretedbased on the incorrect annotation andpropagate the error as they are added to thedatabase. Error amplification is an issuethat we must be extremely careful with aswe begin large scale analysies of linkedcollections databases.

Data Migration and CleanupLegacy data

Biodiversity informatics data sets ofteninclude legacy data. Many natural historycollections began data entry into earlydatabase systems in the 1970s and havegone through multiple cycles throughdifferent database systems. Biological datasets that we encounter were often compiledwith old database tools that didn't effectivelyallow the construction of complex relationaldatabases. Individual scientists with notraining in database design often constructflat data sets in spreadsheets to manageresults from their research and informationrelated to their research program. Thus,legacy data often include a typical set ofinventive approaches to handling relationalinformation in flat files. Understanding thedesign problems that the authors of thesedatasets were trying to solve can be a greathelp in understanding what appear to bepeculiarities in legacy data. Legacy dataoften contain non-atomic data, such assingle taxon name fields, and non-normaldata, such as lists of values in a single field.

In broad terms, legacy data present severalclasses of challenges. At one level, simplyporting the data to a new DBMS may not betrivial. Older DBMS systems may not havean option to simply export all fields as text,

51


and newer databases may not have importfilters for those older systems.Consequently, you may need to learn thedetails of data export or data storage in theolder system to extract data in a usableform. In other cases, export to a newerDBMS format may be trivial. At anotherlevel, legacy data often contain complex,poorly documented codings. Understandingthe data that you are porting is veryimportant as information can be lost oraltered if codings are not properlyinterpreted during data migration. Atanother level, data contain errors. Any dataset can contain data entry errors. Flat filelegacy data sets tend to contain errors thatmake export to normal data structuresdifficult without extensive cleanup of thedata. Non atomic fields can containcomplex variant combinations of theircomponents, making parsing difficult.Fields that contain repeated information(e.g. Country, State, PrimaryDivision) willcontain many misspellings and variant formsof what should be identical rows. Cleanupof large datasets containing these sorts ofissues may be assisted by several sorts oftools. A mapping table with one fieldcontaining a list of all the unique valuesfound in one field in the legacy databaseand another field containing correctedvalues can be used to create a cleanedintermediate table, which can then betransformed into a more normal structure.Scripting languages with pattern matchingcapabilities can be used to compare data ina mapping table with dictionary files and flagexact matches, soundex matches, andsimilarities for human examination. Patternmatching can also be used to parse out datain a non-atomic field, with some fieldsparsing cleanly with one pattern, others withanother, others with yet another, leavingsome small set of exceptions to be fixed bya human. Data cleanup is a complex taskthat ultimately needs input from highlyknowledgeable specialists in the data, but,with some thought to data structures andalgorithms, much of the process can beautomated. Examples of problems in dataarising from non-normal storage ofinformation and techniques for approachingtheir migration are given below. A finalissue in data migration is coding a new frontend to replace the old user interface. Tomost users of the database, this will be seen

as the migration – replacing one databasewith another.

Documentation ProblemsA common legacy problem during datamigration is poor documentation of theoriginal database design. Consider adatabase containing a set of three fields fororiginal genus, original specific epithet, andoriginal subspecies epithet, plus anotherthree fields for current genus, currentspecific epithet, and current subspeciesepithet, plus a single field for author andyear of publication (Figure 25).

Unless this database was designed andcompiled by a single individual who is stillavailable to explain the data structures, theonly way to tell whether author and yearapply to the current identification or theoriginal identification will be to sample alarge number of rows and look upauthorships of the species names in areliable source. In a case such as this, youmight well find that the author field had beenput to different purposes by different dataentry people.

A similar problem exists in another typicaldesign of fields for genus, specific epithet,subspecific epithet, author, and year. Theproblem here is more subtle – values in theauthor field for subspecies might apply tothe species name or to the subspeciesname, and untangling errors will require lotsof work or queries against authoritativenomenclators. Such a problem might alsohave produced incorrect printed labels, sayif in this example, the genus, specificepithet, and author were always printed onlabels, but sometimes the author was filledin with the author of the subspecies epithet.

52

Figure 25. An example legacy collection objecttable illustrating typical issues in legacy data.


These sorts of issues arise not fromproblems in the design of the underlyingdatabase, but in its documentation, its userinterface, and in the documentation on dataentry practices and training provided fordata entry personnel. In particular, changesin the user interface over time, secondary toweak documentation that did not explain thebusiness rules of the database clearlyenough to the programmers of replacementinterfaces, can result in inconsistent data.Given the complexity of nomenclatural andcollections associated information, theseproblems might be more likely to be presentin complex databases designed and codedby computer programmers (with strongknowledge of database design) than insimpler, less well designed databasesproduced by systematists (with a verydetailed knowledge of the complexity ofbiological information).

I will now discuss specific examples ofhandling issues in legacy data. I haveselected typical issues that can be found inlegacy data related to the design of thelegacy database. These problems may notreflect poor design in the legacy database.The data structures may have beenconstrained by the limitations of an earlierDBMS, the data may have been capturedbefore database design principles wereunderstood, or the database may have beendesigned for a different purpose.

Data entered in wrong field

A common problem in legacy data is valuesthat are clearly outside the range ofappropriate values for a field, such as “03-12-1860” as a value for donor. Values suchas this commonly creep into databases fromaccidental entry into the wrong field on adata entry interface, such as the datedonated being typed into the donor field.

A very useful approach to identifying valuesthat were entered into the wrong field is tosort all the distinct values in each field witha case sensitive alphabetic sort. Valuesthat are far out of range will often appear atthe top or bottom of such lists (valuesbeginning with numbers appearing beforevalues beginning with letters, and valuesbeginning with lower case letters appearingbefore those beginning with upper caseletters). The database can then be

searched for records that contain thestrange out of range values, and theserecords will often contain several fields thathave had their values displaced. Thisdisplacement may reflect the organization offields on the current data entry layout, or itmay reflect some older data entry screen.

Records from databases that date back intothe 1970s or earlier should be examinedvery carefully when values that appear tohave been entered into the wrong field arefound. These errors may reflect blind editsapplied to the database by row number, orother early data editing methods thatrequired knowledge of the actual spaceoccupied by a particular value in a field.Some early database editing methods hadthe potential to cause an edit to overwritenearby fields. Thus, when checking valuesin very old data sets, examine all fields in asuspect record, as well as records that mayhave been stored in proximity to that record.

Regular expressions can also be used tosearch for out of range values existing infields. These include simple patterns suchas /^[0-9]{4}$/ which will match a four digityear, or more elaborate patterns such as /^(1[789]/20)[0-9]{2}$/ which will match fourdigit years from 1700-2099.

A rather more subtle problem can occurwhen two text fields that can contain similarvalues sit next to each other on the dataentry interface. Species (specific epithet)and subspecies (subspecific epithet) fieldsboth contain very similar information and,indeed, can contain identical values.

Atomization problems

Sometimes legacy datasets include multipleconcepts placed together in a single field. Acommon offender is the use of two fields tohold author, year of publication, andparentheses. Rows make plain sense forcombinations that use the original genus(and don't parenthesize the author), but endup with odd looking data for changedcombinations (Table 26).

Table 26. Parentheses included in author.

Generic name Trivialepithet

Author Year ofPublication

Palaeozygopleura hamiltoniae (Hall 1868)

The set of fields in Table 26 works perfectly

53


well in a report that always prints taxonname + author + , + year, and can berelatively straightforwardly scripted toproduce reports that print taxon name +author + closing parenthesis if author startswith parenthesis, but make for lots ofcomplications for other operations (such asproducing a list of unique authors) or askingquestions of the database. Splitting outparentheses from these fields and storingthem in a separate field is straightforward,as shown in the following 5 SQL statements(which add a column to hold a flag toindicate whether parentheses should beapplied, populate this column, update theauthor and year fields to removeparentheses, and check for exceptions).

ALTER TABLE names ADD COLUMN parenBOOLEAN DEFAULT FALSE;UPDATE names SET paren = TRUE WHERE LEFT(author,1)= ”(“;UPDATE names SET author = RIGHT(author,LENGTH(author)-1) WHERE paren = TRUE;UPDATE names SET year = LEFT(year,LENGTH(year)-1) WHERE paren = TRUE and RIGHT(year,1)= ”)”;SELECT names_id, author, year FROM names WHERE paren=FALSE AND ( INSTR(author,”(”)>0 OR INSTR(author,”)”)> 0 OR INSTR(year,”(”)>0 OR INSTR(year,”)”)> 0);

Another common offender in legacy data isa field holding non-atomized taxon names(Table 27) including author and year.

Table 27. A non-atomic taxon name field holdinga species name with its author, year ofpublication, and parentheses indicating that it is achanged combination.

Taxon NamePalaeozygopleura hamiltoniae (Hall, 1868)

In manipulating such non-atomic fields youwill need to parse the string content of thesingle field into multiple fields (Table 28). Ifthe names are all simple binomials withoutauthor and year, then splitting such aspecies name into generic name andspecific epithet is simple. Otherwise, the

problem can be very complex.

Table 28. Name from Table 27 atomized intoGeneric name, specific epithet, author, year ofpublication fields and a boolean field (Paren)used to indicate whether parentheses should beapplied to the author or author and year.

Generic name Specific_epithet

Author Year Paren

Palaeozygopleura hamiltoniae Hall 1868 TRUE

In some cases you will just need to performa particular simple operation once and won'tneed to write any code. In other cases youwill want to write code to repeat theoperation several times, make the operationfeasable, or use the code to document thechanges you applied to the data.

One approach to splitting a non-atomic fieldis to write a parser that loops through eachrow in the dataset, extracts the stringcontent of the non-atomic field of interestfrom one row, splits the string intocomponents (based on a separator such asa space, or loops through the string onecharacter at a time), and examines thecontent of each part in an effort to decidewhich parts of the string map onto whichconcept. Pseudocode for such a parsercan be expressed as:

run query(SELECT TaxonName FROM sourceTable)for each row in result set whole bunch of code to

try to identify and split parts of name run query(INSERT INTO parsedTable

(genus, subgenus...)) log errors in splitting row

next row

Follow this with manual examination ofthe results for problems, then fix anybugs in the parser, and then run theparser again.... Eventually a point ofdiminishing returns will be reached andyou will need to handle some rowsindividually by hand.

Writing a parser in this form, especially forcomplex data such as species names canbe a very complex and frustrating task.There are, however, approaches other thanbrute force parsing to handing non-atomicfields. For long term projects, where datafrom many different data sources mightneed to be parsed, a machine learning

54


algorithm of some form might beappropriate. For one shot data migrationproblems, exploiting some of the interestingand useful tools for pattern recognition inthe programmer's toolkit might be of help.There are, for example, regular expressions.Regular expressions are a tool forrecognizing patterns found in an expandingnumber of languages (perl, PHP, andMySQL all include support for regularexpressions).

An algorithm and pseudocode for a parserthat exploits regular expressions to matchand split known taxon name patterns mightlook something like the following:

1) Compile by hand a list of patternsthat match different forms of taxonnames. /̂ [A-Z]{1}[a-z]*$/ maps to field: Generic /̂ [A-Z]{1}[a-z]* [a-z]?$/ maps to fields: Generic, Trivial split on “ “2) Store these patterns in a list

3) for each pattern in list run query (SELECT taxonName WHERE taxonName matches pattern) for each row in results split name into parts

following known pattern run query (INSERT INTO parsedTable (genus, subgenus...)) next rownext pattern

4) By hand, examine rows that didn't havematching patterns, write new patterns andrun parser again

5) Enter last few unmatched taxon namesby hand

Normalization Problems:Multiple values in one field

Legacy datasets very often contain fieldsthat hold multiple repeated pieces of thesame sort of information. These fields areusually remains of efforts to store one tomany relationships in the flat file formatallowed by early database systems. The

goal in parsing these fields is to split theminto their components and to place thesecomponents into separate rows in a newtable, which is joined to the first in a many toone relationship.

If users have been consistent in thedelimiter used to separate multiple repeatedcomponents within the field (e.g. semicolonas a separator in “R1; R2”), then writing aparser is quite straightforward. Twenty fiveto thirty year old data, however, can oftencontain inconsistencies, and you will need tocarefully examine the content of these fieldsto make sure that the content is consistent.This is especially true if the text content iscomplex and some of the text could containthe character used as the delimiter.

Parsing consistent data in this form isusually a simple exercise in splitting thestring on the delimiter (very easy inlanguages such as perl that have a splitcommand, but requiring a little more codingin others). Pseudocode for such a parsermay look something like this:

run query (SELECT Catalog, DictionaryRemarks FROM originalTable)for each row in result set split DictionaryRemarks on the

delimiter “;” into a list of remarks for each split remark in list

run query (INSERT INTO parsedTable (Catalog,Remark) VALUES ... ) next split log errors found when splitting row

next row

The parser above could split a field intorows in a new table as shown in Figure 26.

Figure 26. Splitting a non-atomic DictionaryRemarks field containing multiple instances ofDictionaryRemarks separated by a semicolon

55


delimiter into multiple rows in a separatenormalized table.

Normalization Problems:Duplicate values withmisspellings

If the structures that data are stored in arenot in at least third normal form, they willallow a field to contain multiple rowscontaining duplicate values. A locality tablethat contains fields for country and state willcontain duplicate values for country andstate for as many localities as exist in asingle state. Over time, new values enteredby different people, imports from externaldata sources, and changes to the databasewill result in these repeated values not quitematching each other. Pennsylvania, PA,and Penn. might all be entries in aState/Province/Primary Division field.Likewise, these repeated values can ofteninclude misspellings.

In some cases, these subtle differences andmisspellings pass unnoticed, such as whena database is used primarily for printingspecimen labels and retrieving informationabout single specimens. At other times,these not quite duplicate values createserious problems. An example of thiscomes from our attempt at ANSP to migratethe Ichthyology department database fromMuse into BioLink. Muse uses a set oftables to house information aboutspecimens, localities, and transactions. Itincludes a locality table in second normalform (the key field for locality numbercontains unique values) that has fields forcountry, for state or province, and for namedplace (which typically contains duplicatevalues). In the ANSP fish data, there wereseveral different spellings of Pennsylvaniaand Philadelphia, with at least 12 differentcombinations of variants of United States,Pennsylvania, and Philadelphia. SinceBioLink uses a single geographic hierarchyand uses a graphical tree browser tonavigate through this hierarchy, it was notpossible to import the ANSP fish data intoBioLink and simply use it. Extensivecleanup of the data would have beenrequired before the data would have beenusable within BioLink.

Cleanup of misspellings, differences in

punctuation, and other such near duplicatevalues is simple but time consuming forlarge data sets. The key components are:first, maintain a copy of the existing legacydata; second, select a set of distinct valuesfor a field or several fields into a separatetable; third, map each distinct value onto thecorrect value to use in its place; and fourth,build a new table containing the correctedvalues and links to the original literal values.

SELECT DISTICT country, primary_divisionFROM localityORDER BY country, primary_division;

or, to select and place in a new table:

CREATE TABLE country_primary_map_table SELECT DISTICT country, primary_division, “” as map_country, “” as map_primary FROM locality ORDER BY country, primary_division;

Data in the mapping table produced by thequery above might look like those in Table29. Country and primary division containoriginal data, map_country andmap_primary_division need to be filled in.

Table 29. A table for mapping geographic namesfound in legacy data to correct values.

country primary_division

map_country

map_primary_division

USA PAUSA PennsylvaniaUS Pennsylvana

The process of filling in the mapping valuesin this table can involve both queries anddirect inspection of each row by someone.This examination will probably involve oneperson to examine each row, and others tohelp resolve problematic rows. It may alsoinvolve queries to make changes to lots ofrows at once, especially in cases where twofields (such as country and primary division)are being examined at once. An example ofa query to set the map values for multipledifferent variants of a country name isshown below. Changes to the countrymapping field could also be applied byselecting the distinct set of values forcountry alone, compiling mappings for them,and using an intermediate table to compile alist of unique primary divisions andcountries. Indeed, you will probably want to

56


work iteratively down the fields of ahierarchy.

UPDATE country_primary_map_table SET map_country = “United States” WHERE country = “USA” or country = “US” or country = “United States” or country = “United States of America”;

Ultimately, each row in the mapping tablewill need to be examined by a person andthe mapping values will need to be filled in,as in Table 30.

Table 30. A table for mapping geographic namesfound in legacy data to correct values.

country primary_division

Map_country

map_primary_division

USA PA UnitedStates

Pennsylvania

USA Pennsylvania UnitedStates

Pennsylvania

US Pennsylvana UnitedStates

Pennsylvania

Planning for future migrations

The database life cycle is a clear reminderthat data in your database will need to bemigrated to a new database some time inthe future. Planning ahead in databasedesign can help ease these futuremigrations. The two most important aspectsof database design that will aid in futuremigration are good relational design andclear documentation of the database.Clear documentation of the database isunambiguously valuable. In ten years therewill be questions about the nature of thedata in a database, and the people whocreated the database will either beunavailable or will simply not remember allthe details. Good relational design hastradeoffs.

More complex database designs requiremore complex user interface code thansimple database designs. Migrating asimple database (with all its containedgarbage) into a new simple database isquite straightforward. Migrating a complexdatabase will require some substantive workto replace the user interface (and any codeembedded in the backend) with a workinguser interface in the new system. Migratinga simple database to a new complexdatabase will require substantial time

cleaning up problems in the data as well aswriting code for a new user interface. Thistradeoff is illustrated in Figure 27.

The architecture of your code can affect theease of migration. Some databasemanagement systems are monolithic(Figure 28). Data, code, and user interfaceare all integral parts of the same database.Other database systems can bedecomposed into distinct layers – the userinterface can be separated from the data,and the data can easily be accessedthrough other means.

A monolithic database system offers verylimited options for either code reuse ormigration. Migration requires eitherupgrading to a newer version of the sameDBMS (and then resolving any bugsintroduced into the code by changes madeby the DBMS vendor), or exporting the dataand importing it into an entirely newdatabase with an entirely new userinterface. In contrast, a layered

57

Figure 28. Architecture of a monolithic databasemanagement system.

Figure 27. Migration costs from databases ofvarious complexity into a complex normalizeddatabase. Migration into a flat or simple databasewill have relatively low costs, but will have the riskof substantial quality loss in the data over the longterm as data are added and edited.


architecture can allow the DBMS used tostore the data from user interfacecomponents, meaning that it may bepossible to upgrade a DBMS or migrate to anew DBMS without having to rewrite theuser interface from scratch, or indeedwithout making major changes to the userinterface.

Many relational database managementsystems are written to allow easy separationof the code into layers (Figure 29). Adatabase server, responsible for maintainingthe integrity of the data, can be separatedfrom the front end clients that access thedata in that server. The server stores thedata in a central location, while clientsaccess it (from the local machine or over anetwork) to view, add, and alter data. Theserver handles all of the storage of the data,clients display the data through a user

interface. Multiple users spread out overthe world can work on the same database atthe same time, even altering and viewingthe same records. Different clients canconnect to the same database. Clients maybe a single platform application, a crossplatform application, or multiple differentapplications (such as a data entry client anda web database server update script). Aclient may be a program that a user installson their workstation that contains networktransport and business rules for data entrybuilt into a fancy graphical user interface. Aclient could be a web server application thatallows remote users to query the databaseand view result sets through their webbrowsers. A design that separates adatabase system into layers hasadvantages over a monolithic database.Layers can help in multi-platform support(clients can be written in a cross platform

58

Figure 29. Layers in a database system. A database with server side code can focus on maintaining theintegrity of the data, while the user interface can be separated and connect either locally to the server orremotely over a network (using various network transport layers). A stable and well defined applicationprogramming interface can allow multiple clients of different sorts to connect to the back end of the databaseand allow both client and server code to be altered independently.


language or compiled in different versionsfor different operating systems), allowscaling and distribution of clients [Figure30], and aid in migration (allowingindependent migration of backend and userinterface components through a commonApplication Programing Interface).

Although object oriented programing andobject thinking is a very natural match forthe complex hierarchies of biodiversity data,I have not discussed object orientedprograming here. One aspect of objectoriented thinking, however, can be ofsubstantial help in informing designdecisions about where in a complexdatabase system business logic codeshould go. This aspect is encapsulation.Encapsulation allows an object to hide (that,is encapsulate) all of its internal storagestructures and operation from the world.Encapsulation limits the abilities of the worldto interactions with the information stored inan instance of an object only through a

narrow window of methods. In objectoriented programming, a code object thatrepresents real world collections objectsmight store a catalog number somewhereinside it and allow other objects to find orchange that catalog number only throughinvocations of methods of the object thatallow the collection_object code object toapply business rules to the request and onlyfulfill it if it meets the requirements of thoserules. Direct reading and writing to thecatalog number would not be possible. Thecatalog number for a particular instance of acollection_object might be obtained with acall to Collection_object.get_catalog().Conversely the catalog number for aparticular instance of a a collection_ojbectmight be set with a call toCollection_object.set_catalog(“452685”).The set_catalog() method is a block of codethat can test to see if the catalog number ithas been provided is in a valid format, if thenumber provided is in use elsewhere, if thenumber provided falls within a valid range,

59

Figure 30. BioLink (Windows) and OBIS Toolkit (Java, cross platform) interfaces to a BioLink database on aMS SQLServer, with a web copy of the database. The separation of user interface and database layersallowed the OBIS Indo-Pacific Mollusc project to use both BioLink's Windows client and a custom cross-platform Java application for global (US, France, Australia) distributed data entry.


or any other business rules that apply to thesetting of the catalog number of a collectionobject. More complex logic could beembedded in methods such ascollection_object.loan(a_loan).

We can think about the control ofinformation going into a database asfollowing a spectrum from uncontrolled tohighly encapsulated. At the uncontrolledend of the spectrum, if raw SQL commandscan be presented to a database, acollection_object table in a database couldhave its values changed with a commandsuch as UPDATE collection_object SETcatalog_number = “5435416”; acommand that will probably haveundesirable results if more than one recordis present. At the other end of thespectrum, users might be locked out fromdirect access to the database tables andrequired to interact through storedprocedures such assp_collection_object_set_catalog(id,new_catalog) which might be invokedsomewhere in the user interface code assp_collection_object_set_catalog(“64235003”,”42005”). At an intermediatelevel, direct SQL query access to the tablesis allowed, but on insert and on updatetriggers on the collection_object table applybusiness rules to queries that attempt toalter catalog numbers of collections objects.Encapsulating the backend of a databaseand forcing clients to communicate with itthrough a fixed Application ProgrammingInterface of stored procedures allowsdevelopment and migration of backenddatabase and clients to follow independentpaths.

Conclusion

Biological information is inherently complex.Management and long term stewardship ofinformation related to biological diversity is achallenging task, indeed at times a dauntingtask. Stewardship of biodiversityinformation relies on good database design.The most important principles of gooddatabase design are to atomize information,to reduce redundant information, and todesign for the task at hand. To atomizeinformation, design tables so that each fieldcontains only one concept. To reduceredundant information, design tables so thateach row of each field contains a unique

value. Management of database systemsrequires good day to day systemadministration. Database systemadministration needs to be informed by athreat analysis, and should employ meansof threat mitigation, such as regularbackups, highlighted by that analysis.Stewardship of biodiversity information alsorequires the long term perspective of thedatabase life cycle. Careful databasedesign and documentation of that designare important not only in maintaining dataintegrity during use of a database, but arealso important factors in the ease and extentof data loss in future migrations (includingreduction of the risk that inferences madeabout the data now will be taken at somefuture point to be original facts). Gooddatabase design is a necessary foundation,but the most important factor in maintainingthe quality of data in a biodiversity databaseis a quality control process that involvespeople with a strong stake in the integrity ofthe data.

Acknowledgments

This document had its origins in apresentation I gave at the Use of DigitalTechnology in Museums workshop atSPNHC in 2003. I would like to thank TimWhite for encouraging me to talk about thefundamentals of database design usingexamples from the domain of natural historycollections information in that workshop.This paper has been substantially improvedby comments from and discussions withSusan F. Morris, Kathleen Sprouffske,James Macklin, and an anonymousreviewer. Gary Rosenberg has providedmany stimulating discussions over manyyears on the management and analysis ofcollections and biodiversity data. I wouldalso like to thank the students in theBiodiversity Informatics Seminar for theREU program at ANSP in the summer of2004 for many fruitful and pertinentdiscussions. The PH core tables exampleowes much to extensive discussions, designwork and coding with James Macklin. Alarge and generous community of opensource programmers have produced a widevariety of powerful tools for working withdata including the database softwareMySQL, and PostgreSQL, and the CASEtool Druid.

60


References

Anley, C. 2002. Advanced SQL Injection in SQL Server Applications. NGSSoftware InsightSecurity Research [WWW PDF Document] URLhttp://www.nextgenss.com/papers/advanced_sql_injections.pdf

ASC [Association of Systematics Collections] 1992. An information model for biologicalcollections. Report of the Biological Collections Data Standards Workshop, 8-24 August1992. ASC. Washington DC. [Text available at: WWW URLhttp://palimpsest.stanford.edu/lex/datamodl.html ]

ANSI X3.135-1986. Database Language SQL.

Beatte, S., S. Arnold, C. Cowan, P. Wagle, C. White, and A. Shostack 2002. Timing theApplication of Security Patches for Optimal Uptime. LISA XVI Proceedings p.101-110.[WWW Document] URLhttp://www.usenix.org/events/lisa02/tech/full_papers/beattie/beattie_html/

Beccaloni, G.W., M.J. Scoble, G.S. Robinson, A.C. Downton, and S.M. Lucas 2003.Computerizing Unit-Level Data in Natural History Card Archives. pp. 165-176 In M.J.Scoble ed. ENHSIN: The European Natural History Specimen Information Network. TheNatural History Museum, London.

Bechtel, K. 2003. Anti-Virus Defence In Depth. InFocus 1687:1 [WWW document] URLhttp://www.securityfocus.com/infocus/1687

Berendsohn, W.G., A. Anagnostopoulos, G. Hagedorn, J. Jakupovic, P.L. Nimis, B.Valdés, A. Güntsch, R.J. Pankhurst, and R.J. White 1999. A comprehensive referencemodel for biological collections and surveys. Taxon 48: 511-562. [Available at: WWWURL http://www.bgbm.org/biodivinf/docs/CollectionModel/ ]

Blum, S. D. 1996a. The MVZ Collections Information Model. Conceptual Model. University ofCalifornia at Berkeley, Museum of Vertebrate Zoology. [WWW PDF Document] URLhttp://www.mip.berkeley.edu/mvz/cis/mvzmodel.pdf andhttp://www.mip.berkeley.edu/mvz/cis/ORMfigs.pdf

Blum, S. D. 1996b. The MVZ Collections Information Model. Logical Model. University ofCalifornia at Berkeley, Museum of Vertebrate Zoology. [WWW PDF Document] URLhttp://www.mip.berkeley.edu/mvz/cis/logical.pdf

Blum, S.D., and J. Wieczorek 2004. DiGIR-bound XML Schema proposal for Darwin CoreVersion 2 content model. [WWW XML Schema] URLhttp://www.digir.net/schema/conceptual/darwin/core/2.0/darwincoreWithDiGIRv1.3.xsd

Bruce, T.A. 1992. Designing quality databases with IDEF1X information models. DorsetHouse, New York. 547pp.

Carboni, A. et al. 2004. Druid, the database manager. [Computer software package distributedover the Internet] http://sourceforge.net/projects/druid [version 3.5, 2004 July 4]

Celko, J. 1995a. SQL for smarties. Morgan Kaufmann, San Fransisco.

Celko, J. 1995b. Instant SQL Programing. WROX, Birmingham UK.

Chen, P.P-S. 1976. The Entity Relationship Model – Towards a unified view of data. ACMTransactions on Database Systems. 1(1):9-36.

61


Codd, E.F. 1970. A Relational Model of Data for Large Shared Data Banks. Communications ofthe ACM. 13(6):377-387

Connoly, T., C. Begg, and A. Strachan 1996. Database Systems: A practical approach todesign, implementation and management. Addison Wesley, Harrow, UK.

CSIRO 2001. BioLink 1.0 [Computer software package distributed on CD ROM] CSIROPublishing. [Version 2.0 distributed in 2003].

DuBois, P. 2003 [2nd ed.]. MySQL. Sams Publishing, Indiana.

DuBois, P. et al., 2004. Reference Manual for the 'MySQL Database System' [4.0.18] [info docfile, available from http://www.mysql.com] MySQL AB.

Eisenberg, A., J. Melton, K. Kulkarni, J-E Michels, and F. Zemke 2003. SQL:2003 Has beenPublished. Sigmod Record 33(1):119-126.

Elmasri, R. and S.B. Navathe 1994. Fundamentals of Database Systems. BenjaminCummings, Redwood City, CA.

Ferguson, N. and B. Scheiner 2003. Practical Cryptography. John Wiley and Sons.

Hernandez, M.J. 2003. Database Design for Mere Mortals. Addison Wesley, Boston.

Hoglund, G. and G. McGraw 2004. Exploiting Software: How to Break Code. Addison Wesley,Boston. 512pp.

ISO/IEC 9075:2003. Information technology -- Database languages -- SQL --.

Jeong, S.S. and R. Chen 2001. Functional misassignment of genes. Nature Biotechnology.19:95.

Linsenbardt, M.A., and M.S. Stigler 1999. SQL SERVER 7 Administration. Osborne/McGrawHill, Berkeley CA. 680pp.

National Research Council, Committee on the Preservation of Geoscience Data andCollections 2002. Geoscience Data and Collections: National resources in peril. TheNational Academies Press, Washington DC. 108pp.

Morris, P.J. 2000. A Data Model for Invertebrate Paleontological Collections Information.Paleontological Society Special Publications 10:105-108,155-260.

Morris, P.J. 2001. Posting to [email protected]: Security Advisory for BioLink users. Date:2001-03-02 15:36:40 -0500 [Archived at: http://listserv.nhm.ku.edu/cgi-bin/wa.exe?A2=ind0103&L=taxacom&D=1&O=D&F=&S=&P=1910]

Pennisi, E. 1999. Keeping Genome Databases Clean and Up to Date. Science 286: 447-450

Petuch, E.J. 1989. New species of Malea (Gastropoda Tonnidae) from the Pleistocene ofSouthern Florida. The Nautilus 103:92-95.

PostgreSQL Global Development Group 2003. PostgreSQL 7.4.2 Documentation: 22.2. Filesystem level backup [WWW Document] URLhttp://www.postgresql.org/docs/7.4/static/backup-file.html

62


Pyle, R.L. 2004. Taxonomer: A relational data model for handling information related totaxonomic research. Phyloinformatics 1:1-54

Resolution Ltd. 1998. xCase Professional. [Computer software package distributed on CDROM] Jerusalem, RESolution Ltd. [Version 4.0 distributed in 1998]

Schwartz, P.J. 2003. XML Schema describing the Darwin Core V2 [WWW XML Schema] URLhttp://digir.net/schema/conceptual/darwin/2003/1.0/darwin2.xsd

Smith, M. 2002. SQL Injection: Are your web applications vulnerable? SpiLabs SPIDynamics,Atlanta. [WWW PDF Document] URLhttp://www.spidynamics.com/papers/SQLInjectionWhitePaper.pdf

Teorey, T.J. 1994. Database Modeling & Design. Morgan Kaufmann, San Fransisco.

Wojcik, M. 2004. Posting to [email protected]: RE: Suggestion: erase data posted tothe Web Date: 2004-07-08 07:59:09 -0700. [Archived at:http://www.securityfocus.com/archive/1/368351]

63


Glossary (of terms as used herein).

Accession number: number or alphanumeric identifier assigned to a set of collection objectsthat enter the care of an institution together. Used to track ownership and donor of material heldby an institution. In some disciplines and collections, accession number is used to mean theidentifier of a single collection object (catalog number is used for this concept here).

Catalog number: number or alphanumeric identifier assigned to a single collection object toidentify that particular collection object within a collection. Widely referred to as an accessionnumber in some disciplines.

Collection object: generic term to refer to specimens of all sorts found in collections, refers tothe standard single unit handled within a collection. A collection object can be a singlespecimen, a lot of several specimens all sharing the same data, or a part of a specimen of aparticular preparation type. Example collection objects are a mammal skin, a tray of molluskshells, an alcohol lot of fish, a mammal skull, a bulk sample of fossils, an alcohol lot of insectsfrom a light trap, a fossil slab, or a slide containing a gastropod radula. Collection objects canform a nested hierarchy. An alcohol lot of insects from a light trap could have a single insectremoved, cataloged separately, and then part of a wing could be removed from that specimen,prepared as an SEM stub, and used to produce a published illustration. Each of these objects(including derived objects such as the SEM negative) can be treated as a collection object.Likewise a microscope slide containing many diatoms can be a collection object that containsother collection objects in the form of identified diatoms at particular x-y coordinates on the slide.

DBMS: Database Management System. Used here to refer to the software responsible forstorage and retrieval of the data. Examples include MS Access, MySQL, Postgresql, MSSQLServer, and Filemaker. A database would typically be created using the DBMS and thenhave a front end written in a development environment provided by the DBMS (as in MS Accessor Filemaker), or written as a separate front end in a separate programming language.

Incident Response Capability: A plan for the response to computer security incidents usuallyinvolving an internal incident response team with preplanned contacts to law enforcement andexternal consulting sources to be activated in response to various computer incidents.

Specific epithet: The species word in a binomial species name or polynomial subspecific name.For example, palmarosae in Murex palmarosae and nodocarinatus in Hesperiturrisnodocarinatus crassus are specific epithets.

Subspecific epithet: The subspecies word in the polynomial name of a subspecies or otherpolynomial. For example, crassus in Hesperiturris nodocarinatus crassus is a subspecificepithet.

Trivial epithet: The lowest rank word in a binomial species name or a polynomial subspecificname. The specific epithet palmarosae in the binomial species name Murex palmarosae, or thesubspecific epithet crassus in the trinomial subspecies name Hesperiturris nodocarinatuscrassus are trivial epithets.

64


Appendix A: An example of Entity DocumentationTable: Herbarium Sheet

Definition: A Herbarium Sheet. Normally an approximately 11 X 17 inch piece of paperwith one or more plants or parts of plants and labels attached.Comment: Core table of the Herbarium types database. Some material in herbariumcollection is stored in other physical forms, e.g. envelopes, but concept of herbarium sheetwith specimens that are annotated maps easily to these other forms.See Also: Specimens, ExHerbariumSpecimenAssociation, Images

Field SummaryField Sql type PrimaryKey NotNull Default AutoIncrement

Herb Sheet ID Integer X X X

Name Varchar(32) - X [Current User] -

Date Timestamp - - Now -

Verified by Varchar(32) - - Null -

Verification Date Date - - Null -

CaptureNotes Text - - -

VerificationNotes Text - - -

CurrentCollection Varchar(255) - - -

Verified Boolean - X False -

FieldsName: Herb Sheet ID Type: Integer Definition: Surrogate numeric primary key for herbarium sheet.Domain: Numeric KeyBusiness rules: Required, Automatic.Example Value: 528

Name: Name Type: Varchar(32) Definition: Name of the person who captured the data on the herbarium sheet.Domain: Alphabetic, Names of people in Users:Full NameBusiness rules: Required, Automatic, Fill from authority list in Users:Full Name usingcurrent login to determine identity of current user when a record is created.

Name: Date Type: TimestampDefinition: Timestamp recording the date and time the herbarium sheet data werecaptured.Domain: TimestampBusiness Rules: Required, Automatic. Auto generate when herbarium sheet record iscreated, i.e. default to NOW().Example Value: 20041005:17:43:35UTC Comment: Timestamp format is dbms dependent, and its display may be systemdependent.

Name: Verified by Type: textDefinition: Name of the person who verified the herbarium sheet data.Domain: Alphabetic. Names of people from Users: Full NameBusiness Rules: Optional. Automatic. Default to Null. On verification of record, fill with

65


value from Users: Full Name using current login to identify the current user. Current usermust have verification rights in order to verify a herbarium sheet and to enter a value in thisfield.Example Value: Macklin, James

Name: Verification Date Type: Date Definition: Date the herbarium sheet data were verified. Should be a valid date since theinception of the database (2003 and later).Domain: Date. Valid date greater than the inception date of the database (2003).Business Rules: Optional. Automatic. Default to Null. On verification of record, fill withcurrent date. Current user must have verification rights in order to verify a herbarium sheetand to enter a value in this field. Value must be more recent than value found in HerbariumSheet: Date (records can only be verified after they are created).Example Value: 2004-08-12

Name: CaptureNotes Type: TextDefinition: Notes concerning the capture of the data on the herbarium sheet made by theperson capturing the data. May be questions about difficult to read text, or other reasonswhy the data related to this sheet should be examined by someone more experienced thanthe data entry person.Domain: Free text memo.Business rules: Manual, Optional.Example Values: "Unable to read taxon name Q_____ ___ba", "Locality description hardto read, please check".

Name: VerificationNotesType: TextDefinition: Notes made by the verifier of the specimen. May be working notes by verifierprior to verification of record.Domain: Free Text Memo.Business Rules: Manual, Optional. May contain a value even if Verified By andVerification Date are null.Example Values: “Taxon name is illegible”, “Fixed locality description”

Name: CurrentCollection Type: Varchar(255)Definition: Collection in which herbarium sheet is currently stored. Data entry is currentlyfor the type collection, with some records being added for material in the generalsystematic collection that are imaged to provide virtual loans.Domain: "PH systematic collection", "PH type collection".Business Rules: Required, Semimanual, default to "PH type collection". Should dataentry expand to routine capture of data from systematic collection, change to Manual andallow users to set their own current default value.

Name: Verified Type: BooleanDefinition: Flag to indicate if data associated with a herbarium sheet has been verified.Domain: True, False.Business rules: Required, Semiautomatic. Default to False. Set to true when record isverified. Set the three fields Herbarium Sheet: Verified, Herbarium Sheet: Verified by, andHerbarium Sheet: Verification sheet together. Only a user with rights to verify material canchange the value of this field. Once a record has been verified once maintain these threeverification stamps and do not allow subsequent re-verifications to alter this information.

66

Relational Database Design and Implementation for...

Documents

Transcript of Relational Database Design and Implementation for...