Standardizing Mansfeld's World Database of Agricultural and Horticultural Crops by Implementing a...

1
Standardizing Mansfeld's World Database of Agricultural and Horticultural Crops by Implementing a Concept-Based Data Model Ram Narang and Helmut Knüpffer Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany [email protected] Introduction The integration of species-related information from multiple sources in federated information systems or web portals faces the problem of different taxonomic approaches used. Many global and local taxonomic databases, among them ITIS and Species2000, provide information about species, based on a single taxonomic view, where information is attached to a single accepted (or preferred) name. Taxonomic opinions and standards vary with time, place, and investigator, and depend upon many factors like geographical range of study, interpretation of collected specimens, the fossil record, morphology, genetics and molecular phylogeny. New classifications may arise from more detailed studies of specimens, the discovery of new taxonomic information, or the description of new species and groupings. Consequently, biological taxa often have multiple names, which in turn may have been applied to multiple taxon concepts. When combining such data from diverse sources into a single database or portal, one needs to reconcile those different standards. In addition, the increasing use of DNA sequence comparison as a tool to analyse phylogenetic relationships is accelerating the rate of taxonomic revision, which is thus unlikely to stabilize in the foreseeable future. Therefore, the availability and implementation of a data model representing multiple, alternative taxonomic views is crucial for a sound taxonomic information management. Leibniz Institute of Plant Genetics and Crop Plant Research The Berlin Data Model for Taxonomic Information Mansfeld’s World Database of Agricultural and Horticultural Crops Taxonomy Module of the Mansfeld Database Implementing the Multiple Taxonomic Concepts Model References Various data models have been developed to support the representation of multiple, alternative taxonomic views in taxonomic databases (cf. Kennedy et al. 2006), among them the Berlin Model (Berendsohn et al. 2003), based on the IOPI model. The Berlin Model allows to use alternative taxonomic concepts (potential taxa) for species information. A number of projects, such as the Euro+Med PlantBase, AlgaTerra, MoReTax, the IOPI Global Plant Checklist, the Dendroflora of El Salvador and Med-Checklist, implemented the core of the Berlin Model as a taxonomic backbone for their databases and contributed to its continuous development and optimization (http://www.bgbm.org/biodivinf/docs/bgbm-model/). In addition, the Berlin Model is the underlying model of several tools dedicated to taxonomic data management such as taxonomic revisions, data import from external sources, data integrity checking and data publishing on the World Wide Web. The Core of the Berlin Model contains four central functional sections: (1) Taxon Names, (2) Potential Taxon (taxonomic concepts), (3) Facts and (4) References. Taxon names are the botanical names according to the International Code of Botanical Nomenclature (ICBN). Like many other global taxonomic checklists, the Mansfeld Database represents a single taxonomic view of nomenclatural information. It incorporates classifications that have gained broad acceptance in taxonomic literature and by taxonomists working with the taxa concerned, and thus offers the opportunity of standardizing scientific nomenclature and taxonomy for cultivated plant species. Alternative taxonomic views (reflected by phrases such as sensu, amend., etc.) are presently stored as part of the nomenclatural reference. Similarly, authors and bibliographical references are not yet atomized into individual attributes. These information items need to be parsed and abstracted into the entity-relationship model to allow a conceptual view on the taxon. The Mansfeld Database (http://mansfeld.ipk-gatersleben.de ) is an online database developed at IPK since 1998, initially as a contribution to the project “Federal Information System on Genetic Resources” (BIG, http://www.big-flora.de/). It reflects the contents of “Mansfeld’s Encyclopedia of Agricultural and Horticultural Crops” (Hanelt and IPK 2001) and contains information on ca. 6,100 crop plant species, excluding forestry and ornamental plants. Each species entry provides nomenclature and synonymy, common names in different languages, the distribution of the species in the wild and regions of cultivation, uses, images, references, but also the ancestral species and notes on the phylogeny, variation and history. Originally developed under Microsoft Visual FoxPro, the Mansfeld Database has recently been migrated to the database platform Oracle 10g, and the procedures for the web interface were re-programmed. In a first step of implementation, the latest version of the Berlin Core Model, a database model under MS SQL Server, was migrated into Oracle 10g. All database procedures, functions and triggers that implement taxonomic logic, were translated into their PL/SQL equivalents. Nomenclatural and bibliographical data of the Mansfeld Database was atomised using JAVA programmes. The parsed information was tagged and stored in an XML file. The resulting soft-schema XML-file was read with JDOM and corrected manually -- a time-consuming task --, to write a strict schema XML file which was used to populate the tables in the Taxon, Reference and Potential Taxon sections of the Berlin taxonomic model. After completion of the taxonomic core, the remaining information from the Mansfeld Database, such as textual information on geographical distribution and uses, was linked to the potential taxon as factual data. Finally, the web interface was adapted (re-programmed) to the new data model. Name Taxon Concept Reference Facts Relation The combination of such a name with a reference forms a taxonym (or potential taxon, taxon concept). An auxiliary section Authors assembles author teams for the nomenclatural references. Finally, the fact component can be used to store any kind of factual information. Basic data integrity rules in the Berlin Model are implemented at the level of tables, keys, and relations within the database model. For example, the rule that every botanical name should have a rank can be assured with a foreign key to the table defining the list of valid ranks. More complex rules and functions, e.g. to construct syntactically correct botanical names, are implemented using stored procedures and trigger functions. Triggers are functions executed automatically when certain database events occur. For example, one of the triggers automatically rebuilds an author team when one of its author names was changed. vnam _tax anzeigen I11 id FK 1,I19 taxon_id FK 2,I20 vnam _id I22 vnam _neu I21 vnam _id_alt I13 nam e_orig I18 sprach_id FK 3,I14 nam typ_id I15 pfl_teil I10 geogr_info add_info I2 artikel I7 fuer_big I17 soi_id I16 ref_id I6 erstellt I5 erst_von I9 geaendert I8 geaend_von I3 chk I4 chk_von I12 löschen volksnam PK vnam _id U1 nam e I8,I5 nam e_ansi I7 soi_id I6 ref_id I1 anzeigen I3 erstellt erst_von geaendert geaend_von bem erkung original_n I2 chk chk_von I4 löschen taxa_soi PK id FK 1,I6 taxon_id FK 2,I5 soi_id I2 erstellt I1 erst_von I4 geaendert I3 geaend_von autoren autor_orig I6 autor_id I8 dubl_m it I5 autor_ges I3 autor_apn I4 autor_bas I14 problem I1 autor I7 autor_non I2 autor_api bem erkung I10 erstellt I9 erst_von I12 geaendert I11 geaend_von I13 löschen dubl_botnam botnam _id dublette dubl_m it taxon_id soi PK soi_id soi_big nam e_d nam e_e taxrang PK rang_id I4 rang I3 m f_rang_kuerz I2 mf I1 anzeigen sprach_id I6 taxlevel kulturpflanze I5 reihenf erstellt erst_von geaendert geaend_von pp_stat PK ppstat_id U1 pp_kuerzel bem erkung gruppe PK gruppen_id I2 kuerzel I3 nam e I1 anzeigen taxa PK taxon_id FK 1,I2 botnam _id bnam _id_alt I4 hightax_id I3 fam tax_id fam ilie I1 artikel_id highart_id db_id FK 2,I6 soi_id I5 ref_id erstellt erst_von geaendert geaend_von löschen syn_stat PK synstat_id I4 syn_sym bol I3 syn_status I2 status_big I5 text I1 sortierung bem erkung publikat publ_bphtl I2 publ_id publ bem erkung erstellt erst_von geaendert geaend_von vnam typ PK nam typ_id I2 nam e_d I3 nam e_e bem erkung I1 anzeigen botnam PK botnam _id I15 hom onym I11 dublette I10 dubl_m it I18 löschen FK 3,I31 soi_id I22,I20,I32 nam e I21 nam e_ansi I23 nam e_gz I5 autor_bas I3 autor I9 autor_non I8 autor_id I7,I22 autor_ges I16,I22 jahr I17 jahr_non publ_id I24,I22,I28 publ publ_band publ_seite publ_non publ_add I19 nam _stat alt_nam e nam e_voll I4 autor_apn I6 autor_chk I25 publ_bph I27 publ_tl2 I26 publ_chk I30 ref_id tax_text FK 1,I29 rang_id FK 2,I14 gruppen_id I2 art_autor original bem erkung I1 anzeigen I13 fuer_big I12 erstellt erst_von geaendert geaend_von syno PK id FK 1,I12 taxon_id I14 vtaxon_id I1 akztax_id I7 m f_artikel I13 text_tax FK 2,I11 synstat_id I9 syn_oper FK 3,I8 ppstat_id I10 syn_text artikel_id I3 erstellt erst_von I5 geaendert geaend_von I2 anzeigen I4 fuer_big I6 löschen bem erkung The implementation of the Berlin Model in the Mansfeld Database facilitates standardisation and improves the quality of the taxonomic information by increasing accuracy, resolution and interpretability. In addition, existing standard taxonomy management tools such as a web editors can be adapted to be used on the underlying new conceptual Mansfeld Database model for updating the contents of the database. Vast information about 6,100 species of agricultural and horticultural crop plants will thus become more easily accessible to global portals on biodiversity information. Outlook Conceptual Db model Mansfeld Database XML soft schema I XML strict schema II III Web screenshots of the Mansfeld Database before the transformation to the Berlin Model Mansfeld Database – Taxonomy module Entity-relationship model of the potential taxon Concept-oriented database core Implementation steps Taxon Rank Taxon Name Potential Taxon Name cm cm cm cm m cm m 1 1 1 1 c c is accepted name assign s accept ed name is higher taxon in classificat ion gives status and other taxonomic information of is classifie d 1 Reference Status Assignment Assigned Status Reference Title The Encyclopedia of Life (http://www.eol.org) launched in 2007 is developing “species pages” for all known organisms, the contents to be provided and edited by experts from all over the world, using a wiki-like editor. Its initial contents is being gathered from existing web resources. The rich information contents of >6,000 of the economically most important plant species documented in the Mansfeld Database was offered for inclusion at the EoL Plant Species Pages Meeting (St. Louis, Missouri), 31.10.-2.11.2007. The Global Biodiversity Information Facility (http://www.gbif.org) is aiming at providing free access to biodiversity information on the web, using standardised web services. The Mansfeld Database developers have been approached by GBIF to make its ca. 38,000 common names of crop plant species in many languages available to GBIF, to start developing an interface that would allow the world’s biodiversity data to be queried also via common names, besides scientific names. Integrating the Mansfeld Database fully into GBIF would also make its rich crop species information accessible along with data from other providers of taxon- related data. Berendsohn, W.G., M. Döring, M. Geoffroy, K. Glück, A. Güntsch, A. Hahn, W.-H. Kusber, J.L. Li, D. Röpert and F. Specht. 2003. The Berlin Model: a concept-based taxonomic information model. Pp. 15-26 in Berendsohn, W.G. (ed), MoReTax. Handling Factual Information Linked to Taxonomic Concepts in Biology. Schriftenreihe für Vegetationskunde 39, Bonn. Hanelt, P. and Institute of Plant Genetics and Crop Plant Research (eds), 2001. Mansfeld’s Encyclopedia of Agricultural and Horticultural Crops (Except Ornamentals). 6 vols. 1 st Engl. ed. Springer, Berlin, Heidelberg, New York, etc. (LXX+3645 pp.) Kennedy, J., R. Hyam, R. Kukla and T. Paterson, 2006. Standard data model representation for taxonomic information. OMICS. A Journal of Integrative Biology 10 (Special Issue on Data Standards), 220-230.

Transcript of Standardizing Mansfeld's World Database of Agricultural and Horticultural Crops by Implementing a...

Page 1: Standardizing Mansfeld's World Database of Agricultural and Horticultural Crops by Implementing a Concept-Based Data Model Ram Narang and Helmut Knüpffer.

Standardizing Mansfeld's World Database of Agricultural and Horticultural Crops by Implementing a Concept-Based Data ModelRam Narang and Helmut Knüpffer

Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, [email protected]

Introduction

The integration of species-related information from multiple sources in federated information systems or web portals faces the problem of different taxonomic approaches used. Many global and local taxonomic databases, among them ITIS and Species2000, provide information about species, based on a single taxonomic view, where information is attached to a single accepted (or preferred) name. Taxonomic opinions and standards vary with time, place, and investigator, and depend upon many factors like geographical range of study, interpretation of collected specimens, the fossil record, morphology, genetics and molecular phylogeny. New classifications may arise from more detailed studies of specimens, the discovery of new taxonomic information, or the description of new species and groupings. Consequently, biological taxa often have multiple names, which in turn may have been applied to multiple taxon concepts. When combining such data from diverse sources into a single database or portal, one needs to reconcile those different standards. In addition, the increasing use of DNA sequence comparison as a tool to analyse phylogenetic relationships is accelerating the rate of taxonomic revision, which is thus unlikely to stabilize in the foreseeable future. Therefore, the availability and implementation of a data model representing multiple, alternative taxonomic views is crucial for a sound taxonomic information management.

Leibniz Institute of Plant Genetics and Crop Plant Research

The Berlin Data Model for Taxonomic Information

Mansfeld’s World Database of Agricultural and Horticultural Crops

Taxonomy Module of the Mansfeld Database

Implementing the Multiple Taxonomic Concepts Model

References

Various data models have been developed to support the representation of multiple, alternative taxonomic views in taxonomic databases (cf. Kennedy et al. 2006), among them the Berlin Model (Berendsohn et al. 2003), based on the IOPI model. The Berlin Model allows to use alternative taxonomic concepts (potential taxa) for species information. A number of projects, such as the Euro+Med PlantBase, AlgaTerra, MoReTax, the IOPI Global Plant Checklist, the Dendroflora of El Salvador and Med-Checklist, implemented the core of the Berlin Model as a taxonomic backbone for their databases and contributed to its continuous development and optimization (http://www.bgbm.org/biodivinf/docs/bgbm-model/). In addition, the Berlin Model is the underlying model of several tools dedicated to taxonomic data management such as taxonomic revisions, data import from external sources, data integrity checking and data publishing on the World Wide Web.The Core of the Berlin Model contains four central functional sections: (1) Taxon Names, (2) Potential Taxon (taxonomic concepts), (3) Facts and (4) References. Taxon names are the botanical names according to the International Code of Botanical Nomenclature (ICBN).

Like many other global taxonomic checklists, the Mansfeld Database represents a single taxonomic view of nomenclatural information. It incorporates classifications that have gained broad acceptance in taxonomic literature and by taxonomists working with the taxa concerned, and thus offers the opportunity of standardizing scientific nomenclature and taxonomy for cultivated plant species. Alternative taxonomic views (reflected by phrases such as sensu, amend., etc.) are presently stored as part of the nomenclatural reference. Similarly, authors and bibliographical references are not yet atomized into individual attributes. These information items need to be parsed and abstracted into the entity-relationship model to allow a conceptual view on the taxon.

The Mansfeld Database (http://mansfeld.ipk-gatersleben.de) is an online database developed at IPK since 1998, initially as a contribution to the project “Federal Information System on Genetic Resources” (BIG, http://www.big-flora.de/). It reflects the contents of “Mansfeld’s Encyclopedia of Agricultural and Horticultural Crops” (Hanelt and IPK 2001) and contains information on ca. 6,100 crop plant species, excluding forestry and ornamental plants. Each species entry provides nomenclature and synonymy, common names in different languages, the distribution of the species in the wild and regions of cultivation, uses, images, references, but also the ancestral species and notes on the phylogeny, variation and history. Originally developed under Microsoft Visual FoxPro, the Mansfeld Database has recently been migrated to the database platform Oracle 10g, and the procedures for the web interface were re-programmed.

In a first step of implementation, the latest version of the Berlin Core Model, a database model under MS SQL Server, was migrated into Oracle 10g. All database procedures, functions and triggers that implement taxonomic logic, were translated into their PL/SQL equivalents.

Nomenclatural and bibliographical data of the Mansfeld Database was atomised using JAVA programmes. The parsed information was tagged and stored in an XML file. The resulting soft-schema XML-file was read with JDOM and corrected manually -- a time-consuming task --, to write a strict schema XML file which was used to populate the tables in the Taxon, Reference and Potential Taxon sections of the Berlin taxonomic model. After completion of the taxonomic core, the remaining information from the Mansfeld Database, such as textual information on geographical distribution and uses, was linked to the potential taxon as factual data. Finally, the web interface was adapted (re-programmed) to the new data model.

NameTaxon Concept Reference

Facts

RelationThe combination of such a name with a reference forms a taxonym (or potential taxon, taxon concept). An auxiliary section Authors assembles author teams for the nomenclatural references. Finally, the fact component can be used to store any kind of factual information.

Basic data integrity rules in the Berlin Model are implemented at the level of tables, keys, and relations within the database model. For example, the rule that every botanical name should have a rank can be assured with a foreign key to the table defining the list of valid ranks. More complex rules and functions, e.g. to construct syntactically correct botanical names, are implemented using stored procedures and trigger functions. Triggers are functions executed automatically when certain database events occur. For example, one of the triggers automatically rebuilds an author team when one of its author names was changed.

vnam_tax

anzeigen

I11 idFK1,I19 taxon_idFK2,I20 vnam_idI22 vnam_neuI21 vnam_id_altI13 name_origI18 sprach_idFK3,I14 namtyp_idI15 pfl_teilI10 geogr_info add_infoI2 artikelI7 fuer_bigI17 soi_idI16 ref_idI6 erstelltI5 erst_vonI9 geaendertI8 geaend_vonI3 chkI4 chk_vonI12 löschen

volksnam

PK vnam_id

U1 nameI8,I5 name_ansiI7 soi_idI6 ref_idI1 anzeigenI3 erstellt erst_von geaendert geaend_von bemerkung original_nI2 chk chk_vonI4 löschen

taxa_soi

PK id

FK1,I6 taxon_idFK2,I5 soi_idI2 erstelltI1 erst_vonI4 geaendertI3 geaend_von

autoren

autor_orig

I6 autor_idI8 dubl_mitI5 autor_gesI3 autor_apnI4 autor_basI14 problemI1 autorI7 autor_nonI2 autor_api bemerkungI10 erstelltI9 erst_vonI12 geaendertI11 geaend_vonI13 löschen

dubl_botnam

botnam_id dublette dubl_mit taxon_id

soi

PK soi_id

soi_big name_d name_e

taxrang

PK rang_id

I4 rangI3 mf_rang_kuerzI2 mfI1 anzeigen sprach_idI6 taxlevel kulturpflanzeI5 reihenf erstellt erst_von geaendert geaend_von

pp_stat

PK ppstat_id

U1 pp_kuerzel bemerkung

gruppe

PK gruppen_id

I2 kuerzelI3 nameI1 anzeigen

taxa

PK taxon_id

FK1,I2 botnam_id bnam_id_altI4 hightax_idI3 famtax_id familieI1 artikel_id highart_id db_idFK2,I6 soi_idI5 ref_id erstellt erst_von geaendert geaend_von löschen

syn_stat

PK synstat_id

I4 syn_symbolI3 syn_statusI2 status_bigI5 textI1 sortierung bemerkung

publikat

publ_bphtl

I2 publ_id publ bemerkung erstellt erst_von geaendert geaend_von

vnamtyp

PK namtyp_id

I2 name_dI3 name_e bemerkungI1 anzeigen

botnam

PK botnam_id

I15 homonymI11 dubletteI10 dubl_mitI18 löschenFK3,I31 soi_idI22,I20,I32 nameI21 name_ansiI23 name_gzI5 autor_basI3 autorI9 autor_nonI8 autor_idI7,I22 autor_gesI16,I22 jahrI17 jahr_non publ_idI24,I22,I28 publ publ_band publ_seite publ_non publ_addI19 nam_stat alt_name name_vollI4 autor_apnI6 autor_chkI25 publ_bphI27 publ_tl2I26 publ_chkI30 ref_id tax_textFK1,I29 rang_idFK2,I14 gruppen_idI2 art_autor original bemerkungI1 anzeigenI13 fuer_bigI12 erstellt erst_von geaendert geaend_von

syno

PK id

FK1,I12 taxon_idI14 vtaxon_idI1 akztax_idI7 mf_artikelI13 text_taxFK2,I11 synstat_idI9 syn_operFK3,I8 ppstat_idI10 syn_text artikel_idI3 erstellt erst_vonI5 geaendert geaend_vonI2 anzeigenI4 fuer_bigI6 löschen bemerkung

The implementation of the Berlin Model in the Mansfeld Database facilitates standardisation and improves the quality of the taxonomic information by increasing accuracy, resolution and interpretability. In addition, existing standard taxonomy management tools such as a web editors can be adapted to be used on the underlying new conceptual Mansfeld Database model for updating the contents of the database. Vast information about 6,100 species of agricultural and horticultural crop plants will thus become more easily accessible to global portals on biodiversity information.

Outlook

Conceptual Db modelMansfeld

Database

XMLsoft schema

I XMLstrict schema

II III

Web screenshots of the Mansfeld Database before the transformation to the Berlin Model

Mansfeld Database – Taxonomy module

Entity-relationship model of the potential taxon

Concept-oriented database core

Implementation steps

TaxonRank

TaxonName

Potential Taxon Name

cm

cm

cm

cm

m

cm

m

1 1

1

1c

c

is acceptedname

assignsacceptedname

is higher taxonin classification

gives status and other taxonomicinformation of

is classified

1

ReferenceStatus

Assignment

AssignedStatus

ReferenceTitle

The Encyclopedia of Life (http://www.eol.org) launched in 2007 is developing “species pages” for all known organisms, the contents to be provided and edited by experts from all over the world, using a wiki-like editor. Its initial contents is being gathered from existing web resources. The rich information contents of >6,000 of the economically most important plant species documented in the Mansfeld Database was offered for inclusion at the EoL Plant Species Pages Meeting (St. Louis, Missouri), 31.10.-2.11.2007.

The Global Biodiversity Information Facility (http://www.gbif.org) is aiming at providing free access to biodiversity information on the web, using standardised web services. The Mansfeld Database developers have been approached by GBIF to make its ca. 38,000 common names of crop plant species in many languages available to GBIF, to start developing an interface that would allow the world’s biodiversity data to be queried also via common names, besides scientific names. Integrating the Mansfeld Database fully into GBIF would also make its rich crop species information accessible along with data from other providers of taxon-related data.

Berendsohn, W.G., M. Döring, M. Geoffroy, K. Glück, A. Güntsch, A. Hahn, W.-H. Kusber, J.L. Li, D. Röpert and F. Specht. 2003. The Berlin Model: a concept-based taxonomic information model. Pp. 15-26 in Berendsohn, W.G. (ed), MoReTax. Handling Factual Information Linked to Taxonomic Concepts in Biology. Schriftenreihe für Vegetationskunde 39, Bonn.

Hanelt, P. and Institute of Plant Genetics and Crop Plant Research (eds), 2001. Mansfeld’s Encyclopedia of Agricultural and Horticultural Crops (Except Ornamentals). 6 vols. 1st Engl. ed. Springer, Berlin, Heidelberg, New York, etc. (LXX+3645 pp.)

Kennedy, J., R. Hyam, R. Kukla and T. Paterson, 2006. Standard data model representation for taxonomic information. OMICS. A Journal of Integrative Biology 10 (Special Issue on Data Standards), 220-230.