10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity...
-
Upload
emery-walsh -
Category
Documents
-
view
216 -
download
0
Transcript of 10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity...
10 March 2004 Richard J. White – COMSC / BB Unit
Reliable knowledge discovery in a biodiversity Grid
Part 2: Litchi and ambiguous namesby Richard J. White
presented to the Biostatistics & Bioinformatics Unit, CardiffWednesday 10 March 2004
10 March 2004 Richard J. White – COMSC / BB Unit
Ambiguous nomenclature
Challenges in creating global biodiversity information systems by merging and linking databases: ambiguities arise from the way scientific
names refer to species for example, if two species are combined,
one of the original names must be re-used to refer to the new concept
conversely, when a species is divided into two, one part must retain the original name
10 March 2004 Richard J. White – COMSC / BB Unit
A problem in Biodiversity Informatics
The way species are named may affect the reliability and usability of species information systemsTechniques to handle the problem semi-automatically can be developedThis problem and potential solutions may in some cases generalise to other naming schemes
10 March 2004 Richard J. White – COMSC / BB Unit
Names for species
A new name is published by an author who thinks the species is new and therefore needs a nameLater, others may disagree and merge this species with another (the older name is re-used to designate the merged species – same name, different meaning (broader circumscription)
10 March 2004 Richard J. White – COMSC / BB Unit
Names for species
Alternatively, a species may be split in two; one of the new species gets a new name (the older name is re-used to designate the other one – same name, different meaning (narrower circumscription)
10 March 2004 Richard J. White – COMSC / BB Unit
Example
Locate sequence data for all species of Vicia
Some data may be listed under species of the obsolete genus Orobus
A name such as Vicia narbonensis might be regarded by some as just another name for Vicia faba
10 March 2004 Richard J. White – COMSC / BB Unit
Example
You want to discover all there is to know about one speciesIt may be listed in different sources under different names
These examples show why taxonomists attach great importance to synonyms
10 March 2004 Richard J. White – COMSC / BB Unit
“Mr Linnaeus”
A web-based mock-up to explore aspects of the user interface of a system for interpreting “taxonomically intelligent links”Prepared by Helen Bradbrook, an MSc student in the School of Plant Sciences at the University of Reading
10 March 2004 Richard J. White – COMSC / BB Unit
Ambiguous nomenclature
The problems are inherent in the subjective nature of the species concept they cannot be removed by, for example,
using numbers instead of names (unless a completely new name or
number is invented every time the circumscription changes)
Some of these issues were addressed in the LITCHI project …
10 March 2004 Richard J. White – COMSC / BB Unit
LITCHI Project
A rule-based tool for the detection and repair of conflicts and merging of
data in taxonomic databases
10 March 2004 Richard J. White – COMSC / BB Unit
Litchi
a BBSRC/EPSRC “Bioinformatics Initiative” project (with Reading) using “conflicts” between species databases
arising from ambiguous nomenclature but information is implicit in the lists of
synonyms accompanying species names rule-based (Prolog) definition, detection
and resolution of conflicts
10 March 2004 Richard J. White – COMSC / BB Unit
Project StaffProject StaffSuzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF
Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS
John Robinson, Richard WhiteBiodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX
10 March 2004 Richard J. White – COMSC / BB Unit
Why is LITCHI needed?Why is LITCHI needed?
Species names are the key to biodiversity informationTrend towards large biodiversity databases and global systems Manual merging of taxonomic databases very time-consumingUsers want to browse “seamlessly” from one web-site to anotherUsers want to assemble reliable data sets drawn from several sources, but information on naming “conflicts” is hard to find and checking for them is tedious
10 March 2004 Richard J. White – COMSC / BB Unit
Example 1
Checklist A
Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym]
Checklist B
Caragana sibirica Medikus [accepted name]Caragana arborescens Lam. [synonym]
10 March 2004 Richard J. White – COMSC / BB Unit
Example 2
Checklist A
Caesalpinia crista L. [accepted name]
Checklist B
Caesalpinia crista L. [accepted name]
Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]
10 March 2004 Richard J. White – COMSC / BB Unit
Example 3
In the case of the species Cytisus scoparius
Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius)
Treatment B will list it as
Sarothamnus scoparius (synonym Cytisus scoparius)
GenusCytisus
GenusSarothamnus
GenusCytisus
Cytisus scoparius Sarothamnus scopariusCytisus striatus Sarothamnus striatus
Cytisus multiflorus Cytisus multiflorusCytisus praecox Cytisus praecox
Treatment Arecognises one genus, Cytisus
Treatment Brecognises two genera,
Cytisus and Sarothamnus
What we did
Formulated rules for integrity and conflict, first in English and then in definite clauses of logic Translated these declarative rules to build and test a Prolog model Devised and tested algorithms to detect and report conflicts Devised and tested algorithms to manage the partially-automated correction of the conflicting elementsBuilt and operated a prototype software system
10 March 2004 Richard J. White – COMSC / BB Unit
Integrity and conflict rules
How a scientific name should be composed (Rules of Nomenclature) Rules for citing the assemblage of names and synonyms for one taxonRules of integrity and “concept relationships” (overlap etc.) between the taxa in a taxonomic treatment Rules for detecting conflicts between treatmentsRules for classifying conflicts to determine the action to be taken
10 March 2004 Richard J. White – COMSC / BB Unit
Testing the rules
Conflicts were detected in the ILDIS database by Rule 3 which states that a full name may not appear as an accepted name and a synonym in the same checklist: (n,a,l) accepted_name(n,a,_,l,_) synonym(n,a,_,l,_)
In Prolog form, this rule is expressed:litchi_rule3:- accepted_name(N,A,_,L,_), synonym(N,A,_,L,_).
10 March 2004 Richard J. White – COMSC / BB Unit
A detected conflictThe Prolog conflict detection engine reported: conflict(3:[Astragalus,variegatus]: [Freyn,&,Bornm,.]:combinedlist)The conflict report includes the following information: Astragalus variegatus Freyn & Bornm. (accepted name) Astragalus sarypulensis B.Fedtsch. (synonym)
Astragalus rufescens Freyn (accepted name) Astragalus variegatus Freyn & Bornm. (synonym)
10 March 2004 Richard J. White – COMSC / BB Unit
Repairing violations
User may wish to look at context of violation to determine appropriate repairDomain-specific knowledge can be applied to narrow down set of (taxonomically) valid repairs presented to the user
10 March 2004 Richard J. White – COMSC / BB Unit
Implementing LITCHI: major aspects
Design of a suitable architectureDevelopment of a model for species checklistsModelling taxonomic practice using constraintsProviding appropriate support to the editor in repairing constraint violations
Summary
We modelled the knowledge integrity rules in a taxonomic treatment.The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later). Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases.
10 March 2004 Richard J. White – COMSC / BB Unit
Outcome of project
A prototype tool for merging checklists & checking integrity of individual checklists was implemented & is freely available (but scarcely usable)We plan to extend this work:
“re-implemented” production version dynamic linking (so-called “taxonomically
intelligent links”)
10 March 2004 Richard J. White – COMSC / BB Unit
Litchi 2
Solutions to the nomenclature challenges, including Litchi and its interaction with Spice are being developed further in the course of the new BBSRC “Biodiversity World” Grid demonstrator project and the EU “Species 2000 europa” and ENBI projects (involving the same parties)
10 March 2004 Richard J. White – COMSC / BB Unit
Litchi 2
“Intelligent linking” is to protect users from and explain nomenclatural ambiguitiesDevelopment of these techniques would be easier if we had an explicit representation of the overlaps between species in different databasesSuch “cross-maps” can be constructed automatically using similar rules in the new Litchi version 2
10 March 2004 Richard J. White – COMSC / BB Unit
Future projects
Ambiguous nomenclature on-going programme of projects (already
involving collaboration with staff here in COMSC) building tools such as Litchi to help bioinformaticians deal with ambiguous nomenclature
These techniques might be extended to other areas of bioinformatics where subjective identification and ambiguous nomenclature occur, such as the names of proteins (as suggested by Andrew Jones), genes, geographical areas, habitat types, etc.
10 March 2004 Richard J. White – COMSC / BB Unit
An “intelligent” system
It would know about the synonymies and ambiguities existing in various data domainsIt would help the user work with such dataIt would contain a thesaurus, “knowledge-base” or “ontology”
10 March 2004 Richard J. White – COMSC / BB Unit
An “intelligent” system
These are hard to construct by handLitchi shows how this might be done by supervised automatic procedures in the case of species namesWe want to generalise these ideas and techniques to other data domains, maybe those that you are interested in