10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity...

41
10 March 2004 Richard J. White – COMSC / BB Unit discovery in a biodiversity Grid Part 2: Litchi and ambiguous names by Richard J. White presented to the Biostatistics & Bioinformatics Unit, Cardiff Wednesday 10 March 2004

Transcript of 10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity...

10 March 2004 Richard J. White – COMSC / BB Unit

Reliable knowledge discovery in a biodiversity Grid

Part 2: Litchi and ambiguous namesby Richard J. White

presented to the Biostatistics & Bioinformatics Unit, CardiffWednesday 10 March 2004

10 March 2004 Richard J. White – COMSC / BB Unit

Ambiguous nomenclature

Challenges in creating global biodiversity information systems by merging and linking databases: ambiguities arise from the way scientific

names refer to species for example, if two species are combined,

one of the original names must be re-used to refer to the new concept

conversely, when a species is divided into two, one part must retain the original name

10 March 2004 Richard J. White – COMSC / BB Unit

A problem in Biodiversity Informatics

The way species are named may affect the reliability and usability of species information systemsTechniques to handle the problem semi-automatically can be developedThis problem and potential solutions may in some cases generalise to other naming schemes

10 March 2004 Richard J. White – COMSC / BB Unit

Names for species

A new name is published by an author who thinks the species is new and therefore needs a nameLater, others may disagree and merge this species with another (the older name is re-used to designate the merged species – same name, different meaning (broader circumscription)

10 March 2004 Richard J. White – COMSC / BB Unit

Names for species

Alternatively, a species may be split in two; one of the new species gets a new name (the older name is re-used to designate the other one – same name, different meaning (narrower circumscription)

10 March 2004 Richard J. White – COMSC / BB Unit

Example

Locate sequence data for all species of Vicia

Some data may be listed under species of the obsolete genus Orobus

A name such as Vicia narbonensis might be regarded by some as just another name for Vicia faba

10 March 2004 Richard J. White – COMSC / BB Unit

Example

You want to discover all there is to know about one speciesIt may be listed in different sources under different names

These examples show why taxonomists attach great importance to synonyms

10 March 2004 Richard J. White – COMSC / BB Unit

(PDL cover)

10 March 2004 Richard J. White – COMSC / BB Unit

(PDL page)

10 March 2004 Richard J. White – COMSC / BB Unit

(ILDIS search results)

10 March 2004 Richard J. White – COMSC / BB Unit

(ILDIS species page)

10 March 2004 Richard J. White – COMSC / BB Unit

“Mr Linnaeus”

A web-based mock-up to explore aspects of the user interface of a system for interpreting “taxonomically intelligent links”Prepared by Helen Bradbrook, an MSc student in the School of Plant Sciences at the University of Reading

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

10 March 2004 Richard J. White – COMSC / BB Unit

Ambiguous nomenclature

The problems are inherent in the subjective nature of the species concept they cannot be removed by, for example,

using numbers instead of names (unless a completely new name or

number is invented every time the circumscription changes)

Some of these issues were addressed in the LITCHI project …

10 March 2004 Richard J. White – COMSC / BB Unit

LITCHI Project

A rule-based tool for the detection and repair of conflicts and merging of

data in taxonomic databases

10 March 2004 Richard J. White – COMSC / BB Unit

Litchi

a BBSRC/EPSRC “Bioinformatics Initiative” project (with Reading) using “conflicts” between species databases

arising from ambiguous nomenclature but information is implicit in the lists of

synonyms accompanying species names rule-based (Prolog) definition, detection

and resolution of conflicts

10 March 2004 Richard J. White – COMSC / BB Unit

Project StaffProject StaffSuzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF

Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS

John Robinson, Richard WhiteBiodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX

10 March 2004 Richard J. White – COMSC / BB Unit

Why is LITCHI needed?Why is LITCHI needed?

Species names are the key to biodiversity informationTrend towards large biodiversity databases and global systems Manual merging of taxonomic databases very time-consumingUsers want to browse “seamlessly” from one web-site to anotherUsers want to assemble reliable data sets drawn from several sources, but information on naming “conflicts” is hard to find and checking for them is tedious

10 March 2004 Richard J. White – COMSC / BB Unit

Example 1

Checklist A

Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym]

Checklist B

Caragana sibirica Medikus [accepted name]Caragana arborescens Lam. [synonym]

10 March 2004 Richard J. White – COMSC / BB Unit

Example 2

Checklist A

Caesalpinia crista L. [accepted name]

Checklist B

Caesalpinia crista L. [accepted name]

Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

10 March 2004 Richard J. White – COMSC / BB Unit

Example 3

In the case of the species Cytisus scoparius

Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius)

Treatment B will list it as

Sarothamnus scoparius (synonym Cytisus scoparius)

GenusCytisus

GenusSarothamnus

GenusCytisus

Cytisus scoparius Sarothamnus scopariusCytisus striatus Sarothamnus striatus

Cytisus multiflorus Cytisus multiflorusCytisus praecox Cytisus praecox

Treatment Arecognises one genus, Cytisus

Treatment Brecognises two genera,

Cytisus and Sarothamnus

What we did

Formulated rules for integrity and conflict, first in English and then in definite clauses of logic Translated these declarative rules to build and test a Prolog model Devised and tested algorithms to detect and report conflicts Devised and tested algorithms to manage the partially-automated correction of the conflicting elementsBuilt and operated a prototype software system

10 March 2004 Richard J. White – COMSC / BB Unit

Integrity and conflict rules

How a scientific name should be composed (Rules of Nomenclature) Rules for citing the assemblage of names and synonyms for one taxonRules of integrity and “concept relationships” (overlap etc.) between the taxa in a taxonomic treatment Rules for detecting conflicts between treatmentsRules for classifying conflicts to determine the action to be taken

10 March 2004 Richard J. White – COMSC / BB Unit

Testing the rules

Conflicts were detected in the ILDIS database by Rule 3 which states that a full name may not appear as an accepted name and a synonym in the same checklist: (n,a,l) accepted_name(n,a,_,l,_) synonym(n,a,_,l,_)

In Prolog form, this rule is expressed:litchi_rule3:- accepted_name(N,A,_,L,_), synonym(N,A,_,L,_).

10 March 2004 Richard J. White – COMSC / BB Unit

A detected conflictThe Prolog conflict detection engine reported: conflict(3:[Astragalus,variegatus]: [Freyn,&,Bornm,.]:combinedlist)The conflict report includes the following information: Astragalus variegatus Freyn & Bornm. (accepted name) Astragalus sarypulensis B.Fedtsch. (synonym)

Astragalus rufescens Freyn (accepted name) Astragalus variegatus Freyn & Bornm. (synonym)

10 March 2004 Richard J. White – COMSC / BB Unit

Conflict display

10 March 2004 Richard J. White – COMSC / BB Unit

Repairing violations

User may wish to look at context of violation to determine appropriate repairDomain-specific knowledge can be applied to narrow down set of (taxonomically) valid repairs presented to the user

10 March 2004 Richard J. White – COMSC / BB Unit

Conflict repair

10 March 2004 Richard J. White – COMSC / BB Unit

Implementing LITCHI: major aspects

Design of a suitable architectureDevelopment of a model for species checklistsModelling taxonomic practice using constraintsProviding appropriate support to the editor in repairing constraint violations

Summary

We modelled the knowledge integrity rules in a taxonomic treatment.The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later). Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases.

10 March 2004 Richard J. White – COMSC / BB Unit

Outcome of project

A prototype tool for merging checklists & checking integrity of individual checklists was implemented & is freely available (but scarcely usable)We plan to extend this work:

“re-implemented” production version dynamic linking (so-called “taxonomically

intelligent links”)

10 March 2004 Richard J. White – COMSC / BB Unit

Litchi 2

Solutions to the nomenclature challenges, including Litchi and its interaction with Spice are being developed further in the course of the new BBSRC “Biodiversity World” Grid demonstrator project and the EU “Species 2000 europa” and ENBI projects (involving the same parties)

10 March 2004 Richard J. White – COMSC / BB Unit

Litchi 2

“Intelligent linking” is to protect users from and explain nomenclatural ambiguitiesDevelopment of these techniques would be easier if we had an explicit representation of the overlaps between species in different databasesSuch “cross-maps” can be constructed automatically using similar rules in the new Litchi version 2

10 March 2004 Richard J. White – COMSC / BB Unit

Future projects

Ambiguous nomenclature on-going programme of projects (already

involving collaboration with staff here in COMSC) building tools such as Litchi to help bioinformaticians deal with ambiguous nomenclature

These techniques might be extended to other areas of bioinformatics where subjective identification and ambiguous nomenclature occur, such as the names of proteins (as suggested by Andrew Jones), genes, geographical areas, habitat types, etc.

10 March 2004 Richard J. White – COMSC / BB Unit

An “intelligent” system

It would know about the synonymies and ambiguities existing in various data domainsIt would help the user work with such dataIt would contain a thesaurus, “knowledge-base” or “ontology”

10 March 2004 Richard J. White – COMSC / BB Unit

An “intelligent” system

These are hard to construct by handLitchi shows how this might be done by supervised automatic procedures in the case of species namesWe want to generalise these ideas and techniques to other data domains, maybe those that you are interested in