Biodiversity Informatics and the Biodiversity Literature
Overview• Progress over the last
decade– Organism occurrence data– Taxonomic databases
• The next challenge – Describing diversity
CAS USNM FMNH NHM MNHN Institutions
Tools and standards created in biodiversity informatics enable data to be aggregated from around the world.
Collection Databases
End UserAnywhere data.GBIF.org
San Francisco Washington Chicago London Paris
The Global Biodiversity Information Facility (GBIF) is the largest aggregator of organism occurrence data.
California Academyof Sciences
National Museumof Natural History
Field Museumof Natural History
The Natural History Museum
Museum Nacionalde Histoire Natural
Organism Occurrence Data
Organism occurrence data
Distribution models
Remaining challenges with occurrence data
• Lots of digitization still to do• Taxonomic identifications need to be updated• Georeferencing still needs to be done
Relationship to literature:• Specimens and observations are primary data• Literature contains both reports of primary
data, as well as summarized data• Large scale digitization efforts in museums
might (will) swamp the content in literature
Taxonomic Databases
Nomenclator Checklist valid /accepted taxa(plus synonyms)
Catalog of uses in taxonomic works
Index – all uniquename-stringsmapped to valid names/concepts
>20M
increasing density of names in relevant corpus
Emergent consensus• Philosophical/methodological debates
– Species concepts • Biological • Evolutionary• Phylogenetic
– Taxonomic definitions• Circumscription • Synonymized types• Set of specimens identified by taxon author• Tree or linneage-based definition
Anchor name-usage to publication metadata; actual publication;
enable validation
Name Usage Name
Citation(publication metadata)
begin end
Remaining challenges with taxonomic data
• Taxa are concepts created in literature
• Physical instances of the same published work are “equivalent”
• Develop shared logical identifiers• Reconciliation across “authoritative”
databases; fewer number of same as records
Recap• Taxonomic names are key to
– Information retrieval– Information summary and grouping– Publication metadata are critical to anchoring
taxonomic concepts, and– Providing the semantic touchstones for
collaboration (critical)• Occurrence data gives us species distributions
– Direct relationship to literature is small– But taxonomy is critical to integrating occurrence
data, so the literature is still fundamental
What’s next?
What’s next
What other classes of information remain in the literature?…that could be extracted and structured to be really useful?
Genetic and genomic data?
…are not communicated or stored in the literature
Genetic/Genomic data
A Model Organism
Danio rerio
the zebrafish
Understanding the origins of speciesthrough structured descriptions of diversity
Genotype A
Deve
lopm
ent
Phenotype A
evolutionGenotype B
Phenotype B
mutationGenomic Diversity
MorphologicalDiversity
Morphological variation across species difficult to find and
synthesize
Information retrieval from free-text is difficult
21(Lundberg and Akama 2005)
Not computable across studies
What is an ontology?
• A set of well-defined terms and the logical relationships that hold between them
• Represents knowledge of a discipline
Teleost Anatomy Ontology terms and relationships
is_apart_of
is_a
develops_from
part_of
basihyal bone
ventral hyoid arch
basihyal cartilage
pharyngeal arch cartilage
is_abasihyal element
replacement bone
24
Ontologies quickly become large and complex; guiding philosophy required
Dahdul et al., 2010, Systematic Biology
The Teleost Anatomy Ontology contains 3,039 terms, with >600 skeletal terms
Fig. 1, Washington et al., 2010
Translational medicine
Translation from model organisms to humans
Phenoscape II & Research Coordination Network (RCN)
• Extended to include other model organisms and taxonomic groups, e.g.:– Amphibian Anatomy Ontology (AAO) – Blackburn,
CAS– Hymenoptera Anatomy Ontology (HAO) – Deans,
NCSU– Plant Ontology – Huala, Stanford
• NLP and term extraction (Hong Cui, Univ of Arizona)
What’s next?• Description of biological phenomena• Determining how best to do this will take
time• Top-down design, guided by functional
demonstration• Bottom-up curation of existing descriptions,• into structured knowledge through iteration
Top Related