EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...

EOL Informatics Team


The Encyclopedia of Life: The Encyclopedia of Life: A Web Page for Every A Web Page for Every SpeciesSpecies

Jennifer M. SchopfJennifer M. SchopfSystems ArchitectSystems Architect

Marine Biological Marine Biological LaboratoryLaboratory


• 21st century biology: From Microscope to Macroscope• Genome, protein databases led to revolution in biology

– Data freely and openly available, assimilated from many sources– New applications in education, research, commerce, medicine– Biology fast becoming an information-driven science

• Databases for organism biology less well developed– Individual species banks exist, but all in different formats, styles

• General public shows avid interest in biological information– But hard to find, sort authoritative information in ever vaster Web

• Newly developing countries often lack access to literature and specimens

The Problem


• A web page for each of Earth’s known species– Estimated to be 1.8 million validly known species

• Each page contains:– Introductory page for general public

• Vetted by experts

– Additional information for diverse user groups• Molecular & evolutionary biologists• Taxonomists• Horticulturists, bird watchers• Biodiversity-based industries (fisheries)• School children, teachers, citizen scientists

The Solution:Encyclopedia of Life


Outline

• Vision• How have we implemented this• Need for standards


• Many people have suggested an Encyclopedia of Life, e.g.– TDWG: Global Plant Species Information System

(1990)– Dan Janzen: Species Pages (early 1990s)– OECD Megascience Forum: SpeciesBank (1999)– ALL Species: Encyclopedia of Life (2001)– Smithsonian/Telluride: Encyclopedia of Life (2002)– Rainer Froese: SpeciesBase (2005)

• Current EOL derives from a paper by E.O. Wilson in 2003

Where did the ideacome from?


E O Wilson articulated a need for EOL in 2003


E.O. Wilson’s Vision

• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databases such as ARKive, Ecoport, and GenBank. It comprises a summary of everything known about the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.


E.O. Wilson’s Vision

• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databasessuch as ARKive, Ecoport, and GenBank. It comprises a summary of everything knownabout the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.


TED 2007 Wish


• Technology can enable it– Linking of databases is now relatively simple– Aggregation assembles dynamic, fast, low-cost draft pages– Wiki allows vetting and continuous improvement by experts

• “Species name” is a field common to virtually all biological databases– So it can be used to link these databases together

• Many specialized databases to work with– Catalogue of Life, FishBase, AmphibiaWeb, Ocean

Biogeographic Information System, Barcode of Life Database, GBIF

• Capable institutions interested in making it happen• Foundations willing to fund initial phase• Great community & user interest in the idea

EOL – Why Now?


How are we doing this?

• Content• Aggregation• Smarts


Content

• We don’t do content – our data providers do– Tree of Life web project– Species 2000 & ITIS Catalogue of Life– GBIF– Biopix– FishBase

• These provide us with data bases (and Web services sometimes) to pull existing, expert vetted data

• There are 1,000s of these data bases of varying levels of sophistication


More Content

• Another EOL group is scanning in literature– Post-copyright or with publisher’s

permissions• We also scan RSS feeds for literature

links– Part of the uBio project (MBL)


Aggregation

• We don’t use just one of those on a page• We take advantage of as many different

sources as we can


Smarts

• Not yet – but soon….– Tagging– “If you like this species you might also like..”– Semantic data to enable more interesting

searches


Contributions through WorkBench

• Software to help contribute, mobilise, visualize, analyze, annotate

• First release in 3 months– David Shorthouse overseeing

• Content authoring, association, review• Communication• Managing metadata – ontologies

– Adding to a character ontology vs hierarchy matrix• Personal and communal environments (myEOL)


Where we started

Aggregation intercepted / interpreted source code, database queries, RSS feeds and other APIs to collect data and merged them into new pages

The early uBio aggregation portal


Species pages group selects

data partners

Data partners & data objects

registeredwithin Fedora

Commonsafter

agreements are reached

Metadata normalized

Data enhanced

with semantic & taxonomicintelligence

Data from different sources brought

together –aggregated

IndexDataObjectID

PartnerIDVerbatim name

UnionIDMetadata

Matrixatomised data

Data available to the

WorkBenchwhere it may

be manipulated

Selected and organized data

passed to templates of species pages

Page of information on a species

visible

Secretariat(or delegates)

Workflow for a Species Page


Chromis: Species Page

Overview4 areas on each page:1) Common and species

name- New discovery (Jan ‘08)

2) Media Panel-Images, maps and

videos3) Classification viewer4) Content

-Accessed via the table of contents

http://www.eol.org/taxa/17290368


GreenAnole


BuryingBeetle


Home Page

Search boxSet of random

speciesFeatured example

One of out 25 ExemplarsPopular – most

watched species

http://www.eol.org/


Standards

• I’m here to talk about the need for standards!


Standards in EOL

• Names (Indexing)• Metadata• Schemas (mapping)• Exchange protocols• Data formats• Textual categorization• APIs


Aapaleacea

Limulus polyphemusKiwa hirsuta

Osedax frankpressi

Kingia australis

Names are universal tags for species

Pieris japonica

Pieris rapae

Trypanosoma brucei

Homo sapiens


First Standard - Names

• Names offered a logical way to search for and index content

• Names annotate data objects• All names and name surrogates

annotate all data objects• A compilation of all names ever

used is the foundation of a universal index for biology


However

• Names of organisms change over time

• E-biology has to deal with the cumulative changes


Same name is recorded differently

• Our automated names finding tools / search engines / indexes needed to be aware of all of them


Names are misspelled

Became….Loligo pealeiiLoligo pealiiLoligo pealei


PolysemesOne name more than one

meaning

Peranema– the fern

Peranema– the euglenid



meaningAotus trivirgatus

Aotus Illiger 1811

Aotus

Aotus Smith 1805

Aotus ericoides

.



meaningAotus trivirgatus

Aotus Illiger 1811

Aotus

Aotus Smith 1805

Aotus ericoides

.

Resolve with intelligent disambiguationAuthority, species, contextual data

Contextual data

PrimateMonkeyEyesFoodPanamaAotus nancymaae

Contextual data

legumeplantflowerMirbelieaAustraliaAotus mollis


Koko

Горилла

Guerilla

Eastern Lowland Gorilla

Gorilla graueri

Gorilla berengei

Gorilla beringei MatschieGorilla beringei mikenensis

King kong

Gorilla gorilla

Virunga

Gorila

GorilleMountain gorilla

大猩猩

ゴリラ

KГGEaGGoGo

GoKi

GViGGoMo大猩ゴリ

Many Names for

One Species


This is hard for more then just sw builders

Libraries

PublishersMuseums

Federal Agencies

Search engines

Federated databases

Students and researchers

106000515358003371215585018700

Red spotted newt

COML


Affect in software

One organism– 11 common names– 4 scientific names– 4 maps!

The early adoption of Taxonomic Intelligence ensured OBIS was the first federated env’t to address this


Disambiguation: Distinguishing Polysemes

• Rulesets that allowed automated tools to discriminate the euglenid from the fern

• Authority– Peranema Dons (the fern) Peranema Dujardin (the euglenid)

• Co-occurrence of disambiguating terms – taxonomic context– Peranema and Pteridophyta vs Peranema and Euglenida– Peranema trichophorum, or Peranema, Anisonema and

Urceolus• Taxonic Intelligence tools

– Plain language dictionary filters– Ambiguity dictionaries

Peranima


Reconciliation

• Linking alternative names for the same organism

• A query initiated with any name expands to all names and unified data associated with each

• Using GBIFsuniversal biodiversity data bus to normalize data from multiple sources

Union concept: a reconciliation group that holds all names and labels used (in someone’s opinion) for an entity (converts names to taxonomic concepts)

http://uio.mbl.edu/SOAPbrowser/index.php?func=name_detail&ubioID=12294


Reconciliation 2

• Reconciliation has made all this possible– Standard names won’t happen, so we’ve got

a way to group instead• This is a first step towards GUIDs for

species


Standards 2: Metadata

• What’s out there– ABCD– Dublin Core– Darwin Core – also part of TDWG


ABCD:Access to Biological

Collections Data• Part of Biodiversity Information Standards

– TDWG (Taxonomic Database Working Group)

– We’re institutional partners• Has over 200 elements

– Not in use yet, too cumbersome

http://www.tdwg.org/activities/abcd/


Dublin Core

• 15 metadata elements:

1. Title2. Creator3. Subject4. Description5. Publisher6. Contributor7. Date8. Type

9. Format10. Identifier11. Source12. Language13. Relation14. Coverage15. Rights


Darwin Core

• Information about where and what a sample is– Collection type, institution, collector, etc– Genus and species designation, plus vernacular and

scientific names– Locality (continent, map coords, depth of water found,

etc)• Also part of TDWG, some uptake

http://wiki.tdwg.org/twiki/bin/view/DarwinCore/DarwinCoreDraftStandard


Metadata

• What’s needed– Simple, accepted standards

• What’s next– We’re working with data providers to see

what makes sense pragmatically


Standards 3: Schemas

• What’s out there– Primarily mappings to XML, part of metadata

standards• What we use

– XML• What’s needed

– RDF for semantic use as well


Standards 4:Exchange protocols

• What’s out there– TDWG Access Protocol for Information Retrieval (TAPIR)– Distributed Generic Information Retrieval (DiGIR)

• Open-source PHP-based package that translates columns of data in database (MySQL, SQL Server, Oracle) into Darwin Core fields.

– Both look at data movement and data translation– Both need dedicated resources

• What we use– Neither of these are mature yet, we’re evaluating for next

release• What’s needed

– Reliable, stable, implementable standards with broad appeal -what else?

– Some groups (nearctic spider database) are looking at things like google spreadsheets as an option


Standards 5:Formats for video,

photos, etc• What’s out there

– Everything!• What we use

– Mostly what we’re given• What’s needed

– Some basic choice to make our life easier• For example we made all movies into flash

• What’s next– Coming soon – more media types!


Standards 6: Textual categorization

• What’s out there– Species profile model– http://wiki.tdwg.org/SPM– Started April 2007

• What we use– Nothing yet

• What’s needed– Wiki has no structure, so something soon please

http://wiki.tdwg.org/SPM


Standards 7: APIs

• What’s out there– Nothing

• What we use– We’ve home grown all our own Web service

APIs – SOAP and XML, not even WSDL• What’s needed

– Everything!


Standards – what’s next

• We’re working with the leaders in the field (esp GBIF) to come to consensus

• TDWG standards body moves slowly• We’ll put pragmatic solutions into place

and replace with standards as we can


Status of EOL

• Site went live 1am Feb 26– Between 9am and 3 pm served over 11.5

million hits! (note: hits not users ☺)


For more information

Jennifer Schopf– [email protected]

http://www.eol.org

mailto:[email protected]

http://www.eol.org/

EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...

Documents

Transcript of EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...