EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...
Transcript of EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...
EOL Informatics Team
EOL Informatics Team
The Encyclopedia of Life: The Encyclopedia of Life: A Web Page for Every A Web Page for Every SpeciesSpecies
Jennifer M. SchopfJennifer M. SchopfSystems ArchitectSystems Architect
Marine Biological Marine Biological LaboratoryLaboratory
EOL Informatics Team
• 21st century biology: From Microscope to Macroscope• Genome, protein databases led to revolution in biology
– Data freely and openly available, assimilated from many sources– New applications in education, research, commerce, medicine– Biology fast becoming an information-driven science
• Databases for organism biology less well developed– Individual species banks exist, but all in different formats, styles
• General public shows avid interest in biological information– But hard to find, sort authoritative information in ever vaster Web
• Newly developing countries often lack access to literature and specimens
The Problem
EOL Informatics Team
• A web page for each of Earth’s known species– Estimated to be 1.8 million validly known species
• Each page contains:– Introductory page for general public
• Vetted by experts
– Additional information for diverse user groups• Molecular & evolutionary biologists• Taxonomists• Horticulturists, bird watchers• Biodiversity-based industries (fisheries)• School children, teachers, citizen scientists
The Solution:Encyclopedia of Life
EOL Informatics Team
Outline
• Vision• How have we implemented this• Need for standards
EOL Informatics Team
• Many people have suggested an Encyclopedia of Life, e.g.– TDWG: Global Plant Species Information System
(1990)– Dan Janzen: Species Pages (early 1990s)– OECD Megascience Forum: SpeciesBank (1999)– ALL Species: Encyclopedia of Life (2001)– Smithsonian/Telluride: Encyclopedia of Life (2002)– Rainer Froese: SpeciesBase (2005)
• Current EOL derives from a paper by E.O. Wilson in 2003
Where did the ideacome from?
EOL Informatics Team
E O Wilson articulated a need for EOL in 2003
EOL Informatics Team
E.O. Wilson’s Vision
• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databases such as ARKive, Ecoport, and GenBank. It comprises a summary of everything known about the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.
EOL Informatics Team
E.O. Wilson’s Vision
• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databasessuch as ARKive, Ecoport, and GenBank. It comprises a summary of everything knownabout the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.
EOL Informatics Team
TED 2007 Wish
EOL Informatics Team
• Technology can enable it– Linking of databases is now relatively simple– Aggregation assembles dynamic, fast, low-cost draft pages– Wiki allows vetting and continuous improvement by experts
• “Species name” is a field common to virtually all biological databases– So it can be used to link these databases together
• Many specialized databases to work with– Catalogue of Life, FishBase, AmphibiaWeb, Ocean
Biogeographic Information System, Barcode of Life Database, GBIF
• Capable institutions interested in making it happen• Foundations willing to fund initial phase• Great community & user interest in the idea
EOL – Why Now?
EOL Informatics Team
How are we doing this?
• Content• Aggregation• Smarts
EOL Informatics Team
Content
• We don’t do content – our data providers do– Tree of Life web project– Species 2000 & ITIS Catalogue of Life– GBIF– Biopix– FishBase
• These provide us with data bases (and Web services sometimes) to pull existing, expert vetted data
• There are 1,000s of these data bases of varying levels of sophistication
EOL Informatics Team
More Content
• Another EOL group is scanning in literature– Post-copyright or with publisher’s
permissions• We also scan RSS feeds for literature
links– Part of the uBio project (MBL)
EOL Informatics Team
Aggregation
• We don’t use just one of those on a page• We take advantage of as many different
sources as we can
EOL Informatics Team
Smarts
• Not yet – but soon….– Tagging– “If you like this species you might also like..”– Semantic data to enable more interesting
searches
EOL Informatics Team
Contributions through WorkBench
• Software to help contribute, mobilise, visualize, analyze, annotate
• First release in 3 months– David Shorthouse overseeing
• Content authoring, association, review• Communication• Managing metadata – ontologies
– Adding to a character ontology vs hierarchy matrix• Personal and communal environments (myEOL)
EOL Informatics Team
Where we started
Aggregation intercepted / interpreted source code, database queries, RSS feeds and other APIs to collect data and merged them into new pages
The early uBio aggregation portal
EOL Informatics Team
Species pages group selects
data partners
Data partners & data objects
registeredwithin Fedora
Commonsafter
agreements are reached
Metadata normalized
Data enhanced
with semantic & taxonomicintelligence
Data from different sources brought
together –aggregated
IndexDataObjectID
PartnerIDVerbatim name
UnionIDMetadata
Matrixatomised data
Data available to the
WorkBenchwhere it may
be manipulated
Selected and organized data
passed to templates of species pages
Page of information on a species
visible
Secretariat(or delegates)
Workflow for a Species Page
EOL Informatics Team
Chromis: Species Page
Overview4 areas on each page:1) Common and species
name- New discovery (Jan ‘08)
2) Media Panel-Images, maps and
videos3) Classification viewer4) Content
-Accessed via the table of contents
EOL Informatics Team
GreenAnole
EOL Informatics Team
BuryingBeetle
EOL Informatics Team
Home Page
Search boxSet of random
speciesFeatured example
One of out 25 ExemplarsPopular – most
watched species
EOL Informatics Team
Standards
• I’m here to talk about the need for standards!
EOL Informatics Team
Standards in EOL
• Names (Indexing)• Metadata• Schemas (mapping)• Exchange protocols• Data formats• Textual categorization• APIs
EOL Informatics Team
Aapaleacea
Limulus polyphemusKiwa hirsuta
Osedax frankpressi
Kingia australis
Names are universal tags for species
Pieris japonica
Pieris rapae
Trypanosoma brucei
Homo sapiens
EOL Informatics Team
First Standard - Names
• Names offered a logical way to search for and index content
• Names annotate data objects• All names and name surrogates
annotate all data objects• A compilation of all names ever
used is the foundation of a universal index for biology
EOL Informatics Team
However
• Names of organisms change over time
• E-biology has to deal with the cumulative changes
EOL Informatics Team
Same name is recorded differently
• Our automated names finding tools / search engines / indexes needed to be aware of all of them
EOL Informatics Team
Names are misspelled
Became….Loligo pealeiiLoligo pealiiLoligo pealei
EOL Informatics Team
PolysemesOne name more than one
meaning
Peranema– the fern
Peranema– the euglenid
EOL Informatics Team
PolysemesOne name more than one
meaningAotus trivirgatus
Aotus Illiger 1811
Aotus
Aotus Smith 1805
Aotus ericoides
.
EOL Informatics Team
PolysemesOne name more than one
meaningAotus trivirgatus
Aotus Illiger 1811
Aotus
Aotus Smith 1805
Aotus ericoides
.
Resolve with intelligent disambiguationAuthority, species, contextual data
Contextual data
PrimateMonkeyEyesFoodPanamaAotus nancymaae
Contextual data
legumeplantflowerMirbelieaAustraliaAotus mollis
EOL Informatics Team
Koko
Горилла
Guerilla
Eastern Lowland Gorilla
Gorilla graueri
Gorilla berengei
Gorilla beringei MatschieGorilla beringei mikenensis
King kong
Gorilla gorilla
Virunga
Gorila
GorilleMountain gorilla
大猩猩
ゴリラ
KГGEaGGoGo
GoKi
GViGGoMo大猩ゴリ
Many Names for
One Species
EOL Informatics Team
This is hard for more then just sw builders
Libraries
PublishersMuseums
Federal Agencies
Search engines
Federated databases
Students and researchers
106000515358003371215585018700
Red spotted newt
COML
EOL Informatics Team
Affect in software
One organism– 11 common names– 4 scientific names– 4 maps!
The early adoption of Taxonomic Intelligence ensured OBIS was the first federated env’t to address this
EOL Informatics Team
Disambiguation: Distinguishing Polysemes
• Rulesets that allowed automated tools to discriminate the euglenid from the fern
• Authority– Peranema Dons (the fern) Peranema Dujardin (the euglenid)
• Co-occurrence of disambiguating terms – taxonomic context– Peranema and Pteridophyta vs Peranema and Euglenida– Peranema trichophorum, or Peranema, Anisonema and
Urceolus• Taxonic Intelligence tools
– Plain language dictionary filters– Ambiguity dictionaries
Peranima
EOL Informatics Team
Reconciliation
• Linking alternative names for the same organism
• A query initiated with any name expands to all names and unified data associated with each
• Using GBIFsuniversal biodiversity data bus to normalize data from multiple sources
Union concept: a reconciliation group that holds all names and labels used (in someone’s opinion) for an entity (converts names to taxonomic concepts)
EOL Informatics Team
Reconciliation 2
• Reconciliation has made all this possible– Standard names won’t happen, so we’ve got
a way to group instead• This is a first step towards GUIDs for
species
EOL Informatics Team
Standards 2: Metadata
• What’s out there– ABCD– Dublin Core– Darwin Core – also part of TDWG
EOL Informatics Team
ABCD:Access to Biological
Collections Data• Part of Biodiversity Information Standards
– TDWG (Taxonomic Database Working Group)
– We’re institutional partners• Has over 200 elements
– Not in use yet, too cumbersome
http://www.tdwg.org/activities/abcd/
EOL Informatics Team
Dublin Core
• 15 metadata elements:
1. Title2. Creator3. Subject4. Description5. Publisher6. Contributor7. Date8. Type
9. Format10. Identifier11. Source12. Language13. Relation14. Coverage15. Rights
EOL Informatics Team
Darwin Core
• Information about where and what a sample is– Collection type, institution, collector, etc– Genus and species designation, plus vernacular and
scientific names– Locality (continent, map coords, depth of water found,
etc)• Also part of TDWG, some uptake
http://wiki.tdwg.org/twiki/bin/view/DarwinCore/DarwinCoreDraftStandard
EOL Informatics Team
Metadata
• What’s needed– Simple, accepted standards
• What’s next– We’re working with data providers to see
what makes sense pragmatically
EOL Informatics Team
Standards 3: Schemas
• What’s out there– Primarily mappings to XML, part of metadata
standards• What we use
– XML• What’s needed
– RDF for semantic use as well
EOL Informatics Team
Standards 4:Exchange protocols
• What’s out there– TDWG Access Protocol for Information Retrieval (TAPIR)– Distributed Generic Information Retrieval (DiGIR)
• Open-source PHP-based package that translates columns of data in database (MySQL, SQL Server, Oracle) into Darwin Core fields.
– Both look at data movement and data translation– Both need dedicated resources
• What we use– Neither of these are mature yet, we’re evaluating for next
release• What’s needed
– Reliable, stable, implementable standards with broad appeal -what else?
– Some groups (nearctic spider database) are looking at things like google spreadsheets as an option
EOL Informatics Team
Standards 5:Formats for video,
photos, etc• What’s out there
– Everything!• What we use
– Mostly what we’re given• What’s needed
– Some basic choice to make our life easier• For example we made all movies into flash
• What’s next– Coming soon – more media types!
EOL Informatics Team
Standards 6: Textual categorization
• What’s out there– Species profile model– http://wiki.tdwg.org/SPM– Started April 2007
• What we use– Nothing yet
• What’s needed– Wiki has no structure, so something soon please
EOL Informatics Team
Standards 7: APIs
• What’s out there– Nothing
• What we use– We’ve home grown all our own Web service
APIs – SOAP and XML, not even WSDL• What’s needed
– Everything!
EOL Informatics Team
Standards – what’s next
• We’re working with the leaders in the field (esp GBIF) to come to consensus
• TDWG standards body moves slowly• We’ll put pragmatic solutions into place
and replace with standards as we can
EOL Informatics Team
Status of EOL
• Site went live 1am Feb 26– Between 9am and 3 pm served over 11.5
million hits! (note: hits not users ☺)