EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...

52

Transcript of EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century...

Page 1: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Page 2: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

The Encyclopedia of Life: The Encyclopedia of Life: A Web Page for Every A Web Page for Every SpeciesSpecies

Jennifer M. SchopfJennifer M. SchopfSystems ArchitectSystems Architect

Marine Biological Marine Biological LaboratoryLaboratory

Page 3: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

• 21st century biology: From Microscope to Macroscope• Genome, protein databases led to revolution in biology

– Data freely and openly available, assimilated from many sources– New applications in education, research, commerce, medicine– Biology fast becoming an information-driven science

• Databases for organism biology less well developed– Individual species banks exist, but all in different formats, styles

• General public shows avid interest in biological information– But hard to find, sort authoritative information in ever vaster Web

• Newly developing countries often lack access to literature and specimens

The Problem

Page 4: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

• A web page for each of Earth’s known species– Estimated to be 1.8 million validly known species

• Each page contains:– Introductory page for general public

• Vetted by experts

– Additional information for diverse user groups• Molecular & evolutionary biologists• Taxonomists• Horticulturists, bird watchers• Biodiversity-based industries (fisheries)• School children, teachers, citizen scientists

The Solution:Encyclopedia of Life

Page 5: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Outline

• Vision• How have we implemented this• Need for standards

Page 6: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

• Many people have suggested an Encyclopedia of Life, e.g.– TDWG: Global Plant Species Information System

(1990)– Dan Janzen: Species Pages (early 1990s)– OECD Megascience Forum: SpeciesBank (1999)– ALL Species: Encyclopedia of Life (2001)– Smithsonian/Telluride: Encyclopedia of Life (2002)– Rainer Froese: SpeciesBase (2005)

• Current EOL derives from a paper by E.O. Wilson in 2003

Where did the ideacome from?

Page 7: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

E O Wilson articulated a need for EOL in 2003

Page 8: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

E.O. Wilson’s Vision

• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databases such as ARKive, Ecoport, and GenBank. It comprises a summary of everything known about the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.

Page 9: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

E.O. Wilson’s Vision

• Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databasessuch as ARKive, Ecoport, and GenBank. It comprises a summary of everything knownabout the species’ genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity.

Page 10: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

TED 2007 Wish

Page 11: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

• Technology can enable it– Linking of databases is now relatively simple– Aggregation assembles dynamic, fast, low-cost draft pages– Wiki allows vetting and continuous improvement by experts

• “Species name” is a field common to virtually all biological databases– So it can be used to link these databases together

• Many specialized databases to work with– Catalogue of Life, FishBase, AmphibiaWeb, Ocean

Biogeographic Information System, Barcode of Life Database, GBIF

• Capable institutions interested in making it happen• Foundations willing to fund initial phase• Great community & user interest in the idea

EOL – Why Now?

Page 12: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

How are we doing this?

• Content• Aggregation• Smarts

Page 13: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Content

• We don’t do content – our data providers do– Tree of Life web project– Species 2000 & ITIS Catalogue of Life– GBIF– Biopix– FishBase

• These provide us with data bases (and Web services sometimes) to pull existing, expert vetted data

• There are 1,000s of these data bases of varying levels of sophistication

Page 14: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

More Content

• Another EOL group is scanning in literature– Post-copyright or with publisher’s

permissions• We also scan RSS feeds for literature

links– Part of the uBio project (MBL)

Page 15: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Aggregation

• We don’t use just one of those on a page• We take advantage of as many different

sources as we can

Page 16: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Smarts

• Not yet – but soon….– Tagging– “If you like this species you might also like..”– Semantic data to enable more interesting

searches

Page 17: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Contributions through WorkBench

• Software to help contribute, mobilise, visualize, analyze, annotate

• First release in 3 months– David Shorthouse overseeing

• Content authoring, association, review• Communication• Managing metadata – ontologies

– Adding to a character ontology vs hierarchy matrix• Personal and communal environments (myEOL)

Page 18: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Where we started

Aggregation intercepted / interpreted source code, database queries, RSS feeds and other APIs to collect data and merged them into new pages

The early uBio aggregation portal

Page 19: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Species pages group selects

data partners

Data partners & data objects

registeredwithin Fedora

Commonsafter

agreements are reached

Metadata normalized

Data enhanced

with semantic & taxonomicintelligence

Data from different sources brought

together –aggregated

IndexDataObjectID

PartnerIDVerbatim name

UnionIDMetadata

Matrixatomised data

Data available to the

WorkBenchwhere it may

be manipulated

Selected and organized data

passed to templates of species pages

Page of information on a species

visible

Secretariat(or delegates)

Workflow for a Species Page

Page 20: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Chromis: Species Page

Overview4 areas on each page:1) Common and species

name- New discovery (Jan ‘08)

2) Media Panel-Images, maps and

videos3) Classification viewer4) Content

-Accessed via the table of contents

Page 21: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

GreenAnole

Page 22: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

BuryingBeetle

Page 23: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Home Page

Search boxSet of random

speciesFeatured example

One of out 25 ExemplarsPopular – most

watched species

Page 24: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards

• I’m here to talk about the need for standards!

Page 25: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards in EOL

• Names (Indexing)• Metadata• Schemas (mapping)• Exchange protocols• Data formats• Textual categorization• APIs

Page 26: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Aapaleacea

Limulus polyphemusKiwa hirsuta

Osedax frankpressi

Kingia australis

Names are universal tags for species

Pieris japonica

Pieris rapae

Trypanosoma brucei

Homo sapiens

Page 27: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

First Standard - Names

• Names offered a logical way to search for and index content

• Names annotate data objects• All names and name surrogates

annotate all data objects• A compilation of all names ever

used is the foundation of a universal index for biology

Page 28: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

However

• Names of organisms change over time

• E-biology has to deal with the cumulative changes

Page 29: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Same name is recorded differently

• Our automated names finding tools / search engines / indexes needed to be aware of all of them

Page 30: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Names are misspelled

Became….Loligo pealeiiLoligo pealiiLoligo pealei

Page 31: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

PolysemesOne name more than one

meaning

Peranema– the fern

Peranema– the euglenid

Page 32: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

PolysemesOne name more than one

meaningAotus trivirgatus

Aotus Illiger 1811

Aotus

Aotus Smith 1805

Aotus ericoides

.

Page 33: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

PolysemesOne name more than one

meaningAotus trivirgatus

Aotus Illiger 1811

Aotus

Aotus Smith 1805

Aotus ericoides

.

Resolve with intelligent disambiguationAuthority, species, contextual data

Contextual data

PrimateMonkeyEyesFoodPanamaAotus nancymaae

Contextual data

legumeplantflowerMirbelieaAustraliaAotus mollis

Page 34: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Koko

Горилла

Guerilla

Eastern Lowland Gorilla

Gorilla graueri

Gorilla berengei

Gorilla beringei MatschieGorilla beringei mikenensis

King kong

Gorilla gorilla

Virunga

Gorila

GorilleMountain gorilla

大猩猩

ゴリラ

KГGEaGGoGo

GoKi

GViGGoMo大猩ゴリ

Many Names for

One Species

Page 35: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

This is hard for more then just sw builders

Libraries

PublishersMuseums

Federal Agencies

Search engines

Federated databases

Students and researchers

106000515358003371215585018700

Red spotted newt

COML

Page 36: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Affect in software

One organism– 11 common names– 4 scientific names– 4 maps!

The early adoption of Taxonomic Intelligence ensured OBIS was the first federated env’t to address this

Page 37: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Disambiguation: Distinguishing Polysemes

• Rulesets that allowed automated tools to discriminate the euglenid from the fern

• Authority– Peranema Dons (the fern) Peranema Dujardin (the euglenid)

• Co-occurrence of disambiguating terms – taxonomic context– Peranema and Pteridophyta vs Peranema and Euglenida– Peranema trichophorum, or Peranema, Anisonema and

Urceolus• Taxonic Intelligence tools

– Plain language dictionary filters– Ambiguity dictionaries

Peranima

Page 38: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Reconciliation

• Linking alternative names for the same organism

• A query initiated with any name expands to all names and unified data associated with each

• Using GBIFsuniversal biodiversity data bus to normalize data from multiple sources

Union concept: a reconciliation group that holds all names and labels used (in someone’s opinion) for an entity (converts names to taxonomic concepts)

Page 39: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Reconciliation 2

• Reconciliation has made all this possible– Standard names won’t happen, so we’ve got

a way to group instead• This is a first step towards GUIDs for

species

Page 40: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 2: Metadata

• What’s out there– ABCD– Dublin Core– Darwin Core – also part of TDWG

Page 41: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

ABCD:Access to Biological

Collections Data• Part of Biodiversity Information Standards

– TDWG (Taxonomic Database Working Group)

– We’re institutional partners• Has over 200 elements

– Not in use yet, too cumbersome

http://www.tdwg.org/activities/abcd/

Page 42: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Dublin Core

• 15 metadata elements:

1. Title2. Creator3. Subject4. Description5. Publisher6. Contributor7. Date8. Type

9. Format10. Identifier11. Source12. Language13. Relation14. Coverage15. Rights

Page 43: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Darwin Core

• Information about where and what a sample is– Collection type, institution, collector, etc– Genus and species designation, plus vernacular and

scientific names– Locality (continent, map coords, depth of water found,

etc)• Also part of TDWG, some uptake

http://wiki.tdwg.org/twiki/bin/view/DarwinCore/DarwinCoreDraftStandard

Page 44: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Metadata

• What’s needed– Simple, accepted standards

• What’s next– We’re working with data providers to see

what makes sense pragmatically

Page 45: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 3: Schemas

• What’s out there– Primarily mappings to XML, part of metadata

standards• What we use

– XML• What’s needed

– RDF for semantic use as well

Page 46: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 4:Exchange protocols

• What’s out there– TDWG Access Protocol for Information Retrieval (TAPIR)– Distributed Generic Information Retrieval (DiGIR)

• Open-source PHP-based package that translates columns of data in database (MySQL, SQL Server, Oracle) into Darwin Core fields.

– Both look at data movement and data translation– Both need dedicated resources

• What we use– Neither of these are mature yet, we’re evaluating for next

release• What’s needed

– Reliable, stable, implementable standards with broad appeal -what else?

– Some groups (nearctic spider database) are looking at things like google spreadsheets as an option

Page 47: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 5:Formats for video,

photos, etc• What’s out there

– Everything!• What we use

– Mostly what we’re given• What’s needed

– Some basic choice to make our life easier• For example we made all movies into flash

• What’s next– Coming soon – more media types!

Page 48: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 6: Textual categorization

• What’s out there– Species profile model– http://wiki.tdwg.org/SPM– Started April 2007

• What we use– Nothing yet

• What’s needed– Wiki has no structure, so something soon please

Page 49: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards 7: APIs

• What’s out there– Nothing

• What we use– We’ve home grown all our own Web service

APIs – SOAP and XML, not even WSDL• What’s needed

– Everything!

Page 50: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Standards – what’s next

• We’re working with the leaders in the field (esp GBIF) to come to consensus

• TDWG standards body moves slowly• We’ll put pragmatic solutions into place

and replace with standards as we can

Page 51: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

Status of EOL

• Site went live 1am Feb 26– Between 9am and 3 pm served over 11.5

million hits! (note: hits not users ☺)

Page 52: EOL Informatics Team - Open Grid ForumOGF+Feb+2008.pdf · EOL Informatics Team • 21st century biology: From Microscope to Macroscope • Genome, protein databases led to revolution

EOL Informatics Team

For more information

Jennifer Schopf– [email protected]

http://www.eol.org