Computer, what is the trajectory of the planet Seti Alpha 5?

27
The Future of Microalgal Taxonomy Anne Thessen, [email protected] David Patterson [email protected] (Data Conservancy, Life Sciences)

Transcript of Computer, what is the trajectory of the planet Seti Alpha 5?

The Future of Microalgal TaxonomyAnne Thessen, [email protected] Patterson [email protected](Data Conservancy, Life Sciences)

Scientist’s Dream

Computer, what is the trajectory of

the planet Seti Alpha 5?

Taxonomist’s Dream

How many algal species can be found

on this planet?

Taxonomist’s Dream

What species is this?

Taxonomist’s Dream

Taxonomist’s Dream

Setting the stage for a ‘big new biology’

• BIG = data-centric (like particle physics and astronomy)

• Characterized by data sharing via a virtual pool

• New = new skill sets, tools, cyber-infrastructure to exploit the data pool

• Data driven discovery as a new means of understanding

• GenBank as a model within the Life Sciences

Small science

Large number of providers with small amounts of data.

Small number of providers with lots of data.

Aa paleacea

Limulus polyphemus

Kiwa hirsuta

Osedax frankpressi

Kingia australis

Names

Pieris japonica

Pieris rapae

Trypanosoma brucei

Homo sapiens

Many names for one taxon

Didimosphenia geminata

Didymosphenia geminata

Didymosphenia geminata

Didymosphenia geminata

Rock snot

Didymo

Echinella geminata

Gomphonema geminatum

Gomphonema vulgare

Reconciliation Group

Didymosphenia geminataDidimosphenia geminataDidymoRock SnotEchinella geminataGomphonema geminatumGomphonema vulgare

Reconciliation Group

Didymosphenia geminataDidimosphenia geminataDidymoRock SnotEchinella geminataGomphonema geminatumGomphonema vulgare

One name for many taxa

Cyclophora tenuis Cyclophora Castracane 1878

Cyclophora Cyclophora Hübner 1822 Cyclophora porata

.

Contextual data

DiatomChloroplastFrustuleBenthicMarine

Disambiguate by authority, species, contextual data

Contextual data

FoodMoth

WingsExoskeleton

Caterpillar

Global Names Architecture

Provider Services

DATA AND SERVICE CONSUMERS

DATA AND SERVICE PROVIDERS

EXPERTS

Consumer Services

GNA

Names-based cyberinfrastructure

• Managing names to manage biodiversity data- All names (scientific vernacular surrogate)- For all organisms- Many names for one species reconciled- One name for many species disambiguated

• Global Names Architecture - a virtual layer, using names services to link together

distributed data• Globalnames.org• Micro*scope (microscope.mbl.edu) and

Encyclopedia of Life (eol.org)

Legacy Data

• Narrative tradition in biology

• Too much for a human• Can we get a machine

to do the work?• NLP!!!

Legacy Data

• Use NLP/machine learning to extract names and characters

• Hong Cui

Legacy Data

• Spirogyra:chloroplasts:present

Legacy Data

• Spirogyra:chloroplasts:present:attribution

Coffee Ontology

coffee

is a

drink

Existing Ontology

Semantic Web

Data Discovery and Aggregation

Future Data

Triple Store

The New Workforce

• Informatics/computing training• Modified workflows• Importance of data management and

preservation

In Summary

• Big New Biology is coming, taxonomy can benefit from being a part of it

• Existing data can be made machine-readable using information extraction algorithms

• Existing workflows can be modified to capture data close to the source

• Data can be shared using the semantic web

Acknowledgments

• Dima Mozzherin• David Shorthouse• Sayeed Choudhury• Pete DeVries