Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…

120
e-SI Theme: Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next… Prof. Jessie Kennedy

description

Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…. Prof. Jessie Kennedy. Science & Scientific Data. Science and Scientific Data are Complex…. Climatology. Hydrology. Meteorology. Geography. Oceanography. Geology. Ecology. Paleontology. - PowerPoint PPT Presentation

Transcript of Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…

Page 1: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

e-SI Theme: Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…

Prof. Jessie Kennedy

Page 2: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Science & Scientific Data

Science and Scientific Data are Complex…

Page 3: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Geography

Ecology

Page 4: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Ecology

Geography

Organism

Name

Taxon concept Gene

sequence

Pathway

Protein

Location

TemperatureDepth

Page 5: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Individual Scientist

Small Scientific Community

Large Scientific Community Scientific Laboraotory

Scientific Community: complex

Page 6: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Ecology

Geography

Organism

Name

Taxon concept Gene

sequence

Pathway

Protein

Location

TemperatureDepth

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Ecology

Geography

Organism

Name

Taxon concept Gene

sequence

Pathway

Protein

Location

TemperatureDepth

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Ecology

Geography

Organism

Name

Taxon concept Gene

sequence

Pathway

Protein

Location

TemperatureDepth

Biochemistry

Climatology

Taxonomy

Meteorology

Nomenclature

Paleontology

GenomicsProteomics

Hydrology

Morphology

Geology

Oceanography

Ecology

Geography

Organism

Name

Taxon concept Gene

sequence

Pathway

Protein

Location

TemperatureDepth

Page 7: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Science & Scientific Data

Are continually changing Conclusions become

foundations for new hypotheses

New experiments invalidate existing knowledge

Knowledge is open to interpretation Different opinions

World continually changing

observation

experiment hypothesis

conclusion

Page 8: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the visionTo provide scientists with technological

solutions to exploit the wealth and diversity of Scientific Data Discovery Access Sharing Integration/Linking Analysis

Which would thereby improve the potential for new scientific discovery

Page 9: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Projects in most sciences:

ESG

Page 10: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

SEEK (Scientific Environment for Ecological Knowledge): Vision

• Research, develop, and capitalize upon advances in information technology to radically improve the type and scale of ecological science that can be addressed

– Scalable synthesis

Michener

Page 11: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Data Dispersion Challenges

• Data are massively dispersed– Ecological field stations and research centers (100’s)– Natural history museums and biocollection facilities (100’s)– Agency data collections (10’s to 100’s)– Individual scientists (1000’s)

– Maintenance must be local

Michener

Page 12: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Data Integration Challenges

• Data are heterogeneous– Syntax

• (format)

– Schema• (model)

– Semantics• (meaning)

Jones

Page 13: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Ecological Modeling Challenges

• Analysis and modeling tools are: – Specialized– Disconnected– Proprietary

• It is:– Difficult to revise analyses– Hard to document analyses– Impossible to reliably publish models to share with

colleagues– Hard to re-use models and analyses from colleagues– Difficult to use grid-computing for demanding computations– Labor-intensive to manage data in popular analysis software

Michener

Page 14: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the approachesData Discovery/Access

Metadata To describe the data sets

Ontologies To define the terminology used

Standardisation of formats For the exchange of data

Life Science Identifiers (LSIDs) To uniquely identify and resolve data objects

Provenance of data To record where the data has come from And what has happened to it en route.

GRID/Web technology Distributed data management

Page 15: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the approachesData Integration/Linking

Metadata To know how to interpret the data sets

Ontologies To know how data in the data sets might be related To aid automatic transformation of the data

Standardisation of formats To ease integration

Life Science Identifiers (LSIDs) To know when 2 things are the same

Workflows To enable refinement and repetition of integration

Page 16: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the approachesData Analysis

Metadata To know how to interpret the data sets

Ontologies To know analytical/transformation processes appropriate

Workflow Tools To ease analytical processes Recording/reuse of analytical processes

Provenance Recording life history of data To enable validation

Page 17: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the technologiesStandardisation of formatsMetadata OntologiesLife Science Identifiers (LSIDs)ProvenanceWorkflow Tools GRID/Web technology

Page 18: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Exploiting Diverse Sources of Scientific Data: the technologiesStandardisation of formatsMetadata OntologiesLife Science Identifiers (LSIDs)ProvenanceWorkflow Tools GRID/Web technology

Page 19: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta Data: the vision

Meta data - "data about data" keywords, title, creator ….

If scientists marked up their data with the agreed meta data it would be trivial to find highly relevant data (sub-)sets for analysis…

Meta-utopia….

Page 20: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia

A world of complete, reliable metadata. In meta-utopia,

Everyone uses the same language and means the same thing…

The guardians of epistemology have rationally mapped out a schema or hierarchy of ideas. that everyone adheres to…

Scientists accurately describe their methods, processes and results. so anyone can do anything with it in the future…

Cory Doctorow

Page 21: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta Data: the approach

Common language XML Schemas to describe data/meta data

Domain specific exchange schemas Explosion of these in every domain

Exchanging data Archiving data

Page 22: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Ecological Metadata Language

A look inside the meta-utopia of ecology

Page 23: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Identification: dataset elements

Page 24: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Identification: resource elements

Page 25: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Identification: party elements

Page 26: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Discovery: coverage elements

GeographicTemporal Taxonomic

Page 27: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Evaluation Level Information

Page 28: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Evaluation: Method Information

Page 29: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Evaluation: Project Information

L3

Page 30: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Access: Permissions Information

L4

Page 31: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Access: Physical Information

Page 32: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Access: Physical formatting details

Page 33: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Access: Distribution Information

L4

Page 34: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Integration Level Information

Page 35: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Integration Level: Attribute structure

Page 36: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Integration Level: attribute domains

Page 37: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Integration Level: attribute domains

Page 38: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

knb.ecoinformatics.org

Integration Level: measurementScale

Page 39: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta Data: the approach

Common language XML Schemas to describe data/meta data

Domain specific exchange schemas Explosion of these in every domain

Exchanging data Archiving data

Turned into extensive specifications Difficult to know where to stop…

Page 40: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

but even this wasn’t enough…..

It’s not good enough to have meta-data, we need to know what the terms in the meta-data (schema or data values) mean.

Page 41: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Ontologies – the vision

If we understood the meaning of the schema and the terms used in the meta-data or databases we would be able to: find things more reliably, integrate things more easily, reason about what things are comparable….

because we have support for automatic inference

Page 42: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Ontologies – the approach

Common Language… OWL?

RDF, OWL lite, OWL DL, OWL full…..

Domain specific ontologies or project specific?

Map different ontologies Modularise the ontologies

Reuse..Build upper ontologies to which domain

ontologies extend/link

Page 43: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Biodiversity Base Ontology

Page 44: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Core Layer

Page 45: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

BDI Core Taxon Name

Page 46: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

BDI Core Taxon Concept

Page 47: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

BDI Core BioSpecimen

Page 48: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

BDI Core BioObservation

Similar to…

Page 49: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

SEEK Observation ontology

Josh Madin

Page 50: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

An extension point for domain-specific terms

entity

Josh Madin

Page 51: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Characteristic

Josh Madin

Page 52: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

All the units, scales, indices, classifications, and lists used for ‘measuring’ a characteristic

Measurement standard

Similar to…

Josh Madin

Page 54: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

BDI Taxon Concept Ontology

…is really just a schema for representing

Page 55: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Biological TaxonomyClassify and name all organisms in the world

So we can talk about them, experiment with them Do life science…

The longest running attempt at building an ontology? Linnaeus binomial system of nomenclature started in 1758

An attempt to resolve a long standing problem in biology

Many ways to classify things Understanding continually changes with new discoveries &

technologies Classifications continually being redone

New things defined, New definitions given for things in existence

Lots of classifications over time Many compete at any one point in time

Page 56: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Aus aus L.1758

Aus L.1758

Aus bea Archer 1965

Archer 1965

Aus L.1758

Aus aus L.1758

Linneaus 1758

Aus L.1758

Aus aus L.1758

Aus bea Archer 1965

Aus cea BFry 1989

Fry 1989

Aus L.1758

Xus beus (Archer) Pargiter 2003.

Aus ceus BFry 1989

(vi) Xus Pargiter 2003

Pargiter 2003

Aus aus L. 1758

Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.

Aus aus L.1758

Tucker 1991

Aus L.1758

Aus cea BFry 1989

Taxonomic history of imaginary genus Aus L. 1758

Pyle 1990

5 Revisions of Aus 1 name spelling change

Page 57: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Aus aus L.1758

Aus L.1758

Aus bea Archer 1965

Archer 1965

Aus L.1758

Aus aus L.1758

Linneaus 1758

Aus L.1758

Aus aus L.1758

Aus bea Archer 1965

Aus cea BFry 1989

Fry 1989

Aus L.1758

Xus beus (Archer) Pargiter 2003.

Aus ceus BFry 1989

(vi) Xus Pargiter 2003

Pargiter 2003

Aus aus L. 1758

Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus.

Aus aus L.1758

Tucker 1991

Aus L.1758

Aus cea BFry 1989

Taxonomic history of imaginary genus Aus L. 1758

Pyle 1990

• 8 Names• 2 genus• 6 species

Page 58: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

N4 - Aus beus Archer 1965

N1 - Aus aus L.1758

N1

C1.5

C1.4

C1.3

C1.2

C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758

C1.2 - Aus aus L.1758 sec. Archer 1965

C1.3 - Aus aus L.1758 sec. Fry 1989

C1.4 - Aus aus L.1758 sec. Tucker 1991

C1.5 - Aus aus L.1758 sec. Pargiter 2003

N2 - Aus bea Archer 1965

N5 C5.5N5 - Aus ceus Fry 1989

C5.5 - Aus ceus Fry 1989 sec. Fry 1989

C6.5N6N6 - Xus beus Pargiter 2003

C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003

N2

C2.3

C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965

C2.3 - Aus bea Archer 1965 sec. Fry 1989

N3

N4C3.4

C3.3N3 - Aus cea Fry 1989 C3.3 - Aus cea Fry 1989 sec. Fry 1989

C3.4 - Aus cea Fry 1989 sec. Tucker 1991

N0 - Aus L.1758

N0

C0.5

C0.4

C0.3

C0.2

C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758

C0.2 - Aus L.1758 sec. Archer 1965

C0.3 - Aus L.1758 sec. Fry 1989

C0.4 - Aus L.1758 sec. Tucker 1991

C0.5 - Aus L.1758 sec. Pargiter 2003

C7.5N7

N7 - Xus Pargiter 2003

C7.6 - Xus Pargiter 2003 sec. Pargiter 2003

8 Names 17 Concepts

Results in many

concepts for each name

Page 59: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Possible interpretations of Aus aus L. 1758 Request data sets about Aus aus (N1)

what’s returned?

Original concept: C1.1 Most recent concept: C1.5 Preferred Authority (e.g. Fry 1989): C1.3 Everything ever named N1:

Union(C1.1,C1.2,C1.3,C1.4,C1.5) Best fit according to some matching algorithm

Best(C1.1,C1.2,C1.3,C1.4,C1.5) New concept containing only those features

common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)

Is it appropriate to link or merge data on this? Depends on the user’s purpose Level of precision required

N1 - Aus aus L.1758

N1

C1.5

C1.4

C1.3

C1.2

C1.1

Page 60: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

C1.5 C5.5

C0.5

C1.4 C3.4

C0.4

C1.1

C0.1

C1.2 C2.2

C0.2

C1.3 C2.3 C3.3

C0.3

C6.5

C7.5

N0 N7

N1 N2N5 N6N3 N4

Classifications synonymy relationships between concepts and names.

In the literature taxonomists tell us names that are synonymous with their concepts

Parent child relationships in 5 revisions

Names for each of the concepts

Page 61: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

C1.5 C5.5

C0.5

C1.4 C3.4

C0.4

C1.1

C0.1

C1.2 C2.2

C0.2

C1.3 C2.3 C3.3

C0.3

C6.5

C7.5

N0 N7

N1 N2N5 N6N3 N4

Classifications synonymy relationships between concepts and names.

Which can result in anything being returned for Aus aus by traversing the synonymy links

Page 62: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

C1.5 C5.5

C0.5

C1.4 C3.4

C0.4

C1.1

C0.1

C1.2 C2.2

C0.2

C1.3 C2.3 C3.3

C0.3

C6.5

C7.5

N1N5 N6

N2 N3 N4

N0 N7

= =

Classifications with set relationships between concepts.

What we need are the set relationships from concepts in a revision to earlier concepts

and name changes related to earlier names

We can build systems to return data suit for purpose

Page 63: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Real Taxonomic RevisionsGerman mosses

14 classifications in 73 years covering 1548 taxa only 35% thought to be stable concepts

65% of names used in legacy data sets are ambiguous and we don’t know which ones?? we need computers to help understand this…

Smaller classifications are combined into large classifications ITIS – integrated taxonomy (also changing) approx. 250,000

taxaTaxonomic Revision of genus Alteromonas

34 years: from 1972 to 2006 Thanks to George Garrity, Michigan State Univ.

Page 64: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

macleodii(T)

communis

Alteromonas

1972

vaga

Page 65: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktis

Alteromonasmacleodii(T)

1972 1973

Page 66: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubra

Alteromonas

1972 1973 1976

macleodii(T)

Page 67: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitrea

Alteromonas

1972 1973 1976 1977

macleodii(T)

Page 68: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundina

Alteromonas

1972 1973 1976 1977 1978

macleodii(T)

Page 69: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantia

Alteromonas

1972 1973 1976 1977 1978 1979

macleodii(T)

Page 70: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai

Alteromonas

1972 1973 1976 1977 1978 1979 1981

macleodii(T)

Page 71: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae

Alteromonas

1972 1973 1976 1977 1978 1979 1981 1982

macleodii(T)

Page 72: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceae

vagacommunis(T)

Marinomonas Alteromonas

commune

vagum

1972 1973 1976 1977 1978 1979 1981 1982 1984

multiglobiferum

japonicumminutiumbiejerinckiimaris

maris

hiroshimense

pelagicumpusillum

jannaschiikreigii

Oceanosprillum

mariswilliamsae

linum(T) macleodii(T)

Page 73: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedai

vaga benthicahanedai

Marinomonas Alteromonasputrifaciens(T)

Shewanella

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum

Oceanosprillum

mariswilliamsae

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986

luteoviolaceae

communis(T)linum(T) macleodii(T)

Page 74: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987

communisvagahaloplanktisrubracitreaesperjianaundinaaurantia

hanedailuteoviolaceaedenitrificans

vaga benthicahanedai

Marinomonas Alteromonas Shewanella

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum

Oceanosprillum

mariswilliamsae

putrifaciens

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 75: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

vaga benthicahanedai

Marinomonas Alteromonas Shewanella

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagum

Oceanosprillum

mariswilliamsae

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988

colwelliana

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 76: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedai

Marinomonas Shewanella

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonis

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990

colwelliana

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 77: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 78: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktis

putrifacienshanedai

denitrificans

rubracitreaesperjianaundinaaurantia

luteoviolaceae

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 79: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktis

putrifacienshanedai

denitrificans

rubracitreaesperjianaundinaaurantia

luteoviolaceae

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra

haloplanktishaloplanktis(T)

Pseudoalteromonas

undina

haloplanktistetradonis

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 80: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra

Pseudoalteromonas

undinaantartica

elyakoviii

haloplanktistetradonis

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 81: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra

Pseudoalteromonas

undinaantartica

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea

bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolytica

haloplanktistetradonis

mediterannea

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 82: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubra

Pseudoalteromonas

undinaantartica

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolacea

bacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis

japonica

haloplanktistetradonis

mediterannea

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 83: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

Pseudoalteromonas

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis

haloplanktistetradonis

mediterannea

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

Page 84: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

Pseudoalteromonas

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis

haloplanktistetradonis

12 others

mariniintestinasaireschlegelianagaetbuli

mediteranneaprimoryensis

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

stellipolarislitorea 5 others

Page 85: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

Pseudoalteromonas

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis

haloplanktistetradonis

14 others

mariniintestinasaireschlegelianagaetbuli

mediteranneaprimoryensis

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

stellipolarislitorea 8 others2 others

Page 86: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

vaga benthicahanedaicolwellianaalgae

Marinomonas Shewanella

communisvagahaloplanktisrubracitreaesperjianaundinaaurantiaputrifacienshanedailuteoviolaceaedenitrificans

tetradonisatlanticacarageenovora

Alteromonas

colwelliana

1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005 2006

japonicumminutiumbiejerinckiimaris

maris

hiroshimensemultiglobiferumpelagicumpusillumcommunejannaschiikreigiivagumbiejerinckiipelagicummaris

hiroshimense

Oceanosprillum

mariswilliamsae

distinctafuliginea

Pseudoalteromonas

elyakoviii

fridgidimarinageldimarinawoodyiiamazonensisbalticaoneidensispealeanaviolaceajaponicadenitrificanslivingstonensisalleyanna

atlanticaaurantiacarrageenovoracitreaesperjianaluteoviolaceanigrifacienspisicidarubraundinaantarticabacteriolyticaprydzensistunicatadistinctaelyakoviipeptidolyticatetrodonis

haloplanktistetradonis

14 others

mariniintestinasaireschlegelianagaetbuli

mediteranneaprimoryensis

haloplanktishaloplanktis(T)

putrifaciens(T)communis(T)linum(T) macleodii(T)

stellipolarislitorea 13 others2 others

Page 87: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Alteromonas

Alteromonadacea

Alteromonadales

Gammaproteobacteria

Alishewanella

Aestuariibacter

FerrimonasColwellia

Idiomarina

Glaciecola

Marinobacterium

Marinobacter

Pseudoalteromonas

Microbulbifer

Incertae sedis

Psychromonas

Teredinibacter

Shewanella

Thalassomonas

Ferrimonadacea

Idiomarinaceae

Moritella

Moritellaceae

Pseudoalteromonadaceae

Ferrimonas

Idiomarina

Pseudoalteromonas

Psychromonadaceae

Algicola

Psychromonas

Moritella

Shewanellaceae Shewanella

Incertae sedis

Teredinibacter

Agarvorans

Alishewanella

Marinobacterium

Marinobacter

Microbulbifer

Salinomonas

Colwelliaceae

Colwelliaceae

Thalassomonas

May 2004 November 2004

At the species level 18 “emendations”

21 new species19 species reassigned to 4 genera

3 new combinations6 synonyms 2 species to subspecies2 subspecies to species

50 names, five genera, five families, and two classes but….only 5 validly published species.

At the higher level1 Family 16 genera -> 8 families 12 genera

1 unclassified genus -> 7 unclassified generaWhich is correct?Which is supported/recorded in the data?What is the impact on Analysis?

Page 88: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

What is meta-data? Your meta data is my

data… Depends on your

perspective How you see the world What’s important to you What you want to do with

the “data”

Ecological Data set

Meta data

Taxonomic Data

ME

TA

DA

TA

DA

TA

PinaceaePicea

PiceaPicea rubens

PiceaPicea abies

Higher TaxonTaxon

Name: LinnaeusYear: 1758

Data

It’s all data anyway….. But it’s useful to

differentiate for certain purposes

Page 89: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

Schemas aren't neutral Presumes there is a "correct" way of modelling or

categorising ideas that, given enough time and incentive, people can agree

on the correct way…

Any hierarchy of concepts necessarily implies the importance of some axes over others.

Page 90: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Geographic/cartographic perspective Instance of Picea rubens

is-a feature that can be mapped

Features inherently have geospatial coordinates.

Pinaceae

Picea

Picea rubensPicea abies

Building

Feature

Observation

Organismoccurrence

Picea rubens

Taxonomic perspective Instance of Picea rubens is a

specimen of some biological taxon

Taxa inherently have characteristics used in classification

Page 91: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

There's more than one way to describe something

Page 92: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Page 93: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

There's more than one way to describe something Reasonable people can disagree forever on how to

describe something. Requiring scientists to use the same vocabulary to

describe their data enforces homogeneity in ideas. Which could limit science…

Page 94: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

Metrics influence results Agreeing to a common metric for measuring

important things in a domain necessarily privileges the items that score high on that metric, regardless of those items' overall suitability.

Ranking axes are mutually exclusive software that scores high for security scores low for

convenience, Everyone wants to emphasize their high-scoring

axes and de-emphasize (or, if possible, ignore altogether) their

low-scoring axes.

Page 95: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

People are not altruistic Scientists have their own immediate deliverables

Doesn’t leave time for thinking about who else might do what with their data

Metadata exists in a competitive world. People want their work cited and will (ab)use meta-data to do so.

People are busy e-Scientists understand the importance of excellent

metadata Jo-scientist is mainly concerned about publishing the results.

No time for added extras

Page 96: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta-utopia - a pipe dream?

People make mistakes Even when there's a positive benefit to creating

good metadata, people don’t exercise enough care and diligence in their metadata creation.

Mission Impossible? Simple observation demonstrates people are poor

observers of their own behaviours. Therefore any meta data will be a poor representation

Page 97: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Life Science Identifiers (LSIDs): the visionWWW provides a globally distributed communication frameworkLSID and the LSID Resolution System

will provide a simple mechanism to globally resolve locally named objects distributed over the WWW.

LSIDs will allow us to know what kind of object it is, who originated it, who is responsible for it, how to interface to it and what computations might be carried out on it.

Adoption of LSIDs will facilitate more reliable integration of multiple knowledge bases,

each of which has partial information of a shared domain will encourage stronger global collaboration in life sciences.

Clark T., Martin S., Liefeld T. Globally Distributed Object Identification for Biological Knowledgebases Briefings in Bioinformatics 5.1:59-70, March 1, 2004.

Page 98: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

URI based naming scheme urn:lsid:ipni.org:names:1234-1

retrieval framework

http://lsid.sourceforge.net/

Life Science Identifiers

LSID resolver

Get data

Get metadata

Data record

RDF

An LSID has data- gene sequence in GenBank- ecological data set (in excel, or in a text file)- image

The data should never change- can version

An LSID has metadata- format of the data- display title for clients - Dublin core metadata-anything you want

The metadata can change

Page 99: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Issues For Each Community

What gets an LSID? Real life objects

Biological specimen

Abstract concepts Taxon concept or name – Bellis perennis

Electronic representations of things Image of specimen, description of specimen or concept

For each thing, what’s the data and metadata? LSIDs

Data doesn’t change but Meta data can Should all data become meta data? Maybe it implies a temporal database approach

Page 100: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Issues For Each CommunityWho issues LSIDs?

Owner of data Not always clear who owns data especially legacy data

A central authority One authority responsible for issuing LSID for specific types of

information This would help enforce a 1:1 mapping of LSIDs and data items It MAY also reduce the likelihood of LSIDs becoming unresolvable

A respected authority This would help enforce a 1:1 mapping for those who use the authority It may also be more feasible

Free for all (possibly with an index) List your LSID authority in an index so your LSIDs are easy to find

Perhaps structured delegation has best potential to globally unite science

Page 101: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Organizations Using LSIDsBiopathways consortium

National Center for Biotech Information (NCBI) Pubmed, Genbank

European Bioinformatics Institute (EBI)BioMOBY – an biological database interoperability program

(biomoby.org) represent all entities in MOBY Ontologies (Object, Service, and

Namespace), as well as all instances of BioMOBY services. myGrid (mygrid.org.uk)

used throughout as object naming deviceTDWG (tdwg.org)

IPNI – plant names Index Fungorum – fungi names

US Long Term Ecological Research Network (LTER) SEEK (seek.ecoingformatics.org) Used in Kepler – actors, components, TOS – taxon concepts…

Page 102: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Use of LSIDsUse of LSIDs

Linedseahorse

Hippocampus erectus Perry 1810urn:lsid:biocast.org:concept:347

HippocampusmarginalisKaup, 1856

Hippocampustetragonous

Mitchill, 1814

Hippocampuserectus

347347347

TAX

347

347

347

347

Ecological Data Sets

Page 103: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Moving to a world of LSIDs

Using LSIDs alone will not address all issues of data sharingData repositories must (re)use LSIDs to cross reference data

within and outwith their own repository. it is important that we use the same LSID to refer to the same entity

If multiple LSIDs exist for the same entity we would be required to decide whether or not two LSIDs were really the same thing. We would be in a worse situation than we are today,

for example when trying to decide if two taxonomic names mean the same.

Generating LSIDs for any self contained data set is a fairly trivial task

Appointing LSIDs to existing data from an authoritative repository to re-use them is more challenging Investigate what’s involved…

Page 104: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Specimen PublicationConcept Name

Hexacorallia Data

Triple Store

Person

Hexacorallia Data Provider

Map to ontology

Convert Data Provider to use LSIDs

Original data repository (target)RDF Data to be updated with LSIDs

from authority providers

LSID+ RDF

LSID+ RDF

LSID+ RDF

LSID+ RDF

LSID+ RDF

Map to ontology

Match data from repository with data in LSID resolvers and return LSID to repository

LinkerTool

Match data from repository with data in LSID resolvers and return LSID to repository

Authority LSID resolution services

(source)

Page 105: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Linking….WASABI Service Request Dispatcher

LSIDSPARQL OAI

WASABI Service Request Dispatcher

LSIDSPARQLLinker OAI

authoritative (“source”) provider & linker

local (“target”) provider

Linker Client

Hexacorallia Thematic

Triple Store

PersonTriple Store

Request linkable

classes and select one to

be linked

Page 106: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Linking….WASABI Service Request Dispatcher

LSIDSPARQL OAI

WASABI Service Request Dispatcher

LSIDSPARQLLinker OAI

authoritative (“source”) provider & linker

local (“target”) provider

Linker Client

Hexacorallia Thematic

Triple Store

PersonTriple Store

Select class to be linked

Page 107: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Linking….WASABI Service Request Dispatcher

LSIDSPARQL OAI

WASABI Service Request Dispatcher

LSIDSPARQLLinker OAI

authoritative (“source”) provider & linker

local (“target”) provider

Linker Client

Hexacorallia Thematic

Triple Store

PersonTriple Store

Request possible LSIDs

Page 108: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Confirm/Skip Annotations

Person to find LSID

forChoice of possible persons with LSIDs

Page 109: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Issues in converting to LSIDsMapping to ontology

LSIDs RDF schema? ontology? agreement on ontology - problem?

Replace or annotate existing data? If we replace an author with a person LSID what is returned when resolving that LSID won’t likely be what data was

stored in DB for an author.Dependencies between objects with LSIDs

If you link via a taxon name LSID – the resolved name should have embedded an LSID for a publication – so there shouldn’t be any need (in principal) to match publications for names

What about authorities that issues LSIDs but don’t map to other authorities e.g. name providers not mapping to either publication or specimen

providers

Page 110: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Issues in converting to LSIDsWhat support would a linking tool need to provide end users?

How would users want to process this data How much automation?

E.g. above a certain confidence level Would this be trusted? Order of matching

E.g. match all instances of persons at once Match of persons by publication?

Other Issues… Performance of existing linking tool approach

Lots of data passing going on Need more efficient approach which matches user needs

Finding authorities that provide linking services How do scientists find out about authorities with linking services? How do you they which ones to use?

Page 111: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

To Summarise….We have seen that (Life) Science is

Complex & ChangingThe fundamental challenges of science that have always been

there are still here Now we have additional opportunities associated with the explosion of

scientific information and the move to a virtual world And now the challenge is how best to exploit these….

e-Science uses computation to aid scientists By providing appropriate infrastructure and tool support

Speed up scientific processes Do them repeatedly Re-evaluation

Can give scientists time for more thoughtful science… May require a change of emphasis in how scientists work

Must support the inherent features of science, scientists and scientific data

Page 112: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

e-Science: Complex Science Support decomposition of scientific domains,

problems and associated data Fundamental to data & software analysis and design

Support re-composition, linking or building on the components Need to know when components or links have changed

Identify the overlaps/linkages in the different domains Need useful approximations of things to simplify linked

domain Need to understand the approximations or linking points well

Raise level of abstraction Artefact of storage mechanisms Implies lingua franca Need more evaluation of the different approaches

Page 113: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

e-Science: Changing Science

Science is full of legacy data Today’s scientific research is tomorrow’s legacy data

Provide long-term persistent storage Any published scientific discovery should store the data as

evidence Data needs to be accurately annotated

Sufficient to repeat analyses to test hypotheses

e-Science already changing the way scientists do science But to be effective it needs to change even more… More emphasis on well curated, accessible, persistent data

Evidence for results

Page 114: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Meta Data & Ontologies?Do we throw out meta data/ontologies, then?

No… To benefit from stored data we need to know what it means!

However, there are no large-scale benefits while there is insufficient coverage of meta data if only 10% data has meta data people won’t use meta

data… Need to reach the tipping point…

Controlled vocabulary and schemas shown useful for large projects or small communities with common goal Need long-term projects to see if they sustain their value as

the community and the science evolves.

Page 115: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Describe or Prescribe?

Descriptions become a vocabularies used by others

Folksonomy or ontologies? Informal versus formal or free versus constrained Informal can be basis for something formal

Move towards common vocabularies with built in flexibility and extensibility

Issue of what language(s)…Need more research evaluating these issues…

Page 116: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Reliability of Meta Data

Automatic recording of meta data From machines, software, workflows… Avoids labour Starting to happen Helps reach critical mass of available meta data

Still need to decide what it is that the machines/software are collecting… Human input still needed

Purpose of experiment, deviations from planned protocol etc.

Page 117: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

SupportCommunity ontologies need to be easily available to

all scientists Listing the known ontologies on a web site is not enough

Need to understand when (meta) data is fit for purpose Accurate enough, not overly precise

Need collaborative approaches to extending ontologies Allow users to be involved to achieve community buy-in

Ontologies are difficult for people to comprehend Need good visualisation Need to trust system

Page 118: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

ToolsSimple tools would go a long way to helpContextual data is consistent for many data sets

e.g. observer/location Tools should support collection and re-use of this data

Make use of (incorporate) existing ontologies into tools

Get the software to do as much work as possible Good at repetitive tasks, faster than humans

Personalisation How application specific do tools have to be to be useful Generic/ Domain specific/ Individual? The more generic the more widely applicable

Pluggable components for personalisation?

Page 119: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Finally… It will take time and commitment for any of these approaches to

work.Focus on central important resources that are reused in many

(sub-)domains Ensure the data are well managed and curated, identified, described,

easily available, lasting and evolvingObserve whether they benefit the community or act as a

straight jacketA good test case for this approach is the development of a

taxon concept name resolution service To allow scientists to find correct names for the concepts they are

working with, Mark up their data, Resolve their concepts against other scientists’ data so they know they

are talking about the same thing. Is central to communication in all life sciences Poses many computational, social and data research issues

Page 120: Exploiting Diverse Sources of Scientific Data  the vision,  what has been achieved  and what next…

Exploiting Diverse Sources of Scientific Data

Acknowledgements

E-Science Institute for sponsoring theme leadershipMalcolm Atkinson

For support and many interesting discussions on exploiting scientific data.

Collaborators on SEEK project,

Matt Jones, Bill Michener, Aimee Stewart, Robert Gales, Josh Madin, Shaun Bowers

Collaborators in TDWG/GBIF Robert Kukla, Roger Hyam,

funding, slides, interesting problems