SysMO-DB: A Community-Based Approach to Data Sharing

SysMO-DB: A Community-Based Approach to Data Sharing

Dr Katy WolstencroftUniversity of Manchester

SysMO-DB

A data access, model handling and data integration

platform for Systems Biology A web based resource

That promotes shared understanding Using a common platform and common technologies

Started July 2008

DB

SysMO-DB Dev Team

University of Stellenbosch, South AfricaUniversity of Manchester, UK

Jacky Snoep

Heidelberg Institute for Theoretical Studies Germany

University of Manchester, UK

Olga Krebs

Wolfgang Müller

Sergejs Aleksejevs Carole Goble

Stuart Owen

Katy Wolstencroft

Finn Bacall

Franco B du Preez

Pan European collaboration Eleven individual projects, 89 institutes

Different research outcomes A cross-section of microorganisms, incl.

bacteria, archaea and yeast

Record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way

Present these processes in the form of computerized mathematical models

Pool research capacities and know-how

Already running since April 2007 Runs for 3-5 years This year, 2 new projects join and 6 leave

http://www.sysmo.net

Systems Biology of Microorganisms

Challenges

Heterogeneous data and models Distributed groups of researchers Modellers and experimentalists have different

skills, training, experience Scientists want to remain in control

Social and technical challenges

Social Challenge: Focus Group

DB team Focus Group Projects

Show what is thereSuggest what is possible

Ask for requirements

Give requirementsTell priorities

Rate outcomesSuggest improvements

Double checkTransmit

Disseminate

Collect answers

Focus Group SysMO-DB PALS

21 Postdocs and PhD students Modellers, experimentalists

and bioinformaticians Design and technical

collaboration team Intense collaboration UK and Continental PALS

Chapters Audits and Sharing.

Methods, data, models, standards, software, schemas, spreadsheets, SOPs…..

20 questions Deployment into Projects

Technical Challenge

Rapid and incremental development Just enough and just in time , not Just in case No reinvention Driven by the PALs Sustainable and extensible Migrate to standards Fitting in with normal lab practices

What do we share

Protocol Title Authors Keywords Abstract Materials

ReagentsReagent Set UpEquipment

Time Taken Procedure Troubleshooting Critical Steps Anticipated Results References

Methods Data Results+ +

Nature Protocols

All SysMO Assets

Protocols for Models

Protocol Title Authors Keywords Description Assumptions Equations Numerical Methods/Algorithms Computational Tools Parameter Estimation Techniques Limitations References

What do we share

Methods Data Results+ +Models +

All SysMO Assets

SOP

A Tree View of Assets

Investigation Studies Assay

ConstructionValidation

SOP

SOP

ISA infrastructure provides a directory structure for experiments

http://isatab.sourceforge.net/

Expertise, tools

Coordinates, data

How do we share

“Just Enough Results Model” What type of data is it

Microarray, growth curve, enzyme activity… What was measured

Gene expression, OD, metabolite concentration…. What do the values in the datasets mean

Units, time series, repeats….

Based on: Minimum information models

e.g. MIAME, MIAPE, MIRIAM Biological ontologies

e.g. Gene Ontology, MGED, SBO Bioportal web service used in SysMO-SEEK for:

Concept lookup and visualisation

How do we share

Share JERM templates developed by SysMO-DB, PALs and consortium Spreadsheet templates Database Schemas

Encourage uptake throughout SysMO transcriptomics metabolomics proteomics etc….

Tools to help manage data:Annotation standards by stealth

Controlled vocabulary plug inBioPortal

JERM Model

SysMO JERM a ‘MIBBI’ for the SysMO-SEEK What do we need to help you find stuff?

Title, person, filename, class

What is experiment specific? What is experiment specific, but helps us map

between them? Common biological elements

chemicals, genes, proteins, organisms, strains

Identifying Biological Objects

What do you have in your data? Proteins/enzymes, genes/expression levels,

metabolites

Where/how do these objects interact? Pathways, flux, experimental conditions

What models describe these interactions

Possible when using common frameworks, naming schemes and controlled vocabularies

Following Standards We recommend formats but we do not enforce

them Protocols and SOPs – Nature Protocols Data – JERM models and community minimum

information models Models – SBML and related standards Publications – PubMed and DOI

If you follow the prescribed formats, you get more out, but if you don’t, you can still participate

Lowering the adoption barrier

Access Permissions

Just Enough Sharing

...we don’t talk about security

COSMIC

SysMOLab

MOSES

Alfresco

Wiki

Wiki

ANOTHER

A DATASTORE

Just Enough sharing

SOP

Fetch on Request

Direct Upload

When do People Share

Data Collection Pre-publication Post-publication

Your own group and maybe your project

Project + maybe consortium

Consortium and wider community

Collaboration Discussion and criticism Advertising

• Suspicion and fear of scooping

• Reputation

SysMO Aims : sharing sooner

Incentives for sharing

Safe haven for data Credit and attribution Help with exporting to public repositories (e.g.

One-click export to ArrayExpress, PRIDE etc) A repository for “supplementary materials” in

publications Linking publications and data

Access other resources through a SEEK gateway

SEEK as a Gateway

JWS Online Plugin•online simulator, runs in SysMO-SEEK•upload models in SBML format•SBGN schemas, with annotations and external links

Incentives for sharing

Credit and attribution SEEK records who owns what. If data, models, or

protocols are reused, scientists get recognition Accountability

SEEK records who owns what. If you take credit for others work, they will see

Data citation – formal credit for data published in SEEK

Data Citation

Persistent identifiers and URLs for the data Linking people to the data Safe haven for the data Guarantees of sustainability

Data MUST be uploaded and archived If cited, it must be public

SEEK as a Safe Haven

HITS can archive SysMO data for 10 years All SysMO software is open source and available

Distinction between sustaining the service and the software

Governance and Policy

What is required by SysMO members? When should they share during their projects? How long after the project can they keep data private

to finish publications? If their data is stored locally, what is the archive

process? Policy from DMG and funding agencies and NOT

SysMO-DB

Governance and Policy

Proposals under discussion: All data registered in SEEK should be uploaded and

archived at the end of a SysMO project All data from finished projects should be shared

How long after the end? 1 day, 6 months, 1 year? Scientists can invoke “creator’s privilege” on SysMO

assets produced near the end of the project Extra time to write-up and publish before release to the

general public – respecting publication cycles

SysMO So Far…

People ARE sharing Over 300 assets in SEEK

SOPs: 102, Models: 17, DataFiles: 95 ,Investigations: 13, Studies: 26, Assays: 53

PALs – a network of young SysBio researchers Training and education in data and metadata

management spreading through the consortium Modellers and experimentalists communicating

SysMO Methods Spreading

Virtual Liver Mueller, via HITS

Lungsys SBCancer EraSysBio+

Eukaryotic organisms Interactions between host and pathogen Human disease Multi scale modelling

Why it works for us

A solution that fits in with current practices Start simple, show benefits, add more Engage with the people actually doing the work

PhD students, Post-docs Build to the PALs requirements Respect publication cycles Respect cultural differences Scientists stay in control

Acknowledgements

SysMO-DB Team SysMO-PALS

myGrid, Hits and JWS Online EMBL-EBI, MCISB

http://www.sysmo-db.org

SysMO-DB: A Community-Based Approach to Data Sharing

Documents

Transcript of SysMO-DB: A Community-Based Approach to Data Sharing