SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium
description
Transcript of SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium
SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Carole Goble, Uni of Manchester, UKJacky Snoep, Uni of Manchester, UK / Stellenbosch, South AfricaIsabel Rojas, EML Research gGmbH, Germany
2nd Evaluation Conference, 19-20 May 2009, Vienna, Austria
Started July 2008, 3 years, 3 staff + 3 investigators people, 3 teams over 3 sites
Sensitively retrofit a data access, model handling and data integration platform.
Support and manage the diversity of data, models and competencies.
Web-based solution:exchange of data, models and processes (intra-
and inter-consortia).search for data, models and processes across
the initiative.dissemination of results.
SysMO-DB
SysMO-DB Team
University of Stellenbosch, South AfricaUniversity of Manchester, UK
Jacky Snoep
EML Research gGmbH, Germany
Isabel Rojas
University of Manchester, UK
Olga Krebs
Wolfgang Müller
Sergejs Aleksejevs
Carole Goble
Stuart Owen
Katy Wolstencroft
Own solutions
Suspicion
Data issues
Resource Issues
Own data solutions and collaboration environments. wikis, e-Groupware, PHProjekt, BaseCamp, PLONE, Alfresco, bespoke commercial … files and spreadsheets.
Suspicion and caution over sharing.Interesting interplay between modellers, experimentalists and bioinformaticians.
Many do not have data, or follow the standards that exist or know who is doing what. Much of the data cannot be compared
Different organisms, different strains.
No extra resources for the consortiums91 institutes, 11 consortiums, some overlapping
Principles…
A series of small victories Realistic Don‘t reinvent Sustainable and extensible Migrate to standards
Provide instant gratification Address doubt and anxiety Build it rather than write about it.
Another view on the goal
File Management systemsPlone, Alfresco, PHProjekt, eGroupWare, Wikis
Specialist databases that you make your own: BASE, maxD, myExperiment
Specialist public databases you have a bit of: SABIO, JWS Online, myExperiment
Specialist public databasesBRENDA, PDB, BioModels, WikiPathways, KEGG, UniProt, GenBank, SGD, PubMed
Project
PublicReference Data Sets
Community Supported Data Sets
Pile of spread sheets on my hard drive
Personal
SysMO
Some numbers& Some consequences
1 Software Engineer 1 Bioinformatician, 1 Bio-database specialist
11 projects, 91 institutes 20 person days/year/project 2.5 person days/year/institute “just in case“ approach impossible
Focus on real needs “just in time“, “just enough“ The right 20%
Help people help themselves Communication!
20%
80%
80-20-rule:80% of the featureswon‘t be used anyway
Useful features
Social Approach Questionnaires PALS
19 Postdocs and PhD students All three kinds of people Our design and technical
collaboration team Very intense face to face and
virtual collaboration UK and Continental PALS
Chapters Audits and Sharing
Methods, data, models, standards, software, schemas, spreadsheets, SOPs…..
Communication via PALs
DB team PALS Projects
Show what is thereSuggest what is possible
Ask for requirements
Give requirementsTell priorities
Rate outcomesSuggest improvements
Double checkTransmit
Disseminate
Collect answers
SysMO-DB PALs Meeting statistics
10 months 2 PAL all hands meetings 2 PAL chapter meetings 9 visits to 6 SysMO projects Numerous Skype chats, mails, telcons
Impact on development?
See later in talk
“We need a way of collecting structuring and collecting and sharing Standard
Operating Procedures”
“Excel spreadsheets are our most common way of collecting and processing
data”
“I need a kind of “yellow pages” that tells me who is in what project and
what they are working on”
Modellers
Exchange
Experimentalists
Exch
ange
Exchange
Exchange
Bioinformaticians
SpreadsheetRepository
SBMLModels
Repository
SOPRepository
WorkflowRepository
Cons
ortiu
m
Dat
a
Mod
els
Proc
esse
sSo
ps a
nd W
orkfl
ows
SysMO Approach
SysMO-SEEK web portal interface
JWS Online
AssetsCatalogue
YellowPages
SearchSysMO DB
JERM
Publ
ic d
ata
SBML Nature Protocols
Workflow Management System
Discovery SysMO-SEEK
Single, web based, access point Access control & Versioning managementYellow pages (“who is who”)
People, Expertise, Equipment Assets catalogue (“who has what”)
SOPs, Spreadsheets, pre-published models Metadata about Data held by projects
Access to other repositories Models (JWS Online), Workflows (myExperiment), Public web services (BioCatalogue)
Call out to external resources e.g. PubMed
Does not hold results.
Holds metadata on results and links to results
A component for SysMO groups to incorporate in their own environments and applications
Demo
Finding and Exchanging Project Data
“Just Enough” Exchange
Data Comparison and Exchange Public data sources
model organism databases – (e.g. SGD)
BRENDA …. Data produced by SysMO
SABIO-RK, iChiP, MeMo …. Local databases & Files
Excel Spreadsheets The most common form of
experimental data format.Proteomics
Met
adat
a
Metabolomics
Microarray
Proteomics
Single Cell Data
COSMIC and BaCell ( Alfresco, document management system)
SysMO LAB Spreadsheet
Experiment
measurementnumber
Glucose
Ethanol Acetate Lactate
Formiate
Succinate
Pyruvate
Acetoin
2,3 Butanediol
mM mM mM mM mM mM mM mM mM
1 1 3,57 0 16,61 11,57 0 0 0 3,06 0
2 1 0 0 32,85 7,03 5,73 0 0,56 4,21 0
Our Extra Work!!
ChallengeAim: Maintain the independence of the projects
Data registered in the SEEK Assets Catalogue Data remains at the host project site Data pulled from host project site on request
1. Need to map to a common metadata model for each data type (microarray, metabolomic…) so data can be found, understood and compared.
Just Enough Results Models (JERM)2. Need to create software that interfaces with the
different existing project data management setups (Alfresco, eGroupWare, MediaWiki, BASE, Excel…)
JERM Adapters and Extractors
JERM: Just Enough Results Model
Way to “wrap“ data sources to match our agreed common data model for each data type
Minimum information needed to exchange data of each type Databases Content management
Systems Excel Spreadsheets Data File Store
JERM
Extract Export
Import
ProteomicsM
etad
ata
Metabolomics
Microarray
Proteomics
Single Cell Data
What is Metadata?
Information, additional to the raw/processed data itself.
What a potential user of the data would need to know to be able to make full and accurate use of the data in a subsequent scientific analysis.
Machine readable descriptions of Data, Models, Services, Resources, Applications
[COSMIC]
CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experimentMIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry
ExperimentsSTRENDA Standards for Reporting Enzymology DataTBC Tox Biology Checklist
BioPAX : Biological Pathways Exchange http://www.biopax.org/FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditions
http://www.mibbi.org/index.php/MIBBI_portalMIBBI: Minimum Information for Biological and Biomedical Investigations
Minimum Information Initiatives
Just Enough Results Model
Inspired by MCISB Key Results initiative and SBRML [Paton et al]
Harvested standards Analysed current
practice and consortium schemas and spreadsheets
Designing the corresponding JERMs
Mapping data sources of the projects to JERMs.
What does it cover?
Experimental Data Metadata
People
Projects
Assay
Study
Experimental conditions
Factors studied
Models
SOPs
Homogenised terminology and values in the datasets themselves
Workflows
ISA-TAB compliant
Investigation
Where is it used?
Minimum metadata for SysMO exchange
What an experiment is. Find
Extract metadata from datasets for the Assets catalogue
Access Expose data results through a
JERM interface Access controlled by
consortiums, groups and individuals
Just Enough Results Model
Met
adat
a SABIO-RK
BRENDA
myDB
mySpreadSheet
JERM Web Service Access Interface
Access Control
JERM Extractor and Access Wrapper Layer
JERMTemplate
SourceAccess
and Harvester
SourceExtractor
COSMIC
BaCell-SysMO
SysMOLab
MOSES
Alfresco
Alfresco
Wiki
Wiki
ANOTHER
A DATASTORE
In Practice for Spreadsheets
Native JERM Template JERMed
+
+ +
RegisterExtractMatched to the JERMAdding metadata
browse
search
++
Now
Whole record
RegisterExtractMatched to the JERMAdding metadata here
browse
search
+++
Whole record
Near future
Filtered record
Enriched record
RegisterExtractMatched to the JERMAdding metadata here
browse
search
++
Future Collections of
Records
+Meta-analysis
JERM Source Extractor Generator New spreadsheets adopt JERM
template Legacy spreadsheet JERM
mapper. Databases have JERM mapper
Spreadsheet Ontology Annotator Restrict the values that a range
of fields can have.
Just Enough Results Model Tools
Met
adat
a SABIO-RK
BRENDA
myDB
mySpreadSheet
JERM Web Service Access Interface
Access Control
JERM Extractor and Access Wrapper Layer
JERMTemplate
SourceAccess
and Harvester
SourceExtractor
Models
Model
JWS Online - database of curated models and a model simulator.
ToBiN – platform for storage and analysis of genome scale metabolic networks (PSYSMO)
Biomodels - database of curated models (EMBL-EBI) Copasi – Complex Pathway Simulator (Mendes et al) Pre-publication SEEK store Semantic SBML (TRANSLUCENT); SBRML (MCISB)
More After the Demo!
Processes
Experimental Processes
Protocol Title Authors Keywords Abstract Materials
ReagentsReagent Set UpEquipment
Time Taken Procedure Troubleshooting Critical Steps Anticipated Results References
Protocols and SOPs SOPs assets deposited or
linked to SOP gathering Nature Protocols format
recommendation High level classification for
indexing and tagging Got a few, need more.
Experimental Processes
Protocols and SOPs SOPs assets deposited
or linked to SOP gathering Nature Protocols format
recommendation High level classification
for indexing and tagging Got a few, need more.
http://www.molmeth.org
http://openwetware.org
Workflow Management System
Bioinformatics Processes: Workflows Data preparation, annotation and analysis
pipelines SBML model construction and population
Linking together Data sets, Web Services, R scripts, BioMART, Java libraries, Grid Services, (MATLAB in beta)
Free and Open Source
Data integration: workflows for model parameterisation and validation.
Building models using workflows
Manipulation of SBML models in workflows
LibSBML: data integration & constructing and annotating SBML models
[Li et al]
Ramp up when more data resources become workflow accessible
Libraries of SysMO workflows
Spreadsheet Smart.
Microarray Analysis
SBML Model manipulation
Pathway Analysis Chemical
structure analysis
Protein structure analysis
Kinetic data Excel
Spreadsheet handling
Controlled vocabulary look-ups
http://myexperiment.org
Now…
Demo!!!!!!
Everyone contributedBut obviously we only have time for a few examples
ModelsJWS Online model interface http://jjj.mib.ac.uk
http://jjj.bio.vu.nlhttp://jjj.biochem.sun.ac.za
• Sysmo models interface at JWS Online• SBML upload and webservices• JWS update, new interface (to be released soon), SBGN schema’s
JWS Online SysMO home
~/sysmo
MOSES models selection
MOSES models
JWS Online interface MOSES model
link to localhost /sysmo
SBML model upload
JWS Online access via web services
~/axis/services/QueryJWS?wsdl
{getRates, getAllModels, getAllBiomodels, getAllBiomodelsIds, getModelsByOrganism, getModelsByCategory, getModelInfo, getNmat, getKmat, getLmat, getSteadyStateTable, getTimecourse, getJacob, getEigenv, getCmat, getEmat, getRateEquations, getRateEquationFormulae, getExtVar, getExternalMetabValues, getInitMetabValues, getParamValues, hasFunction}
JWS Online new interface (α)
SpreadsheetRepository
SBMLModels
Repository
SOPRepository
WorkflowRepository
Cons
ortiu
m
Dat
a
Mod
els
Proc
esse
sSo
ps a
nd W
orkfl
ows
What we have done....
SysMO-SEEK web portal interface
JWS Online
AssetsCatalogue
YellowPages
SearchSysMO DB
JERM
Publ
ic d
ata
Standards SBML Nature Protocols
Workflow Management System
Training, Know-how and Dissemination
SysMO-DB Training Kick-start toolkits, workflows and SOP
templates SysMO consortium (esp. PALS)
Social networking for shared content, know-how and best practice
Contribution and Best of breed solutions in place
Outside consortium 6 presentations 2 tutorials More in the pipeline
SABIO-RK User MeetingJune 15-16, 2009
Heidelberg, Germany
Costs supported by SysMO
http://projects.eml.org/sdbv/events/SABIORK_UserMeeting/index.html
Future: more, more, more!
Extend and stabilize software More JERM more data in SEEK
More JERM extractors, data, search possibilities More Models
More data into JWS, Integrate more tools to SysMO-SEEK More SOPs More Workflows
Facilitate workflow-ready solutions, Data collection/analysis workflow, Workflow player in SEEK
More semantics Closed vocabularies, Ontologies
More training
Timetable
SEEK Launch June 2009 JERM Phase 1 demo July 2009 Workflow with JWS-Online and SABIO-RK July 2009 JERM model stablised Sept 2009 Spreadsheet tools Nov 2009 Model comparison Nov 2009 SEEK controlled vocabularies Feb 2010 JERM tooling Feb 2010 MIRIAM comparison Mar 2010 Workflow authoring and harvesting Mar 2010 Workflow Player in SEEK June 2010 Training and Outreach ongoing
How to get there
Update SEEK and Share data Do not need to share full content
tell people about existence of data; help people avoid duplicate work; find contacts
After publication data ready for sharing with the scientific world SysMO-DB will sign a NDA where needed
Retaining data at sites comes with responsibility Reliability - Sites available continuously and promptly Support - Must be proof against virus attacks, etc. Archiving - Beyond the lifetime of the project
Talk to your PAL Right requirements Right software Steer the project Lots of work under the
hood
Make sure your PAL has a voice in your project.
Look at our wiki Thanks!
Acknowledgements SysMO-DB Team SysMO-PALS
myGrid, EML and JWS Online teams OMII-UK, Uni Southampton EMBL-EBI, MCISB
Thank you!Questions?