The need to redefine genomic data sharing - moving towards Open Science Oct 30, 2014
Capture, integration, and sharing of functional genomic data
description
Transcript of Capture, integration, and sharing of functional genomic data
Capture, integration, and sharing offunctional genomic data
Steve Oliver
Professor of GenomicsSchool of Biological Sciences
University of Manchester
http://www.cogeme.man.ac.ukhttp://www.bioinf.man.ac.uk
What are biologists interested in?
Complete organisms are much too complicated.
Only very well understoodsystems have well definedpathways.
Many biologists focus onone or a small number ofgenes.
GENOME
TRANSCRIPTOME
PROTEOME
METABOLOME
• Sample generation– Origin of sample
• hypothesis, organism, environment, preparation, paper citations
• Sample processing– Gels (1D/ 2D) and columns
• images, gel type and ranges, band/spot coordinates
• stationary and mobile phases, flow rate, temperature, fraction details
• Mass Spectrometry• machine type, ion source, voltages
• In Silico analysis• peak lists, database name + version,
partial sequence, search parameters, search hits, accession numbers
The nature of proteomics experiment data
A Systematic Approach to Modelling, Capturing and
Disseminating Proteomics Experimental Data
http://pedro.man.ac.uk/
The PEDRo UML schema in reduced form
MALDI
Electrospray
ToF
Spot Gel2D
TreatedAnalyte ChemicalTreatment
DiGEGelItemBoundaryPoint
GelItemRelatedGelItem
Quadrupole
CollisionCell
IonTrap
Hexapole
Organism TaggingProcess
Band Gel1D
OtherIonisation
OntologyEntry
OthermzAnalysis
OtherAnalyte
OntologyEntry
OtherAnalyte ProcessingStep
Fraction
AssayDataPoint
ColumnGradientStep
MobilePhase ComponentPercentX
Detection
mzAnalysis
AnalyteProcessingStep
IonSource
Analyte
MassSpecMachine
Peak-SpecificChromatogramIntegration
ChromatogramPoint
ListProcessing
MSMSFraction
MassSpecExperiment
Peak
PeakList
TandemSequenceData
DBSearchParameters
RelatedGelItem
Protein
DBSearch
OntologyEntryProteinHit
PeptideHit
DiGEGel
Gel
Experiment
SampleOrigin
S
The Framework Around PEDRo
1. Lab generated data is encoded using the PEDRo data entry tool, producing an XML (PEML) file for local storage, or submission
2. Locally stored PEML files may be viewed in a web browser (with XSLT), allowing web pages to be quickly generated from datasets
3. Upon receipt of a PEML file at the repository site, a validation tool checks the file before entering it into the database
4. The repository (a relational database) holds submitted data, allowing various analyses to be performed, or data to be extracted as a PEML file or another format
INTEGRATION
Why integrate data?
“These 200 genes are up-regulated in my experiment. Are any of their protein products known to interact?”
•Data is stored at a variety of sites and formats.•Databases designed mainly for browsing
(MIPS, SGD, BIND, SCPD, KEGG).•Need databases that allow complex queries.•Need to be easily usable by biologists.
Genome Information Management System (GIMS)
Paton NW, Khan SA, Hayes A, Moussouni F, Brass A, Eilbeck K, Goble GA, Hubbard SJ, Oliver SG (2000)
Conceptual modelling of genomic information. Bioinformatics 16, 548-557.
GIMS
• Integrates genomic and functional data.
• Consists of two parts:
–GIMS Database
–GIMS User Interface
GIMS data warehouse
SGD MIPS maxD
GIMS Database
Analysis Library
Canned QueriesBrowser
Database implementation
• Uses the object database FastObjects.• All database classes and analysis programs
are written in Java.• Allows close integration of the programming
language with the database.• Allows fast access to database data from
application programs.
• Allows data to be stored in a way that reflects the underlying mechanisms in the organism.
• Very flexible and extensible.
GIMS Contents
Data type Data source
DNA sequences, chromosome locations of coding regions, e.g. ORFs, tRNAs, centromeres, telomeres etc.
MIPS
Predicted protein sequences, pI, mol weight, number of transmembrane regions.
MIPS
Protein attributes (e.g. cellular location, function, protein class, Prosite motifs, phenotype).
MIPS
Protein interaction data (affinity purification, yeast two-hybrid, genetic interactions).
Ho et al.,(2002), Gavin et al.,(2002), MIPS, Uetz et al.. (2000), Ito et al., (2001)
GIMS Contents
Data type Data source
Metabolic data (reactions, compounds and enzymes).
L-compound, L-enzyme
Transcription factor. SCPD
Transcriptome data Stanford Microarray Database,
University of Manchester (BBSRC COGEME Project)
Ontology Data
Sequence similarity
GO
SGD
GIMS User Interface
• Java application.• Can download from
http://img.cs.man.ac.uk/gims• Communicates with database via RMI.• On start-up, application is sent information
about database classes and canned queries.• Very flexible.• Allows user to browse database, ask canned
queries, and store and combine data sets.• Can save results as txt, html or xml.
Selecting Canned Queries
Query categories.
Queries in selected category
Initially empty store.
Parameterising a Query
Previously selected
query
Parameters for specific
run – selects down-
regulated genes in
the nucleus
Viewing the Results
Result collection
Operations on
collections
Selecting a Second Query
Setting Its Parameters
Parameters for specific
run – selects down-
regulated genes in the same
experiment that are
transcription factors
Obtaining Its Results
Inter-relating Results
Collections selected for operating
on
Remove one result from the
other
Result of Difference
GIMSempowers
the biologist
Resources at the centre
Provenance record on howthe data wasproduced
Workflows that could be used to generate this data
People who have registered an interest in this data
Ontologies describing data
Services that can use or produce this data
Annotations
Data holdings
Literature relevant
Literature relevant
Related Data
Biologists at the centre
Provenance record of workflow runs they have made
People
Ontologies
Preferences for Services
Notes
Data holdings
LiteratureLiterature
Workflows they wrote or used
People they collaborate with
myGrid
• EPSRC UK e-Science pilot project.• Open Source Upper Middleware for Bioinformatics.• (Web) Service-based architecture -> Grid services.• 42 months, 24 months in.• Prototype v1 Release Sept 2004; some services
available now.
www.mygrid.org.uk
Workflows are in silico experiments
Annotation PipelineWhat is known about my
candidate gene?
Medline
OMIM
GO
BLAST
EMBL
DQP
Query
Application: Work bench demonstratorThe myGrid service components are used in a demonstration application called the “myGrid WorkBench”, which provides a common point of use for the services.
We can select data from the myGrid Information repository (mIR), select a workflow based on its semantic description, and examine the results.
e-Science: ProvenanceLike a bench experiment, myGrid records the materials and methods it has used for an in silico experiment in a provenance log.
This is the where, what, when and how the experiment was run.
Derivation paths ~
workflows, queriesAnnotations ~ notesEvolution paths ~
workflow workflow
e-Science: Notification
A notification service can inform the mIR and the user (proxy) that data, workflows, services, etc. have changed and thus prompt actions over data in the mIR.
Notifications are presented to the user with a client in the workbench environment.
User registers interest in notification topics
The myGrid Team
Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay Dialani, Alvaro Fernandes, Justin Ferris, Rob Gaizauskas, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Claire Jennings, Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon Mee, Simon Miles, Luc Moreau, Juri Papay, Norman Paton, Simon Pearce, Steve Pettifer, Milena Radenkovic, Peter Rice, Angus Roberts, Alan Robinson, Martin Senger, Nick Sharman, Paul
Watson, Anil Wipat and Chris Wroe.
NeedGRID
to empowerthe biologist