Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK...
-
Upload
rose-gregory -
Category
Documents
-
view
219 -
download
4
Transcript of Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK...
Science, Workflows and Collections
Professor Carole Goble
The University of Manchester, [email protected]
©2
Roadmap
How bioinformaticians will work (and are now)
The myGrid project - workflows Using publications in workflows Workflow implications for serials
©3
Williams-Beuren Syndrome
Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal
crossover (homologous recombination) during meiosis
Haploinsufficiency of the region results in the phenotype
Chr 7 ~155 Mb
~1.5 Mb7q11.23
**
WBS
SVAS
Patient deletions
CTA-315H11
CTB-51J22
‘Gap’
Physical Map
Hannah Tipney
©4
1. Identify new, overlapping sequence of interest2. Characterise the new sequence at nucleotide and
amino acid level
Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
©5
In Life Sciences: Data, Publication, its all the same
Its just part of the experiment
No separation between data and publications
Publications are the context for data
Break the silo between published papers and published data
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg
©6
Aside: A heretic speaks
Life Scientists read journals I’m a Computer Scientist. I
don’t. Its on the Web Its in PodCast talks or
Powerpoint Google is the Lord’s work What PhD students are for Journal publications too
outdated
©7
Bioinformatics pipelines on the web
Copy and paste from one web based application to another
Annotate by hand Disadvantages: time consuming, error prone, tacit
procedure so difficult to share both protocol and results
RepeatMasker BLASTn Twinscan
©8
Workflows for Science
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg
©9
“Workflow at its simplest is the movement of documents and/or tasks through a work process.
More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked”.
Workflows for Science
©10
RepeatMasker
Web service
BLASTnWeb Service
TwinscanWeb Service
Sequence in
Predicted genes out
Simple scripting language specifies how steps of a pipeline link together
Hides all the fiddling about. Advantages : automation, quick to write, easier to
explain, share, relocate, and record provenance of results in a standard way
Workflows for Science
©11
Workflows describe the scientists in silico experiment Link together and cross reference data
in different repositories And that includes serials!
Remote, third party, external applications and services Accessible to the workflow machinery And that includes serials!
Results management Semantic metadata annotation of data Provenance tracking of results
Sharing and replicating know-how Reuse of workflows
Workflows for Science
©13
WBS The first complete and
accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome
Perform one WBS pipeline from 2 weeks to 2 hours
Faster, automated, systematic and shareable
©15
Trypanosomiasis in cattle
Chicken genome
Reuseadapting and sharing best practice and know-how across a community by publishing workflows
Mouse genome
Grave Disease
Williams-Beuren Syndrome
©16
Trypanosomiasis in cattle
Identify the genetic difference responsible for resistance to trypanosomiasis and breed into productive cattle.
Mice as a model. Gene expression and
microarray analysis The literature
Associations between upregulated genes
Links between changed genes and genes in the Tir1 region
©24
•Trypanosomes need cholesterol – and have a scavenger receptor – specific for HDL
•Resistant mice reduce available HDL – slowing trypanosome growth
New hypothesis:Resistance and susceptibility in mice is a function of cholesterol recycling pathway. Mice love lard.
lipoprotein and cholesterol
©25
Biological pathway, highlighted with RNA molecules (orange) and DNA QTL molecules (pink), discovered with the aid of Chilibot text mining over PubMed.
©27
myGrid/Discovery Net
Specialist Term recognition software
Assigning Gene Ontology terms to papers in MedLine
©29
“the development of online submission systems for scientific manuscripts provides a mechanism for including a mapping of the information in the manuscript to controlled terminologies as an integral part of the publishing process. It is not hard to envision that the indexing of a paper to controlled terms for anatomical, gene nomenclature, or functional terminologies would be a necessary requirement for acceptance of a paper for publication. This, then, would enable the rapid incorporation of the paper and its contents into bioinformatics systems. “ Judith Blake
Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology 22, 773 - 774 (2004)
©30
Learning & Teaching workflows
Research & e-Science workflows
Aggregator services: national, commercial
Repositories : institutional, e-prints, subject, data, learning objects
Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules
Harvestingmetadata
Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media
Resource discovery, linking, embedding
Deposit / self-archiving
Peer-reviewed publications: journals, conference proceedings
Publication
Validation
Data analysis, transformation, mining, modelling
Resource discovery, linking, embedding
Deposit / self-archiving
Learning object creation, re-use
Searching , harvesting, embedding
Quality assurance bodies
Validation
Presentation services: subject, media-specific, data, commercial portals
Resource discovery, linking, embedding
The scholarly knowledge cycle.
Liz Lyon, Ariadne, July 2003.
This work is licensed under a Creative Commons LicenseAttribution-ShareAlike 2.0
© Liz Lyon (UKOLN, University of Bath), 2005
©31
eBank UK Project Aggregator service harvests metadata from institutional
repository (e-crystals archive) eBank service embedded in PSIgate portal for 3rd party search Service linking from data to derived research publication Embedding eBank service in learning workflows
UKOLN (lead), University of Southampton, University of Manchester
http://www.ukoln.ac.uk/projects/ebank-uk/
©33
1 1 2 2 1 3 1 4
Sample of 4-flourinatedbiphenyl
Add CoolReflux
Butanone Sample ofK2CO3Powder
Weigh
grammes0.9031
Measure
40 ml
Add
Weigh
2.0719 g
text
3 5
Add
g
Sample ofBr11OCB
2 6
Reflux
2 7
Cool
Water
Measure
30 ml
9
Liquid-liquid
extraction
DCM
Measure
3 of 40 ml
10
Dry
MgSO4
11
Filter(Buchner)
12
RemoveSolvent
by RotaryEvaporation
13
Fuse
Silica
14
ColumnChromatography
Ether/PetrolRatio
Butanone dried via silica column andmeasured into 100ml RB flask.
Used 1ml extra solvent to wash outcontainer.
Started reflux at 13.30. (Had tochange heater stirrer) Only reflux
for 45min, next step 14:15.
Inorganics dissolve 2layers. Added brine
~20ml.
Organics are yellowsolution
Washed MgSO4 withDCM ~ 50ml
Measure
excess
Observation Types
weight - grammes
measure - ml, drops
annotate - text
temperature - K, °C
Key
Process
Input
Literal
Observation
Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove
Solventby Rotary
Evaporation
Fuse ColumnChromatography
Dissolve 4-flourinatedbiphenyl inbutanone
Add K2CO3powder
Heat at refluxfor 1.5 hours
Cool and addBr11OCB
Heat atreflux untilcompletion
Cool and addwater (30ml)
Combine organics,dry over MgSO4 &filter
Removesolvent invacuo
Liquid-liquid
extraction
Extract withDCM(3x40ml)
Fuse compound to silica &column in ether/petrol
4 8
Add
Add
text
Annotate
Annotate
text
Weigh
Annotate
g
Annotate Annotate
text text
Future Questions
Whether to have many subclasses of processes or fewer with annotations
How to depict destructive processes
How to depict taking lots of samples
What is the observation/process boundary? e.g. MRI scan
1.5918
Combechem
30 January 2004gvh, hrm, gms
Ingredient List
Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml
image
To
Do
Lis
tP
lan
Pro
ce
ss
Re
co
rd
ProvenanceLog what, where,
when who
For data and for publications
©35
Workflows
Web service
s
Text mining
Bioinformatics
Semantic mark-up
Publications have to be computational services – web services They will be read and processed
by machines
Licensing that works!
Authorisation, Authentication and digital rights management (e.g. Shibboleth)
Integration of data and publications Workflows are linking results,
whatever the source
Common ids and persistent ids for citation (DOI, LSID, InCHI)
No silos
©36
Workflows
Web service
s
Text mining
Bioinformatics
Semantic mark-up
Semantic publishing at source In order to automate we need
better ways of interpreting the publication content
They will be read and processed by machines
Integration of data and publications Common vocabularies
Accessible full texts for text mining, Not just abstracts.
©38
Workflows
Bioinformatics
Data
Publications
Semantic markupProvenance
Publish workflows with data with publicationsPrivacy? Intellectual property?
Licensing models for services so can reuse and share results and workflows.
©39
Take home
Machines are reading your journals, not just people And if the Journals are not online then they unread Workflows are another form of outcome to publish
alongside data, metadata and publications Google rocks – I don’t use anything else!
http://www.mygrid.org.uk http://www.ukoln.ac.uk/projects/ebank-uk/ http://www.combechem.org
©40
Acknowledgements
The myGrid Team, esp. Tom Oinn Chris Wroe Antoon Goderis Andy Brass Paul Fisher Hannah Tipney May Tassabehji Rob Gaizauskas Ian Roberts
Discovery Net / Inforsense Vasa Curcin Moustafa M Ghanem
BioBank / CombeChem David De Roure Liz Lyon
Scientists Peter Murray-Rust Judith Blake Mike Ashburner
©41
Digital Library workflows
Workflows for data capture, deposit, preservation, citation, discovery, mining &&….
Multiple workflows interacting together Workflows may call on each other, in a defined order Multiple workflows may use “common” services e.g.
Assign (identifier) Require sequential or parallel execution, have
dependencies, be time-limited, repetitive Have an owner (control) Include essential human interventions ? ? ?