Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK...

40
Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK [email protected]

Transcript of Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK...

Science, Workflows and Collections

Professor Carole Goble

The University of Manchester, [email protected]

©2

Roadmap

How bioinformaticians will work (and are now)

The myGrid project - workflows Using publications in workflows Workflow implications for serials

©3

Williams-Beuren Syndrome

Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal

crossover (homologous recombination) during meiosis

Haploinsufficiency of the region results in the phenotype

Chr 7 ~155 Mb

~1.5 Mb7q11.23

**

WBS

SVAS

Patient deletions

CTA-315H11

CTB-51J22

‘Gap’

Physical Map

Hannah Tipney

©4

1. Identify new, overlapping sequence of interest2. Characterise the new sequence at nucleotide and

amino acid level

Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

©5

In Life Sciences: Data, Publication, its all the same

Its just part of the experiment

No separation between data and publications

Publications are the context for data

Break the silo between published papers and published data

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

©6

Aside: A heretic speaks

Life Scientists read journals I’m a Computer Scientist. I

don’t. Its on the Web Its in PodCast talks or

Powerpoint Google is the Lord’s work What PhD students are for Journal publications too

outdated

©7

Bioinformatics pipelines on the web

Copy and paste from one web based application to another

Annotate by hand Disadvantages: time consuming, error prone, tacit

procedure so difficult to share both protocol and results

RepeatMasker BLASTn Twinscan

©8

Workflows for Science

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

©9

“Workflow at its simplest is the movement of documents and/or tasks through a work process.

More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked”.

Workflows for Science

©10

RepeatMasker

Web service

BLASTnWeb Service

TwinscanWeb Service

Sequence in

Predicted genes out

Simple scripting language specifies how steps of a pipeline link together

Hides all the fiddling about. Advantages : automation, quick to write, easier to

explain, share, relocate, and record provenance of results in a standard way

Workflows for Science

©11

Workflows describe the scientists in silico experiment Link together and cross reference data

in different repositories And that includes serials!

Remote, third party, external applications and services Accessible to the workflow machinery And that includes serials!

Results management Semantic metadata annotation of data Provenance tracking of results

Sharing and replicating know-how Reuse of workflows

Workflows for Science

©12

©13

WBS The first complete and

accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome

Perform one WBS pipeline from 2 weeks to 2 hours

Faster, automated, systematic and shareable

©15

Trypanosomiasis in cattle

Chicken genome

Reuseadapting and sharing best practice and know-how across a community by publishing workflows

Mouse genome

Grave Disease

Williams-Beuren Syndrome

©16

Trypanosomiasis in cattle

Identify the genetic difference responsible for resistance to trypanosomiasis and breed into productive cattle.

Mice as a model. Gene expression and

microarray analysis The literature

Associations between upregulated genes

Links between changed genes and genes in the Tir1 region

©17

©18

©20

PubMed Text Mining results

©21

©22

Chilibot text mining in Taverna

©23

Taverna output Chilibot web

page

©24

•Trypanosomes need cholesterol – and have a scavenger receptor – specific for HDL

•Resistant mice reduce available HDL – slowing trypanosome growth

New hypothesis:Resistance and susceptibility in mice is a function of cholesterol recycling pathway. Mice love lard.

lipoprotein and cholesterol

©25

Biological pathway, highlighted with RNA molecules (orange) and DNA QTL molecules (pink), discovered with the aid of Chilibot text mining over PubMed.

©26

©27

myGrid/Discovery Net

Specialist Term recognition software

Assigning Gene Ontology terms to papers in MedLine

©28

Science: Knowledge-driven

MEDLINE abstract; marked up by SciBorg

HTML-CMLversion

©29

“the development of online submission systems for scientific manuscripts provides a mechanism for including a mapping of the information in the manuscript to controlled terminologies as an integral part of the publishing process. It is not hard to envision that the indexing of a paper to controlled terms for anatomical, gene nomenclature, or functional terminologies would be a necessary requirement for acceptance of a paper for publication. This, then, would enable the rapid incorporation of the paper and its contents into bioinformatics systems. “ Judith Blake

Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology  22, 773 - 774 (2004)

©30

Learning & Teaching workflows

Research & e-Science workflows

Aggregator services: national, commercial

Repositories : institutional, e-prints, subject, data, learning objects

Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules

Harvestingmetadata

Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media

Resource discovery, linking, embedding

Deposit / self-archiving

Peer-reviewed publications: journals, conference proceedings

Publication

Validation

Data analysis, transformation, mining, modelling

Resource discovery, linking, embedding

Deposit / self-archiving

Learning object creation, re-use

Searching , harvesting, embedding

Quality assurance bodies

Validation

Presentation services: subject, media-specific, data, commercial portals

Resource discovery, linking, embedding

The scholarly knowledge cycle.

Liz Lyon, Ariadne, July 2003.

This work is licensed under a Creative Commons LicenseAttribution-ShareAlike 2.0

© Liz Lyon (UKOLN, University of Bath), 2005

©31

eBank UK Project Aggregator service harvests metadata from institutional

repository (e-crystals archive) eBank service embedded in PSIgate portal for 3rd party search Service linking from data to derived research publication Embedding eBank service in learning workflows

UKOLN (lead), University of Southampton, University of Manchester

http://www.ukoln.ac.uk/projects/ebank-uk/

©32

Linking data to publications

©33

1 1 2 2 1 3 1 4

Sample of 4-flourinatedbiphenyl

Add CoolReflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

3 5

Add

g

Sample ofBr11OCB

2 6

Reflux

2 7

Cool

Water

Measure

30 ml

9

Liquid-liquid

extraction

DCM

Measure

3 of 40 ml

10

Dry

MgSO4

11

Filter(Buchner)

12

RemoveSolvent

by RotaryEvaporation

13

Fuse

Silica

14

ColumnChromatography

Ether/PetrolRatio

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Inorganics dissolve 2layers. Added brine

~20ml.

Organics are yellowsolution

Washed MgSO4 withDCM ~ 50ml

Measure

excess

Observation Types

weight - grammes

measure - ml, drops

annotate - text

temperature - K, °C

Key

Process

Input

Literal

Observation

Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove

Solventby Rotary

Evaporation

Fuse ColumnChromatography

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

Cool and addBr11OCB

Heat atreflux untilcompletion

Cool and addwater (30ml)

Combine organics,dry over MgSO4 &filter

Removesolvent invacuo

Liquid-liquid

extraction

Extract withDCM(3x40ml)

Fuse compound to silica &column in ether/petrol

4 8

Add

Add

text

Annotate

Annotate

text

Weigh

Annotate

g

Annotate Annotate

text text

Future Questions

Whether to have many subclasses of processes or fewer with annotations

How to depict destructive processes

How to depict taking lots of samples

What is the observation/process boundary? e.g. MRI scan

1.5918

Combechem

30 January 2004gvh, hrm, gms

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

image

To

Do

Lis

tP

lan

Pro

ce

ss

Re

co

rd

ProvenanceLog what, where,

when who

For data and for publications

©34

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

©35

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

Publications have to be computational services – web services They will be read and processed

by machines

Licensing that works!

Authorisation, Authentication and digital rights management (e.g. Shibboleth)

Integration of data and publications Workflows are linking results,

whatever the source

Common ids and persistent ids for citation (DOI, LSID, InCHI)

No silos

©36

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

Semantic publishing at source In order to automate we need

better ways of interpreting the publication content

They will be read and processed by machines

Integration of data and publications Common vocabularies

Accessible full texts for text mining, Not just abstracts.

©37

Workflows

Bioinformatics

Data

Publications

Semantic markupProvenance

©38

Workflows

Bioinformatics

Data

Publications

Semantic markupProvenance

Publish workflows with data with publicationsPrivacy? Intellectual property?

Licensing models for services so can reuse and share results and workflows.

©39

Take home

Machines are reading your journals, not just people And if the Journals are not online then they unread Workflows are another form of outcome to publish

alongside data, metadata and publications Google rocks – I don’t use anything else!

http://www.mygrid.org.uk http://www.ukoln.ac.uk/projects/ebank-uk/ http://www.combechem.org

©40

Acknowledgements

The myGrid Team, esp. Tom Oinn Chris Wroe Antoon Goderis Andy Brass Paul Fisher Hannah Tipney May Tassabehji Rob Gaizauskas Ian Roberts

Discovery Net / Inforsense Vasa Curcin Moustafa M Ghanem

BioBank / CombeChem David De Roure Liz Lyon

Scientists Peter Murray-Rust Judith Blake Mike Ashburner

©41

Digital Library workflows

Workflows for data capture, deposit, preservation, citation, discovery, mining &&….

Multiple workflows interacting together Workflows may call on each other, in a defined order Multiple workflows may use “common” services e.g.

Assign (identifier) Require sequential or parallel execution, have

dependencies, be time-limited, repetitive Have an owner (control) Include essential human interventions ? ? ?