Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji /...

11
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Transcript of Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji /...

Page 1: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Bioinformatics Workflows

Chris Wroe(based on material from the myGrid team &

May Tassabehji / Hannah Tipney

Medical Genetics, St Marys)

Page 2: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Bioinformatics pipelines on the web

• Copying and pasting from one web based application to annotation by hand

• Advantages : quick, easy access to distributed resources

• Disadvantages: time consuming, error prone, tacit procedure so difficult to share both protocol and results

RepeatMasker BLASTn Twinscan

Page 3: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Automating pipelines

• Using Perl/ Matlab scripts to implement a pipeline• Advantages : automation, quick to write,

significant community resources (e.g. BioPerl)• Disadvantages: hard to explain, hard to relocate,

hard to tinker with.

Page 4: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

WorkflowsRepeatMasker

Web service

BLASTnWeb Service

TwinscanWeb Service

Sequence in Predicted genes out

• Simple scripting language aims to specify how steps of a pipeline link together

• High level picture of the pipeline separated from any low level fiddling

• Application logic and low level fiddling encapsulated in remote web services

• Advantages : automation, quick to write, easier to explain, share, relocate, and record provenance of results in a standard way

Page 5: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Workflow components in myGrid

• Scufl – Simple Conceptual Unified Flow Language– Developed by myGrid members at EBI.– Designed to be as simple as possible, just enough features to

support bioinformatics workflows

• Taverna – a tool for writing, running workflows and examining results.

(http://taverna.sourceforge.net)

• FreeFluo – workflow engine to run workflows (http://freefluo.sourceforge.net)

Page 6: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
Page 7: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Workflow use

• Newcastle University (Anil Wipat, Peter Li)

– Affymetrix Microarray Analysis Workflow– Gene annotation workflow

• Manchester University May Tassabehji, PhD student Hannah Tipney, Medical

Gentics, St Marys (Wellcome Trust Funded)

– Gene alerting service workflow (GAS)– Gene and protein annotation workflow

• And others

Page 8: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Workflow experience +

• Easy to get started with Taverna (1-2 hours tutorial)

• Sharing does happen• Cuts down the time taken to perform one

pipeline from 2wks to 2 hours

Page 9: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Workflow experience: outstanding issues

• Early days: web services rare; significant time take to wrap applications as web services (licensing, installation, maintenance)– Soaplab and Gowlab try to help

(http://industry.ebi.ac.uk/soaplab)

• Fiddly bits don’t go away: Many ‘shim’ services needed to ensure the output of one step fits the expected input of another

• Automation produces many results in a short amount of time. Issues of result management and display

Page 10: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Other workflow systems

• Commercial bioinformatics – drug discovery – Incogen VIBE– TurboWorx Pipeline Pilot

• eScience– DiscoveryNet (bioinformatics – proprietary)– Keppler ( US ecology)– Triana (UK Physics astronomy, signal

processing)

Page 11: Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)

Workflow standards

• Can’t have enough of them! All currently come from e-Business rather than science community

• BPEL – Business Process Execution Language• WS – Orchestration• XML Process Definition Language (XPDL)• Business Process Markup Language (BPML)