Funded by: EMBRACE and EMBOSS Integrating everything and Integrated by everything Peter Rice, EBI...

40
Funded by: EMBRACE and EMBOSS Integrating everything and Integrated by everything Peter Rice, EBI ([email protected]) June 2006

Transcript of Funded by: EMBRACE and EMBOSS Integrating everything and Integrated by everything Peter Rice, EBI...

Funded by:

EMBRACE and EMBOSS

Integrating everything and

Integrated by everything

Peter Rice, EBI ([email protected])

June 2006

Funded by:

EMBRACE and EMBOSS

EMBRACE is an EC-funded Network of Excellence with 18 partners, developing an integrated set of services for the major bioinformatics data resources and analysis tools.

The EMB name was selected after two previous names were rejected. It stands for "European Model for Bioinformatics Research And Community Education" .... and has no connection with EMBL.

EMBOSS is now 10 years old, with the project team hosted by EMBL-EBI, providing open source libraries and over 200 applications for sequence analysis.

EMBOSS has its roots at EMBL Heidelberg, but started at the Sanger Centre and the UK EMBnet node. The EMB name reflects the EMBL and EMBnet origins as "European Molecular Biology Open Software Suite"

Funded by:

EMBRACE

Network of Excellence - 18 partners with data resources, analysis tools, expertise in grid technology and experimental biologists.

Graham Cameron, Peter Rice, Alan Bleasby — EBI, Cambridge, GBToby Gibson — EMBL, Heidelberg, DEAndreas Gisel — Institute of Biomedical Technologies, Section Bari, CNR, ITTeresa Attwood — University of Manchester, GBMarco Pagni—Swiss Institute of Bioinformatics, CHErik Bongcam-Rudloff — LCB/BMC, Uppsala, SEVincent Breton — CNRS, Clermont Ferrand, FRSøren Brunak — CBS, Lyngby, DKJosé-María Carazo — CNB, Madrid, ESArne Elofsson — DBB, Stockholm, SEDaniel Kahn — INRA/CNRS, Toulouse, FRRalf Herwig — MPI für Molekulare Genetik, Berlin, DEEija Korpelainen — CSC, Espoo, FIChristine Orengo — University College London, GBYitzhak Pilpel — Weizmann Institute of Science, ILGert Vriend — CMBI, Nijmegen, NLAlfonso Valencia — INTA-CAB, Madrid, ESChristian Bryne — University of Bergen, NO

Funded by:

EMBRACE Overview

This kind of programming is hard to do.

EMBRACE aims to make it easier, and within the reach of experimental biologists.

To do this, we need an interoperable set of services and clients that can both find and make use of them.

Funded by:

EMBRACE aims to enable ...•a scientist to evoke the latest and best version of a given program without any concern for its physical location

•the program to find the most up-to-date data without help from the user

•workflows to automatically take advantage of whatever compute power is available

•workflows to deliver results in a way which any user can understand

•the scientist to follow connections to other relevant data and tools using all the straightforward idioms of web browsing and hyperlinks.

Funded by:

App

lica

tion

Use

r in

terf

ace

App

lica

tion

inte

rfac

e

EMBRACE: Interconnectivity

Funded by:

EMBRACE: Approaches

•Defining an application interface•Design from the view of the user/application•Browser example

•User provides a query and a data type•Generate a list of results by data resource•Expand and browse the list, following links•Select some or all as input to analysis tools•Requires human-readable definitions

•Automation•A similar example, but with a program selecting and launching the analysis•Requires machine-readable definitions

Funded by:

EMBRACE Data Content

DNA sequence information Protein sequence information Genome annotation Macromolecular Structure Data Expression information Literature Orthologs Untranslated regions

Protein Families Alignments Protein/protein-associations Structural domainsGene3D ORFandDB SNPs in regulatory regions3D Electron Microscopy data

Funded by:

EMBRACE Analysis Tools

EMBOSSDNA sequence analysis Protein sequence analysis Pattern matching Genome annotationExpert systemsHidden Markov ModelsHomology searchesPhylogenetic analysisProtein structure analysisProtein structure comparison

Protein domain mappingMicroarrays and gene expressionBioinformatics workflowsBioinformatics tool environmentsProtein structure predictionElectron microscopyElectron microscope tomographySystems biology modellingText mining

Funded by:

Web services Grid services

EMBRACEgridRequires:

Data managementData replication

Service discoveryComputing

KO??KOOK

OK??OKKO

Lack of infrastructure providing low-level services

Instability and lack of robustness

Standards still evolving, and implementations lying behind

Informationworld

Infrastructure world

Funded by:

EMBRACE: Data Content Services

•Promised deliverables are prototypes•Webservice technology•Content provided by EBI and EMBL Heidelberg•Access to:

•Nucleotide sequence data resources•Protein sequence data resources•Protein motif resources

•Technology choices kept flexible•SOAP webservices from EBI•BioMart from EBI•Existing services from other partners

Funded by:

EMBRACE: Analysis Tools Services

•Promised deliverables are prototypes•Webservice technology•Content provided by EBI•Access to:

•Sequence analysis tools (EMBOSS etc.)•Protein structure analysis tools (EMBOSS/EMBASSY etc.)

•Technology choices kept flexible•SOAP webservices•SOAPlab project (EBI/MyGrid)•Life Science Analysis Engine standard (OMG)

•Integration also implies•Tools will access data resources via EMBRACE interfaces

Funded by:

EMBRACE: Technology Choice

•Promised deliverable is a survey of webservice and grid technologies•Will be made publicly available•To cover:

•European Grids and Bioinformatics (EGEE etc.)•Webservice standards•Grid service standards•Current standards•Emerging standards•Recommendations on technology adoption•Recommendations on further technology watch

•Technology test cases•Designed to demonstrate technology•Designed to show improvements in technology•Designed to highlight problems

Funded by:

EMBRACE: Test Cases

•EMBRACE is driven by biological test cases:•4 initial test cases in the proposal•Workshop (Uppsala, 2005) defined new test cases•Partners illustrating use of their content/tool resources

•Test cases described in detail•Template adopted from BioMOBY•Implement template solutions•Identify missing components•Set priorities•... and fill in the gaps

Funded by:

EMBRACE: Outreach

•First workshops have been internal (inreach)•In 2006, workshops will be mixed with outreach•EMBRACE is aimed at skilled bioinformaticians•Need to address needs of biological researchers

•EMBRACE provides a programming interface to services•Biologists need a simple "browser"•EMBRACE will need a simple interface to demonstrate utility

•Example interfaces:•Taverna (EBI/MyGrid/OMII-UK)•Other workflow systems•Simple program examples•Simple script examples•"The Big Red Button"

Funded by:

EMBRACE Year Two

•Prototype content services to become standard•Prototype tool services to become standard•Further prototypes beyond sequence data•Established technology choice•Well documented test cases•Good links to biological research community

•Selected collaborators•Willing to explore emerging technologies•Biological (and practical) use cases

EMBOSS: History• EMBOSS started in March 1996

• First requirements based on a list of long-standing problems in existing commercial software (GCG), and the need for public source code

• First "ajax" library written August 1996

• 30 potential developer/user sites identified November 1996 (EMBnet Helsinki)

• Wellcome Trust proposal February 1997 (Sanger, HGMP and EBI)

• Accepted August 1997

• Project started November 1997.

• EMBOSS 1.0.0 released on 15th July 2000.

• EMBOSS 2.0.0 released on 15th July 2002.

• EMBOSS 3.0.0 released on 15th July 2005

• EMBOSS 4.0.0 will be released on 15th July 2006

Original Target UsersEach of the following groups had their own special needs which

EMBOSS aimed to satisfy:

• Sanger Centre genomic sequencing and analysis groups

• RFCGR/HGMP registered academic users (about 10,000)

• EMBnet service providers in 30+ other countries with over 30,000 users

• Academic users everywhere

• Pharmaceutical and biotechnology industry

• Bioinformatics developers

Seqret

Seqret is a very simple application

• It reads a sequence USA (in any format, from anywhere)

• It writes a sequence USA (in any format)

If you tell it the sequence has feature annotation:

• It reads the features (in any format)

• It writes the features (in any format)

Seqret has 13 lines of code

The source code seqret.c

#include "emboss.h"

int main(int argc, char **argv) { AjPSeqall seqall; AjPSeqout outseq; AjPSeq seq = NULL; embInit("seqret", argc, argv); seqall = ajAcdGetSeqall ("sequence"); outseq = ajAcdGetSeqout ("seqout"); while (ajSeqallNext (seqall, &seq)) ajSeqWrite (outseq, seq); ajSeqWriteClose (outseq); ajExit();}

EMBOSS Quality Control

• Nightly build with no compiler warnings• 2,000 test runs (including expected fail conditions)• 150 valgrind memory leak tests• Code documentation validation and indexing• ACD file validation• ACD documentation completeness• Program documentation: description, command line

qualifiers, example run(s) and input/output files• Web site updates

Disaster proof software licences

Disaster proof software licences• 1977 Fred Sanger sequences ΦX174 with computing by Rodger Staden

• 1996 EMBOSS started by Peter Rice (Sanger) and Alan Bleasby (SEQNET Daresbury), in collaboration with Thure Etzold (EBI)

• 1997 funding approved by the Wellcome Trust

• 1998 SEQNET relocated to Hinxton (HGMP)

• 1999 Thure goes to LION Bioscience

• 2000 Peter leaves Sanger – EMBOSS goes to Alan at HGMP

• 2001 LION (Peter) adds EMBOSS to SRS and updates EMBOSS• CCP11 funding for EMBOSS development

• 2002 Peter leaves LION

• 2003 Peter joins EBI – integrating EMBOSS in myGrid services• Medical Research Council terminates funding for Rodger Staden

• MRC still "owns" the Staden package. Rodger Staden retires.

• HGMP is renamed after Rosalind Franklin (by MRC)

• 2004 April 1st: MRC announces RFCGR will be closed within 15 months

• 2005 Alan Bleasby and Jon Ison move to EBI; Tim Carver moves to Sanger

• All the code is still licensed to everyone under (L)GPL.

Users: Are you a Man or a Mouse?

Command Line

EMBOSS has many possible command lines:• Prompting for required values% seqret

What sequence []: embl:paamir

Output file [paamir.fasta]:

• Unix style% seqret embl:paamir –send 100 -auto

% seqret embl:paamir –se 100 -auto

% seqret –se 100 embl:paamir -auto

• GCG style% seqret embl:paamir –send=100 –auto

Web Interface (wEMBOSS)

Web interface (SRS)

GUI Interfaces: Jemboss

GUI Interfaces: Taverna

Where are we now?

New grant vision

• For the new grant we were asked to present a vision:• Genomics (whole genome analysis)

• Phylogenetics (beyond phylip)

• Gene expression (microarray data standards)

• Biostatistics (R and BioConductor)

• Proteomics (2d gel, MS, etc)

• Genetic linkage

• Chemistry (small molecules)

• All these ideas came from the 2005 User Survey

• We have funding only for core development (so far

Extending core EMBOSS

• There are many other things we can do:• Workflows

• Automatic support for the 100+ interfaces• Generating XML files

• Notification of changes to ACD standard

• Testing

• Ontologies

• Graphics library

• Database indexing

• Non-sequence data access

EMBOSS Books

• Three books are planned after 4.0.0

• Text ownership stays with the EMBOSS team for reuse

• Publishers Cambridge University Press• Programmer's guide

• After a major code refactoring effort

• Automated generation of code examples

• Administrator's guide• Installing and maintaining EMBOSS code

• Managing data resources

• Supporting in-house developers

• User's guide• Aimed at experimental biologists

EMBOSS and Industry

• Celera were the first industrial users

• And the first to provide funding (for the SRS interface)

• Hardware manufacturers offer machines and compilers

• IBM, HP, Apple

• Our latest partners are SciTegic/Accelrys

• Pipeline Pilot Independent Software Vendor partnership

Pipelining Heterogeneous Tools

Heterogeneous [BioJava, Perl, PROSITE, EMBOSS, (& GCG)]

tools for sequence annotation

The SciTegic Challenge

• Pipeline Pilot runs on Linux• BioPerl interface to launch EMBOSS• EMBOSS team to maintain the BioPerl code

• Pipeline Pilot runs on Windows• EMBOSS team to support EMBOSSWIN

• Why? Because we can do it, and we expect the GCG development team will find it difficult!

We need help

• Encouraging more developers• CUP books

• Developer training courses - not in Hinxton• Course in Indiana May 2005

• Sponsorship offer from Newcastle, UK

• Willing to travel anywhere!!!

[email protected]• Henrikki Almusa and Medicel (Helsinki)

• Suggestions for new applications• Collaborations in proposed new areas.

Acknowledgements

• (HGMP/RFCGR): Gary Williams, Tim Carver, Hugh Morgan, Claude Beesley, Damian Counsell, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop

• LION: (Thomas Laurent), (Bijay Jassal), Thure Etzold

• Sanger: (Ian Longden), (Richard Bruskiewich), Simon Kelley, (Ewan Birney)

• EBI: Peter Rice, Alan Bleasby, Jon Ison, Lisa Mullan, (Martin Senger), Tom Oinn, Rodrigo Lopez, Mahmut Uludag, Shaun McGlinchey

• EMBnet: UK, Norway, Italy, Germany, Belgium, Argentina, China, Turkey, Israel, Canada, Manchester

• Others: Don Gilbert, Will Gilbert, Rodger Staden, Bill Pearson, Catherine Letondal, Luke McCarthy, Susan Jean Johns, David Bauer, Andrew Lyall,

Henrikki Almusa, Melody Clark, ....