Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison...

19
Anil Wipat Anil Wipat University of Newcastle upon University of Newcastle upon Tyne, UK Tyne, UK A Grid based System for A Grid based System for Microbial Genome Microbial Genome Comparison Comparison and and analysis analysis

Transcript of Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison...

  • A Grid based System for Microbial Genome Comparison and analysisAnil Wipat

    University of Newcastle upon Tyne, UK

  • Motivation: Genome ComparisonThe past decade has seen the emergence of whole genome sequencingWhole genome sequences can reveal a great deal about the biology of an organismComparing genomes is one of the most effective ways to exploit genome sequence informationEstablishes the differences and similarities at the genetic level Aids biologists in understanding pathogenicity, evolution, ecology, metabolism, etc.

  • Microbial Genome comparison commonly applied at different levels:DNA(nucleotide sequence)(..atcggatcgtacgagcgatc..)DNA(nucleotide sequence)(..atcccatcgaacgagcgatc..)Proteins (amino acid sequenceMCSAKMQTR..) Nucleotide sequence Comparison(whole genome)Allagainst-all Amino acid sequence comparisons between proteinsProteins (amino acid sequenceMSAKMPTR..)

  • Motivation: Genome Comparison

    The number of complete genome sequences is rapidly increasing as sequencing technology advancese.g. ~200 whole genomes have been sequencedSequence analysis and comparison is becoming more computationally intensiveLarge scale genome comparison is already beyond the capability of many laboratoriesHow are we going to handle all these genomes? New methods and technologies for genome comparison are required.

  • Microbase Project OverviewAims to create a scalable, Grid-enabled analytical system to support microbial genome comparison.Aims to support both the biological and bioinformatics community. Funded by BBSRC Bioinformatics and e-Science & DTIStarted April 2003.Collaboration with microbiologists and industrial partners Providing use cases.

  • Microbase: FunctionalityA system that utilises Grid resources to automatically perform genome comparisons at nucleotide and protein levels

    An information repository that:maintains and exposes the results of these comparisons to users as a base level datasetprovides canned algorithms for analysis

    A Grid-enabled high-performance environment to execute remote user-specified computations

    Data integration with remote, Grid-enabled databases e.g. Genomic, Metabolic, Protein Interaction, Gene Expression databases, etc

  • MicrobaseLite: A PrototypeThe first prototype of the Microbase systemAutomatically performs all-against-all genome comparisons and exposes the resulting datasetsProvide services for biologists to browse and query genome sequences and comparison results Helps the specification of entire Microbase system and the derivation of use casesImplemented using a Component-based architecture with Web services interfacesAlso uses existing Grid technology myGrid Notification Service

  • MicrobaseLite: Datasets170 + microbial genomes includingBacteria, archaea, eukaryota Held in the GenomePool componentResults of all-against-all nucleotide sequence comparisonBlastn, MUMmerResults of all-against-all protein sequence comparisonBlastp, Ssearch, PromerHeld in the ComparisonPool component

    Object-oriented data model of interspecies genome rearrangementsThe OGRE module component (current research)

  • MicrobaseLite: ArchitectureClient SideServer SideRequestBuilderObject-orientedDatabaseObjectModelBuilderDNAComparisonProteinComparisonComparisonDatabaseNotification ServiceExternal NotificationInternal NotificationBIOSQLGenomeLoaderWeb ServicesQueryMicrobial Genome PoolTask SchedulerPost-processingGenome Comparison PoolQuery & ExecutionOGRE ModuleClient ProxiesNotification ProxyWeb Services ProxyDataProcessingGraphicalViewerUser ToolsResponseReceiver

  • MicrobaseLite: Microbial Genome PoolProvide a Web / Grid service based information repository of microbial genomesmaintains a database of 170+ microbial genomesA web-service implementation of BioJava InterfacesUses the myGrid Notification Service to notify registered clients of new genomesAvailable for use now with a prototype APIClientsComparison PoolNotification ServiceExternal NotificationInternal NotificationBIOSQLGenomeLoaderWeb Service APIMicrobial Genome Pool

  • MicrobaseLite: Genome Comparison PoolRetrieves genomes from the Microbial Genome Pool automatically on NotificationExecutes a variety of genome comparison tools: Blast, MUMmer, Promer, MSPcrunchIncorporates a Task Scheduler for parallel processingUses N1 Grid Engine (batch system) to dispatch comparison tasks to run on Linux clusters Comparison outputs processed and stored into a relational database (mySQL).

  • Task Scheduler and scalability

    Execution times of all-against-all comparisons with 10 microbial genomes (Blastp, Blastn, MSPcrunch, MUMmer and PROmer )

    Number of Processors110203040Execution Time (minutes)978.02103.0357.6748.4837.33

    Workstation

    Data

    Task Scheduler

    Job Submission

    BIOSQL

    Microbial Genome Pool

    Job Execution

    Job State Checking

    Input

    Comparison Database

    Genome Comparison Pool

    N1 Grid Engine

    Job Creation

    Threshold Contral

    Output

    Pre-load

  • MicrobaseLite: User ToolsDemonstration graphical tools under developmentGenome Browser allows users to view genomes, the comparison results and the results of canned algorithmsDeployed at client-side operating via Web services

  • Vision for the full Microbase SystemContinue to explore scalability issues using MicrobaseLite as platform

    Towards seamless scalabilityHarnessing of remote clusters on demand

    A system for the submission and enactment of remotely conceived code or workflows for user defined comparative analysisInvestigating the integration of Taverna core to enact SCUFL workflows within Microbase

  • ConclusionsMicrobase aims to exploit Grid resources to provide a scalable system for Microbial genome comparison

    MicrobaseLite produced as a prototype and demonstrator application for the biologist/bioinformatician

    Work now underway on the full Microbase - a system to support remotely conceived computations

  • AcknowledgementsThe Microbase Team: Anil Wipat, Yudong Sun, Matthew Pocock, Keith Flanagan, Pete Lee, and Paul WatsonThe Microbase User Requirements/Use case contributorsmyGrid project (Particularly Southampton and EBI)The Industrial supporters: NonLinear Dynamics, NCIMB, Arrow Therapeutics, Angel Biotech, Complement Genomics, ACS Dobfar, AstraZenecaSee www.microbase.org.uk

  • Microbial Genome comparison commonly applied at two levels:DNA(nucleotide sequence)(..atcggatcgtacgagcgatc..)DNA(nucleotide sequence)(..atcccatcgaacgagcgatc..)Proteins (amino acid sequenceMCSAKMQTR..) Nucleotide sequence Comparison(whole genome)Allagainst-all Amino acid sequence comparisons between proteinsProteins (amino acid sequenceMSAKMPTR..)

  • OGRE: Object-oriented Genome REarrangements Model

    A dataset that captures genomic rearrangements between microorganisms

    Object-Oriented (OO) concepts and formalism are being used to classify the results of the nucleotide sequence comparison

    An Ontology and OO-conceptual model is being developed to describe chromosomal rearrangements and to define objects that can represent them

    Algorithms developed to recognise defined rearrangement features in nucleotide sequence comparison data

    Objects made persistent in a OO database

  • MicrobaseLite: OGRE Module Performs object-oriented analysis and storage of genome rearrangementsAn OO dataset captures genomic rearrangements revealed through nucleotide sequence comparison Made persistent in an OO databaseProvides Web services interface for external users to query and analyse the OO datasetObject-orientedDatabaseObjectModelBuilderQuery & ExecutionOGRE ModuleComparison PoolWeb Services