GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15!...

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE

ANR-‐‑GUI-‐‑AAP-‐‑05 – Doc Scientifique 2012 – VF 1/21

Acronyme

Acronym GATB

Titre du projet en français

Proposal title in French

Boite à outils « Assemblage pour la Génomique »

Titre du projet en anglais

Proposal title in English

Genomic Assembly Tool Box

Mots-clés (approche scientifique)

Keywords (scientific approach)

Genomic Data Processing, Assembly, Mapping

Mots-clés (domaine d’application)

Keywords (application field)

Next Generation Sequencing, Bioinformatics, Genomic, Assembly, biotechnology

Modèle de valorisation

Technology transfer model

Software Program Licensing

Coopération internationale

International cooperation

¨ Le projet propose une coopération internationale

Aide totale demandée

Requested grant

183372 €

Durée du projet

Project duration

24 months

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



1. EXECUTIVE SUMMARY ....................................................................... 3

2. CONTEXT, POSITION AND OBJECTIVES OF THE PROPOSAL ............................... 3 2.1. Context, social and economic issues ............................................................ 3 2.2. Position of the project ............................................................................... 4 2.3. state of the art ......................................................................................... 5 2.4. Objectives, originality and novelty of the project ........................................... 7

3. SCIENTIFIC AND TECHNICAL PROGRAMME, PROJECT ORGANISATION .................. 8 3.1. Scientific programme, project structure ....................................................... 8 3.2. Project management ................................................................................. 8 3.3. Description by task ................................................................................... 9

3.3.1 Task 1: GATB v1.0 9 3.3.2 Task 2: GATB v2.0 9 3.3.3 Task 3: Validation 10 3.3.4 Task 4: Technology Transfer Activities 11

3.4. Tasks schedule, deliverables and milestones ............................................... 12

4. DISSEMINATION AND EXPLOITATION OF RESULTS, INTELLECTUAL PROPERTY ....... 13 4.1. Technology transfer strategy ..................................................................... 13

4.1.1 Inria technology transfer strategy and associated process 13 4.1.2 Short overview of the market 14 4.1.3 Planned technology transfer scheme 15 4.1.4 Added value of the GATB toolbox 15 4.1.5 Return on Investment 15

4.2. State & strategy of the intellectual property ................................................ 15 4.3. Technology transfer office role in the milestones of the project ...................... 16 4.4. Resources involved by the technology transfer office during the project .......... 16

5. CONSORTIUM DESCRIPTION .............................................................. 16 5.1. Partners description & relevance, complementarity ...................................... 16 5.2. Qualification of the project coordinator ....................................................... 17 5.3. Qualification and contribution of each partner .............................................. 18

6. SCIENTIFIC JUSTIFICATION OF REQUESTED RESSOURCES ............................. 19 6.1. Partner 1: GenScale ................................................................................. 19 6.2. Partner 2: Inria Technology Transfer Office ................................................. 19

7. REFERENCES ............................................................................... 19

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



1. EXECUTIVE SUMMARY A few years ago, genomics witnessed an unprecedentedly deep change with the advent

of High Throughput Sequencing (HTS), also known as Next Generation Sequencing (NGS). These technologies generate huge volumes of genomic data. Crucial computational developments are currently needed to extract knowledge form this mass of data.

The GATB project focuses on a specific critical HTS treatment: assembly. Genomic assembly consists in reconstructing a genome from sets of very small DNA or RNA sequences, called reads, generated by NGS machines. For complex genomes, billions of reads need to be ordered, leading to time-‐‑consuming processing requiring computers with very large memories. This is a serious bottleneck in many HTS analysis both for academic and industry companies.

The INRIA GenScale team has developed fast innovative assembly algorithms with very low memory fingerprint. Two prototypes, respectively called Monument and Mapsembler, have been developed as proof of concept. Monument is dedicated to de-‐‑novo assembly for reconstructing complete genome. Mapsembler, which is a more general HTS processing tool, offers the possibility to assemble specific regions of interest.

In this project we propose to develop a Genomic Assembly Tool Box allowing end-‐‑users to customize the assembly process according (1) to the nature of the genomic data generated by NGS machines, (2) to the complexity of the genome to assemble, or (3) to the answer of a specific biological question. The final goal is to prepare industrial technology transfer of the Genomic Assembly Tool Box, targeting a wide range of genomic domains (health, agronomy, ecology, etc.).

2. CONTEXT, POSITION AND OBJECTIVES OF THE PROPOSAL

2.1. CONTEXT, SOCIAL AND ECONOMIC ISSUES

A few years ago, with the arrival of High Throughput Sequencing (HTS) technologies, genomics witnessed an unprecedentedly deep change to sequence biological material (DNA and RNA) with a volume of sequenced data much higher than before, for a price now accessible to most academic labs. As an example, approximately 10 years and 109 dollars were necessary to sequence the human genome in the nineties while nowadays it is expecting to sequence a full human genome in 24 hours for a few thousands of dollars. Hence, HTS opened the doors to many applications, which appear to be only limited by the imagination of the users. This includes de novo sequencing and resequencing (sequencing an individual of an already sequenced species) of genomes and RNA-‐‑seq (sequencing of

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



transcriptomes: the expressed fraction of a genome). It also enables to detect which DNA regions interact with known proteins (Chip-‐‑seq).

The range of biological questions which can be addressed is extremely broad, as it includes, among others, questions related to health (for instance, find genes differentially regulated in cancer), ecology (identify all species present in a given environment), or agronomy (help for plant selection for instance).

Nowadays, almost all biological studies include a first sequencing step, generating a volume of data that computer scientists were not ready to cope with. The intensive usage of these new technologies generates datasets, which now reach several Tera bytes. The amount of data is thus one of the two main bottlenecks in the exploitation of HTS.

The second main bottleneck comes from the type of data that is generated: technologies do not provide one full sequence per DNA molecule. Instead, they output reads that are small sequence fragments of length a few hundred characters. These reads may contain sequencing errors (insertions/deletions, substitutions). As reads overlap, the original sequence may be reconstructed using an assembly phase. Alternatively, if a reference genome for the studied species is available, reads may be mapped onto this reference, i.e. a (more or less) assembled genome, identical or close to the genome of the species studied.

It is indispensable to develop solutions for extracting information from HTS data while tackling the two main bottlenecks: size and type of data. Methods, and consequently software, must be fast with low memory fingerprint in order to not saturate bioinformatics computer centers. As the difficulty is not anymore to produce data, the real challenges we are facing are the data treatment and analysis.

This project is formulated in this spirit and specifically targets the assembly step.

2.2. POSITION OF THE PROJECT

The project tackles the challenges of the assembly process following two main ideas:

1. To face the HTS data tsunami, assemblers must be fast with low memory fingerprint;

2. To provide high quality results, assemblers need to be customized.

Today, a few assemblers exist (see next section). For most of them, assembling complex genomes requires days of computation. Furthermore, to support these software, computers must be equipped with very large memories (up to 512 GB). This is actually a very strong constraint requiring tera bytes of data to be sent (and stored) to bioinformatics centers. Providing fast assemblers, able to be executed near the source of HTS data, i.e. on computers with standard memory size, would be valuable to anticipate the HTS deluge: today, the trend is to equip genomic labs, hospitals, etc., with next generation sequencers, but not to associate consequent computer power (which is currently required) to process the data.

The second point deals with the variety of genome to assemble or, more generally, with the variety of biological questions which can be treated with HTS data. It is clear that

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



assembling a genome of a prokaryote or the genome of an eukaryote doesn’t have the same complexity and doesn’t require algorithms with the same features. Sequencing technologies also differ and generate data of different types which ask specific assembly treatments. More important, some situations to answer specific question, don’t need a complete assembly phase, but only a targeted assembly on some particular regions of the genome. All these criteria together prevent the design of a «universal assembler» able to cope with all situations. Thus, except experts, users often meet difficulties to fully exploit these (complex) tools, and are often disappointed with assembly results. Yet the current trend is to use (more or less) the same assembly tools for processing a large panel of data. Our approach, as opposed to monolithic assemblers, is to propose a modular assembler able to be customized and adapted to specific assembly treatment.

Compared to historical actors in the domain, such as BGI (Beijing Genomic Institute, China) Broad Institute (MIT, Harvard) or Sanger Institute (Cambridge, UK), the GenScale/INRIA team has a much shorter experience in the assembly field. However, participation to international competitions (dnGasp, Assemblathon [EARL2011]) has shown that our approach is very competitive, even if it doesn’t perform well in all aspects. But we demonstrate that we provide tools among the fastest ones, and tools which can be executed on rather small memory systems.

2.3. STATE OF THE ART

Genomic assembly consists in reconstructing a genome from a set of sequencing reads, either de novo or reference-‐‑guided. Only the former is computationally challenging, as the latter essentially consists of mapping reads to a reference genome and filling the gaps with various strategies, possibly including de novo assembly of un-‐‑mapped reads [NOG+09, PPDS04]. In the following, we survey existing methods for de novo assembly.

Next-‐‑generation de novo genomic assemblers can be divided into two classes: short reads (SR) assemblers and ultra-‐‑short reads (USR) assemblers. The former focuses on assembling 454 data, ie. millions of reads of length between 200-‐‑500 bp. These assemblers are based on a graph data structure (overlap graph), where graph vertices are reads and edges are significant overlaps between reads. Such data structure limits the number of reads that can be processed, as the graph stores information for O(n²) overlaps.

The most high-‐‑profile short reads assembly tools are Newbler (commercial, 454 software), Cabog [MDK+08] and Mira [CPWS99]. Typically, these are capable of assembling a million of 454 reads in a couple of CPU hours. It should be noted that these assemblers can also process hybrid data sets, eg. a mixture of Sanger, 454 and ultra-‐‑short reads. However, assembly of ultra-‐‑short reads using SR assemblers is intrinsically limited to the order of a few million of reads, ie. a fraction of the sequencer output. Among SR assemblers, Celera is to our knowledge the only software implementing parallel assembly, using a coarse-‐‑grained model (requires a grid).

Ultra-‐‑short reads (USR) assemblers are targeted for Illumina and Solid sequencers, which produce several orders of magnitude more reads than the 454 technology, albeit reads are

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



smaller in length (30-‐‑100 bp). USR assemblers rely on a more succinct graph-‐‑based data structure (de Bruijn graph) where, given an integer k (often between 15 and 64), vertices are k-‐‑length substrings of reads, and edges correspond to (k-‐‑1)-‐‑length overlaps. This data structure, introduced by Pevzner and colleagues [PTW01], allows for a high number of reads, as the size of the graph only depends on the genome structure. Differences between USR assemblers stem from various heuristics used for graph simplification and traversal and error-‐‑correction approaches. To cope with sequencing errors in the data, Euler-‐‑USR [CBP09], Allpaths-‐‑LG [GMP+11] and SOAPdenovo [LZR+10] assemblers implement pre-‐‑assembly error correction, whereas Velvet [ZB08] and ABySS [SWJ+09] perform in-‐‑assembly graph simplification to remove vertices corresponding to erroneous reads. As pre-‐‑assembly correction is computationally expensive (its running time is comparable to the whole assembly) and in-‐‑assembly correction greatly expands the size of the graph, it is still unclear which error correction method is practically the most suitable. Moreover, among the previous USR assemblers cited, only SOAPdenovo and ABySS can handle mammalian-‐‑sized genomes, because even constructing error-‐‑corrected de Bruijn graphs for such genomes requires an unreasonable amount of memory. ABySS solved this problem by using a distributed approach, while SOAPdenovo discards reads information in the de Bruijn graph.

A recent method, Monument [CL11], proposed locally global construction and traversal of overlap graphs. This overcomes the memory limitation of constructing a complete overlap graph, while permitting assemblies of better quality than de Bruijn graphs. Compared to other assemblers, Monument assembler implements two novel features. First, it uses a new indexing module that dynamically detects and discards entries due to read errors [CCL11]. Hence, the pre-‐‑assembly or error-‐‑correction phase becomes optional, and the in-‐‑assembly error-‐‑correction is not longer memory-‐‑bound. The operation of constructing this index requires less memory than other approaches, as erroneous index entries are removed before the full index is constructed.

Second, the assembly module of Monument is the only method that can construct longer sequences (scaffolds, i.e. sequence which may contain gaps, as opposed to contigs, sequences without gaps) locally. One main advantage, compared to other methods, is that missing read overlaps (possibly due to sequencing artifacts, such as coverage gaps or localized abundant errors) can be represented by gaps in scaffolds, whereas they would necessarily cause contigs to be interrupted. By pioneering localized scaffolds construction, the Monument assembler casts assembly as an embarrassingly parallel problem, which can be efficiently solved on a large cluster of moderately powerful machines.

A pipeline based on the Monument assembler had the lowest running time and second lowest memory usage in a recent competition (Assemblathon 1, [EARL2011]). In terms of results quality, this pipeline out-‐‑performed several other pipelines based on popular assemblers (Velvet, Phusion2, CLC).

Extracting information from HTS sequences does not necessary require to fully assemble the reads. In particular, the user may process an a priori piece of information and aims at targeting the region of genome to be assembled. Following this idea, a new method,

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



Mapsembler [PC11,PC12], was recently proposed. This is an iterative micro and targeted assemblers which process large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure.

This tool may be used within various frameworks. As it offers the possibility to get the structure of the genome/transcriptome near a region of interest, it may be used to retrieve biological elements of interest such as repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads.

Another key aspect of Mapsembler is that its memory usage is independent from the size of the read sets. Thus, compared to any other assembly tool, Mapsembler presents the main feature to have no memory limitation. It can thus be applied even on tera-‐‑byte sized data sets. In particular, even if it was not initially designed in this spirit, Mapsembler is highly parallelizable and can be adapted to a zero memory whole genome de novo assembly tool.

2.4. OBJECTIVES, ORIGINALITY AND NOVELTY OF THE PROJECT

In this project we propose to develop a Genomic Assembly Tool Box allowing end-‐‑users to customize the assembly process according to:

• The nature of the available data generated by NGS sequencers. Different technologies exist providing different types of data. For instance, 454 reads are much longer than Illumina reads and both exhibit different types of errors. To optimize the final assembly, algorithms must be adapted.

• The complexity of the genome to assemble. Assembling genomes of polyploidy organisms is much more difficult than assembling genomes of bacteria. It requires more steps and specific data (such as mate-‐‑pair reads) to perform the whole final assembly.

• The answer of a specific biological question. In many cases, the genome doesn’t need to be fully assembled to extract knowledge. Targeted assembly focusing on specific regions of the genome can just be the best way to find relevant information.

The Genomic Assembly Tool Box (GATB) will be made of different modules developed in our team and which, today, are instanced into two software: Monument and Mapsembler. Both tools have been designed to remove current HTS computational barriers: execution time and memory fingerprint.

From a practical point of view, connecting the modules will be possible using current graphical interfaces such as Galaxy [GALAXY10] or SLICEE [PIAT11]. They will communicate via standard API of the HTS domain to make them easily exploitable by the scientific and industrial community.

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



3. SCIENTIFIC AND TECHNICAL PROGRAMME, PROJECT ORGANISATION

3.1. SCIENTIFIC PROGRAMME, PROJECT STRUCTURE

The GATB project aims to design a Genomic Assembly Tool Box based on various research tools and prototypes developed in the GenScale INRIA Team. Two software have specifically emerged: Mapsembler for processing HTS data and Monument for de-‐‑novo assembly. Other tools related to assembly are also under investigation. From all this tools, it is possible to identify basic functional modules that can be shared to composed specific assemblers:

• Indexing module: Indexing consists in storing reads in an efficient way inside the memory computers. Dealing with billions of reads make this task critical.

• Read correction module: Reads generated by sequencers are not perfect. They contain errors that can be eliminated by analyzing the read redundancy. Correcting errors allow following tasks to be more efficient.

• Contig/scaffold module: the output of the de-‐‑novo assembly step is a set of contigs (long fragments of uninterrupted A, C, G, T characters) and/or a set of scaffolds (contigs with gaps). The efficiency of this module is measured by the N50 metric.

• Targeted assembly module: from a specific point of the genome (called starter), a single contig is built. Among the possibility to answer many biological questions without reconstructing a full genome, this tool can potentially be used to design a massively parallel assembler.

• Super-‐‑scaffolding module: This activity relies on ordering contigs and scaffolds to produce larger scaffolds, and ultimately the final text of the genome.

• Gap-‐‑filling module: this is a finishing step to complete the missing assembly regions.

We propose to a two-‐‑step procedure for designing the Genomic Assembly Tool Box. The first year of the project will be devoted to finalizing modules which have already been validated inside Mapsembler and Monument: Indexing, contig/scaffold, and target assembly (task 1). During the second year the three other modules, not yet fully validated or still in research phase, will be added (task 2). All these modules will be systematically validated with intensive tests. Furthermore, participation to the international Assemblathon competition is envisioned in order to clearly position our tools with competitor assemblers (task 3). Concurrently to these technical developments, the Inria Rennes Technology Transfer Office will operate the actions needed in order to ensure the "ʺtransferability"ʺ of the toolbox to biotechnologies companies and academics (task 4).

3.2. PROJECT MANAGEMENT

The management of the project will be easy to implement since both partners are physically located in the same building. A monthly meeting will be systematically organized for fine synchronization between the research team GenScale and Inria Rennes Technology

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



Transfer Office. The technical progresses and the technology transfer actions of the project will be discussed.

The Inria Rennes Technology Transfer Office will implement the processes of the Inria Technology Transfer and Innovation Department (DTI) linking the DTI experts, the Technology Transfer Associate for Health, Life Sciences & Biotechnologies and the DTI Head of Software Assets, to the meetings with the GenScale research team, at least every three months (see task 4). Locally, the Inria Rennes technology transfer officer will monthly discuss and analyze the project progresses with the project coordinator

3.3. DESCRIPTION BY TASK

3.3.1 TASK 1: GATB V1.0

Objective:

Provide version 1.0 of the genomic assembly toolbox (GATB v1.0) which will be made of the (1) indexing, (2) de-‐‑novo assembly, and (3) targeted assembly modules.

Task leader: GenScale

Description of the work:

This task will perform the following actions:

1. Test and debug of the 3 modules

2. Make the 3 modules compliant with standard HTS interface

3. Write associated documentation

4. Make the GATB v1.0 deployable on standard OS

Deliverables:

D1.1: indexing module (open access, GPL & CeCill license)

D1.2: de-‐‑novo assembly module (open access, GPL & CeCill license)

D1.3: targeted assembly module (open access, GPL & CeCill license)

Risks:

No identified risks. Prototypes already exist and have demonstrated their efficiency. This is mainly software engineering works.

3.3.2 TASK 2: GATB V2.0

Objective:

Provide version 2.0 of the genomic assembly toolbox (GATB v2.0), which is composed of the previous version enhanced with 3 new modules: Read correction; Super-‐‑scaffolding; Gap-‐‑filling.


PROGRAMME EMERGENCE

EDITION 2012

Projet GATB




This task will perform the following actions:

1. Test and debug of the 3 modules

2. Make the 3 modules compliant with standard HTS interface

3. Write associated documentation

4. Make the GATB v2.0 deployable on standard OS

Deliverables:

D2.1: read correction module (open access, CeCill license)

D2.2: super-‐‑scaffolding module (open access, CeCill license)

D2.3: gap-‐‑filling module (open access, CeCill license)

Risks:

The 3 modules are currently under development. Prototypes are still in their infancy and are not yet completed validated. This is an ongoing research inside GenScale.

3.3.3 TASK 3: VALIDATION

Objectives:

Test the GATB on various benchmarks. Promote our tools by participating to international competitions (Assemblathon)



This task will perform the 2 following actions:

1. Assembly of various genomes. Data will come from numerous datasets available among the scientific community. We will also use internal data from projects for which we have tight collaboration with biologists.

2. Participation to international competitions such the Assemblathon event. This is the best way to compare our results with the state-‐‑of-‐‑the-‐‑art software of the domain. This is also a powerful media to promote our research.

Deliverables:

D3.1: Validation report of GATB v1.0

D3.2: Validation report of GATB v2.0

D3.3: Results of international competitions.

Risks:

No risk identified.

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



3.3.4 TASK 4: TECHNOLOGY TRANSFER ACTIVITIES

Objective:

Performing all the tasks supporting and completing the technical work to guarantee a good impact of the project result i.e. the GATB toolbox, on the targeted market (bioinformatics companies and academics).

Task leader: technology transfer officer of the Inria Rennes research centre, P. Gelin


During all the lifetime of the project, the technology transfer office of the Inria Rennes research centre will operate the “process for the monitoring of technology transfer activities” (described in section 4.1.1) designed and managed by the central Inria Technology Transfer & Innovation Department (DTI) with the help of two experts of the DTI, Ph Gesnouin, the DTI Technology Transfer Associate for Health, Life Sciences & Biotechnologies and P. Moreau, the DTI Head of Software Assets.

As described in section 4.1.1, the advances of the project will be periodically (every 6 months) presented to the DTI Technology Transfer Committee ("ʺcalled CSATT"ʺ), dealing with the IP statement of the software components, the specific advantages of the results compared to the ongoing technology advances in the domain and the evolution of the targeted market (including the possible launching of new commercial products or services). The first step of this work has already been done as the project has been presented to the “CSATT” for the first time on February 8th 2012.

The work with P. Moreau will guarantee a good quality of the software development process. Il will start by a diagnosis of the existing software prototypes of the GenScale team – the “Monument” and “Mapsembler” prototypes -‐‑ to clearly identify what must be done to preserve the Inria IP control of the components that will be developed on the basis of these prototypes. P. Moreau will also provide advices on the architecture of the components and specifically about the links which will ease interfacing with other tools, using standard API of the HTS domain while securing the Inria IP control of the toolbox.

Ph Gesnouin will provide information that he will collect from prospect companies, about their specific need respecting the potential integration of the GATB toolbox, their interest for beta-‐‑tests of the GATB v1.0. He will also provide information about new elements coming from bioinformatics companies such as the launching of new product or services.

The Inria Rennes Technology Transfer Office will manage the IP protection operations of the software components of the toolbox, each time one of them will be finished. This is done with the French agency for the protection of software elements (the “APP”) which delivers an Inter Deposit Digital Number (IDDN) for each component we decide to protect (during 2011, the Inria Rennes TTO protected 49 software components with this APP process).

The Inria Rennes Technology Transfer Office will also prepare drafts of commercial license agreements and specific agreements for the running of beta-‐‑tests.

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



Deliverables:

D4.1: T6: 1st report for the “CSATT” committee

D4.2: T12: identification of the companies and academics interested with beta-‐‑tests of the first version of the GATB toolbox and 2nd report for the “CSATT” committee

D4.3: T18: 3rd report for the “CSATT” committee including a feedback on beta-‐‑tests on GATB v1.0

D4.4: T24: 4th report for the “CSATT” committee including drafts of commercial licenses, expected license fees and identification of the first likely licensees

Risks:

As the GATB toolbox will be a solution to a clearly identified need, the only risk is the launching of an equivalent product by a big bioinformatics company.

3.4. TASKS SCHEDULE, DELIVERABLES AND MILESTONES

Task scheduling

T 1-‐‑ 3 T 4 -‐‑ 6 T 7 -‐‑ 9 T10 -‐‑ 12 T 12-‐‑15 T 16-‐‑18 T 19-‐‑21 T 22-‐‑24

Task 1: GATB v1.0

Task 2: GATB v2.0

Task 3: Validation

Task 4: Technology Transfer

The first year of the project will be devoted to produce the first version of the Assembly Tool Box (GATB v1.0). The second year will enhance the Assembly Tool Box (GATB v2.0) with modules, which are currently in a research phase inside the GenScale team. The validation task will start at T0+6 and includes the Assemblathon competition.

Deliverable Scheduling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

D1.1 X

D1.2 X

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



D1.3 X

D2.1 X

D2.2 X

D2.3 X

D3.1 X

D3.2 X

D3.3 X

D4.1 X

D4.2 X

D4.3 X

D4.4 X

Assuming the start of the project on January 2013, we can expect to participate to Assemblathon 2013 and Assemblathon 2014. As we don’t have yet the exact timing of these events, deliverable D3.3 is arbitrarily set to the end of the project. It will comment the results obtained for these two international competitions.

Synchronization will be necessary between D1.3 and D4.2 for the launching of the beta-‐‑tests phase because the version 1.0 of the toolbox must be available.

4. DISSEMINATION AND EXPLOITATION OF RESULTS, INTELLECTUAL PROPERTY

4.1. TECHNOLOGY TRANSFER STRATEGY

4.1.1 INRIA TECHNOLOGY TRANSFER STRATEGY AND ASSOCIATED PROCESS

One of the main objectives of Inria is to increase the number and the impact of technology transfer projects. Inria thus created a process dedicated to the monitoring of the technology transfer projects (called "ʺPSATT"ʺ: process for the monitoring of technology transfer activities). This process, managed by the central Inria Technology Transfer & Innovation Department (DTI), allows:

• The involvement of the DTI experts: in the present case, the DTI Technology Transfer Associate for Health, Life Sciences & Biotechnologies (Ph Gesnouin) and the DTI Head of Software Assets (P. Moreau) will be closely involved in the TT process for the GATB tool. The effective involvement of P. Moreau guarantees the technological quality of the software developed during the technology transfer process, which must reach a TRL7

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



level for a good transfer process, knowing that the development work starts from software components prototypes at a TRL3-‐‑4 level.

• A coherent monitoring of the progress of each of the technology transfer project during its lifetime. This monitoring is operated trough the DTI Technology Transfer Committee ("ʺcalled CSATT"ʺ): the project coordinator and the local TT officer periodically present to the CSATT the results and advances of the project concerning:

o The IP statement of the software components developed during the project including the IP aspects of the links with potential external components;

o The specific advantages of the results of the project compared to the other the technology advances in the domain;

o The evolution of the targeted market and the characterization of the companies, specifically SMEs, which would be the most interesting partners for the licensing of the GATB toolbox, including information about their products roadmaps, their potential interest for hiring the Inria engineer devoted to the development of the GATB toolbox and, for French companies, their ability to incorporate new technologies (which can be seen through relationship with Oseo, their qualification as a "ʺyoung innovative company"ʺ or their involvement in collaborative ANR or FUI projects).

Each time the project is presented to the "ʺCSATT"ʺ by the project coordinator (D. Lavenier) and the local technology transfer officer of the Inria Rennes research center (P. Gelin), this committee (composed of technology transfer experts from Oseo, IT-‐‑Translation, French competitive clusters and EPFL) provides opinions about the actions that should be implemented in the roadmap of the project to strengthen the technology transfer objective.

4.1.2 SHORT OVERVIEW OF THE MARKET

Since 2006, new sequencing techniques appeared with the High Throughput Sequencing (HTS), and data processing tools are available in integrated solutions inside the equipments (Life Technologies Corp, Illumina, and even Nanopore Technologes with its brand new USB system called "ʺMinION"ʺ) or in solutions used for all the data treatment process (CLCBio, Genostar, GenomeQuest).

Among the bioinformatics companies which could be interested by the GATB toolbox, those who should be preferably concerned are :

• In France: o The SME GenomeQuest (which is now majority held by US investors), provide

tools for High Throughput Sequencing data treatments o The SME Genostar LSC focuses on genome annotation o The SME Korilog, created by a former engineer of the Inria Rennes bioinformatics

platform, develops specific treatment tools and works with Genostar • The SME CLCBio (Denmark) which sells the “Genomics Workbench “ tool for the de

novo assembling

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



• 454 Life Sciences (a Roche division), which sell the Newbler tool • and the major – US – players, whose revenues mainly come from the equipment

business, knowing that those equipment include data treatment tools) : Illumina, Life Technologies Corporation, Affimetrix, Beckman Coulter

4.1.3 PLANNED TECHNOLOGY TRANSFER SCHEME

Considering the current context of bioinformatics, we a priori plan a dual licensing scheme of the GATB toolbox and its components with:

• An Open Source diffusion under a “viral” license i.e. GLP v3 (or, in France under the CeCILL license which is equivalent to the GPL license), targeting academic use

• An Inria commercial non exclusive license, targeting companies, with the possibility to fit specific conditions to the needs of each company

4.1.4 ADDED VALUE OF THE GATB TOOLBOX

The available genomic assemblers (see section 2.3) require a grid and/or huge amount of memory. The two basic modules of the GATB toolbox are "ʺMonument"ʺ and "ʺMapsembler"ʺ and prototypes of those modules have already demonstrate high performances compared to existing tools: "ʺMonument"ʺ has low running time and low memory usage and "ʺMapsembler"ʺ presents the main feature to have no memory limitation (see section 2.3). The GATB toolbox, which is a fast innovative assembly of algorithms with very low memory fingerprint for assembling specific regions of interest of a genome, should then be a new interesting tool both for bioinformatics companies and academic laboratories.

4.1.5 RETURN ON INVESTMENT

The return on investment will come from the commercial non-‐‑exclusive license agreements with bioinformatics companies. As experienced in previous equivalent situations, we a priori consider two kinds of agreements, one with large companies where we will negotiate a global license agreement for an amount of approximately 50 KEuros and one with SMEs where we will negotiate annual fees adapted to their business model.

4.2. STATE & STRATEGY OF THE INTELLECTUAL PROPERTY

The first element of the IP strategy is a complete control of the IP rights of the toolbox. The existing prototypes have been completely internally developed in the GenScale Inria research team. We will keep a complete control of the property of all the core components of the toolbox, excluding any use of external components. For the interface functions (such as user interfaces), if existing external components can be linked to the toolbox to increase its value, the chosen components will be software elements distributed with non-‐‑restrictive licenses, such as BSD or Apache license. Otherwise, those peripheral components will also be internally developed.

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



The second element of our IP strategy is the license scheme of the GATB toolbox software. We plan a dual licensing scheme with:

• An Open Source distribution under a “viral “ license (GPLv3), authorizing the use of the toolbox by academics and facilitating its credit in the community;

• A commercial license which will allow bioinformatics companies to insert the toolbox in their treatment systems, avoiding any impact of the GPL license on their own software.

4.3. TECHNOLOGY TRANSFER OFFICE ROLE IN THE MILESTONES OF THE PROJECT

As described in 3.3.4, During all the lifetime of the project, the technology transfer office of the Inria Rennes research centre will operate the “process for the monitoring of technology transfer activities” with the help of two experts of the central Inria Technology Transfer & Innovation Department (DTI), Ph Gesnouin, the DTI Technology Transfer Associate for Health, Life Sciences & Biotechnologies and P. Moreau, the DTI Head of Software Assets. The advances of the project will be periodically (every 6 months) present to the DTI Technology Transfer Committee ("ʺcalled CSATT"ʺ) by the project coordinator and the Inria Rennes Technology Transfer Officer. These periodical statuses will include the results of the work made concerning the software components quality, the market evolutions and the prospect companies, the IP protection and the draft of agreements.

4.4. RESOURCES INVOLVED BY THE TECHNOLOGY TRANSFER OFFICE DURING THE PROJECT

All the staff of the Inria Rennes Technology Transfer office will be involved in the project:

• Patrice Gelin, Technology Transfer Officer, in charge of the management of Technology Transfer tasks

• Chantal Le Tonqueze, IP manager • Marie-‐‑Anne St Jalmes, corporate lawyer

For this project, the Inria Rennes office will be helped by two experts of the Inria DTI :

• Philippe Gesnouin, the DTI Technology Transfer Associate for Health, Life Sciences & Biotechnologies

• Patrick Moreau, the DTI Head of Software Assets

(See also 5.3)

5. CONSORTIUM DESCRIPTION

5.1. PARTNERS DESCRIPTION & RELEVANCE, COMPLEMENTARITY

GenScale INRIA team

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



GenScale is an INRIA team devoted to research and development in bioinformatics. The scientific axes of GenScale focus on processing of genomic data. Researches conducted in this group investigate the parallelism potential of the main bioinformatics process to reduce the execution time by several orders of magnitude. Topics of interest range from intensive sequence comparisons to NGS processing, including protein structure prediction.

Assembly is an important research axe of GenScale. Since the advent of NGS domain, pioneer works has been done on this critical activity. In the national landscape, GenScale is currently the only group working specifically on assembly algorithms. Our specificity is to combine innovative data structures to lower memory fingerprint, developed advanced heuristics to provide fast execution time, and to implement parallel techniques allowing algorithms to face the huge volume of data to process.

Technology Transfer Office

The Inria Rennes Technology Transfer Office supports the 30 research teams of the centre for the elaboration of research collaboration partnerships and technology transfer of their results, focusing its efforts on bilateral partnership relations with SMEs and the help of start-‐‑up projects. It is also involved in the support of European activities, e.g. through the EIT ICT Labs, Rennes being a satellite node of this multi-‐‑node European technological lab.

The annual contractual activity of the Inria Rennes centre is approximately 7M€ for 50 contracts

Complementarity

As part of the same institute, both teams have a long experience of working together. All the industrial transfers which have already been performed by GenScale have been done in tight cooperation with the Technology Transfer Office.

5.2. QUALIFICATION OF THE PROJECT COORDINATOR

Dominique Lavenier is a computational scientist by training and heads the IRISA/INRIA GenScale bioinformatics team. He has a long-‐‑standing interest in information technology (IT) aspects of biological data production and analysis. Specifically, he has been working on important questions concerning of ultra-‐‑high throughput DNA sequencing including read assembly, mapping, QTL processing. He also has a great expertise in parallelism, form GPU processing to grid processing. For the last ten years, D. Lavenier has coordinated the following national projects:

• GenoGRID: A grid for Genomic Applications (ACI Program, 2002-‐‑2004)

• RDISK: A Reconfigurable and Parallel Architecture for Browsing Genomic Databases (Inter EPST Program, 2002-‐‑2004)

• ReMIX : Reconfigurable Memory for Indexing (ACI MD Program, 2004-‐‑2006)

• Seed optimization and indexing of genomic banks on FLASH Memory (ARC INRIA, 2006-‐‑2007)

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



• BioWIC : Bioinformatics Workflow for Intensive Computing (ANR Arpege, 2009-‐‑2011)

D. Lavenier also experienced industrial transfers through the two following software previously developed in its research team, in collaboration with the Technology Transfer Office:

• GASSST: This software, called mapper, aims to align billions of short reads, generated by NGS machines, to a reference genome. This is a basic and time-‐‑consuming task of NGS processing. In 2011, it has been successfully transferred to the GenomeQuest Company, and integrated inside their NGS suite tools providing a x10-‐‑fold speed-‐‑up compared to their native mapper. Development of GASSST continues in tight collaboration with GenomeQuest, to fulfill industrial requirements and NGS technology evolution.

• PLAST: This is a sequence comparison software tacking as input two set of of sequences and provide an all-‐‑to-‐‑all comparison. PLAST is currently transferred to the Korilog Company within the KORIBLAST tool. Compared to BLAST it allows days of computation on huge volume of data to be reduced to hours. PLAST specifically targets the metagenomic field where intensive comparison between samples and reference banks is systematically performed.

5.3. QUALIFICATION AND CONTRIBUTION OF EACH PARTNER

Partner Name First name Position PM Contribution to the project

Inria-‐‑GenScale Lavenier Dominique DR CNRS, head of the Inria GenScale research team

6 Coordinator;

Inria-‐‑GenScale Peterlongo Pierre CR INRIA 6 Mapsembler designer

Inria-‐‑GenScale Moreews Francois IE INRA 4 Environment

Inria RennesTTO

Gelin Patrice Technology Transfer Officier – Inria Rennes Bretagne Atalntique

4 Coordination of the technology transfer activities of the project

Inria DTI Gesnouin Philipe Technology Transfer Associate for Health, Life Sciences & Biotechnologies – Inria DTI

1,5 Connection with bioinformatics companies

Inria DTI Moreau Patrick Head of Software Assets – Inria DTI

1 Adviser for the software development

Inria RennesTTO

Saint-‐‑Jalmes Marie Anne

Corporate lawyer -‐‑ Inria Rennes Bretagne Atalntique

1 Drafting of license agreements and specific contracts for technology transfer partnerships

Inria Rennes TTO

Le Tonquéze Chantal IP management -‐‑ Inria Rennes Bretagne Atalntique

0,5 IP protection of the software components

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



6. SCIENTIFIC JUSTIFICATION OF REQUESTED RESSOURCES

6.1. PARTNER 1: GENSCALE

• Equipment

1 workstation (+ screen) for the engineer to recruit – 3000 €

• Staff

1 Engineer – 24 months (GenScale CDD) – 130 320 €

• The activity of this engineer will be devoted to tasks 1, 2 and 3.

• Subcontracting

Expertise from the AlgoRizk company to ensure industrial compliance – 30 000 €

This company, created in 2011 by a former PhD Student of GenScale, has a great expertise in the design of high performance software. It has also strong links with bioinformatics industries. In addition to consulting, the involvement of this company in the GATB project will provide a practical help in the optimization of the software.

• Travel

Presentation of GATB in national and international conferences, visit to companies and Participation to Assemblathon meetings – 9000 €

• Costs justified by internal procedures of invoicing

INRIA charges a 4% overhang for services.

6.2. PARTNER 2: INRIA TECHNOLOGY TRANSFER OFFICE

Additional resources are not requested by the TTO for its activity dedicated to this project. All the staff and other resources needed for the TTO tasks will be fully supported by Inria.

7. REFERENCES • [CBP09] Mark J Chaisson, Dumitru Brinza, and Pavel A Pevzner, De novo fragment

assembly with short mate-‐‑paired reads: Does the read length matter? Genome Research 19 (2009), no. 2, 336-‐‑346.

• [CPWS99] B. Chevreux, T. Pfisterer, T. Wetter, and S. Suhai, Assembly of Genomic Sequences Assisted by Automatic Finishing, German Conference on Bioinformatics, 1999, pp. 183-‐‑184.

• [LZR+10] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhongbin Shi, Yingrui Li, Shengting Li, Gao Shan, Karsten Kristiansen, Songgang Li, Huanming

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



Yang, Jian Wang, and Jun Wang, De novo assembly of human genomes with massively parallel short read sequencing, Genome Research 20 (2010), no. 2, 265-‐‑272.

• [MDK+08] Jason R Miller, Arthur L Delcher, Sergey Koren, Eli Venter, Brian P Walenz, Anushka Brownley, Justin Johnson, Kelvin Li, Clark Mobarry, and Granger Sutton, Aggressive assembly of pyrosequencing reads with mates, Bioinformatics 24 (2008), no. 24, 2818-‐‑2824.

• [GMP+11] Gnerre S, MacCallum I, Przybylski D, Ribeiro F, Burton J, Walker B, Sharpe T, Hall G, Shea T, Sykes S, Berlin A, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB. High-‐‑quality draft assemblies of mammalian genomes from massively parallel sequence data Proceedings of the National Academy of Sciences USA (January 2011 vol. 108 no. 4 1513-‐‑1518).

• [PTW01] P.A. Pevzner, H. Tang, and M.S. Waterman, An Eulerian path approach to DNA fragment assembly, Proceedings of the National Academy of Sciences of the United States of America 98 (2001), no. 17, 9748.

• [SWJ+09] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein, S.J.M. Jones, and I. Birol, ABySS: A parallel assembler for short read sequence data, Genome Research 19 (2009), no. 6, 1117.

• [ZB08] Daniel R Zerbino and Ewan Birney, Velvet: Algorithms for de novo short read assembly using de bruijn graphs, Genome Research 18 (2008), no. 5, 821-‐‑829.

• [NOG+09] C. Nusbaum, T.K. Ohsumi, J. Gomez, J. Aquadro, T.C. Victor, R.M. Warren, D.T. Hung, B.W. Birren, E.S. Lander, and D.B. Jaffe, Sensitive, specific polymorphism discovery in bacteria using massively parallel sequencing, Nature methods 6 (2009), no. 1, 67.

• [PPDS04] Mihai Pop, Adam Phillippy, Arthur L Delcher, and Steven L Salzberg, Comparative genome assembly, Brief Bioinform 5 (2004), no. 3, 237-‐‑248.

• [CL11] Chikhi, R., Lavenier, D.: Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph. Algorithms in Bioinformatics pp. 39-‐‑48 (2011)

• [CCL11] G. Chapuis, R. Chikhi, D. Lavenier. Parallel and memory-‐‑e-‐‑fficient reads indexing for genome assembly, In proceedings of PBC 2011 (2011)

• [EARL11] D. Earl et al., Assemblathon 1: A competitive assessment of de novo short read assembly methods, Genome Research (2011)

• [PC11] P Peterlongo and R. Chikhi, Mapsembler, targeted assembly of larges genomes on a desktop computer, Research report, RR-‐‑7565, http://hal.archives-‐‑ouvertes.fr/inria-‐‑00577218_v1/

• [PC12] P Peterlongo and R. Chikhi, Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer, BMC Bioinformatics, under re-‐‑review.

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB



• [PIAT11] J. Piat, F. Moreews, O. Collin, A. Cornu, D. Lavenier, SLICEE: A Service oriented middleware for intensive scientific computation, 7th IEEE 2011 World Congress on Services (SERVICES 2011), Washington DC, USA, 2011

• [GALAXY10] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25; 11(8):R86.

GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15!...

Documents

Transcript of GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15!...