GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15!...

21
PROGRAMME EMERGENCE EDITION 2012 Projet GATB DOCUMENT SCIENTIFIQUE ANRGUIAAP05 – Doc Scientifique 2012 – VF 1/21 Acronyme Acronym GATB Titre du projet en français Proposal title in French Boite à outils « Assemblage pour la Génomique » Titre du projet en anglais Proposal title in English Genomic Assembly Tool Box Mots-clés (approche scientifique) Keywords (scientific approach) Genomic Data Processing, Assembly, Mapping Mots-clés (domaine d’application) Keywords (application field) Next Generation Sequencing, Bioinformatics, Genomic, Assembly, biotechnology Modèle de valorisation Technology transfer model Software Program Licensing Coopération internationale International cooperation Le projet propose une coopération internationale Aide totale demandée Requested grant 183372 Durée du projet Project duration 24 months

Transcript of GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15!...

Page 1: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     1/21  

 

Acronyme

Acronym GATB

Titre du projet en français

Proposal title in French

Boite à outils « Assemblage pour la Génomique »

Titre du projet en anglais

Proposal title in English

Genomic Assembly Tool Box

Mots-clés (approche scientifique)

Keywords (scientific approach)

Genomic Data Processing, Assembly, Mapping

Mots-clés (domaine d’application)

Keywords (application field)

Next Generation Sequencing, Bioinformatics, Genomic, Assembly, biotechnology

Modèle de valorisation

Technology transfer model

Software Program Licensing

Coopération internationale

International cooperation

¨ Le projet propose une coopération internationale

Aide totale demandée

Requested grant

183372 €

Durée du projet

Project duration

24 months

 

Page 2: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     2/21  

 

1.   EXECUTIVE SUMMARY ....................................................................... 3  

2.   CONTEXT, POSITION AND OBJECTIVES OF THE PROPOSAL ............................... 3  2.1.   Context, social and economic issues ............................................................ 3  2.2.   Position of the project ............................................................................... 4  2.3.   state of the art ......................................................................................... 5  2.4.   Objectives, originality and novelty of the project ........................................... 7  

3.   SCIENTIFIC AND TECHNICAL PROGRAMME, PROJECT ORGANISATION .................. 8  3.1.   Scientific programme, project structure ....................................................... 8  3.2.   Project management ................................................................................. 8  3.3.   Description by task ................................................................................... 9  

3.3.1   Task 1: GATB v1.0 9  3.3.2   Task 2: GATB v2.0 9  3.3.3   Task 3: Validation 10  3.3.4   Task 4: Technology Transfer Activities 11  

3.4.   Tasks schedule, deliverables and milestones ............................................... 12  

4.   DISSEMINATION AND EXPLOITATION OF RESULTS, INTELLECTUAL PROPERTY ....... 13  4.1.   Technology transfer strategy ..................................................................... 13  

4.1.1   Inria technology transfer strategy and associated process 13  4.1.2   Short overview of the market 14  4.1.3   Planned technology transfer scheme 15  4.1.4   Added value of the GATB toolbox 15  4.1.5   Return on Investment 15  

4.2.   State & strategy of the intellectual property ................................................ 15  4.3.   Technology transfer office role in the milestones of the project ...................... 16  4.4.   Resources involved by the technology transfer office during the project .......... 16  

5.   CONSORTIUM DESCRIPTION .............................................................. 16  5.1.   Partners description & relevance, complementarity ...................................... 16  5.2.   Qualification of the project coordinator ....................................................... 17  5.3.   Qualification and contribution of each partner .............................................. 18  

6.   SCIENTIFIC JUSTIFICATION OF REQUESTED RESSOURCES ............................. 19  6.1.   Partner 1: GenScale ................................................................................. 19  6.2.   Partner 2: Inria Technology Transfer Office ................................................. 19  

7.   REFERENCES ............................................................................... 19    

Page 3: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     3/21  

 

 

1. EXECUTIVE SUMMARY A  few  years  ago,  genomics  witnessed  an  unprecedentedly  deep  change  with  the  advent  

of  High  Throughput  Sequencing  (HTS),  also  known  as  Next  Generation  Sequencing  (NGS).  These   technologies   generate   huge   volumes   of   genomic   data.   Crucial   computational  developments  are  currently  needed  to  extract  knowledge  form  this  mass  of  data.    

The   GATB   project   focuses   on   a   specific   critical   HTS   treatment:   assembly.   Genomic  assembly   consists   in   reconstructing   a   genome   from   sets   of   very   small   DNA   or   RNA  sequences,  called  reads,  generated  by  NGS  machines.  For  complex  genomes,  billions  of  reads  need   to   be   ordered,   leading   to   time-­‐‑consuming   processing   requiring   computers  with   very  large  memories.   This   is   a   serious   bottleneck   in  many  HTS   analysis   both   for   academic   and  industry  companies.  

The  INRIA  GenScale  team  has  developed  fast  innovative  assembly  algorithms  with  very  low  memory   fingerprint.   Two   prototypes,   respectively   called  Monument   and  Mapsembler,  have   been  developed   as   proof   of   concept.  Monument   is   dedicated   to   de-­‐‑novo   assembly   for  reconstructing  complete  genome.  Mapsembler,  which  is  a  more  general  HTS  processing  tool,  offers  the  possibility  to  assemble  specific  regions  of  interest.  

In  this  project  we  propose  to  develop  a  Genomic  Assembly  Tool  Box  allowing  end-­‐‑users  to  customize  the  assembly  process  according  (1)  to  the  nature  of  the  genomic  data  generated  by  NGS  machines,  (2)  to  the  complexity  of  the  genome  to  assemble,  or  (3)  to  the  answer  of  a  specific  biological  question.  The  final  goal  is  to  prepare  industrial  technology  transfer  of  the  Genomic  Assembly  Tool  Box,  targeting  a  wide  range  of  genomic  domains  (health,  agronomy,  ecology,  etc.).    

 

2. CONTEXT, POSITION AND OBJECTIVES OF THE PROPOSAL

2.1. CONTEXT, SOCIAL AND ECONOMIC ISSUES

A   few  years   ago,  with   the   arrival   of  High  Throughput   Sequencing   (HTS)   technologies,  genomics  witnessed  an  unprecedentedly  deep  change  to  sequence  biological  material  (DNA  and   RNA)   with   a   volume   of   sequenced   data   much   higher   than   before,   for   a   price   now  accessible   to   most   academic   labs.   As   an   example,   approximately   10   years   and   109   dollars  were   necessary   to   sequence   the   human   genome   in   the   nineties   while   nowadays   it   is  expecting   to   sequence   a   full   human   genome   in   24   hours   for   a   few   thousands   of   dollars.  Hence,  HTS  opened  the  doors  to  many  applications,  which  appear  to  be  only  limited  by  the  imagination  of  the  users.  This  includes  de  novo  sequencing  and  resequencing  (sequencing  an  individual   of   an   already   sequenced   species)   of   genomes   and   RNA-­‐‑seq   (sequencing   of  

Page 4: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     4/21  

transcriptomes:   the   expressed   fraction   of   a   genome).   It   also   enables   to   detect  which  DNA  regions  interact  with  known  proteins  (Chip-­‐‑seq).    

The   range   of   biological   questions   which   can   be   addressed   is   extremely   broad,   as   it  includes,   among   others,   questions   related   to   health   (for   instance,   find   genes   differentially  regulated   in   cancer),   ecology   (identify   all   species   present   in   a   given   environment),   or  agronomy  (help  for  plant  selection  for  instance).    

Nowadays,   almost   all   biological   studies   include   a   first   sequencing   step,   generating   a  volume  of  data  that  computer  scientists  were  not  ready  to  cope  with.  The  intensive  usage  of  these  new  technologies  generates  datasets,  which  now  reach  several  Tera  bytes.  The  amount  of  data  is  thus  one  of  the  two  main  bottlenecks  in  the  exploitation  of  HTS.    

The  second  main  bottleneck  comes  from  the  type  of  data  that  is  generated:  technologies  do   not   provide   one   full   sequence   per   DNA  molecule.   Instead,   they   output   reads   that   are  small   sequence   fragments   of   length   a   few   hundred   characters.   These   reads   may   contain  sequencing   errors   (insertions/deletions,   substitutions).   As   reads   overlap,   the   original  sequence   may   be   reconstructed   using   an   assembly   phase.   Alternatively,   if   a   reference  genome  for  the  studied  species  is  available,  reads  may  be  mapped  onto  this  reference,  i.e.  a  (more  or  less)  assembled  genome,  identical  or  close  to  the  genome  of  the  species  studied.    

It   is   indispensable   to  develop  solutions   for  extracting   information   from  HTS  data  while  tackling   the   two   main   bottlenecks:   size   and   type   of   data.   Methods,   and   consequently  software,  must  be  fast  with  low  memory  fingerprint   in  order  to  not  saturate  bioinformatics  computer   centers.  As   the  difficulty   is  not  anymore   to  produce  data,   the   real   challenges  we  are  facing  are  the  data  treatment  and  analysis.    

This  project  is  formulated  in  this  spirit  and  specifically  targets  the  assembly  step.    

2.2. POSITION OF THE PROJECT

The  project  tackles  the  challenges  of  the  assembly  process  following  two  main  ideas:  

1. To   face   the   HTS   data   tsunami,   assemblers   must   be   fast   with   low   memory  fingerprint;  

2. To  provide  high  quality  results,  assemblers  need  to  be  customized.  

Today,  a   few  assemblers  exist   (see  next  section).  For  most  of   them,  assembling  complex  genomes  requires  days  of  computation.  Furthermore,   to  support   these  software,  computers  must  be  equipped  with  very   large  memories   (up   to  512  GB).  This   is  actually  a  very  strong  constraint   requiring   tera   bytes   of   data   to   be   sent   (and   stored)   to   bioinformatics   centers.  Providing  fast  assemblers,  able  to  be  executed  near  the  source  of  HTS  data,  i.e.  on  computers  with   standard   memory   size,   would   be   valuable   to   anticipate   the   HTS   deluge:   today,   the  trend   is   to   equip  genomic   labs,   hospitals,   etc.,  with  next   generation   sequencers,   but   not   to  associate  consequent  computer  power  (which  is  currently  required)  to  process  the  data.  

The  second  point  deals  with  the  variety  of  genome  to  assemble  or,  more  generally,  with  the   variety   of   biological   questions   which   can   be   treated   with   HTS   data.   It   is   clear   that  

Page 5: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     5/21  

assembling  a  genome  of  a  prokaryote  or  the  genome  of  an  eukaryote  doesn’t  have  the  same  complexity  and  doesn’t  require  algorithms  with  the  same  features.  Sequencing  technologies  also  differ  and  generate  data  of  different  types  which  ask  specific  assembly  treatments.  More  important,   some   situations   to   answer   specific   question,   don’t   need   a   complete   assembly  phase,   but   only   a   targeted   assembly   on   some   particular   regions   of   the   genome.   All   these  criteria   together   prevent   the   design   of   a   «universal   assembler»   able   to   cope   with   all  situations.  Thus,  except  experts,  users  often  meet  difficulties  to  fully  exploit  these  (complex)  tools,  and  are  often  disappointed  with  assembly  results.  Yet  the  current  trend  is  to  use  (more  or   less)   the   same   assembly   tools   for   processing   a   large   panel   of   data.   Our   approach,   as  opposed  to  monolithic  assemblers,  is  to  propose  a  modular  assembler  able  to  be  customized  and  adapted  to  specific  assembly  treatment.    

Compared   to   historical   actors   in   the   domain,   such   as   BGI   (Beijing   Genomic   Institute,  China)   Broad   Institute   (MIT,   Harvard)   or   Sanger   Institute   (Cambridge,   UK),   the  GenScale/INRIA   team   has   a   much   shorter   experience   in   the   assembly   field.   However,  participation  to  international  competitions  (dnGasp,  Assemblathon  [EARL2011])  has  shown  that  our  approach  is  very  competitive,  even  if  it  doesn’t  perform  well  in  all  aspects.  But  we  demonstrate  that  we  provide  tools  among  the  fastest  ones,  and  tools  which  can  be  executed  on  rather  small  memory  systems.  

2.3. STATE OF THE ART

Genomic  assembly   consists   in   reconstructing  a  genome   from  a   set  of   sequencing   reads,  either  de  novo  or   reference-­‐‑guided.  Only   the   former   is  computationally  challenging,  as   the  latter  essentially  consists  of  mapping  reads  to  a  reference  genome  and  filling  the  gaps  with  various   strategies,   possibly   including   de   novo   assembly   of   un-­‐‑mapped   reads   [NOG+09,  PPDS04].  In  the  following,  we  survey  existing  methods  for  de  novo  assembly.  

Next-­‐‑generation  de  novo  genomic  assemblers  can  be  divided  into  two  classes:  short  reads  (SR)  assemblers  and  ultra-­‐‑short   reads   (USR)  assemblers.  The   former   focuses  on  assembling  454  data,    ie.    millions  of  reads  of  length  between  200-­‐‑500  bp.  These  assemblers  are  based  on  a   graph   data   structure   (overlap   graph),   where   graph   vertices   are   reads   and   edges   are  significant  overlaps  between  reads.  Such  data  structure  limits  the  number  of  reads  that  can  be  processed,  as  the  graph  stores  information  for  O(n²)  overlaps.    

The   most   high-­‐‑profile   short   reads   assembly   tools   are   Newbler   (commercial,   454  software),  Cabog  [MDK+08]  and  Mira  [CPWS99].  Typically,  these  are  capable  of  assembling  a  million  of  454  reads  in  a  couple  of  CPU  hours.  It  should  be  noted  that  these  assemblers  can  also  process  hybrid  data  sets,  eg.  a  mixture  of  Sanger,  454  and  ultra-­‐‑short   reads.  However,  assembly  of  ultra-­‐‑short  reads  using  SR  assemblers  is  intrinsically  limited  to  the  order  of  a  few  million  of   reads,   ie.   a   fraction  of   the   sequencer  output.  Among  SR  assemblers,  Celera   is   to  our   knowledge   the   only   software   implementing   parallel   assembly,   using   a   coarse-­‐‑grained  model  (requires  a  grid).    

Ultra-­‐‑short  reads  (USR)  assemblers  are  targeted  for  Illumina  and  Solid  sequencers,  which  produce   several   orders  of  magnitude  more   reads   than   the   454   technology,   albeit   reads   are  

Page 6: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     6/21  

smaller   in   length   (30-­‐‑100   bp).   USR   assemblers   rely   on   a   more   succinct   graph-­‐‑based   data  structure  (de  Bruijn  graph)  where,  given  an  integer  k  (often  between  15  and  64),  vertices  are  k-­‐‑length   substrings   of   reads,   and   edges   correspond   to   (k-­‐‑1)-­‐‑length   overlaps.   This   data  structure,   introduced   by   Pevzner   and   colleagues   [PTW01],   allows   for   a   high   number   of  reads,  as   the  size  of   the  graph  only  depends  on  the  genome  structure.  Differences  between  USR  assemblers  stem  from  various  heuristics  used  for  graph  simplification  and  traversal  and  error-­‐‑correction  approaches.  To  cope  with  sequencing  errors  in  the  data,  Euler-­‐‑USR  [CBP09],  Allpaths-­‐‑LG   [GMP+11]   and   SOAPdenovo   [LZR+10]   assemblers   implement   pre-­‐‑assembly  error   correction,   whereas   Velvet   [ZB08]   and   ABySS   [SWJ+09]   perform   in-­‐‑assembly   graph  simplification   to   remove   vertices   corresponding   to   erroneous   reads.   As   pre-­‐‑assembly  correction   is   computationally   expensive   (its   running   time   is   comparable   to   the   whole  assembly)  and  in-­‐‑assembly  correction  greatly  expands  the  size  of  the  graph,  it  is  still  unclear  which   error   correction   method   is   practically   the   most   suitable.   Moreover,   among   the  previous  USR  assemblers  cited,  only  SOAPdenovo  and  ABySS  can  handle  mammalian-­‐‑sized  genomes,   because   even   constructing   error-­‐‑corrected   de   Bruijn   graphs   for   such   genomes  requires   an   unreasonable   amount   of   memory.   ABySS   solved   this   problem   by   using   a  distributed  approach,  while  SOAPdenovo  discards  reads  information  in  the  de  Bruijn  graph.    

A  recent  method,  Monument  [CL11],  proposed  locally  global  construction  and  traversal  of  overlap  graphs.  This  overcomes  the  memory  limitation  of  constructing  a  complete  overlap  graph,   while   permitting   assemblies   of   better   quality   than   de   Bruijn   graphs.  Compared   to  other  assemblers,  Monument  assembler  implements  two  novel  features.  First,   it  uses  a  new  indexing  module  that  dynamically  detects  and  discards  entries  due  to  read  errors  [CCL11].  Hence,   the   pre-­‐‑assembly   or   error-­‐‑correction   phase   becomes   optional,   and   the   in-­‐‑assembly  error-­‐‑correction   is   not   longer   memory-­‐‑bound.   The   operation   of   constructing   this   index  requires  less  memory  than  other  approaches,  as  erroneous  index  entries  are  removed  before  the  full  index  is  constructed.    

Second,  the  assembly  module  of  Monument  is  the  only  method  that  can  construct  longer  sequences  (scaffolds,   i.e.  sequence  which  may  contain  gaps,  as  opposed  to  contigs,  sequences  without  gaps)  locally.  One  main  advantage,  compared  to  other  methods,  is  that  missing  read  overlaps  (possibly  due  to  sequencing  artifacts,  such  as  coverage  gaps  or  localized  abundant  errors)   can   be   represented   by   gaps   in   scaffolds,   whereas   they   would   necessarily   cause  contigs   to   be   interrupted.   By   pioneering   localized   scaffolds   construction,   the   Monument  assembler   casts   assembly   as   an   embarrassingly   parallel   problem,   which   can   be   efficiently  solved  on  a  large  cluster  of  moderately  powerful  machines.    

A  pipeline  based  on  the  Monument  assembler  had  the   lowest  running  time  and  second  lowest  memory   usage   in   a   recent   competition   (Assemblathon   1,   [EARL2011]).   In   terms   of  results   quality,   this   pipeline   out-­‐‑performed   several   other   pipelines   based   on   popular  assemblers  (Velvet,  Phusion2,  CLC).  

Extracting  information  from  HTS  sequences  does  not  necessary  require  to  fully  assemble  the   reads.   In   particular,   the   user  may  process   an   a   priori   piece   of   information   and   aims   at  targeting   the   region   of   genome   to   be   assembled.   Following   this   idea,   a   new   method,  

Page 7: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     7/21  

Mapsembler   [PC11,PC12],   was   recently   proposed.   This   is   an   iterative   micro   and   targeted  assemblers   which   process   large   datasets   of   reads   on   commodity   hardware.   Mapsembler  checks   for   the  presence  of  given  regions  of   interest   that  can  be  constructed  from  reads  and  builds   a   short   assembly   around   it,   either   as   a   plain   sequence   or   as   a   graph,   showing  contextual  structure.    

This   tool  may  be  used  within  various   frameworks.  As   it  offers   the  possibility   to  get   the  structure  of   the  genome/transcriptome  near   a   region  of   interest,   it  may  be  used   to   retrieve  biological  elements  of  interest  such  as  repeats,  SNPs,  exon  skipping,  gene  fusion,  as  well  as  other  structural  events,  directly  from  raw  sequencing  reads.  

Another  key  aspect  of  Mapsembler  is  that  its  memory  usage  is  independent  from  the  size  of  the  read  sets.  Thus,  compared  to  any  other  assembly  tool,  Mapsembler  presents  the  main  feature   to  have  no  memory   limitation.   It   can   thus   be   applied   even  on   tera-­‐‑byte   sized  data  sets.   In  particular,   even   if   it  was  not   initially  designed   in   this   spirit,  Mapsembler   is  highly  parallelizable  and  can  be  adapted  to  a  zero  memory  whole  genome  de  novo  assembly  tool.  

2.4. OBJECTIVES, ORIGINALITY AND NOVELTY OF THE PROJECT

In  this  project  we  propose  to  develop  a  Genomic  Assembly  Tool  Box  allowing  end-­‐‑users  to  customize  the  assembly  process  according  to:  

• The  nature  of  the  available  data  generated  by  NGS  sequencers.  Different  technologies  exist  providing  different  types  of  data.  For  instance,  454  reads  are  much  longer  than  Illumina   reads   and   both   exhibit   different   types   of   errors.   To   optimize   the   final  assembly,  algorithms  must  be  adapted.  

• The   complexity   of   the   genome   to   assemble.   Assembling   genomes   of   polyploidy  organisms   is  much  more   difficult   than   assembling   genomes   of   bacteria.   It   requires  more   steps   and   specific   data   (such   as   mate-­‐‑pair   reads)   to   perform   the   whole   final  assembly.  

• The  answer  of  a  specific  biological  question.  In  many  cases,  the  genome  doesn’t  need  to  be   fully   assembled   to   extract  knowledge.  Targeted  assembly   focusing  on   specific  regions  of  the  genome  can  just  be  the  best  way  to  find  relevant  information.  

The  Genomic  Assembly  Tool  Box  (GATB)  will  be  made  of  different  modules  developed  in  our   team  and  which,   today,   are   instanced   into   two   software:  Monument  and  Mapsembler.  Both  tools  have  been  designed  to  remove  current  HTS  computational  barriers:  execution  time  and  memory  fingerprint.  

From   a   practical   point   of   view,   connecting   the  modules  will   be   possible   using   current  graphical   interfaces   such   as   Galaxy   [GALAXY10]   or   SLICEE   [PIAT11].   They   will  communicate  via   standard  API  of   the  HTS  domain   to  make   them  easily  exploitable  by   the  scientific  and  industrial  community.  

 

Page 8: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     8/21  

3. SCIENTIFIC AND TECHNICAL PROGRAMME, PROJECT ORGANISATION

3.1. SCIENTIFIC PROGRAMME, PROJECT STRUCTURE

The   GATB   project   aims   to   design   a   Genomic   Assembly   Tool   Box   based   on   various  research   tools  and  prototypes  developed   in   the  GenScale   INRIA  Team.  Two  software  have  specifically   emerged:   Mapsembler   for   processing   HTS   data   and   Monument   for   de-­‐‑novo  assembly.  Other  tools  related  to  assembly  are  also  under  investigation.  From  all  this  tools,  it  is   possible   to   identify   basic   functional   modules   that   can   be   shared   to   composed   specific  assemblers:  

• Indexing  module:   Indexing   consists   in   storing   reads   in   an   efficient   way   inside   the  memory  computers.  Dealing  with  billions  of  reads  make  this  task  critical.    

• Read  correction  module:  Reads  generated  by  sequencers  are  not  perfect.  They  contain  errors   that   can   be   eliminated   by   analyzing   the   read   redundancy.   Correcting   errors  allow  following  tasks  to  be  more  efficient.  

• Contig/scaffold  module:   the  output  of   the  de-­‐‑novo  assembly   step   is   a   set  of   contigs  (long   fragments   of   uninterrupted   A,   C,   G,   T   characters)   and/or   a   set   of   scaffolds  (contigs  with  gaps).  The  efficiency  of  this  module  is  measured  by  the  N50  metric.  

• Targeted   assembly   module:   from   a   specific   point   of   the   genome   (called   starter),   a  single   contig   is   built.   Among   the   possibility   to   answer   many   biological   questions  without   reconstructing   a   full   genome,   this   tool   can   potentially   be   used   to   design   a  massively  parallel  assembler.  

• Super-­‐‑scaffolding   module:   This   activity   relies   on   ordering   contigs   and   scaffolds   to  produce  larger  scaffolds,  and  ultimately  the  final  text  of  the  genome.  

• Gap-­‐‑filling  module:  this  is  a  finishing  step  to  complete  the  missing  assembly  regions.  

We  propose  to  a  two-­‐‑step  procedure  for  designing  the  Genomic  Assembly  Tool  Box.  The  first   year   of   the   project   will   be   devoted   to   finalizing   modules   which   have   already   been  validated  inside  Mapsembler  and  Monument:  Indexing,  contig/scaffold,  and  target  assembly  (task  1).  During   the   second  year   the   three  other  modules,  not  yet   fully  validated  or   still   in  research   phase,  will   be   added   (task   2).   All   these  modules  will   be   systematically   validated  with   intensive   tests.   Furthermore,   participation   to   the   international   Assemblathon  competition   is  envisioned   in  order   to  clearly  position  our   tools  with  competitor  assemblers  (task  3).  Concurrently  to  these  technical  developments,  the  Inria  Rennes  Technology  Transfer  Office  will  operate  the  actions  needed  in  order  to  ensure  the  "ʺtransferability"ʺ  of  the  toolbox  to  biotechnologies  companies  and  academics  (task  4).  

3.2. PROJECT MANAGEMENT

The   management   of   the   project   will   be   easy   to   implement   since   both   partners   are  physically  located  in  the  same  building.  A  monthly  meeting  will  be  systematically  organized  for   fine  synchronization  between  the  research   team  GenScale  and  Inria  Rennes  Technology  

Page 9: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     9/21  

Transfer  Office.  The   technical  progresses   and   the   technology   transfer   actions  of   the  project  will  be  discussed.    

The   Inria  Rennes  Technology  Transfer  Office  will   implement   the  processes   of   the   Inria  Technology   Transfer   and   Innovation   Department   (DTI)   linking   the   DTI   experts,   the  Technology  Transfer  Associate  for  Health,  Life  Sciences  &  Biotechnologies  and  the  DTI  Head  of   Software   Assets,   to   the  meetings  with   the   GenScale   research   team,   at   least   every   three  months   (see   task   4).   Locally,   the   Inria   Rennes   technology   transfer   officer   will   monthly  discuss  and  analyze  the  project  progresses  with  the  project  coordinator  

3.3. DESCRIPTION BY TASK

3.3.1 TASK 1: GATB V1.0

Objective:    

Provide  version  1.0  of  the  genomic  assembly  toolbox  (GATB  v1.0)  which  will  be  made  of  the  (1)  indexing,  (2)  de-­‐‑novo  assembly,  and  (3)  targeted  assembly  modules.  

Task  leader:  GenScale  

Description  of  the  work:  

This  task  will  perform  the  following  actions:  

1. Test  and  debug  of  the  3  modules  

2. Make  the  3  modules  compliant  with  standard  HTS  interface  

3. Write  associated  documentation  

4. Make  the  GATB  v1.0  deployable  on  standard  OS  

Deliverables:  

D1.1:  indexing  module  (open  access,  GPL  &  CeCill  license)  

D1.2:  de-­‐‑novo  assembly  module  (open  access,  GPL  &  CeCill  license)  

D1.3:  targeted  assembly  module  (open  access,  GPL  &  CeCill  license)  

Risks:  

No  identified  risks.  Prototypes  already  exist  and  have  demonstrated  their  efficiency.    This  is  mainly  software  engineering  works.  

3.3.2 TASK 2: GATB V2.0

Objective:    

Provide  version  2.0  of  the  genomic  assembly  toolbox  (GATB  v2.0),  which  is  composed  of  the  previous  version  enhanced  with  3  new  modules:  Read  correction;  Super-­‐‑scaffolding;  Gap-­‐‑filling.  

Task  leader:  GenScale  

Page 10: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     10/21  

Description  of  the  work:  

This  task  will  perform  the  following  actions:  

1. Test  and  debug  of  the  3  modules  

2. Make  the  3  modules  compliant  with  standard  HTS  interface  

3. Write  associated  documentation  

4. Make  the  GATB  v2.0  deployable  on  standard  OS  

Deliverables:  

D2.1:  read  correction  module  (open  access,  CeCill  license)  

D2.2:  super-­‐‑scaffolding  module  (open  access,  CeCill  license)  

D2.3:  gap-­‐‑filling  module  (open  access,  CeCill  license)  

Risks:  

The  3  modules  are  currently  under  development.  Prototypes  are  still  in  their  infancy  and  are  not  yet  completed  validated.  This  is  an  ongoing  research  inside  GenScale.  

3.3.3 TASK 3: VALIDATION

Objectives:    

Test  the  GATB  on  various  benchmarks.    Promote  our  tools  by  participating  to  international  competitions  (Assemblathon)  

Task  leader:  GenScale  

Description  of  the  work:  

This  task  will  perform  the  2  following  actions:  

1. Assembly  of  various  genomes.  Data  will  come  from  numerous  datasets  available  among  the  scientific  community.  We  will  also  use  internal  data  from  projects  for  which  we  have  tight  collaboration  with  biologists.    

2. Participation   to   international   competitions   such   the  Assemblathon  event.  This   is  the   best   way   to   compare   our   results   with   the   state-­‐‑of-­‐‑the-­‐‑art   software   of   the  domain.  This  is  also  a  powerful  media  to  promote  our  research.  

Deliverables:  

D3.1:  Validation  report  of  GATB  v1.0  

D3.2:  Validation  report  of  GATB  v2.0  

D3.3:  Results  of  international  competitions.  

Risks:  

No  risk  identified.  

Page 11: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     11/21  

3.3.4 TASK 4: TECHNOLOGY TRANSFER ACTIVITIES

Objective:    

Performing   all   the   tasks   supporting   and   completing   the   technical  work   to   guarantee   a  good   impact   of   the   project   result   i.e.   the   GATB   toolbox,   on   the   targeted   market  (bioinformatics  companies  and  academics).  

Task  leader:  technology  transfer  officer  of  the  Inria  Rennes  research  centre,  P.  Gelin  

Description  of  the  work:  

During   all   the   lifetime  of   the  project,   the   technology   transfer   office   of   the   Inria  Rennes  research  centre  will  operate  the  “process  for  the  monitoring  of  technology  transfer  activities”  (described  in  section  4.1.1)  designed  and  managed  by  the  central  Inria  Technology  Transfer  &   Innovation  Department   (DTI)  with   the  help  of   two  experts  of   the  DTI,  Ph  Gesnouin,   the  DTI   Technology   Transfer   Associate   for   Health,   Life   Sciences   &   Biotechnologies   and   P.  Moreau,  the  DTI  Head  of  Software  Assets.  

As   described   in   section   4.1.1,   the   advances   of   the   project   will   be   periodically   (every   6  months)   presented   to   the   DTI   Technology   Transfer   Committee   ("ʺcalled   CSATT"ʺ),   dealing  with   the   IP   statement   of   the   software   components,   the   specific   advantages   of   the   results  compared   to   the   ongoing   technology   advances   in   the   domain   and   the   evolution   of   the  targeted  market  (including  the  possible  launching  of  new  commercial  products  or  services).    The   first   step  of   this  work  has   already  been  done   as   the  project  has   been  presented   to   the  “CSATT”  for  the  first  time  on  February  8th  2012.  

The  work  with   P.  Moreau  will   guarantee   a   good   quality   of   the   software   development  process.  Il  will  start  by  a  diagnosis  of  the  existing  software  prototypes  of  the  GenScale  team  –  the  “Monument”  and  “Mapsembler”  prototypes   -­‐‑   to  clearly   identify  what  must  be  done   to  preserve  the  Inria  IP  control  of  the  components  that  will  be  developed  on  the  basis  of  these  prototypes.  P.  Moreau  will  also  provide  advices  on  the  architecture  of   the  components  and  specifically  about  the  links  which  will  ease  interfacing  with  other  tools,  using  standard  API  of  the  HTS  domain  while  securing  the  Inria  IP  control  of  the  toolbox.  

Ph   Gesnouin   will   provide   information   that   he   will   collect   from   prospect   companies,  about   their   specific   need   respecting   the   potential   integration   of   the   GATB   toolbox,   their  interest   for   beta-­‐‑tests   of   the   GATB   v1.0.   He   will   also   provide   information   about   new  elements   coming   from  bioinformatics   companies   such   as   the   launching   of   new  product   or  services.  

The  Inria  Rennes  Technology  Transfer  Office  will  manage  the  IP  protection  operations  of  the  software  components  of  the  toolbox,  each  time  one  of  them  will  be  finished.  This  is  done  with  the  French  agency  for   the  protection  of  software  elements   (the  “APP”)  which  delivers  an   Inter  Deposit  Digital  Number   (IDDN)  for  each  component  we  decide   to  protect   (during  2011,  the  Inria  Rennes  TTO  protected  49  software  components  with  this  APP  process).  

The   Inria   Rennes   Technology   Transfer   Office   will   also   prepare   drafts   of   commercial  license  agreements  and  specific  agreements  for  the  running  of  beta-­‐‑tests.  

Page 12: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     12/21  

Deliverables:  

D4.1:    T6:    1st  report  for  the  “CSATT”  committee    

D4.2:    T12:  identification  of  the  companies  and  academics  interested  with  beta-­‐‑tests  of  the  first  version  of  the  GATB  toolbox      and  2nd  report  for  the  “CSATT”  committee  

D4.3:   T18:   3rd   report   for   the   “CSATT”   committee   including   a   feedback  on  beta-­‐‑tests   on  GATB  v1.0  

D4.4:  T24:  4th  report  for  the  “CSATT”  committee  including  drafts  of  commercial  licenses,  expected  license  fees  and  identification  of  the  first  likely  licensees  

Risks:  

As  the  GATB  toolbox  will  be  a  solution   to  a  clearly   identified  need,   the  only  risk   is   the  launching  of  an  equivalent  product  by  a  big  bioinformatics  company.  

3.4. TASKS SCHEDULE, DELIVERABLES AND MILESTONES

 

Task  scheduling  

 

  T  1-­‐‑  3   T  4  -­‐‑  6   T  7  -­‐‑  9   T10  -­‐‑  12   T  12-­‐‑15   T  16-­‐‑18   T  19-­‐‑21   T  22-­‐‑24  

Task  1:  GATB  v1.0                  

Task  2:  GATB  v2.0                  

Task  3:  Validation                  

Task  4:  Technology  Transfer                    

 

The  first  year  of  the  project  will  be  devoted  to  produce  the  first  version  of  the  Assembly  Tool  Box   (GATB  v1.0).   The   second  year  will   enhance   the  Assembly  Tool  Box   (GATB  v2.0)  with   modules,   which   are   currently   in   a   research   phase   inside   the   GenScale   team.   The  validation  task  will  start  at  T0+6  and  includes  the  Assemblathon  competition.  

 

Deliverable  Scheduling  

 

  1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24  

D1.1             X                                      

D1.2                   X                                

Page 13: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     13/21  

D1.3                         X                          

D2.1                                     X              

D2.2                                           X        

D2.3                                                 X  

D3.1                                                 X  

D3.2                                                 X  

D3.3                                                 X  

D4.1             X                                      

D4.2                         X                          

D4.3                                     X              

D4.4                                                 X  

 

Assuming   the   start   of   the   project   on   January   2013,   we   can   expect   to   participate   to  Assemblathon  2013  and  Assemblathon  2014.  As  we  don’t  have  yet  the  exact  timing  of  these  events,  deliverable  D3.3  is  arbitrarily  set  to  the  end  of  the  project.  It  will  comment  the  results  obtained  for  these  two  international  competitions.  

Synchronization  will  be  necessary  between  D1.3  and  D4.2  for  the  launching  of  the  beta-­‐‑tests  phase  because  the  version  1.0  of  the  toolbox  must  be  available.    

 

4. DISSEMINATION AND EXPLOITATION OF RESULTS, INTELLECTUAL PROPERTY

4.1. TECHNOLOGY TRANSFER STRATEGY

4.1.1 INRIA TECHNOLOGY TRANSFER STRATEGY AND ASSOCIATED PROCESS

One  of  the  main  objectives  of  Inria  is  to  increase  the  number  and  the  impact  of  technology  transfer  projects.  Inria  thus  created  a  process  dedicated  to  the  monitoring  of  the  technology  transfer   projects   (called   "ʺPSATT"ʺ:   process   for   the   monitoring   of   technology   transfer  activities).   This   process,   managed   by   the   central   Inria   Technology   Transfer   &   Innovation  Department  (DTI),  allows:  

• The   involvement   of   the  DTI   experts:   in   the   present   case,   the  DTI   Technology   Transfer  Associate  for  Health,  Life  Sciences  &  Biotechnologies  (Ph  Gesnouin)  and  the  DTI  Head  of  Software  Assets  (P.  Moreau)  will  be  closely  involved  in  the  TT  process  for  the  GATB  tool.    The   effective   involvement   of   P.   Moreau   guarantees   the   technological   quality   of   the  software   developed   during   the   technology   transfer   process,  which  must   reach   a   TRL7  

Page 14: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     14/21  

level   for   a   good   transfer   process,   knowing   that   the   development   work   starts   from  software  components  prototypes  at  a  TRL3-­‐‑4  level.  

• A  coherent  monitoring  of  the  progress  of  each  of  the  technology  transfer  project  during  its  lifetime.  This  monitoring  is  operated  trough  the  DTI  Technology  Transfer  Committee  ("ʺcalled  CSATT"ʺ):  the  project  coordinator  and  the  local  TT  officer  periodically  present  to  the  CSATT  the  results  and  advances  of  the  project  concerning:  

o The   IP   statement   of   the   software   components   developed   during   the   project  including  the  IP  aspects  of  the  links  with  potential  external  components;  

o The   specific   advantages   of   the   results   of   the   project   compared   to   the   other   the  technology  advances  in  the  domain;  

o The  evolution  of   the   targeted  market  and   the  characterization  of   the  companies,  specifically  SMEs,  which  would  be  the  most  interesting  partners  for  the  licensing  of  the  GATB  toolbox,  including  information  about  their  products  roadmaps,  their  potential  interest  for  hiring  the  Inria  engineer  devoted  to  the  development  of  the  GATB   toolbox   and,   for   French   companies,   their   ability   to   incorporate   new  technologies   (which   can   be   seen   through   relationship   with   Oseo,   their  qualification   as   a   "ʺyoung   innovative   company"ʺ   or   their   involvement   in  collaborative  ANR  or  FUI  projects).  

Each  time  the  project  is  presented  to  the  "ʺCSATT"ʺ  by  the  project  coordinator  (D.  Lavenier)  and   the   local   technology   transfer  officer  of   the   Inria  Rennes   research  center   (P.  Gelin),   this  committee   (composed   of   technology   transfer   experts   from   Oseo,   IT-­‐‑Translation,   French  competitive   clusters   and   EPFL)   provides   opinions   about   the   actions   that   should   be  implemented  in  the  roadmap  of  the  project  to  strengthen  the  technology  transfer  objective.  

4.1.2 SHORT OVERVIEW OF THE MARKET

Since  2006,  new  sequencing  techniques  appeared  with  the  High  Throughput  Sequencing  (HTS),  and  data  processing  tools  are  available  in  integrated  solutions  inside  the  equipments  (Life  Technologies  Corp,  Illumina,  and  even  Nanopore  Technologes  with  its  brand  new  USB  system   called   "ʺMinION"ʺ)   or   in   solutions   used   for   all   the   data   treatment   process   (CLCBio,  Genostar,  GenomeQuest).  

Among   the   bioinformatics   companies  which   could   be   interested   by   the  GATB   toolbox,  those  who  should  be  preferably  concerned  are  :  

• In  France:  o The   SME  GenomeQuest   (which   is   now  majority   held   by  US   investors),   provide  

tools  for  High  Throughput  Sequencing  data  treatments  o The  SME  Genostar  LSC  focuses  on  genome  annotation  o The  SME  Korilog,  created  by  a  former  engineer  of  the  Inria  Rennes  bioinformatics  

platform,  develops  specific  treatment  tools  and  works  with  Genostar    • The   SME   CLCBio   (Denmark)   which   sells   the   “Genomics  Workbench   “   tool   for   the   de  

novo  assembling  

Page 15: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     15/21  

• 454  Life  Sciences  (a  Roche  division),  which  sell  the  Newbler  tool  • and   the   major   –   US   –   players,   whose   revenues   mainly   come   from   the   equipment  

business,   knowing   that   those   equipment   include   data   treatment   tools)   :   Illumina,     Life  Technologies  Corporation,  Affimetrix,  Beckman  Coulter  

4.1.3 PLANNED TECHNOLOGY TRANSFER SCHEME

Considering   the   current   context   of   bioinformatics,   we   a   priori   plan   a   dual   licensing  scheme  of  the  GATB  toolbox  and  its  components  with:  

• An  Open   Source   diffusion  under   a   “viral”   license   i.e.  GLP   v3   (or,   in   France   under   the  CeCILL  license  which  is  equivalent  to  the  GPL  license),  targeting  academic  use  

• An  Inria  commercial  non  exclusive  license,  targeting  companies,  with  the  possibility  to  fit  specific  conditions  to  the  needs  of  each  company  

4.1.4 ADDED VALUE OF THE GATB TOOLBOX

The  available  genomic  assemblers  (see  section  2.3)  require  a  grid  and/or  huge  amount  of  memory.  The   two  basic  modules  of   the  GATB   toolbox  are   "ʺMonument"ʺ  and   "ʺMapsembler"ʺ  and  prototypes  of  those  modules  have  already  demonstrate  high  performances  compared  to  existing  tools:  "ʺMonument"ʺ  has  low  running  time  and  low  memory  usage    and  "ʺMapsembler"ʺ  presents  the  main  feature  to  have  no  memory  limitation  (see  section  2.3).  The  GATB  toolbox,  which   is   a   fast   innovative   assembly   of   algorithms   with   very   low  memory   fingerprint   for  assembling   specific   regions   of   interest   of   a   genome,   should   then   be   a   new   interesting   tool  both  for  bioinformatics  companies  and  academic  laboratories.  

4.1.5 RETURN ON INVESTMENT

The   return   on   investment   will   come   from   the   commercial   non-­‐‑exclusive   license  agreements   with   bioinformatics   companies.   As   experienced   in   previous   equivalent  situations,  we  a  priori   consider   two  kinds  of   agreements,  one  with   large   companies  where  we  will  negotiate  a  global  license  agreement  for  an  amount  of  approximately  50  KEuros  and  one  with  SMEs  where  we  will  negotiate  annual  fees  adapted  to  their  business  model.  

 

4.2. STATE & STRATEGY OF THE INTELLECTUAL PROPERTY

The  first  element  of   the   IP  strategy   is  a  complete  control  of   the   IP  rights  of   the   toolbox.  The   existing   prototypes   have   been   completely   internally   developed   in   the   GenScale   Inria  research  team.  We  will  keep  a  complete  control  of  the  property  of  all  the  core  components  of  the  toolbox,  excluding  any  use  of  external  components.    For  the  interface  functions  (such  as  user   interfaces),   if  existing  external  components  can  be   linked   to   the   toolbox   to   increase   its  value,   the   chosen   components   will   be   software   elements   distributed   with   non-­‐‑restrictive  licenses,  such  as  BSD  or  Apache  license.  Otherwise,  those  peripheral  components  will  also  be  internally  developed.      

Page 16: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     16/21  

The   second   element   of   our   IP   strategy   is   the   license   scheme   of   the   GATB   toolbox  software.  We  plan  a  dual  licensing  scheme  with:    

• An  Open  Source  distribution  under  a  “viral  “  license  (GPLv3),  authorizing  the  use  of  the  toolbox  by  academics    and  facilitating  its  credit  in  the  community;  

• A  commercial  license  which  will  allow  bioinformatics  companies  to  insert  the  toolbox  in   their   treatment   systems,   avoiding   any   impact   of   the   GPL   license   on   their   own  software.  

 

4.3. TECHNOLOGY TRANSFER OFFICE ROLE IN THE MILESTONES OF THE PROJECT

As  described  in  3.3.4,  During  all  the  lifetime  of  the  project,  the  technology  transfer  office  of  the  Inria  Rennes  research  centre  will  operate  the  “process  for  the  monitoring  of  technology  transfer  activities”  with   the  help  of   two  experts  of   the   central   Inria  Technology  Transfer  &  Innovation   Department   (DTI),   Ph   Gesnouin,   the   DTI   Technology   Transfer   Associate   for  Health,   Life   Sciences  &  Biotechnologies   and  P.  Moreau,   the  DTI  Head   of   Software  Assets.  The   advances   of   the   project   will   be   periodically   (every   6   months)   present   to   the   DTI  Technology  Transfer  Committee   ("ʺcalled  CSATT"ʺ)   by   the  project   coordinator   and   the   Inria  Rennes  Technology  Transfer  Officer.  These  periodical  statuses  will  include  the  results  of  the  work   made   concerning   the   software   components   quality,   the   market   evolutions   and   the  prospect  companies,  the  IP  protection  and  the  draft  of  agreements.  

4.4. RESOURCES INVOLVED BY THE TECHNOLOGY TRANSFER OFFICE DURING THE PROJECT

All  the  staff  of  the  Inria  Rennes  Technology  Transfer  office  will  be  involved  in  the  project:    

• Patrice   Gelin,   Technology   Transfer   Officer,   in   charge   of   the   management   of  Technology  Transfer  tasks  

• Chantal  Le  Tonqueze,  IP  manager    • Marie-­‐‑Anne  St  Jalmes,  corporate  lawyer  

For  this  project,  the  Inria  Rennes  office  will  be  helped  by  two  experts  of  the  Inria  DTI  :  

• Philippe   Gesnouin,   the   DTI   Technology   Transfer   Associate   for   Health,   Life  Sciences  &  Biotechnologies  

• Patrick  Moreau,  the  DTI  Head  of  Software  Assets  

(See  also  5.3)  

 

5. CONSORTIUM DESCRIPTION

5.1. PARTNERS DESCRIPTION & RELEVANCE, COMPLEMENTARITY

GenScale  INRIA  team  

Page 17: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     17/21  

GenScale   is  an     INRIA   team    devoted   to   research       and  development   in  bioinformatics.  The    scientific  axes  of    GenScale  focus  on    processing  of  genomic  data.  Researches  conducted  in   this   group   investigate   the   parallelism   potential   of   the   main   bioinformatics   process   to  reduce   the   execution   time   by   several   orders   of   magnitude.   Topics   of   interest   range   from  intensive  sequence  comparisons  to  NGS  processing,  including  protein  structure  prediction.  

Assembly   is   an   important   research   axe   of  GenScale.   Since   the   advent   of  NGS   domain,  pioneer  works  has  been  done  on  this  critical  activity.  In  the  national  landscape,  GenScale    is  currently   the  only  group  working  specifically  on  assembly  algorithms.  Our  specificity   is   to  combine   innovative   data   structures   to   lower   memory   fingerprint,   developed   advanced  heuristics   to   provide   fast   execution   time,   and   to   implement   parallel   techniques   allowing  algorithms  to  face  the  huge  volume  of  data  to  process.  

Technology  Transfer  Office  

The  Inria  Rennes  Technology  Transfer  Office  supports  the  30  research  teams  of  the  centre  for   the   elaboration   of   research   collaboration   partnerships   and   technology   transfer   of   their  results,  focusing  its  efforts  on  bilateral  partnership  relations  with  SMEs  and  the  help  of  start-­‐‑up  projects.  It  is  also  involved  in  the  support  of  European  activities,  e.g.  through  the  EIT  ICT  Labs,  Rennes  being  a  satellite  node  of  this  multi-­‐‑node  European  technological  lab.  

The   annual   contractual   activity   of   the   Inria  Rennes   centre   is   approximately   7M€   for   50  contracts    

Complementarity  

As  part  of  the  same  institute,  both  teams  have  a  long  experience  of  working  together.  All  the  industrial  transfers  which  have  already  been  performed  by  GenScale  have  been  done  in  tight  cooperation  with  the  Technology  Transfer  Office.    

5.2. QUALIFICATION OF THE PROJECT COORDINATOR

Dominique  Lavenier  is  a  computational  scientist  by  training  and  heads  the  IRISA/INRIA  GenScale  bioinformatics  team.  He  has  a  long-­‐‑standing  interest  in  information  technology  (IT)  aspects   of   biological   data   production   and   analysis.   Specifically,   he   has   been   working   on  important   questions   concerning   of   ultra-­‐‑high   throughput  DNA   sequencing   including   read  assembly,  mapping,  QTL  processing.  He  also  has  a  great  expertise  in  parallelism,  form  GPU  processing   to   grid   processing.   For   the   last   ten   years,   D.   Lavenier   has   coordinated   the  following  national  projects:  

• GenoGRID:  A  grid  for  Genomic  Applications  (ACI  Program,  2002-­‐‑2004)  

• RDISK:  A  Reconfigurable  and  Parallel  Architecture  for  Browsing  Genomic  Databases  (Inter  EPST  Program,  2002-­‐‑2004)  

• ReMIX  :  Reconfigurable  Memory  for  Indexing  (ACI  MD  Program,  2004-­‐‑2006)  

• Seed  optimization  and  indexing  of  genomic  banks  on  FLASH  Memory  (ARC  INRIA,  2006-­‐‑2007)  

Page 18: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     18/21  

• BioWIC  :   Bioinformatics   Workflow   for   Intensive   Computing   (ANR   Arpege,   2009-­‐‑2011)  

D.   Lavenier   also   experienced   industrial   transfers   through   the   two   following   software  previously   developed   in   its   research   team,   in   collaboration   with   the   Technology   Transfer  Office:  

• GASSST:   This   software,   called   mapper,   aims   to   align   billions   of   short   reads,  generated   by   NGS   machines,   to   a   reference   genome.   This   is   a   basic   and   time-­‐‑consuming  task  of  NGS  processing.  In  2011,  it  has  been  successfully  transferred  to  the  GenomeQuest   Company,   and   integrated   inside   their   NGS   suite   tools   providing   a  x10-­‐‑fold   speed-­‐‑up   compared   to   their   native   mapper.   Development   of   GASSST  continues  in  tight  collaboration  with  GenomeQuest,  to  fulfill  industrial  requirements  and  NGS  technology  evolution.    

• PLAST:     This   is   a   sequence   comparison   software   tacking   as   input   two   set   of   of  sequences  and  provide  an  all-­‐‑to-­‐‑all  comparison.  PLAST  is  currently  transferred  to  the  Korilog  Company  within   the  KORIBLAST  tool.  Compared  to  BLAST  it  allows  days  of  computation  on  huge  volume  of  data   to  be   reduced   to  hours.  PLAST  specifically  targets   the   metagenomic   field   where   intensive   comparison   between   samples   and  reference  banks  is  systematically  performed.  

5.3. QUALIFICATION AND CONTRIBUTION OF EACH PARTNER

 

Partner   Name   First  name   Position   PM   Contribution  to  the  project  

Inria-­‐‑GenScale    Lavenier   Dominique  DR  CNRS,  head  of  the  Inria  GenScale  research  team  

6   Coordinator;    

Inria-­‐‑GenScale   Peterlongo   Pierre   CR    INRIA   6   Mapsembler  designer  

Inria-­‐‑GenScale   Moreews     Francois   IE  INRA   4   Environment    

Inria  RennesTTO  

Gelin   Patrice   Technology  Transfer  Officier  –  Inria  Rennes  Bretagne  Atalntique  

4   Coordination  of  the  technology  transfer  activities  of  the  project  

Inria    DTI   Gesnouin   Philipe   Technology  Transfer  Associate  for  Health,  Life  Sciences  &  Biotechnologies  –  Inria  DTI  

1,5   Connection  with  bioinformatics  companies    

Inria  DTI   Moreau   Patrick   Head  of  Software  Assets  –  Inria  DTI  

1   Adviser  for  the  software  development  

Inria  RennesTTO  

Saint-­‐‑Jalmes   Marie  Anne  

Corporate  lawyer  -­‐‑  Inria  Rennes  Bretagne  Atalntique  

1   Drafting  of  license  agreements  and  specific  contracts  for  technology  transfer  partnerships  

Inria  Rennes  TTO  

Le  Tonquéze   Chantal   IP  management  -­‐‑  Inria  Rennes  Bretagne  Atalntique  

0,5   IP  protection  of  the  software  components  

 

Page 19: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     19/21  

6. SCIENTIFIC JUSTIFICATION OF REQUESTED RESSOURCES  

6.1. PARTNER 1: GENSCALE

• Equipment

1  workstation  (+  screen)  for  the  engineer  to  recruit  –  3000  €  

• Staff

1  Engineer  –  24  months  (GenScale  CDD)  –  130  320  €  

• The  activity  of  this  engineer  will  be  devoted  to  tasks  1,  2  and  3.  

• Subcontracting

Expertise  from  the  AlgoRizk  company  to  ensure  industrial  compliance  –  30  000  €  

This  company,  created  in  2011  by  a  former  PhD  Student  of  GenScale,  has  a  great  expertise  in   the   design   of   high   performance   software.   It   has   also   strong   links   with   bioinformatics  industries.   In  addition  to  consulting,   the   involvement  of   this  company  in  the  GATB  project  will  provide  a  practical  help  in  the  optimization  of  the  software.  

• Travel

Presentation  of  GATB   in  national  and   international   conferences,  visit   to   companies  and  Participation  to  Assemblathon  meetings  –  9000  €  

• Costs justified by internal procedures of invoicing

INRIA  charges  a  4%  overhang  for  services.    

6.2. PARTNER 2: INRIA TECHNOLOGY TRANSFER OFFICE

         Additional  resources  are  not  requested  by  the  TTO  for  its  activity  dedicated  to  this  project.  All  the  staff  and  other  resources  needed  for  the  TTO  tasks  will  be  fully  supported  by  Inria.  

 

7. REFERENCES • [CBP09]   Mark   J   Chaisson,   Dumitru   Brinza,   and   Pavel   A   Pevzner,   De   novo   fragment  

assembly  with  short  mate-­‐‑paired  reads:  Does  the  read  length  matter?  Genome  Research  19  (2009),  no.  2,  336-­‐‑346.  

• [CPWS99]   B.   Chevreux,   T.   Pfisterer,   T.   Wetter,   and   S.   Suhai,   Assembly   of   Genomic  Sequences  Assisted  by  Automatic  Finishing,  German  Conference  on  Bioinformatics,  1999,  pp.  183-­‐‑184.  

• [LZR+10]  Ruiqiang  Li,  Hongmei  Zhu,  Jue  Ruan,  Wubin  Qian,  Xiaodong  Fang,  Zhongbin  Shi,   Yingrui   Li,   Shengting   Li,   Gao   Shan,   Karsten  Kristiansen,   Songgang   Li,  Huanming  

Page 20: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     20/21  

Yang,  Jian  Wang,  and  Jun  Wang,  De  novo  assembly  of  human  genomes  with  massively  parallel  short  read  sequencing,  Genome  Research  20  (2010),  no.  2,  265-­‐‑272.  

• [MDK+08]   Jason  R  Miller,  Arthur  L  Delcher,  Sergey  Koren,  Eli  Venter,  Brian  P  Walenz,  Anushka   Brownley,   Justin   Johnson,   Kelvin   Li,   Clark   Mobarry,   and   Granger   Sutton,  Aggressive  assembly  of  pyrosequencing  reads  with  mates,  Bioinformatics  24   (2008),  no.  24,  2818-­‐‑2824.  

• [GMP+11]  Gnerre  S,  MacCallum  I,  Przybylski  D,  Ribeiro  F,  Burton  J,  Walker  B,  Sharpe  T,  Hall   G,   Shea   T,   Sykes   S,   Berlin   A,   Aird   D,   Costello   M,   Daza   R,   Williams   L,   Nicol   R,  Gnirke  A,  Nusbaum  C,  Lander  ES,  Jaffe  DB.  High-­‐‑quality  draft  assemblies  of  mammalian  genomes  from  massively  parallel  sequence  data  Proceedings  of  the  National  Academy  of  Sciences  USA  (January  2011  vol.  108  no.  4  1513-­‐‑1518).  

• [PTW01]  P.A.  Pevzner,  H.  Tang,  and  M.S.  Waterman,  An  Eulerian  path  approach  to  DNA  fragment   assembly,   Proceedings   of   the   National   Academy   of   Sciences   of   the   United  States  of  America  98  (2001),  no.  17,  9748.  

• [SWJ+09]   J.T.   Simpson,   K.   Wong,   S.D.   Jackman,   J.E.   Schein,   S.J.M.   Jones,   and   I.   Birol,  ABySS:  A  parallel   assembler   for   short   read   sequence  data,  Genome  Research  19   (2009),  no.  6,  1117.  

• [ZB08]  Daniel   R  Zerbino   and   Ewan  Birney,  Velvet:  Algorithms   for   de   novo   short   read  assembly  using  de  bruijn  graphs,  Genome  Research  18  (2008),  no.  5,  821-­‐‑829.  

• [NOG+09]  C.  Nusbaum,  T.K.  Ohsumi,   J.  Gomez,   J.  Aquadro,  T.C.  Victor,  R.M.  Warren,  D.T.   Hung,   B.W.   Birren,   E.S.   Lander,   and   D.B.   Jaffe,   Sensitive,   specific   polymorphism  discovery  in  bacteria  using  massively  parallel  sequencing,  Nature  methods  6  (2009),  no.  1,  67.  

• [PPDS04]   Mihai   Pop,   Adam   Phillippy,   Arthur   L   Delcher,   and   Steven   L   Salzberg,  Comparative  genome  assembly,  Brief  Bioinform  5  (2004),  no.  3,  237-­‐‑248.  

• [CL11]   Chikhi,   R.,   Lavenier,   D.:   Localized   genome   assembly   from   reads   to   scaffolds:  practical   traversal   of   the   paired   string   graph.   Algorithms   in   Bioinformatics   pp.   39-­‐‑48  (2011)  

• [CCL11]   G.   Chapuis,   R.   Chikhi,   D.   Lavenier.  Parallel   and   memory-­‐‑e-­‐‑fficient   reads  indexing  for  genome  assembly,  In  proceedings  of  PBC  2011  (2011)  

• [EARL11]  D.  Earl  et  al.,  Assemblathon  1:  A  competitive  assessment  of  de  novo  short  read  assembly  methods,  Genome  Research  (2011)  

• [PC11]  P  Peterlongo  and  R.  Chikhi,  Mapsembler,  targeted  assembly  of  larges  genomes  on  a   desktop   computer,   Research   report,   RR-­‐‑7565,   http://hal.archives-­‐‑ouvertes.fr/inria-­‐‑00577218_v1/  

• [PC12]  P  Peterlongo  and  R.  Chikhi,  Mapsembler,   targeted  and  micro  assembly  of   large  NGS  datasets  on  a  desktop  computer,  BMC  Bioinformatics,  under  re-­‐‑review.  

Page 21: GATB - team.inria.fr · Added value of the GATB toolbox 15! 4.1.5!Return on Investment 15! 4.2.!State & strategy of the intellectual property .....15! 4.3.!Technology transfer office

PROGRAMME EMERGENCE

EDITION 2012

Projet GATB

DOCUMENT SCIENTIFIQUE  

 

ANR-­‐‑GUI-­‐‑AAP-­‐‑05  –  Doc  Scientifique  2012  –  VF     21/21  

• [PIAT11]   J.   Piat,   F.   Moreews,   O.   Collin,   A.   Cornu,   D.   Lavenier,   SLICEE:   A   Service  oriented  middleware  for  intensive  scientific  computation,  7th  IEEE  2011  World  Congress  on  Services  (SERVICES  2011),  Washington  DC,  USA,  2011  

• [GALAXY10]   Goecks,   J,   Nekrutenko,   A,   Taylor,   J   and   The   Galaxy   Team.   Galaxy:   a  comprehensive   approach   for   supporting   accessible,   reproducible,   and   transparent  computational  research  in  the  life  sciences.  Genome  Biol.  2010  Aug  25;  11(8):R86.