REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33...

35
REPET: pipelines for the identification and annotation of transposable elements in genomic sequences Timothée Flutre INRA, Unité de Recherche en Génomique-Info (URGI) 03/09/2009

Transcript of REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33...

Page 1: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

REPET: pipelines for the

identification and annotation

of transposable elements

in genomic sequences

Timothée Flutre

INRA, Unité de Recherche en Génomique-Info (URGI)

03/09/2009

Page 2: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 2/33

Context

Page 3: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 3/33

TEs in plant genomes

~20%~30%

~60% ~90%

proportion of TEs

rest of the genome

Genome sizes

● A. thaliana: ~180 Mb

● O. sativa: ~400 Mb

● Z. mays: ~2,400 Mb

● T. aestivum: ~16,000 Mb

Page 4: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 4/33

TE genomic annotation

High-segment scoring pairs

TE families

(+ multiple alignment)

(self-alignment)

The TEdenovo pipeline The TEannot pipeline

library of known TEs

Similarity search (~Blast)

genome

con

sen

sus

Repeat search

Clustering

+genome

TE copies

Page 5: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 5/33

Biological challenges... nested or degenerated elements

nested

... other genomic repeats

segmental duplications

satellites

tandem repeats

degenerated

Page 6: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 6/33

Technical challenges

● Handle big volumes of data:– Storage in MySQL tables

● Handle long computation times:– Interaction with computer clusters

Page 7: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 7/33

De novo identification of TE families

Page 8: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 8/33

Build de novo TE consensus

Flutre et al., in prep.

Page 9: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 9/33

Clustering the variants

Page 10: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 10/33

Benchmark

● Drosophila melanogaster genome sequence– Release 4: ~ 130 Mb (mainly euchromatic)

completeconsensus

completereference

incompleteconsensus

in-completereference

CC CI

IC II

Complete = match length >= 95%

● Berkeley Drosophila Genome Project:

126 reference sequences of TEs found in the D. melanogaster genome

high-quality library

Page 11: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 11/33

Results on D. melanogaster

● First steps of the TEdenovo pipeline:– 220,000 HSPs from BLASTER (7.4% coverage)

– 4004 clusters and consensus:● 3443 from GROUPER● 441 from RECON● 120 from PILER

Page 12: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 12/33

Results on D. melanogaster

● First steps of the TEdenovo pipeline:– 220,000 HSPs from BLASTER (7.4% coverage)

– 4004 clusters and consensus:● 3443 from GROUPER● 441 from RECON● 120 from PILER

⇒ 54 reference sequences retrieved with precise boundaries (“CC” alignments with de novo consensus)

⇒ close to the optimum: 62 families with 2 complete copies, 51 with 3 complete copies

Page 13: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 13/33

Combined approach

Grouper 45

44

Piler 33

54

46

48

54

Clustering methods

Ref. sequences in CC matches

Recon

Grp + Rec

Grp + Pil

Rec + Pil

Grp + Rec + Pil

GROUPER

RECON

PILER

7

8

1

3 28

1

7

⇒ Combine several programs at the clustering step gives better results !

Page 14: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 14/33

A. thaliana and O. sativa

● TAIR release 9 for A. thaliana and IRGSP build 5 for O. sativa

● First steps of the TEdenovo pipeline:– HSPs from BLASTER:

● A. thaliana: 207,000 (coverage 13.5%)● O. sativa: 12,086,000 (coverage 3.1%)

– clusters and consensus:● A. thaliana: 6,639 (5,344 from GROUPER, 994 from

RECON and 301 from PILER)

● O. sativa: 88,042 (81,400 from GROUPER, 5,071 from RECON and 1,571 from PILER)

Page 15: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 15/33

TE classification

● False-positive filtering based on consensus analysis (TE structural features)

● Removal of redundant consensus

Page 16: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 16/33

TE classification

● False-positive filtering based on consensus analysis (TE structural features)

● Removal of redundant consensus

● Search for terminal repeats (LTR and TIR)

● Search for polyA and SSR-like tails

● Detect matches with known TEs via tblastx and blastx

● Detect matches with host's genes via blastn

● Detect matches with TE HMM profiles

Page 17: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 17/33

TE classification

● False-positive filtering based on consensus analysis (TE structural features)

● Removal of redundant consensus

● Category: 'class I' or 'class II' for the TEs, 'SSR', 'HostGene', 'NoCat' or '?' for the others

● Type: 'LTR' / 'LARD' / 'LINE' / 'SINE' / 'TIR' / 'MITE' / 'Helitron' / 'Polinton' for the TEs, 'NA' or '?' for the other categories

● Completeness: 'yes', 'no', 'NA' or '?'

● Comments: briefly explain the classification

● Confusedness: 'yes' or 'no'

Flutre et al., in prep.

● Search for terminal repeats (LTR and TIR)

● Search for polyA and SSR-like tails

● Detect matches with known TEs via tblastx and blastx

● Detect matches with host's genes via blastn

● Detect matches with TE HMM profiles

Page 18: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 18/33

TE families

● ATREP11, non-autonomous Helitron (1053 nt)

Overview of the multiple alignment

“zoom-in”

from RECON(279 copies)

from RECON(137 copies)

from RECON(85 copies)

from RECON(62 copies)

from GROUPER(0 copy)from GROUPER(0 copy)

from GROUPER(34 copies)

reference sequence (Repbase) all-by-all matches + consensus + reference sequence

Page 19: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 19/33

TE families

● ATREP11, non-autonomous Helitron (1053 nt)

Overview of the multiple alignment

“zoom-in”

from RECON(279 copies)

from RECON(137 copies)

from RECON(85 copies)

from RECON(62 copies)

from GROUPER(0 copy)from GROUPER(0 copy)

from GROUPER(34 copies)

reference sequence (Repbase) all-by-all matches + consensus + reference sequence

Page 20: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 20/33

TE families

● ATREP11, non-autonomous Helitron (1053 nt)

Overview of the multiple alignment

“zoom-in”

from RECON(279 copies)

from RECON(137 copies)

from RECON(85 copies)

from RECON(62 copies)

from GROUPER(0 copy)from GROUPER(0 copy)

from GROUPER(34 copies)

reference sequence (Repbase) all-by-all matches + consensus + reference sequence

Page 21: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 21/33

TE families

● ATREP11, non-autonomous Helitron (1053 nt)

reference sequence

all-by-all matches + consensus + reference sequence

Page 22: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 22/33

Take-home messages (1)

● Final TE library (classified, non-redundant consensus)

– D. melanogaster: 748 consensus (461 for R+P)

– A. thaliana: 1584 consensus (999 for R+P)

– O. sativa: 5243 consensus (for R+P only)

➔ TEdenovo builds TE consensus with well-defined boundaries.

➔ A TE family can be represented by several consensus for its different structural variants.

Page 23: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 23/33

Genomic annotation of TE copies

Page 24: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 24/33

TE annotation pipeline

Quesneville et al., PLoS Comp. Bio., 2005

Page 25: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 25/33

TE fragment connections (1)

● MATCHER:– Take all matches found by BLASTER, RepeatMasker

and CENSOR;

– Connect the TE fragments via dynamic programming (find the optimal path);

– Filter the redundancy and clean the conflicts.

Page 26: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 26/33

TE fragment connections (1)

● MATCHER:– Take all matches found by BLASTER, RepeatMasker

and CENSOR;

– Connect the TE fragments via dynamic programming (find the optimal path);

– Filter the redundancy and clean the conflicts.

TE annotations before MATCHER

TE annotations after MATCHER

Page 27: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 27/33

TE fragment connections (2)

● “Long join” procedure:

∆identity < 2%

gap < 5kb

mismatches < 500bp

Σ TE lengths = 95%

all identities > max(identity1,identity2) - 2%

OR

Page 28: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 28/33

Results on A. thaliana

● Input:– TAIR release 9 (120 Mb)

– TE library of 1531 de novo consensus

● Output:– 45,558 TE fragments

– 39,675 TE copies (67 “long joins”)

– 21.80% genome coverage

– 115 consensus without any copy

Page 29: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 29/33

Integration in TriAnnot

● Optimize the statistical filter parameters● Build a library of consensus by combining

TREP with custom sequences from F. Choulet● Launch the TEannot pipeline as part of the

TriAnnot pipeline (block 03 - TE modelling)

Page 30: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 30/33

Take-home messages (2)

● A full TE annotation requires to handle nested patterns and thus to reconstruct TE copies

➔ TEannot implements an efficient dual strategy to connect TE fragments:

– MATCHER

– “long join” procedure

Page 31: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 31/33

REPET on the web !!

http://urgi.versailles.inra.fr

Authors & contributors: Hadi Quesneville, Timothée Flutre, Olivier Inizan, Claire Hoede, Anna-Sophie Fiston-Lavier, Elodie Duprat, Gaël Faroux, Delphine Autard, Benoît Bely

Page 32: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 32/33

Acknowledgements

● URGI, INRA Versailles: Hadi Quesneville, Emmanuelle Permal, Joëlle Amselem, Olivier Inizan, Claire Hoede, Victoria Dominguez, Isabelle Luyten and Sébastien Reboux

● GDEC, INRA Clermont: Catherine Feuillet, Philippe Leroy, Frédéric Choulet

● Univ. Perpignan: Olivier Panaud, François Sabot, Cristian Chaparro

● Elodie Duprat (UPMC), Anna-Sophie Fiston-Lavier (Stanford)

Page 33: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 33/33

Thank you !

Page 34: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 34/33

Page 35: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and

03/09/2009 T. Flutre - INRA 35/33

Genomes annotated with REPET

Done (8)

● D. melanogaster (Flybase) with M. Ashburner (Univ. Cambridge) and C. Bergman (Univ. Manchester)

● A. thaliana (TAIR) with V. Colot (ENS) and N. Buisine (MNHN)

● Fusarium graminearum with C. Kistler (Univ. Minnesota)

● Laccaria bicolor with F. Martin (INRA-Nancy)

● Meloidogyne incognita with P. Abad (INRA Antibes)

● Ectocarpus siliculosus with M. Cock (Roscoff)

● Leptospheria maculans with T. Rouxel (INRA Versailles)

● Acyrthosiphon pisum with D. Tagu (INRA Rennes) and A. Wilson (Miami)

In progress …

● O. sativa (IRGSC) with O. Panaud (Perpignan)

● T. aestivum (IWGSC) with C. Feuillet (INRA Clermont-Ferrand)

● 12 Drosophilidae genomes with D. Petrov (Univ. Stanford)

● Spodoptera frugiperda, Helicoverpa armigera with P. Fournier (INRA Montpellier)

● Melampsora larici-populina with F. Martin (INRA-Nancy)

● Tuber melanosporum with F. Martin (INRA-Nancy)

● Stagonospora nodorum with T. Rouxel (INRA Versailles)

● Blumeria graminis with P. Spanu (Imp. Coll. London)

● Botrytis cinerea with M-H. Lebrun (INRA Versailles)

● Emiliania huxleyi with Betsy Read (Cal State University)