REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33...
Transcript of REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33...
![Page 1: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/1.jpg)
REPET: pipelines for the
identification and annotation
of transposable elements
in genomic sequences
Timothée Flutre
INRA, Unité de Recherche en Génomique-Info (URGI)
03/09/2009
![Page 2: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/2.jpg)
03/09/2009 T. Flutre - INRA 2/33
Context
![Page 3: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/3.jpg)
03/09/2009 T. Flutre - INRA 3/33
TEs in plant genomes
~20%~30%
~60% ~90%
proportion of TEs
rest of the genome
Genome sizes
● A. thaliana: ~180 Mb
● O. sativa: ~400 Mb
● Z. mays: ~2,400 Mb
● T. aestivum: ~16,000 Mb
![Page 4: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/4.jpg)
03/09/2009 T. Flutre - INRA 4/33
TE genomic annotation
High-segment scoring pairs
TE families
(+ multiple alignment)
(self-alignment)
The TEdenovo pipeline The TEannot pipeline
library of known TEs
Similarity search (~Blast)
genome
con
sen
sus
Repeat search
Clustering
+genome
TE copies
![Page 5: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/5.jpg)
03/09/2009 T. Flutre - INRA 5/33
Biological challenges... nested or degenerated elements
nested
... other genomic repeats
segmental duplications
satellites
tandem repeats
degenerated
![Page 6: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/6.jpg)
03/09/2009 T. Flutre - INRA 6/33
Technical challenges
● Handle big volumes of data:– Storage in MySQL tables
● Handle long computation times:– Interaction with computer clusters
![Page 7: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/7.jpg)
03/09/2009 T. Flutre - INRA 7/33
De novo identification of TE families
![Page 8: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/8.jpg)
03/09/2009 T. Flutre - INRA 8/33
Build de novo TE consensus
Flutre et al., in prep.
![Page 9: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/9.jpg)
03/09/2009 T. Flutre - INRA 9/33
Clustering the variants
![Page 10: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/10.jpg)
03/09/2009 T. Flutre - INRA 10/33
Benchmark
● Drosophila melanogaster genome sequence– Release 4: ~ 130 Mb (mainly euchromatic)
completeconsensus
completereference
incompleteconsensus
in-completereference
CC CI
IC II
Complete = match length >= 95%
● Berkeley Drosophila Genome Project:
126 reference sequences of TEs found in the D. melanogaster genome
high-quality library
![Page 11: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/11.jpg)
03/09/2009 T. Flutre - INRA 11/33
Results on D. melanogaster
● First steps of the TEdenovo pipeline:– 220,000 HSPs from BLASTER (7.4% coverage)
– 4004 clusters and consensus:● 3443 from GROUPER● 441 from RECON● 120 from PILER
![Page 12: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/12.jpg)
03/09/2009 T. Flutre - INRA 12/33
Results on D. melanogaster
● First steps of the TEdenovo pipeline:– 220,000 HSPs from BLASTER (7.4% coverage)
– 4004 clusters and consensus:● 3443 from GROUPER● 441 from RECON● 120 from PILER
⇒ 54 reference sequences retrieved with precise boundaries (“CC” alignments with de novo consensus)
⇒ close to the optimum: 62 families with 2 complete copies, 51 with 3 complete copies
![Page 13: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/13.jpg)
03/09/2009 T. Flutre - INRA 13/33
Combined approach
Grouper 45
44
Piler 33
54
46
48
54
Clustering methods
Ref. sequences in CC matches
Recon
Grp + Rec
Grp + Pil
Rec + Pil
Grp + Rec + Pil
GROUPER
RECON
PILER
7
8
1
3 28
1
7
⇒ Combine several programs at the clustering step gives better results !
![Page 14: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/14.jpg)
03/09/2009 T. Flutre - INRA 14/33
A. thaliana and O. sativa
● TAIR release 9 for A. thaliana and IRGSP build 5 for O. sativa
● First steps of the TEdenovo pipeline:– HSPs from BLASTER:
● A. thaliana: 207,000 (coverage 13.5%)● O. sativa: 12,086,000 (coverage 3.1%)
– clusters and consensus:● A. thaliana: 6,639 (5,344 from GROUPER, 994 from
RECON and 301 from PILER)
● O. sativa: 88,042 (81,400 from GROUPER, 5,071 from RECON and 1,571 from PILER)
![Page 15: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/15.jpg)
03/09/2009 T. Flutre - INRA 15/33
TE classification
● False-positive filtering based on consensus analysis (TE structural features)
● Removal of redundant consensus
![Page 16: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/16.jpg)
03/09/2009 T. Flutre - INRA 16/33
TE classification
● False-positive filtering based on consensus analysis (TE structural features)
● Removal of redundant consensus
● Search for terminal repeats (LTR and TIR)
● Search for polyA and SSR-like tails
● Detect matches with known TEs via tblastx and blastx
● Detect matches with host's genes via blastn
● Detect matches with TE HMM profiles
![Page 17: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/17.jpg)
03/09/2009 T. Flutre - INRA 17/33
TE classification
● False-positive filtering based on consensus analysis (TE structural features)
● Removal of redundant consensus
● Category: 'class I' or 'class II' for the TEs, 'SSR', 'HostGene', 'NoCat' or '?' for the others
● Type: 'LTR' / 'LARD' / 'LINE' / 'SINE' / 'TIR' / 'MITE' / 'Helitron' / 'Polinton' for the TEs, 'NA' or '?' for the other categories
● Completeness: 'yes', 'no', 'NA' or '?'
● Comments: briefly explain the classification
● Confusedness: 'yes' or 'no'
Flutre et al., in prep.
● Search for terminal repeats (LTR and TIR)
● Search for polyA and SSR-like tails
● Detect matches with known TEs via tblastx and blastx
● Detect matches with host's genes via blastn
● Detect matches with TE HMM profiles
![Page 18: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/18.jpg)
03/09/2009 T. Flutre - INRA 18/33
TE families
● ATREP11, non-autonomous Helitron (1053 nt)
Overview of the multiple alignment
“zoom-in”
from RECON(279 copies)
from RECON(137 copies)
from RECON(85 copies)
from RECON(62 copies)
from GROUPER(0 copy)from GROUPER(0 copy)
from GROUPER(34 copies)
reference sequence (Repbase) all-by-all matches + consensus + reference sequence
![Page 19: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/19.jpg)
03/09/2009 T. Flutre - INRA 19/33
TE families
● ATREP11, non-autonomous Helitron (1053 nt)
Overview of the multiple alignment
“zoom-in”
from RECON(279 copies)
from RECON(137 copies)
from RECON(85 copies)
from RECON(62 copies)
from GROUPER(0 copy)from GROUPER(0 copy)
from GROUPER(34 copies)
reference sequence (Repbase) all-by-all matches + consensus + reference sequence
![Page 20: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/20.jpg)
03/09/2009 T. Flutre - INRA 20/33
TE families
● ATREP11, non-autonomous Helitron (1053 nt)
Overview of the multiple alignment
“zoom-in”
from RECON(279 copies)
from RECON(137 copies)
from RECON(85 copies)
from RECON(62 copies)
from GROUPER(0 copy)from GROUPER(0 copy)
from GROUPER(34 copies)
reference sequence (Repbase) all-by-all matches + consensus + reference sequence
![Page 21: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/21.jpg)
03/09/2009 T. Flutre - INRA 21/33
TE families
● ATREP11, non-autonomous Helitron (1053 nt)
reference sequence
all-by-all matches + consensus + reference sequence
![Page 22: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/22.jpg)
03/09/2009 T. Flutre - INRA 22/33
Take-home messages (1)
● Final TE library (classified, non-redundant consensus)
– D. melanogaster: 748 consensus (461 for R+P)
– A. thaliana: 1584 consensus (999 for R+P)
– O. sativa: 5243 consensus (for R+P only)
➔ TEdenovo builds TE consensus with well-defined boundaries.
➔ A TE family can be represented by several consensus for its different structural variants.
![Page 23: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/23.jpg)
03/09/2009 T. Flutre - INRA 23/33
Genomic annotation of TE copies
![Page 24: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/24.jpg)
03/09/2009 T. Flutre - INRA 24/33
TE annotation pipeline
Quesneville et al., PLoS Comp. Bio., 2005
![Page 25: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/25.jpg)
03/09/2009 T. Flutre - INRA 25/33
TE fragment connections (1)
● MATCHER:– Take all matches found by BLASTER, RepeatMasker
and CENSOR;
– Connect the TE fragments via dynamic programming (find the optimal path);
– Filter the redundancy and clean the conflicts.
![Page 26: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/26.jpg)
03/09/2009 T. Flutre - INRA 26/33
TE fragment connections (1)
● MATCHER:– Take all matches found by BLASTER, RepeatMasker
and CENSOR;
– Connect the TE fragments via dynamic programming (find the optimal path);
– Filter the redundancy and clean the conflicts.
TE annotations before MATCHER
TE annotations after MATCHER
![Page 27: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/27.jpg)
03/09/2009 T. Flutre - INRA 27/33
TE fragment connections (2)
● “Long join” procedure:
∆identity < 2%
gap < 5kb
mismatches < 500bp
Σ TE lengths = 95%
all identities > max(identity1,identity2) - 2%
OR
![Page 28: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/28.jpg)
03/09/2009 T. Flutre - INRA 28/33
Results on A. thaliana
● Input:– TAIR release 9 (120 Mb)
– TE library of 1531 de novo consensus
● Output:– 45,558 TE fragments
– 39,675 TE copies (67 “long joins”)
– 21.80% genome coverage
– 115 consensus without any copy
![Page 29: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/29.jpg)
03/09/2009 T. Flutre - INRA 29/33
Integration in TriAnnot
● Optimize the statistical filter parameters● Build a library of consensus by combining
TREP with custom sequences from F. Choulet● Launch the TEannot pipeline as part of the
TriAnnot pipeline (block 03 - TE modelling)
![Page 30: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/30.jpg)
03/09/2009 T. Flutre - INRA 30/33
Take-home messages (2)
● A full TE annotation requires to handle nested patterns and thus to reconstruct TE copies
➔ TEannot implements an efficient dual strategy to connect TE fragments:
– MATCHER
– “long join” procedure
![Page 31: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/31.jpg)
03/09/2009 T. Flutre - INRA 31/33
REPET on the web !!
http://urgi.versailles.inra.fr
Authors & contributors: Hadi Quesneville, Timothée Flutre, Olivier Inizan, Claire Hoede, Anna-Sophie Fiston-Lavier, Elodie Duprat, Gaël Faroux, Delphine Autard, Benoît Bely
![Page 32: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/32.jpg)
03/09/2009 T. Flutre - INRA 32/33
Acknowledgements
● URGI, INRA Versailles: Hadi Quesneville, Emmanuelle Permal, Joëlle Amselem, Olivier Inizan, Claire Hoede, Victoria Dominguez, Isabelle Luyten and Sébastien Reboux
● GDEC, INRA Clermont: Catherine Feuillet, Philippe Leroy, Frédéric Choulet
● Univ. Perpignan: Olivier Panaud, François Sabot, Cristian Chaparro
● Elodie Duprat (UPMC), Anna-Sophie Fiston-Lavier (Stanford)
![Page 33: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/33.jpg)
03/09/2009 T. Flutre - INRA 33/33
Thank you !
![Page 34: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/34.jpg)
03/09/2009 T. Flutre - INRA 34/33
![Page 35: REPET: pipelines for the identification and annotation of ... · 03/09/2009 T. Flutre - INRA 30/33 Take-home messages (2) A full TE annotation requires to handle nested patterns and](https://reader035.fdocuments.in/reader035/viewer/2022071219/6056a5984cac4049305d2fbf/html5/thumbnails/35.jpg)
03/09/2009 T. Flutre - INRA 35/33
Genomes annotated with REPET
Done (8)
● D. melanogaster (Flybase) with M. Ashburner (Univ. Cambridge) and C. Bergman (Univ. Manchester)
● A. thaliana (TAIR) with V. Colot (ENS) and N. Buisine (MNHN)
● Fusarium graminearum with C. Kistler (Univ. Minnesota)
● Laccaria bicolor with F. Martin (INRA-Nancy)
● Meloidogyne incognita with P. Abad (INRA Antibes)
● Ectocarpus siliculosus with M. Cock (Roscoff)
● Leptospheria maculans with T. Rouxel (INRA Versailles)
● Acyrthosiphon pisum with D. Tagu (INRA Rennes) and A. Wilson (Miami)
In progress …
● O. sativa (IRGSC) with O. Panaud (Perpignan)
● T. aestivum (IWGSC) with C. Feuillet (INRA Clermont-Ferrand)
● 12 Drosophilidae genomes with D. Petrov (Univ. Stanford)
● Spodoptera frugiperda, Helicoverpa armigera with P. Fournier (INRA Montpellier)
● Melampsora larici-populina with F. Martin (INRA-Nancy)
● Tuber melanosporum with F. Martin (INRA-Nancy)
● Stagonospora nodorum with T. Rouxel (INRA Versailles)
● Blumeria graminis with P. Spanu (Imp. Coll. London)
● Botrytis cinerea with M-H. Lebrun (INRA Versailles)
● Emiliania huxleyi with Betsy Read (Cal State University)
…