Iden%fica%on of long non-‐coding RNAs in livestock species
Oana Palasca February 15th, 2018
Overview
• hGp://www.faang-‐europe.org • Community effort to establish a database of ncRNAs in
livestock species • Annota%on based on experimental evidence approach (RNA-‐
Seq) • Ini%ated as a hackathon in October 2016
2
Background
Long non coding RNAs • Capped, polyadenylated, alterna%vely spliced transcripts • Roles in regula%on of transcrip%on/transla%on, chroma%n modifica%on
etc Challenges in iden%fica%on of lncRNA genes
• LiGle sequence conserva%on – difficult to predict “de novo” • Low expression levels – difficult to dis%nguish from transcrip%onal noise • Low consistency between biological replicates
3
McMullen et al, Clinical Science, 2016
Background Current annota%on status, Ensembl 91
4
0
5000
10000
human mouse pig cow sheep horse chicken
LincRNAs type
genestranscripts
0
50000
100000
150000
human mouse pig cow sheep horse chicken
Prot
ein
codi
ng
typegenestranscripts
Workflow
Workflow
Data collec%on and filtering
1
Data collec%on and filtering
Ø Data obtained from the European Nucleo%de Archive
Ø Cow, horse, pig, sheep, chicken Ø Selec%on criteria: Illumina, paired-‐ended,
stranded, >100 bp
1
Patro et. Al, Nature Methods, 2017 1
Data collec%on and filtering
Ø Data obtained from the European Nucleo%de Archive
Ø Cow, horse, pig, sheep, chicken Ø Selec%on criteria: Illumina, paired-‐ended,
stranded, >100 bp
Ø Quality check – FastQC Ø Strandedness check – Salmon
Ø 15% samples unstranded Ø 10% samples with first read mapping to
the forward strand
1
Patro et. Al, Nature Methods, 2017 1
Data collec%on and filtering
Ø Data obtained from the European Nucleo%de Archive
Ø Cow, horse, pig, sheep, chicken Ø Selec%on criteria: Illumina, paired-‐ended,
stranded, >100 bp
Ø Quality check – FastQC Ø Strandedness check – Salmon
Ø 15% samples unstranded Ø 10% samples with first read mapping to
the forward strand
Ø ~ 900 samples, 32 Brenda %ssue ontology terms
Ø Muscle and brain available for all species
1
Patro et. Al, Nature Methods, 2017 1
Workflow
Comparison of pipelines Star /Cufflinks (SC) vs. Hisat2/String%e (HS)
Input: Ø 3 samples from chicken (kidney, liver, heart – pooled) Ø 5 samples from cow (heart, cerbral cortex, spleen, liver, kidney) Ø With/without annota%on
12
1
Dobin et. al, 2013 1
2
Trapnell et.al, 2010 2 3 Pertea et.al, 2016
3
Comparison of pipelines Total number of transcripts, per %ssue and by merging all %ssues
Size of the transcript sets obtained by running the two pipelines, COW
0
25000
50000
75000
kidney liver heart merged
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
0
20000
40000
60000
80000
heart c.cortex spleen liver kidney merged
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
Size of the transcript sets obtained by running the two pipelines, CHICKEN
0
25000
50000
75000
kidney liver heart merged
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
Comparison of pipelines Impact of the merging step string%e-‐merge on the SC output and cuffmerge on the HS output
Size of the transcript sets obtained by running the two pipelines, COW
0
25000
50000
75000
kidney liver heart merged merge_swap
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
Size of the transcript sets obtained by running the two pipelines, CHICKEN
0
25000
50000
75000
kidney liver heart merged
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
0
20000
40000
60000
80000
heart c.cortex spleen liver kidney merged merge_swap
trans
crip
ts pipelineSC annotSC w/o annotHS annotHS w/o annot
Comparison of pipelines Recovery rate compared to annota%on
Number of transcripts exactly overlapping the annota%on
0
10
20
30
chicken cow%
from
pre
d. tr
ansc
ripts
pipelineSC w/o annotHS w/o annot
0
2000
4000
6000
chicken cow
num
ber o
f tra
nscr
ipts
pipelineSC w/o annotHS w/o annot
0
10
20
30
chicken cow
% fr
om p
red.
tran
scrip
ts
pipelineSC w/o annotHS w/o annot
Percentage of transcripts exactly overlapping the annota%on from the total transcripts predicted
Comparison of pipelines Assessment of capacity of reconstruc%ng lncRNAs in mouse
Data • 30 RNA-‐Seq mouse samples from ENCODE/CSHL, illumina, paired-‐ended, 100bp • Gencode mouse annota%on down-‐sampled such as to contain a set of genes and
transcripts similar to e.g. cow (e.g. 20.000 PCGs orthologous with cow + 3000 randomly selected PCGs + 2000 miRNas/snoRNAs)
16
Comparison of pipelines Assessment of capacity of reconstruc%ng lncRNAs in mouse
Data • 30 RNA-‐Seq mouse samples from ENCODE/CSHL, illumina, paired-‐ended, 100bp • Gencode mouse annota%on down-‐sampled such as to contain a set of genes and
transcripts similar to e.g. cow (e.g. 20.000 PCGs orthologous with cow + 3000 randomly selected PCGs + 2000 miRNas/snoRNAs)
Pipelines to test • STAR>Cufflinks-‐>cuffmerge/string%e-‐merge-‐>FEELnc • HiSat2-‐>String%e-‐>string%e-‐merge/cuffmerge-‐>FEELnc
17
Comparison of pipelines Assessment of capacity of reconstruc%ng lncRNAs in mouse
Data • 30 RNA-‐Seq mouse samples from ENCODE/CSHL, illumina, paired-‐ended, 100bp • Gencode mouse annota%on down-‐sampled such as to contain a set of genes and
transcripts similar to e.g. cow (e.g. 20.000 PCGs orthologous with cow + 3000 randomly selected PCGs + 2000 miRNas/snoRNAs)
Pipelines to test • STAR>Cufflinks-‐>cuffmerge/string%e-‐merge-‐>FEELnc • HiSat2-‐>String%e-‐>string%e-‐merge/cuffmerge-‐>FEELnc Output • Sensi%vity and specificity for predic%ng lncRNAs and PCGs, based on the exact
overlap with the remaining set of the annota%on
18
Further work
• Synteny analysis • Correla%on of expression levels across organisms • Structural predic%ons
• Public database, connected to exis%ng repositories, e.g. EBI
19
Acknowledgements
• Daniel Fischer • Sarah Djebali • Chris%an Anton • Alicja Pacholewska • Nadezhda Doncheva • Thomas Derrien • Lel Eory • Frank Panitz • Magda Mielczarek • Christa Kuhn • Alan Archibald • Jan Gorodkin
20
Top Related