Ensembl Steve Searle Joint project leader, Ensembl Genebuild team.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build...
-
Upload
daisy-newman -
Category
Documents
-
view
218 -
download
0
Transcript of 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build...
![Page 2: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/2.jpg)
2 of 34
Ways we use RNASeq data in Ensembl:
• Build complete gene set from scratch for individual or pooled RNASeq data sets
• Incorporate into a new Ensembl gene set
• Add novel models into a gene set
• UTR
• Filtering Models
• Improve old gene sets
Introduction
![Page 4: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/4.jpg)
4 of 34
• Reads are aligned to the genome with a quick un-gapped alignment using BWA
• Transcriptome reads split over introns - we need to allow for this:
• Align with up to 50% miss-matches to get intron spanning reads to align• The alignments are then processed to collapse overlapping reads into
blocks representing exons• Read pairing is used (if available) to group the exon blocks into
approximate transcript structures
RNASeq PipelineAlignment and Initial Processing
![Page 5: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/5.jpg)
5 of 17
![Page 6: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/6.jpg)
6 of 34
RNASeq Pipeline Intron Alignment
We align split reads using Exonerate – has a good splice model but is not a short read aligner
Intron alignment is made faster in 2 ways: • Don’t realign all the reads:
• Introns are resolved by realigning partially aligned reads.• Use Exonerate word length to define which reads to realign
• Align to a single transcript:• Reads are realigned either to the rough transcript sequence or
to the genomic span of the rough transcript.
• Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length.
• Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.
![Page 7: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/7.jpg)
Exonerate spliced alignment Partially aligned reads
Split reads
CollapsedIntron Features
Final Models
![Page 8: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/8.jpg)
BLASTP
Coverage
(PE12)
![Page 9: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/9.jpg)
9 of 34
Website Display of RNASeq pipeline results
Data visible in Ensembl
Transcript models
Intron features
BAM files of BWA alignments
![Page 10: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/10.jpg)
10 of 34
Human gene ZMPSTE24
RNASeq introns by tissue
RNASeq models by tissue & merged
CCDS
GENCODE transcript
![Page 14: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/14.jpg)
14 of 34
RNASeq Volume
We are collecting more and more RNASeq
We now have sizeable RNASeq sets for 12 species +
Pipeline is now being used in production
Further automation has allowed us to speed up model building:
• Process spreadsheet data to automate the pipeline setup and configuration
• Parse meta data out of spreadsheets into the final BAM files
![Page 16: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/16.jpg)
16 of 34
Using RNASeq in the Ensembl genebuild pipeline
Some species have little specific dataEg. Nile tilapia
131 proteins in Uniprot
35 cDNAs, 119531 ESTs
Rely on data from related species
RNASeq supplements the above dataSpecies-specific
Fills gaps, alternate splice sites, faster genebuild
![Page 17: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/17.jpg)
17 of 34
Raw Computes
Targeted stage Similarity stage
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Genebuild process
Filtering
TranscriptConsensus
LayerAnnotation
Annotation Projection(primates)
![Page 18: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/18.jpg)
18 of 34
Raw Computes
Targeted stage Similarity stage
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Genebuild process
FilteringMerged
RNA-Seq models
Annotation Projection(primates)
![Page 19: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/19.jpg)
19 of 34
RNASeq helps with:1. Choice of splice site
RNASeq
Similarity models
Ensembl model
![Page 20: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/20.jpg)
20 of 34
RNASeq helps with:2. UTR addition
RNASeq model
Similarity model
Ensembl model
![Page 21: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/21.jpg)
21 of 34
RNASeq helps with:3. New models
RNASeq intronsRNASeq modelSimilarity modelEnsembl model
![Page 22: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/22.jpg)
22 of 34
Species with RNASeq used in generating Ensembl gene set
Released:•Zebrafish•Tasmanian Devil•Coelacanth•Tilapia
In progress:
Dog, Turtle, Rat, Cat, Chicken, Platyfish
So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward
![Page 24: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/24.jpg)
24 of 34
Gene set Update Pipeline using RNASeq
1. RNA-Seq• RNA-Seq is pipeline is highly automated, many
species take around a week to process
2. Split core gene set into single transcript genes
3. Transcript scoring / filtering• UTR addition done at the same time
4. Layering• avoiding pseudogenes• gap filling with fragments
5. Rebuild core set
6. Transfer pseudogenes + ncRNAs
Gene set update pipeline is fast and is using existing code in a novel way with very few alterations
![Page 25: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/25.jpg)
RNASeq model
Ensembl models
RNASeq Introns
Filter and add UTRs
![Page 26: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/26.jpg)
Add ‘UTR’
Extend CDS
RNASeq models
Ensembl models
RNASeq Introns
![Page 31: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/31.jpg)
31 of 34
ResultsMonodelphisPlatypus
Genes Transcripts
19,466 32,541
21,324 22,307
132
Genes Transcripts
17,951 26,836
21,695 23,581
204
before merge
after merge
joined genes
![Page 32: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/32.jpg)
32 of 34
Gene set update pipeline -Summary
Quick, straightforward method of tidying up gene sets
Add species specific models into gene-sets that were previously mostly based on proteins from other species
Much more efficient than a new genebuild
Future work:
Lots of other species we could apply this to
See what effect it has on primates / projection builds - in progress
![Page 33: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/33.jpg)
33 of 34
Ensembl Use of NHPRT dataPrimates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey+, Baboon+, Orangutan, Gibbon, Tarsier* (+ = Pre!, *=2x)
Run RNASeq pipeline on NHPRT primates in Ensembl to generate:–Transcript models–Introns–BAM files of alignments
(would like individual tissue RNASeq data for this)
Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon
Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque
Consider other uses - –targeted improvement of models for ‘important’ genes (disease related)–Long non coding genes–Alignment to human
![Page 34: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.](https://reader036.fdocuments.in/reader036/viewer/2022062314/56649de55503460f94adcf59/html5/thumbnails/34.jpg)
34 of 34
Steve Searle
Bronwen Aken
Daniel Barrell
Susan Fairley
Carlos Garcia Giron
Thibaut Hourlier
Andreas Kahari
Rishi Nag
Magali Ruffier
Amy Tang
Jan-Hinnerk Vogel
Amonida Zadissa
Acknowledgements
John E Collins
Stephen Keenan
Henrik Kaessman
Jessica Alfoldi
Illumina (Human Body Map data)