Systematic evaluation of spliced alignment programs for RNA-seq data

PowerPoint-Prsentation

Systematic evaluation
of spliced alignment programs
for RNA-seq data

Engstrm et al. (Nature Methods 2013)Presented by Monica Drgan

/home/monique/Desktop/ETH_alignment_MDragan.odp

for RNA-seq data

for RNA-seq data

for RNA-seq data

bioinformatics.ca

Mapping the reads to a reference genome or

a transcriptome database

Deep sequencing (with NGS)

for RNA-seq data

bioinformatics.ca

Why RNA sequencing?

Functional studies

Gene prediction is difficult

for RNA-seq data

for RNA-seq data

Mapping strategies depend on read length

Read length < 50 bp

Read length > 50 bp

for RNA-seq data


Read length < 50 bp Short (Unspliced) aligners

Read length > 50 bp

BWABOWTIE

for RNA-seq data


Read length < 50 bp Short (Unspliced) aligners

Read length > 50 bp Spliced alignment programs In mRNA sequences the introns were removed

BWABOWTIEGSNAPMapSpliceSTARPAL MapperTopHatReadsMapPASSSMALT

Outline

Challenges in RNA sequence alignment

The aim of this paper

Existing spliced-alignment software

Conclusions

Outline




Conclusions

Challenges in RNA-seq alignment

Large #reads


Large #reads ~100M = computationally expensive


Large #reads ~100M = computationally expensive

Compression with Burrows-Wheeler Transform


Large #reads

RNA Splicing

RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps


Large #reads

RNA Splicing



Large #reads

RNA Splicing / Alternative splicing



Large #reads


a single gene may code for multiple proteins



Large #reads


Paired read separation issue



Large #reads





Large #reads



Pseudogenes



Large #reads



Pseudogenespseudogenes often have highly similar sequences to functional, intron-containing genes RNA reads can incorrectly be mapped here

the human genome, which contains over 14,000 pseudogenes [Pei et al. Genome Biol 2012]



Large #reads



Pseudogenes

Duplications



Large #reads



Pseudogenes

Duplicationsmay correspond to biased PCR amplification of particular fragments


Outline




Conclusions


Asses the performance of 26 RNA seq alignment protocols based on 11 programs on real and simulated human and mouse transcriptomes

Alignment protocols were evaluated on Illumina 76-nucleotide

paired-end RNA-seq data from: the human leukemia cell line K562 (1.3 109 reads)

mouse brain (1.1 108 reads) and two simulated

Outline



Existing spliced-alignment softwareTopHat

MapSplice

STAR

GSNAP

Conclusions

unspliced alignment

TopHat
Trapnell, Pachter, and Salzberg (2009)

unspliced alignment

- reads that map to more than 10 locations- reads that have more than a few mismatches

TopHat

unspliced alignment

assemble

islands of sequences

- reads that map to more than 10 locations- reads that have more than a few mismatches

TopHat

unspliced alignment

assemble

Such an approach will identify only known or predicted combinations of exons

TopHat

TopHat

unspliced alignment

spliced alignment

TopHat

TopHat

Known junction signals:GT-AG, GC-AG, and AT-AC

TopHat

If an alignment extends into an intron region, realign the reads to the adjacent exons instead

Known junction signals:GT-AG, GC-AG, and AT-AC

Outline

Challenges in sequence alignment

What the paper is about

Existing softwareTopHat

MapSplice

STAR

GSNAP

Conclusions

Future work

MapSplice
Wang et al. (2010)

Similar to TopMap

Reads = tags

A tag has an exonic alignment if it can be aligned in its entirety to a consecutive sequence of nucleotides in G.

T has a spliced alignment if its alignment to G Requires one or more gaps

MapSplice
Wang et al. (2010)

Step 1: exonic alignment

MapSplice
Wang et al. (2010)

Step 2: spliced alignment

the spliced alignment of tj+1

to the genomic interval betweenanchors tj and tj+2

consider all the possible positions of the splice site and map according to the Hamming distace

MapSplice
Wang et al. (2010)

Step 3: merge candidate segment alignments

Outline




MapSplice

STAR

GSNAP

Conclusions

Future work

STAR
Dobin et al. (2012)

Maximal Mappable Prefix (read location i) = the longest read substring from position i that has exact match on one or more substrings of the ref genome

poor genomic alignment

Detect: (a) splice junctions(b) mismatches(c) tails

Outline




MapSplice

STAR

GSNAP

Conclusions

Future work

GSNAP
Wu and Nacu (2010)

Efficient detection of indels and splice pairs:

For large genomes, it is more efficient to preprocess the genome rather than the reads to create genomic index files, which provide genomic positions for a given prefix/suffix.

Works with candidate regions in the ref genome. (keep track of the read location of 12 residues that support each candidate region)

GSNAP
Wu and Nacu (2010)

For a more powerful use of the algorithms:

use of available gene annotations, which allow it to avoid erroneously mapping reads to pseudogenes

use the information about the pair sof the paired read

Outline




Conclusions

Conclusions

Mismatches and basewise accuracy

MapSplice, PASS and TopHat display a low tolerance for mismatches. Consequently, a large proportion of reads with low base-call quality scores were not mapped by these methods

Conclusions


GSNAP, GSTRUCT, MapSplice,PASS, SMALT and STAR allow missmatches an can also output an incomplete alignment when they are unable to map an entire sequence

Conclusions


Reads from mouse were mapped (against the mouse reference assembly17) at a greater rate and with fewer mismatches than those from K562 (the cancer cell line K562 accumulated a lot of mutations with respect to the human reference assembly).

Conclusions

Indel frequency and accuracy

.

GSTRUCT produced the most uniform

distribution of indels

(coefficient of variation (CV) = 0.32) TopHat produced the most variable distribution

(CV = 1.5 and 1.1 splice junctions)

Size distribution of indelsfor the human K562 data set

Precision and recall, stratified by indel size

GEM and PALMapper output included more indels than any other method

Conclusions

Indel frequency and accuracy

GEM and PALMapper report many false indels (precision)

GSNAP and GSTRUCT exhibit high sensitivity for deletions, independent of size (recall)

TopHat2 protocol is the most

sensitive method for long insertions (recall)

Precision and recall, stratified by indel size

Conclusions

Spliced alignment

High accuracy discovery rate for ReadsMap, GSNAP, GSTRUCT and MapSplice and TopHat

#false junction calls was greatly reduced if junctions were filtered by supporting alignment counts (plot c)

Protocols using annotation recovered nearly all of the known junctions in expressed transcripts (plot d)

For novel-junction discovery, GSTRUCT outperformed other methods

Conclusions

GSNAP, GSTRUCT, MapSplice and STAR compared favorably to the other methods

MapSplice seems to be a conservative aligner with respect to mismatch frequency, indel and exon junction calls.

The most significant issue with GSNAP, GSTRUCT and STAR is the presence of many false exon junctions in the output.

Both GSNAP and GSTRUCT require considerable computing time when parameterized for sensitive spliced alignment

Thank you!

Remaining challenges:

Remaining challenges include exploiting gene annotation with-

out introducing bias, correctly placing multimapped reads, achiev-

ing optimal yet fast alignment around gaps and mismatches, and

Analysis

reducing the number of false exon junctions reported. Ongoing

developments in sequencing technology will demand efficient

processing of longer reads with higher error rates and will require

more extensive spliced alignment as reads span multiple exon

junctions. We expect performance of the aligners evaluated

here to improve as current shortfalls are addressed. Differential

treatment of these issues will enhance and expand the range of

RNA-seq aligners suited to varied computational methodologies

and analysis aims.

Some RNA-seq aligners, including GSNAP [5], RUM [6], and STAR [7], map reads independently of the alignments of other reads, which may explain their lower sensitivity for these spliced reads

GSNAP [5] and STAR [7] also make use of annotation, although they use it in a more limited fashion in order to detect splice sites

have shown how suffix arrays (Manber

and Myers, 1990), compressed using a Burrows-Wheeler Transform

(BWT) (Burrows and Wheeler, 1994), can rapidly map reads that

are exact matches or have a few mismatches or short insertions or

deletions (indels) relative to the reference.

A third approach, provided by the QPALMA program (Bona

et al., 2008), can align individual reads across exonexon junctions

using SmithWaterman-type alignments and a specifically trained

splice site model.

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte Gliederungsebene

Functional Genomics, SS2014

Montag, 10. Mrz 2014


Klicken Sie, um die Formate des Gliederungstextes bearbeiten

Montag, 10. Mrz 2014

Departement/Institut/Gruppe


Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte Gliederungsebene

Systematic evaluation of spliced alignment programs for RNA-seq data

Education

Transcript of Systematic evaluation of spliced alignment programs for RNA-seq data