Seyed Abolfazl Motahari RNA-seq -...
Transcript of Seyed Abolfazl Motahari RNA-seq -...
Seyed Abolfazl Motahari
RNA-seqData Analysis
Basics
Next Generation Sequencing
Biological Samples Data
Data VolumeCost
تشخیص بیماریها
محافظت بیولوژیکی
طراحی داروها
تولید مواد بیولوژیکی جدید
کنترل سیستمهای بیولوژیکی
تحلیل داده ها
Big Data Analysis in Biology
DNA Fragment
Read
Sequencer
ACCGTAACCTACTTAGTA
The sequence generated by a sequencing machine from a DNA fragment.
Read
DNA Fragment
ACCGTAACCTACTTAGTAReads
Sequencer
Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions.
CTAGTAACCTTAACCGTA
Paired-end Read
Long DNA Fragment
B
B
B
B
12kb
B
B
B
B B
B
Biotinylated
Shearing Capture
Similar to the paired-end reads with longer inserts
Mate-pair Read
Maxam-Gilbert Sequencing Sanger Sequencing
Basic MethodsPolony Sequencing 454 Pyrosequencing Illumina Sequencing SOLiD Sequencing Ion Torrent Semiconductor Sequencing DNA Nanoball Sequncing Heliscope Signle Molecule Sequencing Signle Molecule Real-time Sequencing
Next Generation Methods
Nanopore DNA Sequencing Tunneling Currents DNA Sequencing Sequencing by Hybridization Sequencing with Mass Spectrometry Microfluidic Sanger Sequencing Microscopy-based Sequencing
Methods in Development
Sequencing Technologies
Technology Read Length Error Rate Paired-end
Sanger up to 2k 2% Yes
ABI/SOLiD 75 2% Yes
Illumina 100-150 2% Yes
Roche/454 400-600 4% No
IonTorrent 200 4% No
PacBio up to 15k 18% Yes
Potentially all sequencing technologies can be used to sequence mate-pair libraries obtained by the circularization of long DNA fragments
Comparison of Sequencers
Biological samples
Genomes (DNA) Transcriptome (RNA)
Size selection
Capturing
Whole Genome
Capturing Regions
NGS Applications
RNA-seqmiRNA-seq
CLIP-seq
Whole Exome Epigenomics
ChIP-seq methyl-seq DNase-seq
Raw Data
Pre-processing
Assembly
De Novo/Alignment
App
DNA Prep
Library Prep
Chip Prep
Sequencing
Typical Workflow
Tasks
UNIX
Advantages of Unix - Unix is more flexible and can be installed on many different types of machines, including main-frame computers, supercomputers and micro-computers.- Unix is more stable and does not go down as often as Windows does, therefore requires less administration and maintenance.- Unix has greater built-in security and permissions features than Windows.- Unix possesses much greater processing power than Windows.- Unix is the leader in serving the Web. About 90% of the Internet relies on Unix operating systems running on Apache, the worlds most widely used Web server, which is free.- Software upgrades from Microsoft often require the user to purchase new or more hardware or prerequisite software. That is not the case with Unix.- The mostly free or inexpensive open-source operating systems, such as Linux and BSD, with their flexibility and control, prove to be very attractive to (aspiring) computer wizards. Many of the smartest programmers are developing state-of-the-art software free of charge for the fast growing "open-source movement.- Unix also inspires novel approaches to software design, such as solving problems by interconnecting simpler tools instead of creating large monolithic application programs.
UNIX
Command Line
Graphical User Interface (GUI)
Command Line Interface (CLI)
General Command
command [-options] [args ...]
The “prompt”
The current directory (“path”)
The host
MacBook-Pro:abolfazl$ bowtie -v 3 -S human reads.fastq > aligned.sam
Unix File System
/home/Abolfazl/Data
The Path
/
Applications bin home use
Abolfazl Sajjad
Data
>gi|123440403|ref|NC_008800.1| Yersinia enterocolitica subsp. enterocolitica 8081 chromosome, complete genomeGATCTTTTTATTTAAAGATCTCTTTATTAGATCTCTTATTAGGATCATGACCTTCTGTGGATAAGTGATTATTCACATTTAAGATCATGTGATTAAGGAGGATCGTTTGCTGTGAATGATCGGTGATCCTATTGCGTATAAGCTGGGATCTAAATGGCATGTTATGCACAGGCACTTTAAGTTACTAAGGTTGTTATGTGGATATGTACTGCTTATACCCTGCTTTCAAGCTTACTTATCCACATTCGTTCGCGTGATCTTTAAGCAAATTAGAGTAAATTAATCCAGTTTTTAACCCAAATCTCTGCCGGATCCTCAGGAATTTCATGTTTGATGACGTCAATTTCTAAAATATCACCCACACGAATGGCTCCCTGGATGATCAGTTGCTGATCCAATTTTCTGACCGCACCACAGAAAGTGTCATATTCTGAACTGCCCAAACCAACAGCACCAAAGCGAACCTGTGAGAGATCCGGTCTCTGCTGCTCGATTTGTTCTAATAAGGGTTGAAGATTGTCTGGCAGATCACCTGCACCATGAGTGGAAGTGATTATTAGCCACATACCATCCAAGGTCAGCTCGTCTAATTCCGGGCCATGCAGAGTTTCTGTCGTGAAACCCGCCTCTTCTAATTTCTCAGCTAAATGTTCAGCAACGTATTCAGCACTGCCAAGCGTACTGCCACTGATCAAGGTAATGTCAGCCATAAAGACCCCAACCGAAGTAATGAACCGGTATTGTACGCTGTGAATCAGCTGGGATCTACCTGTGGATAATGTGGGTATAGTTATTTAGTGCTCAGGGCACGATGGTACGCATGATGGGGTTTTGCAGGGAAATAAGAGTCTCGGTTGACTGGATCTCATCAATAGTTTGGATCTTGTTGATAAGTACCTGTTGCAGTGCATCTATCGATTTACACATGACCTTAATAAAGATGCTGTAATGGCCAGTGGTGTAATAGGCCTCGACAACTTCTTCTAAACTTTCCAGTTTTTTTAATGCAGAAGGGTAATCTTTGGCACTTTTCAAAATGATGCCGATGAA
Header
Sequence
Fasta Format
Genome Format
Read Format (Fastq)
Header
FASTQ files extend FASTA files in that they provide both sequence and quality. A FASTQ file thus typically consists of four lines.
1. A line starting with @ containing the sequence identifier2. the actual sequence3. a line starting with + after which the sequence identifier is optional4. a line with quality values which are encoded in ASCII space
As such the 2nd and 4th line must have the same length One such entry is given below showing one sequence "ATGTCT"..
@HWI-ST999:102:D1N6AACXX:1:1101:1235:1936 1:N:0:ATGTCTCCTGGACCCCTCTGTGCCCAAGCTCCTCATGCATCCTCCTCAGCAACTTGTCCTGTAGCTGAGGCTCACTGACTACCAGCTGCAG+1:DAADDDF<B<AGF=FGIEHCCD9DG=1E9?D>CF@HHG??B<GEBGHCG;;CDB8==C@@>>GII@@5?A?@B>CEDCFCC:;?CCCAC
Quality Scores
Sequence
RNA-seq
WHY? 1-‐ To assemble the transcriptome 2-‐ To find the expression levels
Assembly Paradigms: (depending on other sources) 1-‐ De Novo 2-‐ Genome Guided 3-‐ Transcriptome Guided
NATURE BIOTECHNOLOGY VOLUME 28 NUMBER 5 MAY 2010 421
Brian J. Haas and Michael C. Zody are at the Broad Institute, Cambridge, Massachusetts, USA. e-mail: [email protected] or [email protected]
Advancing RNA-Seq analysisBrian J Haas & Michael C Zody
New methods for analyzing RNA-Seq data enable de novo reconstruction of the transcriptome.
Sequencing of RNA has long been recognized as an efficient method for gene discovery1 and remains the gold standard for annotation of both coding and noncoding genes2. Compared with earlier methods, massively parallel sequencing of RNA (RNA-Seq)3 has vastly increased the throughput of RNA sequencing and allowed global measurement of transcript abundance. Two reports in this issue introduce approaches for RNA-Seq analysis that capture genome-wide transcription and splicing in unprecedented detail. Trapnell et al.4 describe a software package, Cufflinks, for simultane-ous discovery of transcripts and quantification of expression levels and apply it to study gene expression and splicing during the differentia-tion of mouse myoblast cells. Taking a similar approach, Guttman et al.5 use software called Scripture to reannotate the transcriptomes of three mouse cell lines, defining complete gene models for hundreds of new large intergenic noncoding RNAs (lincRNAs)6.
Although transcript sequencing has been possible for nearly 20 years, until recently it required the construction of clone libraries. Projects to determine full-length gene struc-tures for human, mouse and other impor-tant models have taken years to complete7. With new sequencing technologies, no clon-ing is needed, allowing direct sequencing of cDNA fragments. In a matter of days and at a small fraction of the cost of earlier projects, one can achieve reasonably complete cover-age of a transcriptome8. But this approach has been hindered by a substantial challenge: without cloning, one cannot know a priori which reads came from which transcripts. Recent studies analyzed gene expression and alternative splicing by mapping short RNA-Seq reads to previously known or predicted
transcripts9,10. Although highly informative, such studies are inherently limited to known genes and to alternative splicing across pre-viously identified splice junctions. To fully leverage RNA-Seq data for biological dis-covery, one should be able to reconstruct transcripts and accurately measure their relative abundance without reference to an annotated genome.
Previous efforts to reconstruct transcripts
from short RNA-Seq reads have followed two general strategies (Fig. 1). The first, a de novo assembly approach implemented in the ABySS software11, reduces the annotation problem to that of aligning full-length cDNAs, which is well handled by several algorithms. This method is also applicable to the discovery of transcripts that are missing or incomplete in the reference genome and to RNA-Seq data from organisms lacking a genome reference.
RNA-Seq reads
Align reads to genome
Assemble transcripts de novo
Assemble transcripts from spliced alignments
More abundant
Less abundant
Align transcriptsto genome
Genome
Figure 1 Strategies for reconstructing transcripts from RNA-Seq reads. The ‘align-then-assemble’ approach (left) taken by Trapnell et al.4 and Guttman et al.5 first aligns short RNA-Seq reads to the genome, accounting for possible splicing events, and then reconstructs transcripts from the spliced alignments. The ‘assemble-then-align’ approach (right) first assembles transcript sequences de novo—that is, directly from the RNA-Seq reads. These transcripts are then splice-aligned to the genome to delineate intron and exon structures and variations between alternatively spliced transcripts. As de novo assembly is likely to work only for the most abundant transcripts, the align-then-assemble method should be more sensitive, although this warrants further investigation. RNA-Seq reads are colored according to the transcript isoform from which they were derived. Protein-coding regions of reconstructed transcript isoforms are depicted in dark colors.
N E W S A N D V I E W S
De Novo Assembly
Reads
Transcriptome Assembly
Quantitative Analysis
Assembly with Available Genome
RNA-‐seq Reads
Genome
DNA
Transcriptome Assembly
Quantitative Analysis
RNA
Assembly with Available Transcriptome
RNA-‐seq Reads
Transcriptome
RNA
Quantitative Analysis
RNA
Assembly with Available Genome
Pipeline
RNA-‐seq reads (2 x 100 bp)
Sequencing
RNA-‐seq reads (2 x 100 bp) (.fastq files)
FASTX/FastQC
Quality Control
Bowtie/Tophat
Read Alignment
Reference genome
(.fasta files)
Cufflinks
Transcript assembly
Gene annotation (.gtf files)
Cufflinks (cuffmerge)
Gene identification
Cuffdiff (A:B Comparison)
Differential expression
IGV
Visualization
Quality Control
Filtering data
removing low quality sequences or bases (trimming), adaptors, contaminations or overrepresented sequences to assure a coherent final result.
Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq.
• FastQC
developed in Java. Results are presented in HTML permanent reports.
• FASTX conversion from FASTQ to FASTA format information about statistics of quality removing sequencing adapters, filtering and cutting sequences based on quality
Packages:
FastQC
FASTX
Read Mapping Strategies
unspliced aligner
bowtie bwa soap spliced aligner
tophat MapSplice SpliceMap
Garber et al. Nature Methods 8, 469–477 (2011)
Short Read Aligners
Aligners
Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca
Which!read!aligner!should!I!use?!
http://wwwdev.ebi.ac.uk/fg/hts_mappers/
RNA Bisulfite DNA microRNA
Short Read Aligners
Name Description paired-end
Use FASTQ quality
Gapped
Multi-threaded
Bowtie
Based on Burrows-Wheeler transform 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.
Yes Yes No Yes
BWABased on Burrows-Wheeler transform It's a bit slower than bowtie but allows indels in alignment.
Yes No Yes Yes
SHRiMPIndexes the reference genome as of version 2. Uses masks to generate possible keys.
Yes Yes Yes Yes
SOAP, SOAP2, SOAP3 and SOAP3-dp
SOAP: Robust with a small (1-‐3) number of gaps and mismatches. SOAP2: using bidirectional BWT much faster than the first version. SOAP3: GPU-‐accelerated version SOAP3-‐dp, also GPU accelerated,
Yes No
SOAP3-dp:Yes
Yes
Bowtie
Exons
bowtie [options]* <ebwt> {-‐1 <m1> -‐2 <m2> | -‐-‐12 <r> | <s>} [<hit>]
Tophat
tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
Exons
Tophat
tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
Exons1 2 3
3
3
1
2isoforms
Tophat
• TopHat is a ‘splice-‐aware’ RNA-‐ seq read aligner
• Requires a reference genome
• Breaks reads into pieces, uses ‘bow,e’ aligner to first align these pieces
• Then extends alignments from these seeds and resolves exon edges (splice junc,ons)
Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca
Bow8e/TopHat!
• TopHat&is&a&‘splice>aware’&RNA>seq&read&aligner&
• Requires&a&reference&genome&
• Breaks&reads&into&pieces,&uses&‘bow,e’&aligner&to&first&align&these&pieces&
• Then&extends&alignments&from&these&seeds&and&resolves&exon&edges&(splice&junc,ons)&
Trapnell et al. 2009
Trapnell et al. Nature Biotechnology 28, 511–515 (2010)
Cufflinks
Integrative Genome Viewer (IGV)
Genomic Coordinates
Data Panel
Annotation Heatmap
Cytoband
Genome Features
Track Names
IGV
IGV