Seyed Abolfazl Motahari RNA-seq -...

Seyed Abolfazl Motahari

RNA-seqData Analysis

Basics

Next Generation Sequencing

Biological Samples Data

Data VolumeCost

تشخیص بیماریها

محافظت بیولوژیکی

طراحی داروها

تولید مواد بیولوژیکی جدید

کنترل سیستمهای بیولوژیکی

تحلیل داده ها

Big Data Analysis in Biology

DNA Fragment

Read

Sequencer

ACCGTAACCTACTTAGTA

The sequence generated by a sequencing machine from a DNA fragment.

Read

DNA Fragment

ACCGTAACCTACTTAGTAReads

Sequencer

Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions.

CTAGTAACCTTAACCGTA

Paired-end Read

Long DNA Fragment

B

B

B

B

12kb

B

B

B

B B

B

Biotinylated

Shearing Capture

Similar to the paired-end reads with longer inserts

Mate-pair Read

Maxam-Gilbert Sequencing Sanger Sequencing

Basic MethodsPolony Sequencing 454 Pyrosequencing Illumina Sequencing SOLiD Sequencing Ion Torrent Semiconductor Sequencing DNA Nanoball Sequncing Heliscope Signle Molecule Sequencing Signle Molecule Real-time Sequencing

Next Generation Methods

Nanopore DNA Sequencing Tunneling Currents DNA Sequencing Sequencing by Hybridization Sequencing with Mass Spectrometry Microfluidic Sanger Sequencing Microscopy-based Sequencing

Methods in Development

Sequencing Technologies

Technology Read Length Error Rate Paired-end

Sanger up to 2k 2% Yes

ABI/SOLiD 75 2% Yes

Illumina 100-150 2% Yes

Roche/454 400-600 4% No

IonTorrent 200 4% No

PacBio up to 15k 18% Yes

Potentially all sequencing technologies can be used to sequence mate-pair libraries obtained by the circularization of long DNA fragments

Comparison of Sequencers

Biological samples

Genomes (DNA) Transcriptome (RNA)

Size selection

Capturing

Whole Genome

Capturing Regions

NGS Applications

RNA-seqmiRNA-seq

CLIP-seq

Whole Exome Epigenomics

ChIP-seq methyl-seq DNase-seq

Raw Data

Pre-processing

Assembly

De Novo/Alignment

App

DNA Prep

Library Prep

Chip Prep

Sequencing

Typical Workflow

UNIX

Advantages of Unix - Unix is more flexible and can be installed on many different types of machines, including main-frame computers, supercomputers and micro-computers.- Unix is more stable and does not go down as often as Windows does, therefore requires less administration and maintenance.- Unix has greater built-in security and permissions features than Windows.- Unix possesses much greater processing power than Windows.- Unix is the leader in serving the Web. About 90% of the Internet relies on Unix operating systems running on Apache, the worlds most widely used Web server, which is free.- Software upgrades from Microsoft often require the user to purchase new or more hardware or prerequisite software. That is not the case with Unix.- The mostly free or inexpensive open-source operating systems, such as Linux and BSD, with their flexibility and control, prove to be very attractive to (aspiring) computer wizards. Many of the smartest programmers are developing state-of-the-art software free of charge for the fast growing "open-source movement.- Unix also inspires novel approaches to software design, such as solving problems by interconnecting simpler tools instead of creating large monolithic application programs.

Command Line

Graphical User Interface (GUI)

Command Line Interface (CLI)

General Command

command [-options] [args ...]

The “prompt”

The current directory (“path”)

The host

MacBook-Pro:abolfazl$ bowtie -v 3 -S human reads.fastq > aligned.sam

Unix File System

/home/Abolfazl/Data

The Path

/

Applications bin home use

Abolfazl Sajjad

Data

>gi|123440403|ref|NC_008800.1| Yersinia enterocolitica subsp. enterocolitica 8081 chromosome, complete genomeGATCTTTTTATTTAAAGATCTCTTTATTAGATCTCTTATTAGGATCATGACCTTCTGTGGATAAGTGATTATTCACATTTAAGATCATGTGATTAAGGAGGATCGTTTGCTGTGAATGATCGGTGATCCTATTGCGTATAAGCTGGGATCTAAATGGCATGTTATGCACAGGCACTTTAAGTTACTAAGGTTGTTATGTGGATATGTACTGCTTATACCCTGCTTTCAAGCTTACTTATCCACATTCGTTCGCGTGATCTTTAAGCAAATTAGAGTAAATTAATCCAGTTTTTAACCCAAATCTCTGCCGGATCCTCAGGAATTTCATGTTTGATGACGTCAATTTCTAAAATATCACCCACACGAATGGCTCCCTGGATGATCAGTTGCTGATCCAATTTTCTGACCGCACCACAGAAAGTGTCATATTCTGAACTGCCCAAACCAACAGCACCAAAGCGAACCTGTGAGAGATCCGGTCTCTGCTGCTCGATTTGTTCTAATAAGGGTTGAAGATTGTCTGGCAGATCACCTGCACCATGAGTGGAAGTGATTATTAGCCACATACCATCCAAGGTCAGCTCGTCTAATTCCGGGCCATGCAGAGTTTCTGTCGTGAAACCCGCCTCTTCTAATTTCTCAGCTAAATGTTCAGCAACGTATTCAGCACTGCCAAGCGTACTGCCACTGATCAAGGTAATGTCAGCCATAAAGACCCCAACCGAAGTAATGAACCGGTATTGTACGCTGTGAATCAGCTGGGATCTACCTGTGGATAATGTGGGTATAGTTATTTAGTGCTCAGGGCACGATGGTACGCATGATGGGGTTTTGCAGGGAAATAAGAGTCTCGGTTGACTGGATCTCATCAATAGTTTGGATCTTGTTGATAAGTACCTGTTGCAGTGCATCTATCGATTTACACATGACCTTAATAAAGATGCTGTAATGGCCAGTGGTGTAATAGGCCTCGACAACTTCTTCTAAACTTTCCAGTTTTTTTAATGCAGAAGGGTAATCTTTGGCACTTTTCAAAATGATGCCGATGAA

Header

Sequence

Fasta Format

Genome Format

Read Format (Fastq)

Header

FASTQ files extend FASTA files in that they provide both sequence and quality. A FASTQ file thus typically consists of four lines.

1. A line starting with @ containing the sequence identifier2. the actual sequence3. a line starting with + after which the sequence identifier is optional4. a line with quality values which are encoded in ASCII space

As such the 2nd and 4th line must have the same length One such entry is given below showing one sequence "ATGTCT"..

@HWI-ST999:102:D1N6AACXX:1:1101:1235:1936 1:N:0:ATGTCTCCTGGACCCCTCTGTGCCCAAGCTCCTCATGCATCCTCCTCAGCAACTTGTCCTGTAGCTGAGGCTCACTGACTACCAGCTGCAG+1:DAADDDF<B<AGF=FGIEHCCD9DG=1E9?D>CF@HHG??B<GEBGHCG;;CDB8==C@@>>GII@@5?A?@B>CEDCFCC:;?CCCAC

Quality Scores

Sequence

RNA-seq

WHY? 1-‐ To assemble the transcriptome 2-‐ To find the expression levels

Assembly Paradigms: (depending on other sources) 1-‐ De Novo 2-‐ Genome Guided 3-‐ Transcriptome Guided

NATURE BIOTECHNOLOGY VOLUME 28 NUMBER 5 MAY 2010 421

Brian J. Haas and Michael C. Zody are at the Broad Institute, Cambridge, Massachusetts, USA. e-mail: [email protected] or [email protected]

Advancing RNA-Seq analysisBrian J Haas & Michael C Zody

New methods for analyzing RNA-Seq data enable de novo reconstruction of the transcriptome.

Sequencing of RNA has long been recognized as an efficient method for gene discovery1 and remains the gold standard for annotation of both coding and noncoding genes2. Compared with earlier methods, massively parallel sequencing of RNA (RNA-Seq)3 has vastly increased the throughput of RNA sequencing and allowed global measurement of transcript abundance. Two reports in this issue introduce approaches for RNA-Seq analysis that capture genome-wide transcription and splicing in unprecedented detail. Trapnell et al.4 describe a software package, Cufflinks, for simultane-ous discovery of transcripts and quantification of expression levels and apply it to study gene expression and splicing during the differentia-tion of mouse myoblast cells. Taking a similar approach, Guttman et al.5 use software called Scripture to reannotate the transcriptomes of three mouse cell lines, defining complete gene models for hundreds of new large intergenic noncoding RNAs (lincRNAs)6.

Although transcript sequencing has been possible for nearly 20 years, until recently it required the construction of clone libraries. Projects to determine full-length gene struc-tures for human, mouse and other impor-tant models have taken years to complete7. With new sequencing technologies, no clon-ing is needed, allowing direct sequencing of cDNA fragments. In a matter of days and at a small fraction of the cost of earlier projects, one can achieve reasonably complete cover-age of a transcriptome8. But this approach has been hindered by a substantial challenge: without cloning, one cannot know a priori which reads came from which transcripts. Recent studies analyzed gene expression and alternative splicing by mapping short RNA-Seq reads to previously known or predicted

transcripts9,10. Although highly informative, such studies are inherently limited to known genes and to alternative splicing across pre-viously identified splice junctions. To fully leverage RNA-Seq data for biological dis-covery, one should be able to reconstruct transcripts and accurately measure their relative abundance without reference to an annotated genome.

Previous efforts to reconstruct transcripts

from short RNA-Seq reads have followed two general strategies (Fig. 1). The first, a de novo assembly approach implemented in the ABySS software11, reduces the annotation problem to that of aligning full-length cDNAs, which is well handled by several algorithms. This method is also applicable to the discovery of transcripts that are missing or incomplete in the reference genome and to RNA-Seq data from organisms lacking a genome reference.

RNA-Seq reads

Align reads to genome

Assemble transcripts de novo

Assemble transcripts from spliced alignments

More abundant

Less abundant

Align transcriptsto genome

Genome

Figure 1 Strategies for reconstructing transcripts from RNA-Seq reads. The ‘align-then-assemble’ approach (left) taken by Trapnell et al.4 and Guttman et al.5 first aligns short RNA-Seq reads to the genome, accounting for possible splicing events, and then reconstructs transcripts from the spliced alignments. The ‘assemble-then-align’ approach (right) first assembles transcript sequences de novo—that is, directly from the RNA-Seq reads. These transcripts are then splice-aligned to the genome to delineate intron and exon structures and variations between alternatively spliced transcripts. As de novo assembly is likely to work only for the most abundant transcripts, the align-then-assemble method should be more sensitive, although this warrants further investigation. RNA-Seq reads are colored according to the transcript isoform from which they were derived. Protein-coding regions of reconstructed transcript isoforms are depicted in dark colors.

N E W S A N D V I E W S

De Novo Assembly

Reads

Transcriptome Assembly

Quantitative Analysis

Assembly with Available Genome

RNA-‐seq Reads

Genome

DNA

Transcriptome Assembly


RNA

Assembly with Available Transcriptome

RNA-‐seq Reads

Transcriptome

RNA


RNA

Assembly with Available Genome

Pipeline

RNA-‐seq reads (2 x 100 bp)

Sequencing

RNA-‐seq reads (2 x 100 bp) (.fastq files)

FASTX/FastQC

Quality Control

Bowtie/Tophat

Read Alignment

Reference genome

(.fasta files)

Cufflinks

Transcript assembly

Gene annotation (.gtf files)

Cufflinks (cuffmerge)

Gene identification

Cuffdiff (A:B Comparison)

Differential expression

IGV

Visualization

Quality Control

Filtering data

removing low quality sequences or bases (trimming), adaptors, contaminations or overrepresented sequences to assure a coherent final result.

Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq.

• FastQC

developed in Java. Results are presented in HTML permanent reports.

• FASTX conversion from FASTQ to FASTA format information about statistics of quality removing sequencing adapters, filtering and cutting sequences based on quality

Packages:

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/HTML

FastQC

Read Mapping Strategies

unspliced aligner

bowtie bwa soap spliced aligner

tophat MapSplice SpliceMap

Garber et al. Nature Methods 8, 469–477 (2011)

Short Read Aligners

Aligners

Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca

Which!read!aligner!should!I!use?!

http://wwwdev.ebi.ac.uk/fg/hts_mappers/

RNA Bisulfite DNA microRNA

Short Read Aligners

Name Description paired-end

Use FASTQ quality

Gapped

Multi-threaded

Bowtie

Based on Burrows-Wheeler transform 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.

Yes Yes No Yes

BWABased on Burrows-Wheeler transform It's a bit slower than bowtie but allows indels in alignment.

Yes No Yes Yes

SHRiMPIndexes the reference genome as of version 2. Uses masks to generate possible keys.

Yes Yes Yes Yes

SOAP, SOAP2, SOAP3 and SOAP3-dp

SOAP: Robust with a small (1-‐3) number of gaps and mismatches. SOAP2: using bidirectional BWT much faster than the first version. SOAP3: GPU-‐accelerated version SOAP3-‐dp, also GPU accelerated,

Yes No

SOAP3-dp:Yes

Yes

http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

Bowtie

Exons

bowtie [options]* <ebwt> {-‐1 <m1> -‐2 <m2> | -‐-‐12 <r> | <s>} [<hit>]

Tophat

tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]

Exons

Tophat

tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]

Exons1 2 3

3

3

1

2isoforms

Tophat

• TopHat is a ‘splice-‐aware’ RNA-‐ seq read aligner

• Requires a reference genome

• Breaks reads into pieces, uses ‘bow,e’ aligner to first align these pieces

• Then extends alignments from these seeds and resolves exon edges (splice junc,ons)

Module!2!–!RNA-seq!alignment!and!visualiza8on! bioinformatics.ca

Bow8e/TopHat!

•  TopHat&is&a&‘splice>aware’&RNA>seq&read&aligner&

•  Requires&a&reference&genome&

•  Breaks&reads&into&pieces,&uses&‘bow,e’&aligner&to&first&align&these&pieces&

•  Then&extends&alignments&from&these&seeds&and&resolves&exon&edges&(splice&junc,ons)&

Trapnell et al. 2009

Trapnell et al. Nature Biotechnology 28, 511–515 (2010)

Cufflinks

Integrative Genome Viewer (IGV)

Genomic Coordinates

Data Panel

Annotation Heatmap

Cytoband

Genome Features

Track Names

Seyed Abolfazl Motahari RNA-seq -...

Documents

Transcript of Seyed Abolfazl Motahari RNA-seq -...