Introduction To Next Generation Sequencing (NGS) Data Analysis

67
Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu

description

Introduction To Next Generation Sequencing (NGS) Data Analysis. Jenny Wu UCI Genomics High Throughput Facility. Outline. Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis Basics: terminology, data formats, general workflow etc. Data Analysis Pipeline - PowerPoint PPT Presentation

Transcript of Introduction To Next Generation Sequencing (NGS) Data Analysis

Page 1: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Introduction To Next Generation Sequencing (NGS) Data Analysis

Jenny Wu

Page 2: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Popular RNA-Seq pipeline: Tuxedo suite vs. Tophat-HTSeq • Data visualization with Genome Browsers and R packages.• Downstream Pathway analysis

• ChIP-Seq data analysis workflow and software• NGS bioinformatics resources• Summary

Page 3: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Why Next Generation SequencingOne can generate hundreds of millions of short sequences (up to 250bp) in a single run in a short period of time with low per base cost.

• Illumina/Solexa GA II, HiSeq 2500, 3000,X• Roche/454 FLX, Titanium• Life Technologies/Applied Biosystems SOLiD

Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31Quail et al (2012) BMC Genomics Jul 24;13:341.

Page 4: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Why Bioinformatics

(wall.hms.harvard.edu)

Informatics

Page 5: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Bioinformatics Challenges in NGS Data Analysis

• “Big Data” (thousands of millions of lines long)– Can’t do ‘business as usual’ with familiar tools– Impossible memory usage and execution time – Manage, analyze, store, transfer and archive huge files

• Need for powerful computers and expertise– Informatics groups must manage compute clusters– New algorithms and software are required and often time

they are open source Unix/Linux based.– Collaboration of IT experts, bioinformaticians and biologists

Page 6: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Basic NGS Workflow

Olson et al.

Page 7: Introduction To Next Generation  Sequencing (NGS) Data Analysis

NGS Data Analysis Overview

Olson et al.

Page 8: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• NGS bioinformatics resources• Summary

Page 9: Introduction To Next Generation  Sequencing (NGS) Data Analysis

TerminologyExperimental Design:• Coverage (sequencing depth): The number of nucleotides from reads that are

mapped to a given position.average coverage = read length * # reads/ genome size

• Paired-End Sequencing: Both end of the DNA fragment is sequenced, allowing highly precise alignment.

• Multiplexing/Demultiplexing: "barcode" sequences are added to each sample so they can be distinguished in order to sequence large number of samples on one lane.

Data analysis:• Quality Score: Each called base comes with a quality score which measures the

probability of base call error.• Mapping: Align reads to reference to identify their origin.• Assembly: Merging of fragments of DNA in order to reconstruct the original sequence.• Duplicate reads: Reads that are identical. Can be identified after mapping.• Multi-reads: Reads that can be mapped to multiple locations equally well.

Page 10: Introduction To Next Generation  Sequencing (NGS) Data Analysis

What does the data look like?Common NGS Data Formats

For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.html

Page 11: Introduction To Next Generation  Sequencing (NGS) Data Analysis

File Formats• Reference sequences, reads:

– FASTA– FASTQ (FASTA with quality scores)

• Alignments:– SAM (Sequence Alignment Mapping)– BAM (Binary version of SAM)

• Features, annotation, scores:– GFF3/GTF(General Feature Format)– BED/BigBed– WIG/BigWig

http://genome.ucsc.edu/FAQ/FAQformat.html

Page 12: Introduction To Next Generation  Sequencing (NGS) Data Analysis

FASTA Format (Reference Seq)

Page 13: Introduction To Next Generation  Sequencing (NGS) Data Analysis

FASTQ Format (Illumina Example)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAACAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT+BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AGAAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG+CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read RecordHeader

Read BasesSeparator

(with optional repeated header)

Read Quality Scores

Flow Cell IDLane Tile

Tile Coordinates

Barcode

NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads.

(Passarelli, 2012)

Page 14: Introduction To Next Generation  Sequencing (NGS) Data Analysis

GFF3 and GTF format

Khetani RS et al.

GTF format:

GFF3 format:

Page 15: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 16: Introduction To Next Generation  Sequencing (NGS) Data Analysis

General Data Pipeline

Page 17: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Why QC?Sequencing runs cost money • Consequences of not assessing the Data • Sequencing a poor library on multiple runs –

throwing money away!

Data analysis costs money and time• Cost of analyzing data, CPU time $$• Cost of storing raw sequence data $$$• Hours of analysis could be wasted $$$$• Downstream analysis can be incorrect.

Page 18: Introduction To Next Generation  Sequencing (NGS) Data Analysis

How to QC?

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPCTutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

$ module load fastqc$ fastqc s_1_1.fastq;

Page 19: Introduction To Next Generation  Sequencing (NGS) Data Analysis

FastQC: Example

Page 20: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 21: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Premade Genome SequenceIndexes and Annotation

http://ccb.jhu.edu/software/tophat/igenomes.shtml

Page 22: Introduction To Next Generation  Sequencing (NGS) Data Analysis

The UCSC Genome Browser Homepage

Get genome annotation here!

General information

Specific information—new features, current status, etc.

Get reference sequences here!

Page 23: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Downloading Reference Sequences

Page 24: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Downloading Reference Annotation

Page 25: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 26: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Sequence Mapping Challenges• Alignment (Mapping) is often the first step once analysis-

read reads are obtained.• The task: to align sequencing reads against a known

reference.• Difficulties: high volume of data, size of reference

genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.

Page 27: Introduction To Next Generation  Sequencing (NGS) Data Analysis

How to choose an aligner?• There are many short read aligners and they

vary a lot in performance(accuracy, memory usage, speed and flexibility etc).

• Factors to consider : application, platform, read length, downstream analysis, etc.

• Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie2).

• Guaranteed high accuracy will take longer.• Popular choices: Bowtie2, BWA, Tophat2, STAR.

Page 28: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 29: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Application Specific Software

Page 30: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 31: Introduction To Next Generation  Sequencing (NGS) Data Analysis
Page 32: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Two Major Approaches 1. Gene or Exon level differential expression (DE):

DESeq2, EdgeR, DEXSeq…2. Transcripts assembly :

Trinity, Velvet-Oasis, TransABySS, Cufflinks, Scripture…

Page 33: Introduction To Next Generation  Sequencing (NGS) Data Analysis

RNA-Seq Pipeline for DE

Page 34: Introduction To Next Generation  Sequencing (NGS) Data Analysis

RNA-Seq: Spliced Alignment

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png“Systematic evaluation of spliced alignment programs for RNA-seq data”Nature Methods, 2013

• Some reads will span two different exons

• Need long enough reads to be able to reliably map both sides

• Use a splice aware aligner!

Page 35: Introduction To Next Generation  Sequencing (NGS) Data Analysis

How much sequence do I need?

• Oversimplified answer:20-50M PE/sample (Human/Mouse)

Depends on: – Size and complexity of your transcriptome. – Goal of experiment: DE, transcript discovery.– Tissue type, library type, RNA quality, read length,

single-end…

Page 36: Introduction To Next Generation  Sequencing (NGS) Data Analysis

RNA-Seq: NormalizationGene-length bias• Differential expression of longer genes is more significant because long genes yield more reads

RNA-Seq normalization methods:• Scaling factor based: Total count, upper quartile,

median, DESeq, TMM in edgeR• Quantile, RPKM (cufflinks)• ERCCNormalize by gene length and by number of reads mapped, e.g. RPKM/FPKM (reads/fragments per kilo bases per million mapped reads)

Page 37: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Definition of Expression levels

RPKM: Reads Per Kilobase per Million of mapped reads:

FPKM: Fragment Per Kilobase per Million of mapped reads (for paired-end reads)

Mortazavi, et al. 2008

Page 38: Introduction To Next Generation  Sequencing (NGS) Data Analysis

RNA-Seq: Differential Expression

 Discrete vs. Continuous data: Microarray florescence intensity data: continuous

Modeled using normal distribution  RNA-Seq read count data: discrete

 Modeled using negative binomial distribution

Microarray software can NOT be directly used to analyze RNA-Seq data!

Page 39: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Popular RNA-Seq pipeline: Tuxedo suite, Tophat2-HTSeq-DESeq• Data visualization with Genome Browsers.

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 40: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Popular RNA-Seq DE Pipeline

(The Tuxedo Protocol)

Pipeline 1 Pipeline 2

(The Alternative Protocol)

Page 41: Introduction To Next Generation  Sequencing (NGS) Data Analysis

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Classic RNA-Seq (Tuxedo Protocol)

2. Transcript assembly and quantification

1. Spliced Read mapping

3. Merge assembled transcripts from multiple

samples

4. Differential Expression analysis

SAM/BAM

GTF/GFF

Page 42: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Classic vs. Advanced RNA-Seq workflow

Page 43: Introduction To Next Generation  Sequencing (NGS) Data Analysis

1. Spliced Alignment: TophatTophat : a spliced short read aligner for RNA-seq.

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq

$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Page 44: Introduction To Next Generation  Sequencing (NGS) Data Analysis

2.Transcript assembly and abundance quantification: Cufflinks

Cufflinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam

$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam

Page 45: Introduction To Next Generation  Sequencing (NGS) Data Analysis

3. Final Transcriptome assembly: Cuffmerge$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ more assembies.txt

./C1_R1_clout/transcripts.gtf

./C1_R2_clout/transcripts.gtf

./C2_R1_clout/transcripts.gtf

./C2_R2_clout/transcripts.gtf

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Page 46: Introduction To Next Generation  Sequencing (NGS) Data Analysis

4.Differential Expression: CuffdiffCuffDiff: a program that compares transcript abundance between samples.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam, ./C1_R2_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam, ./C2_R2_thout/accepted_hits.bam

Page 47: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Alternative Pipeline with HTSeq

HTSeq

DESeq2/edgeR

Tophat2,

$ htseq-count -f bam C1_R1_thout/sorted.bam -s no > hsc/C1_R1.counts

$ htseq-count -f bam C1_R1_thout/sorted.bam -s no > hsc/C1_R1.counts

$ htseq-count -f bam C1_R1_thout/sorted.bam -s no > hsc/C1_R1.counts

$ htseq-count -f bam C1_R1_thout/sorted.bam -s no > hsc/C1_R1.counts

Page 48: Introduction To Next Generation  Sequencing (NGS) Data Analysis

HTSeq Output: Gene Count Table

Page 49: Introduction To Next Generation  Sequencing (NGS) Data Analysis

DESeq2

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

April 21st workshp!

Page 50: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Downstream Analysis

Pathway and functional analysis:• Gene Ontology over representation• Gene Set Enrichment Analysis (GSEA)• Signaling Pathway Impact Analysis• Software

DAVID, GSEA, WGCNA, Blast2go, topGO, BinGO...IPA, GeneGO MetaCore, iPathway Guide

Page 51: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data file formats, general workflow – Data Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data Visualization with genome browsers and R packages

• ChIP-Seq data analysis workflow and software• Open source pipeline software with Graphical User Interface• Summary

Page 52: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Integrative Genomics Viewer (IGV)http://www.broadinstitute.org/igv

Available on HPC. Use ‘module load igv’ and ‘igv’

Page 53: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Visualizing RNA-Seq mapping with IGV

http://www.broadinstitute.org/igv/UserGuideIntegrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.Thorvaldsdóttir H et al. Brief Bioinform. 2013

Page 54: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Genomic Data Visualization

R packages for plots:

•ggplot2•ggbio•GenomeGraphs

Page 55: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 56: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Galaxy: Web based platform for analysis of large datasets

http://hpc-galaxy.oit.uci.edu/roothttps://main.g2.bx.psu.edu/Galaxy: A platform for interactive large-scale genome analysis:

Genome Res. 2005. 15: 1451-1455

Page 57: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis

– Basics: terminology, data formats, general workflow etc.– Data Analysis Pipeline

• Sequence QC and preprocessing• Downloading reference sequences: query NCBI, UCSC databases.• Sequence mapping• Downstream analysis workflow and software

• RNA-Seq data analysis• Concepts: spliced alignment, normalization, coverage, differential expression.• Tuxedo suite: Tophat, Cufflinks and cummeRbund • Data visualization with Genome Browsers.• RNA-Seq pipeline software: Galaxy vs. shell scripting

• ChIP-Seq data analysis workflow and software• Scripting Languages and bioinformatics resources• Summary

Page 58: Introduction To Next Generation  Sequencing (NGS) Data Analysis

What is ChIP-Seq?• Chromatin-Immunoprecipitation (ChIP)- Sequencing

• ChIP - A technique of precipitating a protein antigen out of solution using an antibody that specifically binds to the protein.

• Sequencing – A technique to determine the order of nucleotide bases in a molecule of DNA.

• Used in combination to study the interactions between protein and DNA.

Page 59: Introduction To Next Generation  Sequencing (NGS) Data Analysis

ChIP-Seq Applications

Enables the accurate profiling of

• Transcription factor binding sites• Polymerases• Histone modification sites• DNA methylation

Page 60: Introduction To Next Generation  Sequencing (NGS) Data Analysis

A View of ChIP-Seq Data• Typically reads (35-55bp) are quite sparsely

distributed over the genome. • Controls (i.e. no pull-down by antibody)

often show smaller peaks at the same locations

Rozowsky et al Nature Biotech, 2009

Page 61: Introduction To Next Generation  Sequencing (NGS) Data Analysis

ChIP-Seq Analysis Pipeline

Sequencing

Short readSequences

Base Calling Read QC Short read

Alignment

Enriched Regions

Peak Calling

Combine with gene expression

Motif Discovery

Visualization with genome

browser

Differential peaks

Page 62: Introduction To Next Generation  Sequencing (NGS) Data Analysis

ChIP-Seq: Identification of Peaks• Several methods to identify peaks but they mainly fall into 2 categories:

– Tag Density– Directional scoring

• In the tag density method, the program searches for large clusters of overlapping sequence tags within a fixed width sliding window across the genome.

• In directional scoring methods, the bimodal pattern in the strand-specific tag densities are used to identify protein binding sites.

• Determining the exact binding sites from short reads generated from ChIP-Seq experiments

– SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008)

– MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)

Page 63: Introduction To Next Generation  Sequencing (NGS) Data Analysis

ChIP-Seq: Output• A list of enriched locations

• Can be used:– In combination with RNA-Seq, to determine the

biological function of transcription factors– Identify genes co-regulated by a common

transcription factor– Identify common transcription factor binding

motifs

Page 64: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Resources in NGS data analysis

• Stackoverflow.com

Page 65: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Languages in Bioinformatics

Page 66: Introduction To Next Generation  Sequencing (NGS) Data Analysis

Summary

Thank you!

• NGS technologies are transforming molecular biology.

• Bioinformatics analysis is a crucial part in NGS applications – Data formats, terminology, general workflow– Analysis pipeline– Software for various NGS applications

• RNA-Seq and ChIP-Seq data analysis• Pathway Analysis• Data visualization• Bioinformatics resources

Page 67: Introduction To Next Generation  Sequencing (NGS) Data Analysis