NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq:...

77
NGS data overview Università degli studi di parma Dipartimento di Bioscienze Davide Carnevali [email protected]

Transcript of NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq:...

Page 1: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

NGS data overview Università degli studi di parma

Dipartimento di Bioscienze

Davide Carnevali [email protected]

Page 2: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

NGS

Next-generation sequencing refers to non-Sanger-b a s e d h i g h - t h r o u g h p u t D N A s e q u e n c i n g technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more throughput and minimizing the need for the fragment-cloning methods that are often used in Sanger sequencing of genomes.

Page 3: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Migliaia di $ —> centesimi di $!

Milioni di $ —> migliaia di $!

Page 4: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Technologies

Page 5: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Illumina

  library preparation: •  Fragmentation •  Adapters ligation

single/paired end, barcodes (multiplexing) •  Amplification

  Sequencing —> reads

Page 6: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library preparation

Page 7: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library adapters

•  Single/paired end (reads) • Multiplexing (barcodes)

(multiple samples sequencing)

Page 8: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library amplification

Page 9: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library amplification

Page 10: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library sequencing

Page 11: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Library sequencing

Page 12: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

Page 13: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

SNPs

A Single Nucleotide Polymorphism is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide — A, T, C or G — in the genome differs between members of a biological species or paired chromosomes.

Page 14: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Variant

A “variation” or “variant” refers to an allele sequence that is different from the reference at as little as a single base or for a longer (potentially much longer) interval.!!In general the distinction between “variation” and “polymorphism” is that polymorphisms are by definition variable sites within or between populations. “Variation” makes no assumption about degree of polymorphism except by comparison between a sample and the reference. !

Page 15: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Variant Calling

Page 16: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Variant Calling The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations.

http://samtools.github.io/hts-specs/VCFv4.2.pdf

Page 17: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

Page 18: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

ChIP-seq

ChIP-sequencing, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins.

Page 19: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

ChIP-seq

Page 20: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

ChIP-seq

Page 21: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

Page 22: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

RNA-seq

RNA-Seq is an approach to transcriptome profiling that uses deep-sequencing technologies. RNA-seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

Page 23: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

RNA-seq

Page 24: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

stranded RNA-seq

Page 25: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

RNA-seq

Page 26: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

Page 27: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

Page 28: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bisulfite-seq

•  Bisulfite turns unmethylated C's into U's but leaves methylated C's alone; U's get coverted back to T's in reads

•  Mapping to converted versions of reference show where methylation is happening

Page 29: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bioinformatic workflow

•  Quality Control •  Alignment •  Experiment specific analyses (VC/PC/DE) •  Data visualization

Page 30: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads: fastq format

@PRESLEY_0005:2:1:1455:1033#0/1GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+PRESLEY_0005:2:1:1455:1033#0/1IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Rad name! sequence!

quality score (phred)!

Page 31: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads quality: Phred scale Phred quality scores Q are defined as a property which is logarithmically related to the base-call ing error probabilities P. Q = -10 Log10 P or P = 10-Q/10

Illumina: Phred+33 Q: 0 to 93

(ASCII 33-126)

Page 32: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads Quality Control

•  Check reads quality and contamination •  Remove poor quality bases (trimming) •  Remove adapter contamination (clipping)

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Page 33: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads Quality Control

Page 34: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads Quality Control

Page 35: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment (to a reference genome)

Alignment itself is the process of determining the most likely source within the genome sequence for the observed DNA sequencing read, given the knowledge of which species the sequence has come from

Page 36: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment (to a reference genome)

Short read alignment is tricky for several reasons: 1.  The reference genome is really big. Searching (in)

big things is harder than searching (in) small things.

2.  You aren’t always looking for exact matches in the reference genome.

Page 37: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment Basic algorithm

Seed and extend (e.g. BLAST) Seeds: parts of the read with exact matches of a fixed size (e.g. size = 11) Tradeoff between speed and accuracy

Page 38: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment Basic algorithm

Page 39: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment Basic algorithm

We gain speed by skipping directly to the stage when we have a viable proto-alignment: a perfect match of the seed. We lose accuracy because we might miss the best alignment; the best alignment of the read with the target sequence might involve a difference between the read and the target that is inside the seed. If we require an exact match at the seeding stage, we will miss optimal alignments that have this feature.

Page 40: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment Basic algorithm

The size of seed chosen affects the number of matches found during the seeding stage or alignment. In general, the larger seeds yield fewer exact matches while smaller seeds will match at more locations in the target

Page 41: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment

Page 42: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

NGS Alignment

•  Hash table–based implementations, in which the hash may be created using either the reference genome or the set of sequencing reads (es. SOAP) (spaced seeds)

•  Burrows Wheeler transform (BWT)-based methods,

which first create an efficient index (FM Index) of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint (es. Bowtie, BWA)

Page 43: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: Hash table–based

Hash table: data structure that is able to index complex and nonsequential data in a way that facilitates rapid searching

Page 44: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: Burrows Wheeler transform (BWT)-based

The Burrows–Wheeler transform rearranges a character string (e.g. reference genome) into runs of similar characters. The transformation is reversible, without needing to store any additional data

Page 45: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: Burrows Wheeler transform (BWT)-based

FM-index is a compressed full-text substring index based on the Burrows-Wheeler transform It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence

Page 46: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 47: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 48: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 49: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 50: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 51: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 52: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 53: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G T TA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 54: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Bowtie algorithm

Query: A AT G T TA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Page 55: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Reads from RNA-seq can span exon-exon junction

RNA-seq reads alignment

splicing aware aligner (TopHat)

Page 56: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

RNA-seq reads alignment

Page 57: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

TopHat (spliced aligner)

Page 58: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: SAM format

Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:

Page 59: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: SAM format

Each bit in the FLAG field is defined as:

http://broadinstitute.github.io/picard/explain-flags.html

Page 60: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Alignment: SAM format

@HD VN:1.0 SO:coordinate !@SQ SN:chr1 LN:249250621 !@SQ SN:chr10 LN:135534747 !@SQ SN:chr11 LN:135006516!. . . . . !. . . . . !. . . . . !!@PG ID:TopHat VN:2.0.11 CL:tophat -p 2 --b2-very-sensitive –o output_dir --library-type fr-firststrand!Genome rep1_1.fastq rep1_2.fastq !!TUPAC_0006:2:44:16964:12525#0 337 chr1 10017 0 76M chr5 11709 0 !CCTAACCCTATCCCTAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA!BBBBBBBBBBBBBBBBBBBBBBBdbbac_ffd`dZdd`^d\a`a]]ZWc_^dcfffde`ecfbfbdffffefffff!AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10A9T55 YT:Z:UU NH:i:20 CC:Z:chr5 CP:i:10617 XS:A:+!

Page 61: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.

Alignment: BAM format

http://samtools.sourceforge.net/!

Page 62: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

NGS data analysis

From alignment to…. •  Peak Calling (MACS) •  Variant Calling (GATK) •  Differential Expression analysis (DESeq)

Page 63: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

Expression: bedgraph, (big)wig Annotation: bed, gff/gtf

bed and bedgraph have 0-based coordinate system (and do not include stop position), others are 1-based (and include stop position)

Page 64: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

chrom start end value!!chr19 163488 163538 1 !chr19 527211 527261 2 !chr19 527435 527485 2 !chr19 583006 583007 1 !chr19 583007 583056 2 !chr19 583056 583057 1 !chr19 626610 626657 1 !chr19 845596 845646 1 !chr19 845696 845746 1 !chr19 871830 871880 1 !chr19 978730 978776 1 !chr19 1079712 1079762 1 !chr19 1092579 1092629 1 !chr19 1125251 1125301 1 !chr19 1272443 1272493 4 !

The bedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data.

http://genome.ucsc.edu/goldenPath/help/bedgraph.html

Page 65: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. The bigWig (compressed) format i s the recommended format for almost all graphing track needs

http://genome.ucsc.edu/goldenPath/help/wiggle.html

Page 66: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats wiggle

variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format.

variableStep chrom=chr19 span=50 !163488 1 !variableStep chrom=chr19 span=50 !527211 2 !variableStep chrom=chr19 span=50 !527435 2 !variableStep chrom=chr19 span=1 !583006 1 !variableStep chrom=chr19 span=49 !583007 2 !variableStep chrom=chr19 span=1 !583056 1 !variableStep chrom=chr19 span=47 !626610 1 !variableStep chrom=chr19 span=50 !845596 1 !variableStep chrom=chr19 span=50 !

Page 67: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

fixedStep is for data with regular intervals between new data values and is the more compact wiggle format.

fixedStep chrom=chr19 start=400601 step=100 !11 !22 !33 !21 !18 !9 !5 !

Other formats wiggle

Page 68: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

BED lines have three required fields and nine additional optional fields:

Page 69: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats The 9 additional optional BED fields are:

Page 70: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

BED file example

chr1 62901974 62917475 NM_001017416 0 + 62905538 62916652 0 9 164,239,121,105,161,692,171,202,1559, 0,3495,5184,5891,6855,8434,11037,12161,13942, !

chr1 62901974 62917475 NM_001017416 0 +

chrom start end name score strand

Page 71: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

GFF (General Feature Format) lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly.

Page 72: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.

Page 73: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

The attribute list must begin with the two mandatory attributes:

Page 74: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Other formats

GTF file example

chr1 hg19_refGene exon 62901975 62902138 0.000000 + . gene_id "NM_001017416"; transcript_id "NM_001017416"; !chr1 hg19_refGene exon 62901975 62902233 0.000000 + . gene_id "NM_001017415"; transcript_id "NM_001017415"; !

Page 75: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Visualization

Use Genome Browsers to visual ize sam/bam alignments, expression and annotation tracks (files)

Genome Browser is a graphical interface for display of information from a biological database for genomic data.

http://genome.ucsc.edu/

http://www.broadinstitute.org/igv/

Page 76: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Visualization

Page 77: NGS data overview - unipr.itbiochimica.unipr.it/biocomp/Corso_NGS.pdf · Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA.SNPs detection, indel and

Visualization