Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda.

Introducción a NGS

Diseño Experimental y Controles de Calidad

Antonio Rueda

NGS technologies

Library preparation

Amplification

5

From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069

Sequencing

The identity of each base of a cluster is

read off from sequential images

Concepts

• Lecturas (reads): Son cada una de las secuencias que lee el secuenciador, cada una de las posiciones de la read se leyó en un ciclo del secuenciación. Para cada posición se reporta una base y un valor de calidad.

• Cada read tiene un identificador único.

Concepts

• Calidad de secuenciación (Base quality):– Todos los secuenciadores asumen errores durante el

proceso de secuenciación, se reporta por cada base secuenciada un valor de calidad.

Concepts• Mapeo: Consiste en colocar cada read en su lugar en el

genoma de referencia. Aunque existen otros formatos, el formato SAM/BAM es el más usado

Concepts• Cobertura (coverage): Llamamos coverage al número de veces que se

lee cada una de las posiciones. Cuando analizamos el coverage, debemos tener en cuenta dos puntos muy importantes:

– El coverage. Usaremos el parámetro, coverage medio para medirlo. Debe ser suficiente para dar confianza a la zona secuenciada

– El coverage debe ser uniforme en todas las zonas que queremos secuenciar, usaremos el % de posiciones cubiertas y la desviación típica para cuantificarlo.

Concepts• Detección de variantes(Variant Calling): Consiste en la

búsqueda de diferencias en las reads con respecto al genoma de referencia.

Referencia

Experimental Design

RNA-seq /Transcriptomicso Quantitativeo Descriptive

Alternative splicingo miRNA profiling

Resequencingo Mutation callingo ProfilingoGenome annotation

De novo sequencing

Copy number variationChIP-seq /Epigenomicso Protein-DNA interactionso Active transcription factor binding sitesoHistone methylation

Metagenomics Metatranscriptomics

Exome sequencingTargeted sequencing

DNA sequencing - 1

• Whole GENOME Resequencing– Need reference

genome– Variation discovery

DNA sequencing - 2• Whole GENOME “de novo” sequencing

– Uncharacterized genomes with no reference genome available– Known genomes where significant structural variation is expected– Long reads or mate-pair libraries. Sequencing mostly done by Roche 454 and

also Illumina– Assembly of reads is needed: Computational intensive

• E.g. Genome bacteria sequencing

DNA sequencing - 3• Whole EXOME Resequencing

– Need reference genome• Available for Human and Mouse

– Variation discovery on ORFs • 2% of human genome (lower cost)• 85% disease mutation are in the

exome

– Need probes complementary to exons• Nimblegen• Agilent

• E.g. Human exome

DNA sequencing - 4• Targeted Resequencing

– Capture of specific regions in the genome

• Custom genes panel sequencing– Allows to cover high number of genes

related to a disease

– E.g. Disease gene panel

• Low cost and quicker than capillary sequencing

• Multiplexing is possible• Need custom probes complementary

to the genomic regions– Nimblegen

– Agilent

Introduction to NGS Technologies

Transcriptomics - 1

• RNA-Seq – Sequencing of mRNA– rRNA depleted samples– Very high dynamic range– No prior knwoledge of expressed genes– Gives information about (richer than microarrays)

• Differential expression of known or unknown transcripts during a treatment or condition

• Isoforms• New alternative splicing events• Non-coding RNAs• Post-transcriptional mutations or editing • Gene fusions

2 Oct, 2013

Introduction to NGS Technologies 17

Transcriptomics - 1

• RNA-Seq – Sequencing of mRNA

– Detecting gene fusions

Common Problems

• Signal Errors.• Intensity signal Error(454, Ion-Torrent).

Common Problems

• Diploid and Polyploidy Genomes:– Error or Heterozygous Variant??!!!!– USE COVERAGE!!

Common Problems

• Polymerase Errors(All platforms, excluding Solid)• Ligase Errors (Solid)• Mapping Errors• Variant Calling Errors• Human Error

Maybe, Quality Control is needed!!

Comparison

•Short fragments•High throughput•Cheap•GC bias

•Resequencing•De novo sequencing•ChipSeq•RNASeq•MethylSeq

•Short fragments•High throughput•Cheap•Color-space

•Resequencing•ChipSeq•RNASeq•MethylSeq

•Long fragments•Low throughput•Expensive•Poly nts errors

•De novo sequencing•Amplicon sequencing•Metagenomics•RNASeq

Roche 454 Illumina SOLiD

Similar to all NGS platforms Pipeline & LOTS of DATA

DNA Sample NGS Instrument Data

Library Preparation Sequencing Data

Analysis

NGS is relatively cheap but think what you want to answer, because the analysis will not do magic

QC and read cleaning

Basic steps NGS data processing


Mapping



Mapping

DNA Binding site


Práctica

• Descargar programas:– FastQC– Qualimap

• Descargar ficheros (Página de la asignatura):– Fastq file– Bam file

http://www.bioinformatics.babraham.ac.uk/projects/download.html

https://www.dropbox.com/sh/0zpbl8b2yw5doxp/zqbO27c-L9

Quality Control: Raw Data

• Number of input reads• Base Quality• Reads Quality• Biases• Software: FastQC

Where are we?

Sequence processing

Mapping

Variant calling

Variant annotation

Transcript quantification

DE analysis miRNA prediction

• Fastq format “ is a fasta with qualities”:1. Header line (like fasta but starting with “@”)2. Sequence (string of nucleotides)3. “+” and sequence ID (optional)4. Quality values of sequence encoded as a single byte ASCII code

• File extension: .fastq• Sequence quality encodingo Base quality must be encoded in just 1 byte!o Each base has a corresponding quality value: quality in position n is related to base in

position no Encoding procedure:

Fastq format

Error probability Phred transformation(inversed integer value)

ASCII encoding

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Fastq format. Sequence quality encoding

• Phred + 33o Sanger [0,40], Illumina 1.8 [0,41], llumina 1.9 [0,41]

• Phred + 64o Illumina 1.3 [0,40], Illumina 1.5 [3,40]

Prob. of incorrect base call

Phredquality Score

Base call accuracy

1 in 10 10 90%

1 in 100 20 99%

1 in 1000 30 99.9%

1 in 10000 40 99.99%

1 in 100000 50 99.999%

Error probability Phred transformation(inversed integer value)

ASCII encoding

Good quality

Reasonable quality

Poor quality

Shows an overview of the range of quality values across all bases at each postion in the fastq file• The central red line is the median value• The yellow box represents the inter-quartile range (25-75%)• The upper and lower whiskers represent the 10% and 90% points• The blue line represents the mean quality

Per base sequence quality

• Good data• Consistent• High quality along the

read

Per base sequence quality

• Bad data• High variance• Quality decreases

towards the end of the read

Good

Reasonable

Poor

Per sequence quality scores

• Good data• Most of the reads are high-quality

sequences

• Bad data• Distribution with bi-modalities

Low quality reads

Allows to see if a subset of sequences have universally low quality values

Per base sequence content

• Good data• Smooth over the read

• Bad data• Sequence position bias• Library contamination (overrep. sequence)?

Plots the proportion of each base position in a file for which each of the four normal DNA bases has been called -> little/ no difference between different bases in a random libraryThe relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other

Per base GC content

• Good data• Smooth over the read

Plots the GC content of each base position in a file -> little / no difference between the different bases (random library)The overall GC content should reflect the GC content of the underlying genome

• Bad data• Sequence position bias• Library contamination (overrep. sequence)?

Per sequence GC content

• Good data• Normal distribution• Distribution fits with expected• Organism dependent

• Bada data• Distribution does not fit with

expected• Library contamination?

Measures the GC content across the whole length sequence in a file and compares it to a modelled normal distribution of GC content

Per base N content

• Good data • Bada data• There are N bias per base position

Plots the percentages of base calls at each position for which an N was calledIt’s not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence

Sequence length distribution

• Some sequencers output reads of different length (for example, Roche 454)

• Some sequencers generate sequence fragments of uniform length

Sequence duplication levels

• Good datao Low level of duplicationo May indicate a very high level of

coverage in the target sequence

• Bada datao High number of duplicateso May indicate some kind of enrichment

bias (eg PCR over amplification)

• In transcriptomics, it is expected higher number of duplicated sequences• In genomics, it is expected a low number of duplicated sequences

Overrepresented sequences and K-mer content

• Exact same sequences too many times…• Is that a problem? It

depends….o PCR primers,

adapters,…

Typical artifacts

Sequence adapters

Typical artifacts

Platform dependent

Sequence Filtering

• It is important to remove bad quality data -> our confidence on downstream analysis will be improved

Sequence Filtering

• Sequence filtering:o Mean qualityo Read lengtho Read length after

trimmingo Percentage of

bases above a quality threshold

o Adapter trimmingo Adapter reads

Minimum quality threshold

Sequence Filtering

• Sequence filtering toolso Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)o Galaxy (https://main.g2.bx.psu.edu/)o SeqTK (https://github.com/lh3/seqtk)o Cutadapt (https://code.google.com/p/cutadapt/)o ….

http://hannonlab.cshl.edu/fastx_toolkit/

https://main.g2.bx.psu.edu/

https://github.com/lh3/seqtk

https://code.google.com/p/cutadapt/

Where are we?

Sequence processing

Mapping

Variant calling

Variant annotation

Transcript quantification

DE analysis miRNA prediction

Quality Control: Mapping Data

• Coverage• Mapped Reads• Uniformity• Biases• Software: BamQC

Qualimap Example

• http://qualimap.bioinfo.cipf.es/samples/HG00096.chrom20_result/qualimapReport.html

http://qualimap.bioinfo.cipf.es/samples/HG00096.chrom20_result/qualimapReport.html

http://qualimap.bioinfo.cipf.es/samples/HG00096.chrom20_result/qualimapReport.html

Quality Control: Capture Data

• Sensitivity• Specificity• Uniformity• Biases• Software: NGScat

ngsCAT Example

• http://www.bioinfomgp.org/ngscat/documentation/start

http://www.bioinfomgp.org/ngscat/documentation/start

http://www.bioinfomgp.org/ngscat/documentation/start

Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda.

Documents

Transcript of Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda.