Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda.
-
Upload
samson-warren -
Category
Documents
-
view
216 -
download
2
Transcript of Introducción a NGS Diseño Experimental y Controles de Calidad Antonio Rueda.
Introducción a NGS
Diseño Experimental y Controles de Calidad
Antonio Rueda
NGS technologies
Library preparation
Amplification
5
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
Sequencing
The identity of each base of a cluster is
read off from sequential images
Concepts
• Lecturas (reads): Son cada una de las secuencias que lee el secuenciador, cada una de las posiciones de la read se leyó en un ciclo del secuenciación. Para cada posición se reporta una base y un valor de calidad.
• Cada read tiene un identificador único.
Concepts
• Calidad de secuenciación (Base quality):– Todos los secuenciadores asumen errores durante el
proceso de secuenciación, se reporta por cada base secuenciada un valor de calidad.
Concepts• Mapeo: Consiste en colocar cada read en su lugar en el
genoma de referencia. Aunque existen otros formatos, el formato SAM/BAM es el más usado
Concepts• Cobertura (coverage): Llamamos coverage al número de veces que se
lee cada una de las posiciones. Cuando analizamos el coverage, debemos tener en cuenta dos puntos muy importantes:
– El coverage. Usaremos el parámetro, coverage medio para medirlo. Debe ser suficiente para dar confianza a la zona secuenciada
– El coverage debe ser uniforme en todas las zonas que queremos secuenciar, usaremos el % de posiciones cubiertas y la desviación típica para cuantificarlo.
Concepts• Detección de variantes(Variant Calling): Consiste en la
búsqueda de diferencias en las reads con respecto al genoma de referencia.
Referencia
Experimental Design
RNA-seq /Transcriptomicso Quantitativeo Descriptive
Alternative splicingo miRNA profiling
Resequencingo Mutation callingo ProfilingoGenome annotation
De novo sequencing
Copy number variationChIP-seq /Epigenomicso Protein-DNA interactionso Active transcription factor binding sitesoHistone methylation
Metagenomics Metatranscriptomics
Exome sequencingTargeted sequencing
DNA sequencing - 1
• Whole GENOME Resequencing– Need reference
genome– Variation discovery
DNA sequencing - 2• Whole GENOME “de novo” sequencing
– Uncharacterized genomes with no reference genome available– Known genomes where significant structural variation is expected– Long reads or mate-pair libraries. Sequencing mostly done by Roche 454 and
also Illumina– Assembly of reads is needed: Computational intensive
• E.g. Genome bacteria sequencing
DNA sequencing - 3• Whole EXOME Resequencing
– Need reference genome• Available for Human and Mouse
– Variation discovery on ORFs • 2% of human genome (lower cost)• 85% disease mutation are in the
exome
– Need probes complementary to exons• Nimblegen• Agilent
• E.g. Human exome
DNA sequencing - 4• Targeted Resequencing
– Capture of specific regions in the genome
• Custom genes panel sequencing– Allows to cover high number of genes
related to a disease
– E.g. Disease gene panel
• Low cost and quicker than capillary sequencing
• Multiplexing is possible• Need custom probes complementary
to the genomic regions– Nimblegen
– Agilent
Introduction to NGS Technologies
Transcriptomics - 1
• RNA-Seq – Sequencing of mRNA– rRNA depleted samples– Very high dynamic range– No prior knwoledge of expressed genes– Gives information about (richer than microarrays)
• Differential expression of known or unknown transcripts during a treatment or condition
• Isoforms• New alternative splicing events• Non-coding RNAs• Post-transcriptional mutations or editing • Gene fusions
2 Oct, 2013
Introduction to NGS Technologies 17
Transcriptomics - 1
• RNA-Seq – Sequencing of mRNA
– Detecting gene fusions
Common Problems
• Signal Errors.• Intensity signal Error(454, Ion-Torrent).
Common Problems
• Diploid and Polyploidy Genomes:– Error or Heterozygous Variant??!!!!– USE COVERAGE!!
Common Problems
• Polymerase Errors(All platforms, excluding Solid)• Ligase Errors (Solid)• Mapping Errors• Variant Calling Errors• Human Error
Maybe, Quality Control is needed!!
Comparison
•Short fragments•High throughput•Cheap•GC bias
•Resequencing•De novo sequencing•ChipSeq•RNASeq•MethylSeq
•Short fragments•High throughput•Cheap•Color-space
•Resequencing•ChipSeq•RNASeq•MethylSeq
•Long fragments•Low throughput•Expensive•Poly nts errors
•De novo sequencing•Amplicon sequencing•Metagenomics•RNASeq
Roche 454 Illumina SOLiD
Similar to all NGS platforms Pipeline & LOTS of DATA
DNA Sample NGS Instrument Data
Library Preparation Sequencing Data
Analysis
NGS is relatively cheap but think what you want to answer, because the analysis will not do magic
QC and read cleaning
Basic steps NGS data processing
QC and read cleaning
Mapping
Basic steps NGS data processing
QC and read cleaning
Mapping
DNA Binding site
Basic steps NGS data processing
Práctica
• Descargar programas:– FastQC– Qualimap
• Descargar ficheros (Página de la asignatura):– Fastq file– Bam file
Quality Control: Raw Data
• Number of input reads• Base Quality• Reads Quality• Biases• Software: FastQC
Where are we?
Sequence processing
Mapping
Variant calling
Variant annotation
Transcript quantification
DE analysis miRNA prediction
• Fastq format “ is a fasta with qualities”:1. Header line (like fasta but starting with “@”)2. Sequence (string of nucleotides)3. “+” and sequence ID (optional)4. Quality values of sequence encoded as a single byte ASCII code
• File extension: .fastq• Sequence quality encodingo Base quality must be encoded in just 1 byte!o Each base has a corresponding quality value: quality in position n is related to base in
position no Encoding procedure:
Fastq format
Error probability Phred transformation(inversed integer value)
ASCII encoding
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Fastq format. Sequence quality encoding
• Phred + 33o Sanger [0,40], Illumina 1.8 [0,41], llumina 1.9 [0,41]
• Phred + 64o Illumina 1.3 [0,40], Illumina 1.5 [3,40]
Prob. of incorrect base call
Phredquality Score
Base call accuracy
1 in 10 10 90%
1 in 100 20 99%
1 in 1000 30 99.9%
1 in 10000 40 99.99%
1 in 100000 50 99.999%
Error probability Phred transformation(inversed integer value)
ASCII encoding
Good quality
Reasonable quality
Poor quality
Shows an overview of the range of quality values across all bases at each postion in the fastq file• The central red line is the median value• The yellow box represents the inter-quartile range (25-75%)• The upper and lower whiskers represent the 10% and 90% points• The blue line represents the mean quality
Per base sequence quality
• Good data• Consistent• High quality along the
read
Per base sequence quality
• Bad data• High variance• Quality decreases
towards the end of the read
Good
Reasonable
Poor
Per sequence quality scores
• Good data• Most of the reads are high-quality
sequences
• Bad data• Distribution with bi-modalities
Low quality reads
Allows to see if a subset of sequences have universally low quality values
Per base sequence content
• Good data• Smooth over the read
• Bad data• Sequence position bias• Library contamination (overrep. sequence)?
Plots the proportion of each base position in a file for which each of the four normal DNA bases has been called -> little/ no difference between different bases in a random libraryThe relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other
Per base GC content
• Good data• Smooth over the read
Plots the GC content of each base position in a file -> little / no difference between the different bases (random library)The overall GC content should reflect the GC content of the underlying genome
• Bad data• Sequence position bias• Library contamination (overrep. sequence)?
Per sequence GC content
• Good data• Normal distribution• Distribution fits with expected• Organism dependent
• Bada data• Distribution does not fit with
expected• Library contamination?
Measures the GC content across the whole length sequence in a file and compares it to a modelled normal distribution of GC content
Per base N content
• Good data • Bada data• There are N bias per base position
Plots the percentages of base calls at each position for which an N was calledIt’s not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence
Sequence length distribution
• Some sequencers output reads of different length (for example, Roche 454)
• Some sequencers generate sequence fragments of uniform length
Sequence duplication levels
• Good datao Low level of duplicationo May indicate a very high level of
coverage in the target sequence
• Bada datao High number of duplicateso May indicate some kind of enrichment
bias (eg PCR over amplification)
• In transcriptomics, it is expected higher number of duplicated sequences• In genomics, it is expected a low number of duplicated sequences
Overrepresented sequences and K-mer content
• Exact same sequences too many times…• Is that a problem? It
depends….o PCR primers,
adapters,…
Typical artifacts
Sequence adapters
Typical artifacts
Platform dependent
Sequence Filtering
• It is important to remove bad quality data -> our confidence on downstream analysis will be improved
Sequence Filtering
• Sequence filtering:o Mean qualityo Read lengtho Read length after
trimmingo Percentage of
bases above a quality threshold
o Adapter trimmingo Adapter reads
Minimum quality threshold
Sequence Filtering
• Sequence filtering toolso Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)o Galaxy (https://main.g2.bx.psu.edu/)o SeqTK (https://github.com/lh3/seqtk)o Cutadapt (https://code.google.com/p/cutadapt/)o ….
Where are we?
Sequence processing
Mapping
Variant calling
Variant annotation
Transcript quantification
DE analysis miRNA prediction
Quality Control: Mapping Data
• Coverage• Mapped Reads• Uniformity• Biases• Software: BamQC
Qualimap Example
• http://qualimap.bioinfo.cipf.es/samples/HG00096.chrom20_result/qualimapReport.html
Quality Control: Capture Data
• Sensitivity• Specificity• Uniformity• Biases• Software: NGScat
ngsCAT Example
• http://www.bioinfomgp.org/ngscat/documentation/start