A technical and methodological introduction to NGS (data ... and... · A technical and...

54
A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Transcript of A technical and methodological introduction to NGS (data ... and... · A technical and...

Page 1: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

A technical and methodological introduction

to NGS (data) analysis.

biomina

Geert Vandeweyer 2015-04-24

Page 2: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 3: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

The digital code of DNA, Leroy Hood and David Galas Nature 421, 444-448, 23 January 2003

NGS Principles: Sanger

Page 8: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 9: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

Whole Genome Sequencing • Novel organisms, de novo reference genome

Page 10: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

Whole Genome Sequencing • Novel organisms, de novo reference genome • Structural variance detection

Page 11: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing

Page 12: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing

• Candidate genes for disease • All genes in pathway • ... => PCR, Capture, MIPs, ...

Page 13: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq

Page 14: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq • 16S metagenomics

Page 15: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: RNA-Seq

Whole Transcriptome Sequencing • Gene/Transcript variant identification

Page 16: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: RNA-Seq

Whole Transcriptome Sequencing • Gene/Transcript variant identification • Gene Expression

• Unbiased detection • Highly quantitative

Page 17: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Applications: RNA-Seq

“Selective” Sequencing • Ribosome Profiling

Page 18: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 19: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Sanger Sequencing:

- 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks

Page 20: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred target amplicons - exome panel: > 200.000 target amplicons

Page 21: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets.

Page 22: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets. => Amount of data : > 8.000.000 sequences / sample

Page 23: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Data format : FASTQ - FASTA : >Sequence_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA GCTCTAGATCGATCTATACCGAT

- Add Quality (fasta-Q => FASTQ) @Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA + BCEECEEFFECGECGECFGFF@?<<=??<>53@##

Page 24: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<< Standard == Sanger Format : Quality = phred + 33, ascii-encoded

Page 25: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<<

Phred Score : correlates with the risk on error (probability that basecall is wrong) “High Quality” : P(error) < 0.001

Q > 30

Page 26: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with? Quality = phred + 33, ascii-encoded

=> Example: Value == B Ascii-decode : 66 Phred : 66-33 = 33 Chance on error = 1/10^3.3

Page 27: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Data Description

What kind of data are we working with?

- Data format : FASTQ << WARNING >>

@Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA

+

BCEECEEFFECGECGECFGFF@?<<=??<>53@##

=> Standard == Sanger Format : Other scales exist !

Page 28: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 29: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Adapter Trimming: Remove artificial sequences

Read Mapping: Place reads on the reference genome (BWA)

Quality-Trimming: Remove low quality sequences to improve specificity

Generate QC Reports: Visual inspection of main quality parameters

Optimize Mapping: Remove Duplicate reads (picard),

recalibrate mapping scores (GATK), realign around indels (GATK)

Call and Annotate Variants: Call variants(GATK) and

annotate using ANNOVAR, and snpEff (VariantDB)

Pre

- P

roce

ssin

g

Seq

uen

ce –

To

– V

ari

an

t

NGS Based Variant Calling From sequence to variant: Analysis flow

Page 30: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Adapter Trimming Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Sequence Read 1 Sequence Read 2

Sequence Barcode

Scan all reads for presence of artificial sequence & remove from the reads Note: Adapters are present when lenght(Targetted fragment) < read_length

Page 31: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Low quality leads to high error rates (cfr Phred Score) => Due to chemical degradation, 3’ ends have a lower quality => We want a limit of 1 error in 1000 positions => Trim everything on 3’ end with quality < 30

Quality-Trimming

Page 32: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Quality should improve after trimming

Generate QC Reports

Page 33: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Base composition should be 25% for G,C,T,A

Generate QC Reports

Good run Failed Run

Page 34: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Read Mapping Sequence – To – Variant

NGS Data analysis From sequence to variant: Analysis flow

Burrows-Wheeler Transformation: - Highly efficient method to scan string for substring matches - Principle: Build Prefix Trie, scan top-down using reverse search. 1. Permute String 2. Sort Permuted Strings 3. Last Column = Burrows Wheeler Transformation of string. 4. Build prefix trie from BWT

Page 35: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Insert Size

Page 36: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Capture Efficiency

Page 37: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Capture Efficiency

Page 38: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Optimize Mapping Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Remove Duplicate reads (picard) => Reduce computational time => Reduce amplification bias

Page 39: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Optimize Mapping Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Realign around indels (GATK) => InDels are hard to align => P(>1 SNPs) < P(1 indel)

If at a certain locus, both InDel AND multiple SNPs => Replace SNPs by one InDel => Reduction of false positives

Page 40: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Call And Annotate Variants Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Call Variants (GATK) - Search for positions with statistically significant evidence for

a non-reference nucleotide - Take into account: base-quality, position in read, strand

bias, ...

Page 41: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Call And Annotate Variants Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

Annotate Variants (ANNOVAR, snpEff, ...) - Add as information to the variant to ease interpretation - Effect on Gene transcription (RefSeq, Ensembl, UCSC) - Quality parameters (GATK) - Occurence in control populations (dbSNP, ESP, HapMap, 1KG, ...) - Known pathogenic variations (dbSNP, OMIM, ...) - Effect on gene function (PolyPhen, MutationTaster, Sift, ...) - ...

Page 42: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 43: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration “The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact.” “The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.”

Page 44: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration Train model on known variants (both positive and negative)

Page 45: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration Apply model to experimental data

Page 46: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 2 : Select an inheritance model

De Novo Dominant Recessive Variant not present in Variant present in Variant homozygous either parent affected parent in patient, heterozygous in both parents

Page 47: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 3 : Effect on gene function (~ from high to low)

- Variant causes gain/loss of stop/start coding? - Variant causes aberrant splicing of the transcript? - Variant replaces a highly conserved nucleotide/amino acid ? - Variant replaces an aminoacid, and is not reported in control

populations ? - Variant can modify binding of regulatory elements? - ...

Extended annotation is critical

Manual inspection of > 20.000 variants/sample is impossible. automation is needed

Page 48: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Page 49: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Final Remarks Reducing Computational complexity: Web-Tools

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis (http://www.usegalaxy.org)

Page 50: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Final Remarks

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis - Support for allmost all types of analysis - Variant Calling - RNA seq : Expression / transcript identification - MetaGenomics - ChIP-seq - Many organisms available by default (on main servers) - New organisms can be added on request (on Biomina Servers) Public Server: http://www.usegalaxy.org Biomina Server: http://www.biomina.be/apps/galaxy

Reducing Computational complexity: Web-Tools

Page 51: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Final Remarks

Variant Interpretation: VariantDB - Extensive annotation - Flexible filtering options - Automatic updates - Multiple output formats: - online (tabular) - offline (CSV) - API (JSON)

Reducing Computational complexity: Web-Tools

Page 52: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Final Remarks Reducing Computational complexity: Future ?

Page 53: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

Final Remarks

Future NGS assays will be: - Real-Time - On-Site - Low-Cost - ....

Reducing Computational complexity: Future ?

Page 54: A technical and methodological introduction to NGS (data ... and... · A technical and methodological introduction to NGS (data) analysis. biomina Geert Vandeweyer 2015-04-24

biomina