Data analysis pipelines for NGS applications

37
Data analysis pipelines for NGS applications Sergi Beltran Agulló VHIR-CNAG Course, 11 th February 2015

Transcript of Data analysis pipelines for NGS applications

Page 1: Data analysis pipelines for NGS applications

Data analysis pipelines for NGS applications

Sergi Beltran Agulló

VHIR-CNAG Course, 11th February 2015

Page 2: Data analysis pipelines for NGS applications

BIOREPOSITORY LABORATORY SEQUENCING

QC ANALYSIS TRANSFER

LIM

S

Full Traceability of CNAG’s Workflow ISO9001

Page 3: Data analysis pipelines for NGS applications

P7

P5 Index SP read 1

SP read 2

DNA insert

WG _BS _SEq

WG_Seq

mRNA Seq

…and many more

smallRNA_Seq

Target capture

ChIP _Seq

Sequencing Platform (M. Gut)

Page 4: Data analysis pipelines for NGS applications

cBot Automatic

reagent dispencer

Flow cell Glass slide with

a lawn of oligonucleotides

Sequencing Platform (M. Gut)

Page 5: Data analysis pipelines for NGS applications

Flow cell Glass slide with

a lawn of oligonucleotides and sequencing library

HiSeq2000 – the sequencer

Sequencing Platform (M. Gut)

Page 6: Data analysis pipelines for NGS applications

Sequencing-by-synthesis (SBS)

5’

G

T

C

A

G

T

C

A

G

T

C

A

G

T

3’

5’

C

A

G

T

C

A

T

C

A

C

C

T

A

G

C

G

T

A

First base incorporated

Cycle 1: Add sequencing reagents

Remove unincorporated bases

Detect signal

Cycle 2-n: Add sequencing reagents and repeat

All four labelled nucleotides in one reaction

High accuracy

Base-by-base sequencing

No problems with homopolymer repeats

5’

Sequencing Platform (M. Gut)

Page 7: Data analysis pipelines for NGS applications

100 microns

Page 8: Data analysis pipelines for NGS applications

Sequencing Output: FASTQ files

- Developed by the Wellcome Trust Sanger Institue

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

- Usually, each sequence (read) is split in 4 rows

- Sequence identifiers, description and quality encoding can be different

Page 9: Data analysis pipelines for NGS applications

SEQUENCING

CN

AG

-LIM

S

MAPPING QUALITY CTRL

MAP/BAM

VARIANT CALLING

RNA-Seq

QUANTIFICATION

ASSEMBLY &

METAGENOMICS

BISULFITE

SHAPE-BASED 3D

MODELLING OF RNA

SEQUENCES

Hi-C TO 3D MODELS OF

GENOMIC DOMAINS

AND GENOMES

STRUCTURAL VARIANTS

FASTQ

Data Analysis Pipelines at CNAG

Page 10: Data analysis pipelines for NGS applications

SEQUENCING

CN

AG

-LIM

S

MAPPING QUALITY CTRL

MAP/BAM

VARIANT CALLING

RNA-Seq

QUANTIFICATION

ASSEMBLY &

METAGENOMICS

BISULFITE

SHAPE-BASED 3D

MODELLING OF RNA

SEQUENCES

Hi-C TO 3D MODELS OF

GENOMIC DOMAINS

AND GENOMES

STRUCTURAL VARIANTS

FASTQ

Data Analysis Pipelines at CNAG

Page 11: Data analysis pipelines for NGS applications

Aligning and merging fragments of DNA in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA or gene transcript (ESTs). (adapted from Wikipedia)

Adapted from Li-Jun Ma; Natalie D. Fedorova. Mycology, pages 9 - 24

Assembly: definition

Page 12: Data analysis pipelines for NGS applications

CNAG de novo assembly pipeline

Assembly and Annotation Team (T. Alioto)

Removed

Page 13: Data analysis pipelines for NGS applications

CNAG genome projects

Assembly and Annotation Team (T. Alioto)

Removed

Page 14: Data analysis pipelines for NGS applications

Assembly: Metagenomics

http://wiki.biomine.skelleftea.se

Page 15: Data analysis pipelines for NGS applications

Clinical Applications: Human Microbiome Project

Page 16: Data analysis pipelines for NGS applications

SEQUENCING

CN

AG

-LIM

S

MAPPING QUALITY CTRL

MAP/BAM

VARIANT CALLING

RNA-Seq

QUANTIFICATION

ASSEMBLY &

METAGENOMICS

BISULFITE

SHAPE-BASED 3D

MODELLING OF RNA

SEQUENCES

Hi-C TO 3D MODELS OF

GENOMIC DOMAINS

AND GENOMES

STRUCTURAL VARIANTS

FASTQ

Data Analysis Pipelines at CNAG

Page 17: Data analysis pipelines for NGS applications

Mapping to reference genome

Adapted from wikipedia

100bp read 100bp read

Page 18: Data analysis pipelines for NGS applications

Adapted from wikipedia

100bp read 100bp read

Mapping to reference genome

Page 19: Data analysis pipelines for NGS applications

Adapted from wikipedia

100bp read 100bp read 100bp read 100bp read

Mapping to reference genome

Page 20: Data analysis pipelines for NGS applications

Mapping: Exome sequence example

Alignments are stored in a BAM file, which is the binary version of SAM

(Sequence/Alignment Map) format

Page 21: Data analysis pipelines for NGS applications

Exome sequencing metrics

Removed

Page 22: Data analysis pipelines for NGS applications

Variant Calling

Identification of genetic differences in comparison to a reference (strict definition)

- Designs: Pedigree, trio, group, somatic

Removed

Page 23: Data analysis pipelines for NGS applications

CNAG’s Variant Calling Pipeline J. Camps, S. Derdak, S. Laurie, E.

Serra, R. Tonda, JR Trotta, S Beltran

Removed

Page 24: Data analysis pipelines for NGS applications

CNAG’s Variant Calling Pipeline: Sensitive and Precise

- NA12878 50x Whole Genome FASTQs from Illumina Platinum Genomes

analyzed with the pipeline: http://www.illumina.com/platinumgenomes/

- Results compared independently for SNPs and INDELs agains NIST

reference set: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype

calls. Zook et al. Nat Biotechnol. 2014 Mar;32(3):246-51.

- Results (on callable region):

S. Derdak, A. Kanterakis, S. Laurie,

E. Serra, R. Tonda, S Beltran

Removed

Page 25: Data analysis pipelines for NGS applications

Results: PDF Report

Removed

Page 26: Data analysis pipelines for NGS applications

General (6) Chrom

Pos

Ref

Alt

RS

GMAF

Call Specific (5) Genotype

Genotype Quality

Depth

GT Probabilty Likelihood

Strand Bias

Functional Annotation

(12) Gene Name

Coding /Non-coding

Transcript Biotype

Variant Effect

Variant Effect Impact

Functional Class

Codon Change

Amino Acid affected

Trnscript ID

Transcript Length

Exon rank in transcript

Effect Prediction (12) Sift Prediction & Score

Polyphen2 HDIV Prediction and

score

Polyphen2 HVAR Prediction and

score

Mutation Taster Prediction and

score

Phylop Score

Gerp++ Score

SiPhy 29 Mammal Score

CADD Score

Control Populations

(6) ESP6500 European-

Americans

ESP6500 African-

Americans

1000GP-phase 1

Europeans

1000GP-phase 1 Africans

1000GP-phase 1 Asians

ExAC

Results: Relevant Fields in gVCF and Excel

Page 27: Data analysis pipelines for NGS applications

Results: Secondary data analysis

Mutation number (All chr)

Inte

r-m

uta

tional dis

tance

Chromosomal position (Chr 6)

Norm

aliz

ed

Copy

Num

ber

Page 28: Data analysis pipelines for NGS applications

Removed

Page 29: Data analysis pipelines for NGS applications

Examples in Rare Diseases

Page 30: Data analysis pipelines for NGS applications

30

RD-Connect : Integration and Sharing

WP1: Coordination

WP2: Patient registries

WP3: Biobanks

WP4: Bioinformatics

WP5: Unified platform

Hanns Lochmüller (Newcastle and TREAT-NMD)

Domenica Taruscio (ISS and EPIRARE)

Lucia Monaco (Fondaz. Telethon & EuroBioBank)

WP6 Ethical/legal/social

Ivo Gut (CNAG Barcelona)

Christophe Béroud (INSERM Marseille)

WP7: Impact/Innovation

Mats Hansson (Uppsala)

Kate Bushby (Newcastle and EUCERD/ EJARD)

Page 31: Data analysis pipelines for NGS applications

Removed

Page 32: Data analysis pipelines for NGS applications

SEQUENCING

CN

AG

-LIM

S

MAPPING QUALITY CTRL

MAP/BAM

VARIANT CALLING

RNA-Seq

QUANTIFICATION

ASSEMBLY &

METAGENOMICS

BISULFITE

SHAPE-BASED 3D

MODELLING OF RNA

SEQUENCES

Hi-C TO 3D MODELS OF

GENOMIC DOMAINS

AND GENOMES

STRUCTURAL VARIANTS

FASTQ

Data Analysis Pipelines at CNAG

Page 33: Data analysis pipelines for NGS applications

Microarrays RNA-seq

Nature Methods, 8 469-477 (2011)

RNA-Seq Differential Expression

Page 34: Data analysis pipelines for NGS applications

RNA-Seq Analysis Pipeline

A. Esteve, S. Heath

Page 35: Data analysis pipelines for NGS applications

A. Esteve, S. Heath

Page 36: Data analysis pipelines for NGS applications

Summary

- NGS has multiple applications, usually with higher precision compared to

microarrays.

- NGS has direct clinical applicability

- Sequencing can greatly speed up research and diagnostics.

- Analysis is far from being standardized but results are already very accurate.

- The CNAG offers full collaborations (from experiment design to user-friendly

analysed results)

Page 37: Data analysis pipelines for NGS applications

www.cnag.eu

3rd CNAG Symposium

Feb 26th, 2015

[email protected]

ISO 9001:2008