Variant visualisation and quality control€¦ · the form oflibraries for different programming...

31
Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1

Transcript of Variant visualisation and quality control€¦ · the form oflibraries for different programming...

Page 1: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Variant visualisation and quality control

You really should be making plots!

25/06/14 Paul Theodor Pyl 1

Page 2: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Classical Sequencing Example

DNA .BAM .VCF

A single sample sequencing run

25/06/14 Paul Theodor Pyl 2

Aligner Variant Caller

Page 3: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Comparative Sequencing Example

CaseDNA

.BAM .VCF

Control DNA

.BAM .VCF

A comparative genomics example

Comparisons

25/06/14 Paul Theodor Pyl 3

Page 4: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Scary Sequencing Example

What to do with 1000 .BAM files (~200GB each)

DNA (many Samples) .BAM   .VCF  

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

.BAM

25/06/14 Paul Theodor Pyl 4

Page 5: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

How are we dealing with this?

25/06/14 Paul Theodor Pyl 5

Page 6: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

SNV Calling

•  Pretty well established for diploid monoclonal populations (i.e. non-cancer human sample); e.g. GATK; samtools

•  Can be more problematic in interesting samples, e.g. cancer: – Math might make unreasonable assumptions

(copy number, clonality, etc. …)

•  Specialised tools exists: e.g. MuTect

25/06/14 Paul Theodor Pyl 6

Page 7: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Example VCF

25/06/14 Paul Theodor Pyl 7

#CHROM POS ID REF ALT QUALFILTER INFO Format Control Tumour1 113988213rs… C A 65 PASS GMAF=0.02AD:DP:GT 0:42:0/0 24:38:0/12 101733683- G C 60 PASS GMAF=0.3 AD:DP:GT 0:18:0/0 5:14:0/1…

Page 8: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Visualisation is Key

25/06/14 Paul Theodor Pyl 8

Page 9: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

What the VCF file will tell you

25/06/14 Paul Theodor Pyl 9

Page 10: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

What the VCF file won’t tell you

25/06/14 Paul Theodor Pyl 10

Page 11: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

What the VCF file won’t tell you

25/06/14 Paul Theodor Pyl 11

Page 12: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Example VCF revisited

25/06/14 Paul Theodor Pyl 12

#CHROM POS ID REF ALT QUALFILTER INFO Format Control Tumour1 113988213rs… C A 65 PASS GMAF=0.02AD:DP:GT 0:42:0/0 24:38:0/12 101733683- G C 60 PASS GMAF=0.3 AD:DP:GT 0:18:0/0 5:14:0/1…

Page 13: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Example: CDC27

25/06/14 Paul Theodor Pyl 13

Page 14: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

25/06/14 Paul Theodor Pyl 14

Page 15: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Half-time Summary

•  Visualisation gives context •  A list of positons (i.e. VCF file) is likely to

miss out on some of that •  Good to know: – Regions that are always hard (e.g. CDC27)

– Regions that show specific artifacts from library prep / sequencer (if the samples are all processed the same)

25/06/14 Paul Theodor Pyl 15

Page 16: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Important post-processing Steps (Alignments)

•  After alignment: – Remove duplicates –  InDel realignment (GATK)

25/06/14 Paul Theodor Pyl 16

 ATTAC-­‐-­‐ACAC      TTACACAC  GATTACTTACAC  

 ATTAC-­‐-­‐ACAC      TTAC-­‐-­‐ACAC  GATTACTTACAC  

 ATTACACAC      TTACACAC  GATTACTTACAC  

Page 17: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Important post-processing Steps (Variants)

•  Ensembl Variant Effect Predictor – R package: ensemblVEP (or use command line

tool) •  Location / overlapping genes etc. •  GMAF (1000genomes or HapMap) •  SIFT / PolyPhen Scores

•  Annotate with available data –  e.g. mismatch rates in other samples of the same

cohort –  Local mismatch rate within a sample (e.g.

genomic distance to the next 10 mismatches)

25/06/14 Paul Theodor Pyl 17

Page 18: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Visualisation Tools

•  Genome Browser, e.g. IGV – Programmatic access? (IGV can be scripted)

•  h5vc R/Bioconductor package – Processing BAM files into nucleotide tallies

– Analysing and visualising on those – Shamelessly advertising my own software J

25/06/14 Paul Theodor Pyl 18

Page 19: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

+   +   G   T  -­‐   -­‐  

A   C   G   T  

Nucleotide Tallies

•  Table of (mis)matches, coverages, deletions, insertions, softclips, …

25/06/14 Paul Theodor Pyl 19

A   C   G   T  A   C   G   T  

[…]

Gen

om

ic P

osi

tio

n

Sample One

+   +   G   T  -­‐   -­‐  

A   C   G   T  

A   C   G   T  A   C   G   T  

[…]

Sample Two

+   G  -­‐  

C   G  

C   G  S1   S2  

[…]

Coverage

Page 20: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Genomics Analyses on Tallies

•  Having the data as a matrix: – Easy subsetting (e.g. selecting all controls) – Easy building of summary statistics

•  applying functions to the matrix •  E.g. summarise control samples

•  Many Analyses, especially variant calling and visualisation, can be performed on tallies (we don’t need the BAM’s for it)

25/06/14 Paul Theodor Pyl 20

Page 21: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

G   T  

A   C   G   T  

A   C   G   T  

G   T  

A   C   G   T  

A   C   G   T  

G   T  

A   C   G   T  

A   C   G   T  A   C   G   T  

Genomics Analyses on Tallies

25/06/14 Paul Theodor Pyl 21

[…]

Gen

om

ic P

osi

tio

n

Control 1 Control 2

Case 1 Case 2

Population level statistics apply(a, c(1,3), sum)

Background of mismatches in control samples apply(a[,3:4,], c(1,3), mean)

Pairwise Comparisons a[,c(2,4),]

Page 22: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

What is HDF5

•  Hierarchical Data Format – Efficient storage of numerical data

•  Two kinds of objects – Groups – Folders

– Datasets – Files

25/06/14 Paul Theodor Pyl 22

.BAM  

150GB  

HDF5

6.3 GB  

Page 23: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

HDF5 – A brief overview

25/06/14 Paul Theodor Pyl 23

•  Introduced in 1987 –  National Center for Supercomputing

•  Maintained by the HDFGroup

•  Production Use – NASA –  Imaging – The Lord of the Rings

Page 24: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

What to store in our HDF5 file

•  4 data-sets per Chromosome •  Counts –  4D : [bases x samples x strands x positions]

•  Coverages –  3D : [samples x strands x positions]

•  Deletions –  3D : [samples x strands x positions]

•  Reference –  1D : [positions]

25/06/14 Paul Theodor Pyl 24

Page 25: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

The ‘h5vc’ package

•  Available in R/Bioconductor •  Functionality: – creating / interacting with HDF5 tally files – Variant calling

– Data exploration – Plotting –  ...

25/06/14 Paul Theodor Pyl 25

2014, pages 1–3BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btu026

Genome analysis Advance Access publication January 21, 2014

h5vc: scalable nucleotide tallies with HDF5Paul Theodor Pyl*, Julian Gehring, Bernd Fischer and Wolfgang Huber*EMBL Heidelberg, Genome Biology Unit, Meyerhofstr. 1, 69117 Heidelberg, GermanyAssociate Editor: John Hancock

ABSTRACT

Summary: As applications of genome sequencing, including exomes

and whole genomes, are expanding, there is a need for analysis tools

that are scalable to large sets of samples and/or ultra-deep coverage.

Many current tool chains are based on the widely used file formats

BAM and VCF or VCF-derivatives. However, for some desirable ana-

lyses, data management with these formats creates substantial imple-

mentation overhead, and much time is spent parsing files and collating

data. We observe that a tally data structure, i.e. the table of counts of

nucleotides ! samples! strands!genomic positions, provides a rea-

sonable intermediate level of abstraction for many genomics analyses,

including single nucleotide variant (SNV) and InDel calling, copy-

number estimation and mutation spectrum analysis. Here we present

h5vc, a data structure and associated software for managing tallies.

The software contains functionality for creating tallies from BAM files,

flexible and scalable data visualization, data quality assessment, com-

puting statistics relevant to variant calling and other applications.

Through the simplicity of its API, we envision making low-level analysis

of large sets of genome sequencing data accessible to a wider range

of researchers.

Availability and implementation: The package h5vc for the statis-

tical environment R is available through the Bioconductor project. The

HDF5 system is used as the core of our implementation.

Contact: [email protected] or [email protected]

Supplementary information: Supplementary data are available at

Bioinformatics online.

Received on November 26, 2013; revised on December 20, 2013;

accepted on January 12, 2014

1 MOTIVATION

There is interest in analyses of cancer genome data across largecohorts (Kandoth et al., 2013), but the standard file formats arenot well suited to the task. The BAM format (Li et al., 2009)provides low-level information (alignments), but is resource-hungry, especially for data from many samples at high depth.On the other hand, the VCF format (Danecek et al., 2011) pro-vides high-level information and focuses on reporting positivevariant calls, while reporting of negative calls is usually not at-tempted and can be expected to encounter scalability limitations.However, absence of evidence is not evidence of absence: justconsidering every position that is not mentioned in a VCF filea ‘no variant’ would imply a high false-negative rate, especially inthe face of subclonality and uneven coverage.There is a need foran auxiliary format that is scalable, compact and accessible frommultiple platforms.

2 HDF5

We use HDF5 (The HDF Group, 2010) as the core of our im-plementation. HDF5 is designed to store large arrays of numer-ical data efficiently, scales well with the size of the datasets,supports compression and is available on many platforms inthe form of libraries for different programming languages includ-ing C/Cþþ, Java, Python, Matlab and R.Our implementation relies on the rhdf5 Bioconductor pack-

age (Fischer and Pau, 2012) for low-level access functions toHDF5 files. We store the mismatch tally in a dataset calledCounts and further quantities in the datasets Coverages,Deletions and Reference. The four datasets, which can be thoughtof as large arrays of integers, are defined as follows:

Counts: [bases! samples! strands!positions]Coverages: [samples! strands!positions]Deletions: [samples! strands!positions]Reference: [positions]

Within an HDF5 file, data are stored in a hierarchical struc-ture consisting of groups and datasets. This layout is analogousto a file system where groups represent folders and datasets rep-resent files. We use groups to represent the organizatorial unitscohort and chromosome. In the filesystem analogy, the Countsdataset of e.g. chromosome chr7 of cohort ExampleCohort willbe stored at location/ExampleCohort/chr7/Counts inthe HDF5 file (Supplementary Table S1 and SupplementaryFig. S1).

3 FEATURES

The tally file size is determined mainly by the genome size andthe number of samples and not by the depth of coverage. Byexplicitly including the sample as a dimension of the data matrix,we can scale from single-sample or pairwise comparisons tocohort-level analyses involving thousands of samples withouthaving to open thousands of file connections and parsing asmany files. The use of R/Bioconductor (Gentleman et al.,2004) for analyses and HDF5 for data storage provides platformindependence and allows scientists to interact with their data onmultiple operating systems. HDF5 tallies are small in compari-son with BAM files, e.g. a dataset of 21 human exome sequen-cing samples used #150 GB of storage (at#100 million reads persample), whereas the tally file took only 6.3 GB independent ofthe per-sample coverage. The tally can be interacted withthrough any of the languages that have HDF5 libraries(Section 2). Representing the mismatch tallies of a wholecohort within one array allows for convenient analyses acrosspositions and samples. The central tool for interacting with*To whom correspondence should be addressed.

! The Author 2014. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Bioinformatics Advance Access published February 5, 2014

by guest on February 7, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 26: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Applying Functions Block-wise

25/06/14 Paul Theodor Pyl 26

variantCalls <- h5dapply( filename = "example.tally.hfs5", group = "/ExampleStudy/16", blocksize = 100000, names = c("Counts", "Coverages"), dims = c(4, 3), range = c(29000000, 30000000), FUN = callVariants, sampledata = sampleData

)

+   +   G   T  -­‐   -­‐  

A   C   G   T  

A   C   G   T  A   C   G   T  

[…]

Gen

om

ic P

osi

tio

n

Sample One

Block #1

Block #n

Page 27: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Tutorial Tomorrow

•  Example Workflow – Creating Tally Files – Variant Calling

– Visualisation and Quality Control – Creating Reports – …

25/06/14 Paul Theodor Pyl 27

Page 28: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

My Current Workflow

•  Alignment (e.g. gsnap) •  Postprocessing (GATK) – Remove PCR duplicates –  InDel realignment

•  Tallying (h5vc) •  Variant Calling (e.g. h5vc) •  Ensembl VEP •  ReportingTools (Interactive HTML tables)

25/06/14 Paul Theodor Pyl 28

Page 29: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Final Summary

•  (comparative) variant calling is not completely solved yet Ø We need to do some quality control

•  Plotting variants can be helpful –  Tables of variants risk missing important context

•  We should try to formalise the intuitions we use for visual inspection (we’re working on it)

•  HDF5-based nucleotide tallies allow for analysis and visualisation of SNVs in context

25/06/14 Paul Theodor Pyl 29

Page 30: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Acknowledgement

•  EMBL and the Huber Lab – Wolfgang – Bernd Fischer – Simon Anders – Julian Gehring

25/06/14 Paul Theodor Pyl 30

Page 31: Variant visualisation and quality control€¦ · the form oflibraries for different programming languagesinclud-ing C/Cþþ, Java, Python, Matlab and R. Our implementation relies

Tutorial this afternoon

•  http://192.168.0.9/materials/4_Thursday/labs/ –  ExampleData.zip –  Tutorial.Rmd –  Tutorial.R –  Tutorial.pdf

•  Get newest version of h5vc:

source("http://192.168.0.9/biocLite.R") biocLite(”h5vc")

25/06/14 Paul Theodor Pyl 31