Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes...

47
Introduction to Variant Calling Michael Lawrence June 24, 2014

Transcript of Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes...

Page 1: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Introduction to Variant Calling

Michael Lawrence

June 24, 2014

Page 2: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 3: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 4: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Variant calls

DefinitionI A variant call is a conclusion that there is a nucleotide

difference vs. some reference at a given position in anindividual genome or transcriptome,

I Usually accompanied by an estimate of variant frequency andsome measure of confidence.

Page 5: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Use cases

DNA-seq: variants

I Genetic associations with diseaseI Mutations in cancerI Characterizing heterogeneous cell populations

RNA-seq: allele-specific expression

I Allelic imbalance, often differentialI Association with isoform usage (splicing QTLs)I RNA editing (allele absent from genome)

ChIP-seq: allele-specific binding

Page 6: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Variant calls are more general than genotypesGenotypes make additional assumptions

I A genotype identifies the set of alleles present at each locus.I The number of alleles (the ploidy) is decided and fixed.I Most genotyping algorithms output genotypes directly, under a

blind diploid assumption and special consideration of SNPsand haplotypes.

Those assumptions are not valid in general

I Non-genomic input (RNA-seq) does not represent a genotype.I Cancer genome samples are subject to:

I Copy number changesI Tumor heterogeneityI Tumor/normal contamination

So there is a mixture of potentially non-diploid genotypes, andthere is no interpretable genotype for the sample

Page 7: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

FASTQ

Page 8: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

AlignmentBWA GSNAP

FASTQ

gmapR

Page 9: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

AlignmentBWA GSNAP

FASTQ

FiltersRemove

PCR Dups(Picard)

Realign Indels

(GATK)

gmapR

Page 10: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

AlignmentBWA GSNAP

FASTQ

FiltersRemove

PCR Dups(Picard)

Realign Indels

(GATK)

Tallysamtools bam_tally gmapR

gmapR

ACT

TT A

C G

TTTTTTTTT

AA

A

T

TTTT

CCCCCCC

GGGGGGGGG

TTT

TTTTTTT

T

A CCCCCCCCC

GGGGGGGGGG

C

CCCC

GTTTC

CCC

A

AA

AAA

AA

A

AA

A

AA

AA

AC

Page 11: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

AlignmentBWA GSNAP

FASTQ

FiltersRemove

PCR Dups(Picard)

Realign Indels

(GATK)

Tallysamtools bam_tally gmapR

gmapR

CallingGATK VarScan2

VariantTools

ACT

TT A

C G

TTTTTTTTT

AA

A

T

TTTT

CCCCCCC

GGGGGGGGG

TTT

TTTTTTT

T

A CCCCCCCCC

GGGGGGGGGG

C

CCCC

GTTTC

CCC

A

AA

AAA

AA

A

AA

A

AA

AA

AC

ACT

TT A

C G

TTTTTTTTT

AA

A

T

TTTT

CCCCCCC

GGGGGGGGG

TTT

TTTTTTT

T

A CCCCCCCCC

GGGGGGGGGG

C

CCCC

GTTTC

CCC

A

AA

AAA

AA

A

AA

AA

AC

Het

Hom-alt

Low Freq

Error

POS REF ALT

2 T A4 A G8 C T

Page 12: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical variant calling workflow

AlignmentBWA GSNAP

FASTQ

FiltersRemove

PCR Dups(Picard)

Realign Indels

(GATK)

Tallysamtools bam_tally gmapR

gmapR

CallingGATK VarScan2

VariantTools

ACT

TT A

C G

TTTTTTTTT

AA

A

T

TTTT

CCCCCCC

GGGGGGGGG

TTT

TTTTTTT

T

A CCCCCCCCC

GGGGGGGGGG

C

CCCC

GTTTC

CCC

A

AA

AAA

AA

A

AA

A

AA

AA

AC

ACT

TT A

C G

TTTTTTTTT

AA

A

T

TTTT

CCCCCCC

GGGGGGGGG

TTT

TTTTTTT

T

A CCCCCCCCC

GGGGGGGGGG

C

CCCC

GTTTC

CCC

A

AA

AAA

AA

A

AA

AA

AC

Het

Hom-alt

Low Freq

Error

POS REF ALT

2 T A4 A G8 C T

Annotation Comparison

Page 13: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Sources of technical error

Errors can occur at each stage of data generation:I Library prepI SequencingI Alignment

Page 14: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Variant information for filtering

Information we know about each variant, and how it is useful:

Information UtilityBase Qualities Low quality indicates sequencing errorRead Positions Bias indicates mapping issuesGenomic Strand Bias indicates mapping issuesGenomic Position PCR dupes; self-chain, homopolymersMapping Info Aligner-dependent quality score/flags

Page 15: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Typical QC filters

10.1038/nbt.2514

These filters are heuristicsthat aim to reduce theFDR; however, they willalso generate falsenegatives and are bestapplied as soft filters(annotations).

Page 16: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Whole-genome sequencing and problematic regions

I Many genomic regions are inherently difficult to interpret.I Including homopolymers, simple repeats

I These will complicate the analysis with little compensatingbenefit and should usually be excluded.

Page 17: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 18: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

VariantTools pipeline

At least twoalt reads

At least 4%alt read fractionC

all

Po

st F

ilter

Max Countin Neighborhood

Ou

tpu

t

Inp

ut

Variants

Tal

ly

Unique Alignments

MappingQuality > 13

Require > 23Base Quality

Mask SimpleRepeats

Ignore PicardDuplicates

QA

dbSNP positionsnot considered;

mostly useful for WGS

Not overlappingHP ( > 6nt )

Overlappingends in same pair

are clipped

Binomial LikelihoodRatio Test:

p(var) = 0.2 /p(error) = 0.001

Page 19: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

UCSC self-chain as indicator of mappability

I UCSC publishes the self-chain score as a generic indicator ofintragenomic similarity that is independent of any aligner

I About 6% of the genome fits this definitionI Virtually all (GSNAP) multi-mapping is in self-chainsI Lower unique coverage in self-chains

Page 20: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Aligner matters: coverage and mappability

BWA coverage

GS

NA

Pco

vera

ge

0

50

100

150

200

0

50

100

150

200

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

FN

self−chained

FP

self−chained

TP

self−chained

unchained unchained unchained

FN FP TP

Page 21: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Aligning indels is error proneResolved by indel realignment

Page 22: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Homopolymers are problematic

Discard variants over or next to homopolymers (>6nt)

FAIL

PASS

CTGCGAAAAAAAA

CTGCGAAAAAAAA

0.0

0.1

0.2

0.3

inside/adjacent outside

Relationship to Nearest Homopolymer

FD

R

Page 23: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Choosing the homopolymer length cutoff

I We fit two logistic regressions to find the optimal length cutofffor our filter

I Response, TP: whether the variant call is a true positiveI Length as linear predictor:

I TP ~ I(hp.dtn <= 1) + hp.lengthI Indicator for when length exceeds 7:

I TP ~ I(hp.dtn <= 1) + I(hp.length > 7)

Page 24: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Logistic regression results

0.7

0.8

0.9

1.0

0.4 0.6 0.8 1.0Fraction of FP retained

Fra

ctio

n of

TP

ret

aine

d

group TP ~ I(dtn.hp <= 1) + hp.length TP ~ I(dtn.hp <= 1) + I(hp.length > 7)

sample 10 YRI x 90 CEU 50 YRI x 50 CEU 90 YRI x 10 CEU

Page 25: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Effect of coverage extremes on frequencies

0

2

4

6

8

0.00 0.25 0.50 0.75 1.00

altDepth/totalDepth

Den

sity

Coverage (1,40]

(40,120]

(120,Inf]I Coverage sweet-spot

(40-120) matches expecteddistribution.

I High coverage (>120) hasmuch lower frequencies thanexpected; mapping error?

I Low coverage also different

Page 26: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Coverage extremes and self-chained regions

self−chained unchained

0.0

0.1

0.2

0.3

0.4[0

,10]

[10,

20]

[20,

30]

[30,

40]

[40,

50]

[50,

60]

[60,

70]

[70,

80]

[80,

90]

[90,

100]

[100

,Inf]

[0,1

0]

[10,

20]

[20,

30]

[30,

40]

[40,

50]

[50,

60]

[60,

70]

[70,

80]

[80,

90]

[90,

100]

[100

,Inf]

Coverage Bin

FD

R

Page 27: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Variant density filter performance

Discard variants clumped on the chromosome.

FAIL

PASS

0.0

0.1

0.2

0.3

0.4

0.5

> 0.1 ≤ 0.1

Neighborhood Score

FD

R

Page 28: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 29: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Downstream of variant calling

Calling(vs reference)

GATK VarScan2

VariantTools

POS REF ALT

2 T A4 A G8 C T

Functional Annotations

Genomic context, coding consequences, disease assocations

Annovar Ensembl VEP

VariantAnnotation

VariantFiltering

Two sample comparisons

Mutation callingRNA-editing

ensemblVEP

mutect strelka

VariantTools

Interpretation

VarScan2

Direct

Page 30: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Calling mutations through filtering

I We have two sets of variant calls (vs. reference) and need todecide which are specific to one (i.e., the tumor)

I We have to decide whether the variant frequency is:I Non-zero in tumor butI Zero in normal

I Variant frequencies are a function of:I Copy number changesI Tumor/normal contaminationI Sub-clonality (tumor heterogeneity)I Mutations

I Mutations often present at low frequency and may even showup in the normal data due to contamination

Page 31: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

VariantTools mutation calling algorithm

A mutation must pass the following filters:I The variant was only called in the tumorI There was sufficient coverage in normal to detect a variant,

assuming the likelihood ratio model and given a power cutoffI The raw frequency in normal is sufficiently lower than the

frequency in tumor (avoids near-misses in normal)

Page 32: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Functional annotations with VariantAnnotation

The VariantAnnotation package

I Handles import/export of variants from/to VCFI Defines central data structures for representing variants

I VCF objects represent full complexity of VCF as a derivative ofSummarizedExperiment

I VRanges extends GRanges for special handling of variantsI Annotates variants with:

I Genomic context: locateVariants()I Coding consequences: predictCoding()I SIFT/PolyPhen

I Filters VCF files as a stream (filterVcf())

Learn moreThursday lab on annotating variants

Page 33: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 34: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Overview

I Convenient interface for tallying mismatches and indelsI Several built-in variant filtersI Combines filters into a default calling algorithmI Other utilities: call wildtype, ID verificationI Integrates:

I VRanges data structure from VariantAnnotationI Tallying with bam_tally via gmapRI FilterRules framework from IRanges

Page 35: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Tallying

The underlying bam_tally from Tom Wu’s GSTRUCT accepts anumber of parameters, which we specify as a TallyVariantsParamobject. The genome is required; we also mask out the repeats.

library(VariantTools)data(repeats, package = "VariantToolsTutorial")param <- TallyVariantsParam(TP53Genome(), mask = repeats)

Tallies are generated via the tallyVariants function:tallies <- tallyVariants(bam, param)

Page 36: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

VRanges

I The tally results are stored in a VRanges objectI Extension of GRanges to describe variantsI One element/row per position + alt combinationI Adds these fixed columns:

ref ref allelealt alt alleletotalDepth total read depthrefDepth ref allele read depthaltDepth alt allele read depthsampleNames sample identifierssoftFilterMatrix FilterMatrix of filter resultshardFilters FilterRules used to subset object

Page 37: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

VRanges features

I Rough, lossy, two-way conversion between VCF and VRangesI Matching/set operations by position and alt (match, %in%)I Recurrence across samples (tabulate)I Provenance tracking of applied hard filtersI Convenient summaries of soft filter results (FilterMatrix)I Lift-over across genome builds (liftOver)I VRangesList, stackable into a VRanges by sampleI All of the features of GRanges (overlap, etc)

Page 38: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Tally statistics

In addition to the alleles and read depths, tallyVariants provides:

Raw counts Count before quality filter for alt/ref/totalMean quality Mean base quality for alt/refStrand counts Plus/minus counts for alt/refUniq read pos Number of unique read positions for alt/refMean read pos Mean read position (cycle) for alt/refVar read pos Variance in read position for alt/refMDFNE Median distance from nearest end for alt/refRead pos bins Counts in user-defined read pos bins for alt

Page 39: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Filtering framework

VariantTools implements its filters within the FilterRules frameworkfrom IRanges. The default variant calling filters are constructed byVariantCallingFilters:calling.filters <- VariantCallingFilters()

Post-filters are filters that attempt to remove anomalies from thecalled variants:post.filters <- VariantPostFilters()

Page 40: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Filter tallies into variant calls

The filters are then passed to the callVariants function:variants <- callVariants(tallies, calling.filters,

post.filters)

Or more simply in this case:variants <- callVariants(tallies)

Page 41: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Interoperability via VCF

We can export the variant calls to a VCF file:writeVcf(variants, "variants.vcf", index = TRUE)

Page 42: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Outline

Introduction

Calling variants vs. reference

Downstream of variant calling

VariantTools package

Visualization

Page 43: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Visualizing variants with IGV SRAdb

Creating a connection to IGV

library(SRAdb)startIGV("lm")sock <- IGVsocket()

Exporting our calls as VCF

vcf <- writeVcf(variants, "variants.vcf", index = TRUE)

Page 44: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Creating an IGV session

Create an IGV session with our VCF, BAMs and custom p53genome:rtracklayer::export(genome, "genome.fa")session <- IGVsession(c(bam.paths, vcf), "session.xml",

"genome.fa")

Load the session:IGVload(sock, session)

Page 45: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

Browsing regions of interest

IGV will (manually) load BED files as a list of bookmarks:rtracklayer::export(interesting.variants, "bookmarks.bed")

Page 46: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

IGV section, from R

Page 47: Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes Genotypesmakeadditionalassumptions I Agenotypeidentifiesthesetofallelespresentateachlocus.

VariantExplorer package

I The VariantExplorer package by Julian Gehring is anunreleased package for visually diagnosing variant calls

I Produces static ggbio plots and interactive web-based plotsbased on epivizr

I The epivizr package (Hector Corrada Bravo) is abrowser-based genomic visualization platform that pulls datadirectly from a running R session

I Get epivizr:devtools::install_github("epivizr", "epiviz")