Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes...
Transcript of Introduction to Variant Calling - Bioconductor · Variantcallsaremoregeneralthangenotypes...
Introduction to Variant Calling
Michael Lawrence
June 24, 2014
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
Variant calls
DefinitionI A variant call is a conclusion that there is a nucleotide
difference vs. some reference at a given position in anindividual genome or transcriptome,
I Usually accompanied by an estimate of variant frequency andsome measure of confidence.
Use cases
DNA-seq: variants
I Genetic associations with diseaseI Mutations in cancerI Characterizing heterogeneous cell populations
RNA-seq: allele-specific expression
I Allelic imbalance, often differentialI Association with isoform usage (splicing QTLs)I RNA editing (allele absent from genome)
ChIP-seq: allele-specific binding
Variant calls are more general than genotypesGenotypes make additional assumptions
I A genotype identifies the set of alleles present at each locus.I The number of alleles (the ploidy) is decided and fixed.I Most genotyping algorithms output genotypes directly, under a
blind diploid assumption and special consideration of SNPsand haplotypes.
Those assumptions are not valid in general
I Non-genomic input (RNA-seq) does not represent a genotype.I Cancer genome samples are subject to:
I Copy number changesI Tumor heterogeneityI Tumor/normal contamination
So there is a mixture of potentially non-diploid genotypes, andthere is no interpretable genotype for the sample
Typical variant calling workflow
FASTQ
Typical variant calling workflow
AlignmentBWA GSNAP
FASTQ
gmapR
Typical variant calling workflow
AlignmentBWA GSNAP
FASTQ
FiltersRemove
PCR Dups(Picard)
Realign Indels
(GATK)
gmapR
Typical variant calling workflow
AlignmentBWA GSNAP
FASTQ
FiltersRemove
PCR Dups(Picard)
Realign Indels
(GATK)
Tallysamtools bam_tally gmapR
gmapR
ACT
TT A
C G
TTTTTTTTT
AA
A
T
TTTT
CCCCCCC
GGGGGGGGG
TTT
TTTTTTT
T
A CCCCCCCCC
GGGGGGGGGG
C
CCCC
GTTTC
CCC
A
AA
AAA
AA
A
AA
A
AA
AA
AC
Typical variant calling workflow
AlignmentBWA GSNAP
FASTQ
FiltersRemove
PCR Dups(Picard)
Realign Indels
(GATK)
Tallysamtools bam_tally gmapR
gmapR
CallingGATK VarScan2
VariantTools
ACT
TT A
C G
TTTTTTTTT
AA
A
T
TTTT
CCCCCCC
GGGGGGGGG
TTT
TTTTTTT
T
A CCCCCCCCC
GGGGGGGGGG
C
CCCC
GTTTC
CCC
A
AA
AAA
AA
A
AA
A
AA
AA
AC
ACT
TT A
C G
TTTTTTTTT
AA
A
T
TTTT
CCCCCCC
GGGGGGGGG
TTT
TTTTTTT
T
A CCCCCCCCC
GGGGGGGGGG
C
CCCC
GTTTC
CCC
A
AA
AAA
AA
A
AA
AA
AC
Het
Hom-alt
Low Freq
Error
POS REF ALT
2 T A4 A G8 C T
Typical variant calling workflow
AlignmentBWA GSNAP
FASTQ
FiltersRemove
PCR Dups(Picard)
Realign Indels
(GATK)
Tallysamtools bam_tally gmapR
gmapR
CallingGATK VarScan2
VariantTools
ACT
TT A
C G
TTTTTTTTT
AA
A
T
TTTT
CCCCCCC
GGGGGGGGG
TTT
TTTTTTT
T
A CCCCCCCCC
GGGGGGGGGG
C
CCCC
GTTTC
CCC
A
AA
AAA
AA
A
AA
A
AA
AA
AC
ACT
TT A
C G
TTTTTTTTT
AA
A
T
TTTT
CCCCCCC
GGGGGGGGG
TTT
TTTTTTT
T
A CCCCCCCCC
GGGGGGGGGG
C
CCCC
GTTTC
CCC
A
AA
AAA
AA
A
AA
AA
AC
Het
Hom-alt
Low Freq
Error
POS REF ALT
2 T A4 A G8 C T
Annotation Comparison
Sources of technical error
Errors can occur at each stage of data generation:I Library prepI SequencingI Alignment
Variant information for filtering
Information we know about each variant, and how it is useful:
Information UtilityBase Qualities Low quality indicates sequencing errorRead Positions Bias indicates mapping issuesGenomic Strand Bias indicates mapping issuesGenomic Position PCR dupes; self-chain, homopolymersMapping Info Aligner-dependent quality score/flags
Typical QC filters
10.1038/nbt.2514
These filters are heuristicsthat aim to reduce theFDR; however, they willalso generate falsenegatives and are bestapplied as soft filters(annotations).
Whole-genome sequencing and problematic regions
I Many genomic regions are inherently difficult to interpret.I Including homopolymers, simple repeats
I These will complicate the analysis with little compensatingbenefit and should usually be excluded.
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
VariantTools pipeline
At least twoalt reads
At least 4%alt read fractionC
all
Po
st F
ilter
Max Countin Neighborhood
Ou
tpu
t
Inp
ut
Variants
Tal
ly
Unique Alignments
MappingQuality > 13
Require > 23Base Quality
Mask SimpleRepeats
Ignore PicardDuplicates
QA
dbSNP positionsnot considered;
mostly useful for WGS
Not overlappingHP ( > 6nt )
Overlappingends in same pair
are clipped
Binomial LikelihoodRatio Test:
p(var) = 0.2 /p(error) = 0.001
UCSC self-chain as indicator of mappability
I UCSC publishes the self-chain score as a generic indicator ofintragenomic similarity that is independent of any aligner
I About 6% of the genome fits this definitionI Virtually all (GSNAP) multi-mapping is in self-chainsI Lower unique coverage in self-chains
Aligner matters: coverage and mappability
BWA coverage
GS
NA
Pco
vera
ge
0
50
100
150
200
0
50
100
150
200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
FN
self−chained
FP
self−chained
TP
self−chained
unchained unchained unchained
FN FP TP
Aligning indels is error proneResolved by indel realignment
Homopolymers are problematic
Discard variants over or next to homopolymers (>6nt)
FAIL
PASS
CTGCGAAAAAAAA
CTGCGAAAAAAAA
0.0
0.1
0.2
0.3
inside/adjacent outside
Relationship to Nearest Homopolymer
FD
R
Choosing the homopolymer length cutoff
I We fit two logistic regressions to find the optimal length cutofffor our filter
I Response, TP: whether the variant call is a true positiveI Length as linear predictor:
I TP ~ I(hp.dtn <= 1) + hp.lengthI Indicator for when length exceeds 7:
I TP ~ I(hp.dtn <= 1) + I(hp.length > 7)
Logistic regression results
0.7
0.8
0.9
1.0
0.4 0.6 0.8 1.0Fraction of FP retained
Fra
ctio
n of
TP
ret
aine
d
group TP ~ I(dtn.hp <= 1) + hp.length TP ~ I(dtn.hp <= 1) + I(hp.length > 7)
sample 10 YRI x 90 CEU 50 YRI x 50 CEU 90 YRI x 10 CEU
Effect of coverage extremes on frequencies
0
2
4
6
8
0.00 0.25 0.50 0.75 1.00
altDepth/totalDepth
Den
sity
Coverage (1,40]
(40,120]
(120,Inf]I Coverage sweet-spot
(40-120) matches expecteddistribution.
I High coverage (>120) hasmuch lower frequencies thanexpected; mapping error?
I Low coverage also different
Coverage extremes and self-chained regions
self−chained unchained
0.0
0.1
0.2
0.3
0.4[0
,10]
[10,
20]
[20,
30]
[30,
40]
[40,
50]
[50,
60]
[60,
70]
[70,
80]
[80,
90]
[90,
100]
[100
,Inf]
[0,1
0]
[10,
20]
[20,
30]
[30,
40]
[40,
50]
[50,
60]
[60,
70]
[70,
80]
[80,
90]
[90,
100]
[100
,Inf]
Coverage Bin
FD
R
Variant density filter performance
Discard variants clumped on the chromosome.
FAIL
PASS
0.0
0.1
0.2
0.3
0.4
0.5
> 0.1 ≤ 0.1
Neighborhood Score
FD
R
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
Downstream of variant calling
Calling(vs reference)
GATK VarScan2
VariantTools
POS REF ALT
2 T A4 A G8 C T
Functional Annotations
Genomic context, coding consequences, disease assocations
Annovar Ensembl VEP
VariantAnnotation
VariantFiltering
Two sample comparisons
Mutation callingRNA-editing
ensemblVEP
mutect strelka
VariantTools
Interpretation
VarScan2
Direct
Calling mutations through filtering
I We have two sets of variant calls (vs. reference) and need todecide which are specific to one (i.e., the tumor)
I We have to decide whether the variant frequency is:I Non-zero in tumor butI Zero in normal
I Variant frequencies are a function of:I Copy number changesI Tumor/normal contaminationI Sub-clonality (tumor heterogeneity)I Mutations
I Mutations often present at low frequency and may even showup in the normal data due to contamination
VariantTools mutation calling algorithm
A mutation must pass the following filters:I The variant was only called in the tumorI There was sufficient coverage in normal to detect a variant,
assuming the likelihood ratio model and given a power cutoffI The raw frequency in normal is sufficiently lower than the
frequency in tumor (avoids near-misses in normal)
Functional annotations with VariantAnnotation
The VariantAnnotation package
I Handles import/export of variants from/to VCFI Defines central data structures for representing variants
I VCF objects represent full complexity of VCF as a derivative ofSummarizedExperiment
I VRanges extends GRanges for special handling of variantsI Annotates variants with:
I Genomic context: locateVariants()I Coding consequences: predictCoding()I SIFT/PolyPhen
I Filters VCF files as a stream (filterVcf())
Learn moreThursday lab on annotating variants
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
Overview
I Convenient interface for tallying mismatches and indelsI Several built-in variant filtersI Combines filters into a default calling algorithmI Other utilities: call wildtype, ID verificationI Integrates:
I VRanges data structure from VariantAnnotationI Tallying with bam_tally via gmapRI FilterRules framework from IRanges
Tallying
The underlying bam_tally from Tom Wu’s GSTRUCT accepts anumber of parameters, which we specify as a TallyVariantsParamobject. The genome is required; we also mask out the repeats.
library(VariantTools)data(repeats, package = "VariantToolsTutorial")param <- TallyVariantsParam(TP53Genome(), mask = repeats)
Tallies are generated via the tallyVariants function:tallies <- tallyVariants(bam, param)
VRanges
I The tally results are stored in a VRanges objectI Extension of GRanges to describe variantsI One element/row per position + alt combinationI Adds these fixed columns:
ref ref allelealt alt alleletotalDepth total read depthrefDepth ref allele read depthaltDepth alt allele read depthsampleNames sample identifierssoftFilterMatrix FilterMatrix of filter resultshardFilters FilterRules used to subset object
VRanges features
I Rough, lossy, two-way conversion between VCF and VRangesI Matching/set operations by position and alt (match, %in%)I Recurrence across samples (tabulate)I Provenance tracking of applied hard filtersI Convenient summaries of soft filter results (FilterMatrix)I Lift-over across genome builds (liftOver)I VRangesList, stackable into a VRanges by sampleI All of the features of GRanges (overlap, etc)
Tally statistics
In addition to the alleles and read depths, tallyVariants provides:
Raw counts Count before quality filter for alt/ref/totalMean quality Mean base quality for alt/refStrand counts Plus/minus counts for alt/refUniq read pos Number of unique read positions for alt/refMean read pos Mean read position (cycle) for alt/refVar read pos Variance in read position for alt/refMDFNE Median distance from nearest end for alt/refRead pos bins Counts in user-defined read pos bins for alt
Filtering framework
VariantTools implements its filters within the FilterRules frameworkfrom IRanges. The default variant calling filters are constructed byVariantCallingFilters:calling.filters <- VariantCallingFilters()
Post-filters are filters that attempt to remove anomalies from thecalled variants:post.filters <- VariantPostFilters()
Filter tallies into variant calls
The filters are then passed to the callVariants function:variants <- callVariants(tallies, calling.filters,
post.filters)
Or more simply in this case:variants <- callVariants(tallies)
Interoperability via VCF
We can export the variant calls to a VCF file:writeVcf(variants, "variants.vcf", index = TRUE)
Outline
Introduction
Calling variants vs. reference
Downstream of variant calling
VariantTools package
Visualization
Visualizing variants with IGV SRAdb
Creating a connection to IGV
library(SRAdb)startIGV("lm")sock <- IGVsocket()
Exporting our calls as VCF
vcf <- writeVcf(variants, "variants.vcf", index = TRUE)
Creating an IGV session
Create an IGV session with our VCF, BAMs and custom p53genome:rtracklayer::export(genome, "genome.fa")session <- IGVsession(c(bam.paths, vcf), "session.xml",
"genome.fa")
Load the session:IGVload(sock, session)
Browsing regions of interest
IGV will (manually) load BED files as a list of bookmarks:rtracklayer::export(interesting.variants, "bookmarks.bed")
IGV section, from R
VariantExplorer package
I The VariantExplorer package by Julian Gehring is anunreleased package for visually diagnosing variant calls
I Produces static ggbio plots and interactive web-based plotsbased on epivizr
I The epivizr package (Hector Corrada Bravo) is abrowser-based genomic visualization platform that pulls datadirectly from a running R session
I Get epivizr:devtools::install_github("epivizr", "epiviz")