MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo...

17
MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula<on Gene<c Analysis Genome Sequencing and Analysis Group Medical and Popula<on Gene<cs Program Broad Ins<tute of Harvard and MIT 02/04/10

Transcript of MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo...

Page 1: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

MPGNGSworkshopI:SNPcalling

MarkDePristo

Manager,MedicalandPopula<onGene<cAnalysisGenomeSequencingandAnalysisGroupMedicalandPopula<onGene<csProgram

BroadIns<tuteofHarvardandMIT02/04/10

Page 2: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

ThreeslidebackgroundonSNPcallingintheGATK

2

Page 3: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

SNPcallingworkflow

3

Call-ready BAM files(cleaned, dedupped, recalibrated,

with well-formated header)

Raw variants (VCF)(all sites confidently containing non-reference bases; with genotypes)

Filtered variants (VCF)(separate true segregating variation

from machine artifacts)

Data input and output Processing tools

GATK unified genotyper

GATK variant analysis

GATK variant filtration

Expert user judgement

Ease of useRuntime*Filesize*

* Runtime and file sizes are for a single sample 30x whole genome BAM

** Potentially requires many rounds of experimentation and evaluation

Very easy

200Gb

1 Gb

1 Gb

Tools are easy to use

but parameter selection

requires significant

expertise and

judgement

10 hrs

Instant

30 min

Days**

Page 4: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏

GATKsinglesamplegenotypelikelihoods

•  Priorsappliedduringmul<‐samplecalcula<on;P(G)=1

•  Likelihoodofdatacomputedusingpileupofbasesandassociatedqualityscoresatgivenlocus

•  Only“goodbases”areincluded:thosesa<sfyingminimumbasequality,mappingreadquality,pairmappingquality,NQS

•  P(b|G)usesplaYorm‐specificconfusionmatrices•  L(G|D)computedforall10genotypes

Prior for the genotype

Likelihood for the genotype

Likelihood of the data given the genotype

Bayesianmodel

Independent base model

Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on4

Page 5: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Weapplyageneraliza<onofthesinglesampleSNPcallertoPilot1

•  Thisapproachallowsustocombineweaksinglesamplecallstodiscovervaria<onamongsampleswithhighconfidence

Individual 1

Sample-associated reads

Individual 2

Individual N

Genotype likelihoods

Joint estimate across samples

Genotype frequencies

Allele frequency

SNPs

Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on5

Page 6: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

MakingrawvariantcallswiththeGATKunifiedgenotyper

6

Page 7: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

RunningtheUnifiedGenotyper

Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on7

java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta -T UnifiedGenotyper-D dbsnp_129_b36.rod -varout NA19240.raw.vcf -confidence 50 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam

Minimumphred‐scaledconfidencerequiredtoemitaSNP

1hetper1000referencebasesonaverageforaYoruban

BAMfilecontainingNA19240SLXreads

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 0 <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70

1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00

1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00

Longstringofvariantannota<ons(moreinfoinafewslides)RawVCFcalls(NA19240.raw.vcf)

Page 8: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

SNPcallingar<facts

•  SNPcallsaregenerallyinfestedwithfalseposi<ves–  Fromsystema<cmachinear<facts,mismappedreads,alignedindels/CNV

–  RawSNPcallsmighthavebetween5‐20%FPsamongnovelcalls

•  Separa<ngtruevaria<onfromar<factsdependsverymuchonthepar<cularsofone’sdataandprojectgoals– Wholegenomedeepdata,WGlow‐pass,hybridcapture,pooledPCRarehavesignificantlydifferenterrormodes

8

Page 9: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Filteringar<factsoutofyourSNPcalls

•  TheGATKusesathreepassapproach–  Firstemitallsitespoten<allycontainingatruevariant

–  AggregateSNPcovariatesintherawVCFtodeterminetherela<onshipbetweeneachcovariateanderror[warning:requiresuserexper0se]

–  Finally,applythesefilterstotherawVCFusingtheGATKVariantFiltra<ontool

•  Wearecurrentlyworkingonarobust,easy‐to‐useautomatedtool

9

Page 10: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Variantannota<onsandfilters

22  49582364 . A G 198.96 0 AB=0.67;AC=3;AF=0.50;AN=6;DP=87;Dels=0.00;HRun=1;MQ=71.31;MQ0=22;QD=2.29;SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78

VCFrecordforanA/GSNPat22:49582364

HeterozygousgenotypeA/Ginallthreeindividuals

Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantAnnotatorformoreinforma<on10

AC No.chromosomescarryingaltallele

AB Allelebalanceofref/altinhets

AN Totalno.ofchromosomes Hrun Lengthoflongestcon<guoushomopolymer

AF Allelefrequency MQ RMSMAPQofallreads

DP Depthofcoverage MQ0 No.ofMAPQ0readsatlocus

QD QUALscoreoverdepth SB Es<matedSBscore

INFO

field

Page 11: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Covariate bin value

Tra

nsitio

n / tra

nsvers

ion r

atio

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0 0.2 0.4 0.6 0.8

AB

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 200 400 600

DP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 10 20 30 40 50

MQ0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-1500 -1000 -500 0

SB

titv

dbSNP/100

Selec<ngfilteringthresholds

Selectedfiltersare:AB>0.75||DP>300||MQ0>40||SB>‐0.10||3snpswithin10bp

Notelet‐mostvaluesareSNPswithoutdisplayedannota<on

Annota<on

Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantFiltra<onWalkerformoreinforma<on11

Page 12: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

RunningVariantFiltra<on

12

java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta-T VariantFiltration -B variant,VCF,NA19240.raw.vcf -D dbsnp_129_b36.rod --clusterWindowSize 10--filterExpression “AB > 0.75 || DP > 300 || MQ0 > 40 || SB > -0.10” -l INFO-o NA19240.filtered.vcf

ExpressiondescribingSNPsthatshouldbefilteredout

Filtersoutanygroupof3SNPswithin10bpofeachother

FilteredVCFcalls(NA19240.filtered.vcf)#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 GATK_FILTER<ATTRIBUTES> GT:DP:GQ 1/0:6:84.70

1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00

1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00

SNPswithpoorcharacteris<cshavetheirFILTERfieldfilledin

Page 13: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Callset Callablebases1

#variants dbSNP% Ti/Tv(Est.FPrate2) Hapmap3Sensi@vity3

Hapmap3Concordance3

Known Novel

SingleindividualcallsfromtheGATK

RawNA192402.70B(89%)

4.52M 77.832.07(1.9%)

1.81(18.1%)

99.41 99.85

FilteredNA19240 4.26M 80.422.10

(~0.0%)2.01(5.6%)

99.14 99.85

Daughter+parentsmul@‐samplecallsfromtheGATK

RawYRItriotogether

2.5B(81%)

6.24M 71.652.07(1.9%)

1.80(18.8%)

99.62 99.85

FilteredYRItriotogether

5.60M 74.862.11

(~0.0%)2.02(5.0%)

99.29 99.85

RawandfilteredautosomalcallsforYRIdaughterandtrio

1.  %ofall3.1BbasesoftheB36humangenomecalledwithatleastQ50confidence2.  Calculatedas1‐(<tv_Observed‐0.5)/(<tv_Expected‐0.5)with<tv_Expectedof2.13.  NA19240sensi<vityandconcordanceresults

13

Page 14: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Examplenovelvariant

14

Chr1:67634785in3’untranslatedregion

Page 15: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Examplescripts

•  1000GenomesSLXYRIBAMfiles:– Locallyavailableat:/humgen/gsa‐hpprojects/1kg/1kg_pilot2/useTheseBamsForAnalyses/<sample>.SLX.bam

– Availablefordownloadat1000genomes.org

•  ScriptsandVCFfiles:– /humgen/gsa‐scr1/pub/tutorials/MPG_workshop

15

Page 16: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

Appendix

16

Page 17: MPG Workshop SNP calling 020410 v6 - Broad Institute NGS workshop I: SNP calling Mark DePristo Manager, Medical and Populaon Genec Analysis Genome Sequencing and Analysis ... SNP calling

SNPs with confidence score within interval

% S

NP

s in

db

SN

P 1

29

020406080100

0 100 200 300 400 500

SNPs with confidence score within interval

Ti/T

v r

atio

0.5

1.0

1.5

2.0

2.5

0 100 200 300 400 500

SNPs with confidence score within interval

Tru

e p

ositiv

e S

NP

s

0100020003000

0 100 200 300 400 500

ChoosingaminimumconfidencescoreforaSNP

17

Defaultthreshold

•  Eachpointonplotincludes~3000SNPsfromNA19240•  ThedensityofpointsacrosstheconfidenceintervalindicatesthenumberofSNPs•  ~0.5%ofSNPshaveQ<100,andonly2%arelessthanQ<200•  ThedefaultQ50thresholdresultsinanhighlysensi<vecallset

dbSNPrate

Ti/Tvrate

Trueposi<veson1KG

customIlluminachip(cum.)